WEBVTT

00:00.000 --> 00:14.080
So, yeah, it's the topic is a few years in finance, a practical 101 as in 2020-6.

00:14.080 --> 00:20.840
It's an ongoing project and hobby project, so nothing commercial, but important thing is

00:20.840 --> 00:25.160
it's all open source things I'm using it.

00:25.160 --> 00:32.560
Why this project and what are the applications in the financial FPGAs that I'll talk about

00:32.560 --> 00:42.680
that are okay, why basically we would see FPGAs versus CGRAs and if time permits a quick

00:42.680 --> 00:49.240
demo, but yeah, the only good demo is the demo which does not work, let's see if it works,

00:49.240 --> 00:59.200
if it's time permits, why this project, so you know, I mean FPGAs hard and hardware

00:59.200 --> 01:06.600
hard and FPGAs even harder, so the and the other thing is that I witnessed that when

01:06.600 --> 01:14.320
we are talking about the applications we've been running on FPGAs when it comes to finance

01:14.320 --> 01:23.240
they are get kept, so you know, you don't you have access to open source hardware and FPGAs

01:23.240 --> 01:30.680
tools, but still the traditionally the struggle has been the learning curve has you know,

01:30.680 --> 01:38.640
a lot of high for the applications to run on FPGAs when it comes to FPGAs finance applications.

01:38.640 --> 01:46.840
So and how to fix it, so my mental model was that you know, you have to first find out

01:46.840 --> 01:53.040
you know, what are the actual applications and then break down in a Lego approach that

01:53.040 --> 01:59.760
you know, what are the computations being run on FPGAs and the most important part for

01:59.760 --> 02:05.320
me was that you know, which open source projects can be helpful to implement and understand

02:05.400 --> 02:11.440
those computations you know, because that's the spirit of the first and also an open

02:11.440 --> 02:17.880
source community because traditionally Silicon tools has not been open source but now

02:17.880 --> 02:28.160
we have, so why not just see in a hobby project what kind of tools we can use to actually

02:28.160 --> 02:34.800
run those applications which were traditionally very close source and get kept.

02:34.800 --> 02:40.040
So to break it down what we're talking about is, you know, when talked about financial

02:40.040 --> 02:47.680
applications, we mean traditionally this HFT, high frequency trading and then you have

02:47.680 --> 02:57.400
this quantitative workload and you see a block diagram of picture on the right side.

02:57.400 --> 03:03.720
The most common thing is that you know, you have a plumbing pipeline, so regardless

03:03.760 --> 03:09.640
of what kind of applications you have, whether it's HFT or quantitative, you have a plumbing

03:09.640 --> 03:18.040
pipeline which is basically, you know, most FPG-related techniques exist in that plumbing

03:18.040 --> 03:24.480
pipeline and this is what, when I was basically working on those applications in a hobby

03:24.480 --> 03:32.480
project, I found out that you know, okay, if I can somehow deep dive into this plumbing

03:32.560 --> 03:42.040
pipeline, I could see using those tools, understand that what kind of FPGA competitions

03:42.040 --> 03:45.280
you can understand from these tools.

03:45.280 --> 03:51.440
Also when I say you have a small table, what is the difference, you know, like for instance

03:51.440 --> 04:00.840
you have HFT, there is traditionally called ultra-low latency, that's the goal.

04:00.840 --> 04:05.880
In the quantitative is, you don't have the somehow, you know, targeting the low latency,

04:05.880 --> 04:08.600
you can actually throughput.

04:08.600 --> 04:13.520
Workload is likely different in both applications, so you have stream driven, like, you

04:13.520 --> 04:20.760
know, event, event, access stream in a sense and then you have in the quantitative, it's

04:20.760 --> 04:23.560
more badge oriented.

04:23.560 --> 04:28.400
More important part is this control flow, yeah, so you have minimal branching in HFT,

04:28.400 --> 04:35.200
so, but on the other side, the quantitative finance, it's basically branch heavy.

04:35.200 --> 04:39.920
Memory is also important, so in quantitative finance, of course you have, you know,

04:39.920 --> 04:47.280
large error sets, so it's the memory is actually irregular and of the other hand with

04:47.280 --> 04:54.760
HFT, it's predictable, so that is, because it's mostly deterministic workload.

04:54.760 --> 05:04.520
I mean, that's the one background, but the most important part is that, you know, that

05:04.520 --> 05:12.920
is the most basic workflow, which you have in, let's say, in HFT, you know, you have an

05:12.920 --> 05:20.680
exchange server, then you have a data feed, you have the stick data, that is the FPGA side,

05:20.760 --> 05:27.080
and then you have the exchange server, so, to cut, it's very short, so, I mean, to

05:27.080 --> 05:33.160
monologies, take data, that is, you know, a time or a stream of signals, that is basically,

05:33.160 --> 05:38.760
you know, you mutate the market state in hardware, so that is, you know, you can imagine

05:38.760 --> 05:46.920
that stream is coming, it's from the, that is the smart-naked stream, as a stream, you

05:47.720 --> 05:53.560
can say, and then you have a odd basically, that is a hardware generated command packet,

05:53.560 --> 06:00.520
so that is the most basic terminology, which you can summarize in this FPGA part.

06:03.480 --> 06:13.960
So, when I say plumbing HFT, so the most part was that, you know, yeah, that's the, like, you know,

06:14.040 --> 06:21.640
the most, knowledge-get-keeping was mostly on the HFT part. In the next slide, what I would like to

06:21.640 --> 06:27.320
mostly focus on would be that, you know, which tools I have been using to

06:28.760 --> 06:34.200
understand the flow and how that can be implemented on FPGA.

06:37.480 --> 06:43.880
And when I say plumbing HFT, you know, my mental model was that, you know, you have a data feed,

06:43.880 --> 06:49.160
okay, that's fine, that's a stream coming, what else I want to look into that, that, okay,

06:49.160 --> 06:53.880
networking at high-speed I-O, membrane clock timing, these three things like networking,

06:53.880 --> 07:02.200
membrane clock timing, and in this networking, membrane and clock timing, what I would like

07:02.200 --> 07:13.080
to see more inside is that, you know, I mostly, you know, you have a physical silicon, that is

07:13.160 --> 07:17.720
the vendor specific, but then you can go into this mag pipeline, so you have, you know,

07:17.720 --> 07:23.560
plastic buffers and zero buffer design, so you have to really play with the buffers in the mag pipeline.

07:24.520 --> 07:31.400
And then, uh, understandably, the interconnects, you know, uh, the MA Ethernet,

07:33.160 --> 07:38.280
but most important in my experience was that, you know, when you have the timing close and

07:38.280 --> 07:43.640
clock the main, because I'll show you the next slides that, you know, the CDC crossings are

07:43.640 --> 07:49.320
a lot in this pipeline in HFT pipeline, so that is, and we know that, you know, you don't,

07:49.320 --> 07:56.840
you can't actually simulate the clock crossing, so this, this is where the open source tools

07:56.840 --> 08:03.240
becomes useful. And, uh, yeah, membrane subsystem micro-architecture as well, so

08:04.120 --> 08:10.920
also, there are some tools which I will show the next slide, which actually, open source tools,

08:10.920 --> 08:19.560
which, which helps you to actually, uh, unfold or break down the membrane modules on the FGA.

08:19.560 --> 08:29.080
So that is what I was also looking forward in this practice as well. So when I'm talking about

08:29.160 --> 08:37.480
the market error feed was that, you know, I mean, that's the, if I had to start with the tools,

08:37.480 --> 08:43.960
you know, we knew that, you know, the L1 part, which is the, the physical layer, that is the

08:43.960 --> 08:49.000
hard IP, so, you know, that is the proprietary and type of silicon blender, so that's basically

08:49.000 --> 08:57.720
not an open RTL. Uh, what I was basically, most, uh, understandably, uh, more interested was in this

08:57.800 --> 09:05.960
L2L failure, when I had to see, you know, what kind of tools exist, which has open RTL for the

09:05.960 --> 09:13.320
L2 part and, uh, understandably L3, you know, where I can actually see the buffering, minimize buffering part.

09:14.600 --> 09:24.680
And, uh, this, so I actually started with the, uh, tools looking into, which have completely open

09:24.760 --> 09:35.640
RTL, so nothing proprietary. And, uh, even, uh, in L2 layer, you have some vendor tools, which are

09:36.760 --> 09:42.920
still proprietary, but you can customize using, uh, their own tools, but still that does not serve

09:42.920 --> 09:49.400
the purpose. I actually wanted to see what are the actual open RTL where you can play with the,

09:50.360 --> 10:03.640
let's say, this PCSPMA and Mac pipelines. So, uh, I was using mostly, you know, just a background

10:03.640 --> 10:11.560
that, you know, when I started using some open RTL for L2L, uh, AMD's I think, yes, it's, it's, it's,

10:11.560 --> 10:19.480
it's basically a vendor, but their project open NIC is, it's, it's basically open RTL. So, you can

10:19.480 --> 10:25.720
actually, that is also part of my demo with time allows. So, uh, you can actually implement the

10:25.720 --> 10:31.800
Mac part inside the FPGA and you have, you have your own, you, the logic and system very lot.

10:31.800 --> 10:39.720
You can try that, uh, you can generate the, uh, you can simulate that, you can try the different

10:39.720 --> 10:45.480
frequency, like 350 megahertz for ultra scale FPGA and 250 megahertz for also ultra scale.

10:47.160 --> 10:53.800
Net FPGA, I think, most of the people in this room may already know that as an old, uh, I tried it.

10:53.800 --> 11:00.120
I mean, of course, that's something, you know, uh, Net FPGA, it's a very, very powerful platform

11:00.120 --> 11:07.960
for the WordX7 and other FPGAs, but it's somehow outdated, right? I mean, this is, uh, it's a good

11:08.040 --> 11:16.040
to prototype, but it's, uh, having said that, it's basically, yeah, I mean, it's very helpful.

11:16.040 --> 11:22.120
The other tool, which I came across was this SNAT open core. My understanding with open core

11:22.120 --> 11:31.480
was as well, the, uh, the RTL, which exists is around, like, nine, ten years old. So, I don't know

11:31.560 --> 11:37.320
how, I believe in this, that I never tried that. Uh, the most important was this core

11:37.320 --> 11:45.080
and them, and I'm quite sure people in this room already know about core and them, uh, because

11:45.080 --> 11:51.480
that is the, I always say that, you know, we owe a lot to the core and them because that is

11:51.480 --> 12:00.280
the niche open source project, which exists, where you can actually, you know, uh, tailor your RTL

12:01.000 --> 12:08.280
for the L2 and even the L3 layer. And there was also a talk for core and them in first

12:08.280 --> 12:18.440
them, uh, by the main contributor. Also, uh, for me, helpful was that, you know, uh, the video

12:18.440 --> 12:22.840
which existed and also the core work through for the core and them was very helpful.

12:23.800 --> 12:29.880
It's basically, like I mentioned, it's niche, it's complex, uh, using core and them.

12:32.600 --> 12:39.720
And, uh, it's, it's, it's basically an open, uh, it's a nicked platform. So, what I

12:40.920 --> 12:49.560
experienced in last days was that there has been an effort to actually make it more easier to use.

12:49.720 --> 12:57.240
That is basically the FPGA ninja taxi. So, that is the part of the core and them, uh,

12:57.240 --> 13:06.520
core base, but the main team has broken down the complex part of core and them into more

13:07.320 --> 13:15.880
reusable IP. So, that is, and, uh, I can, I can't recall the, the license, but of course,

13:16.520 --> 13:26.040
understand it, it's an open source license. This is one I was, uh, playing with in last weeks

13:26.040 --> 13:33.960
or months. This I mentioned is open nick. Uh, there is, uh, you have, it's, it's completely

13:33.960 --> 13:45.000
based in this system. Very log, uh, RTL. So, you can, I just had a snapshot for the only, like,

13:45.080 --> 13:51.880
QDMA interface, CMAC, and you can put on the, develop your own RTL in the user logic.

13:53.720 --> 14:01.160
It works. So, but what I haven't still done is that, you know, I haven't implemented any

14:01.240 --> 14:08.920
user logic, which is the, uh, trading logic or application financial technology, application in the

14:09.560 --> 14:18.440
open nick as so far now. But, uh, someone who really wants to break down those computations and

14:18.440 --> 14:24.520
understand what's happening inside the, uh, silicon logic, uh, you can use open nick that was,

14:24.520 --> 14:33.480
for me, what it was helpful. That is the corundum. So, that is, uh, I have been using this

14:33.480 --> 14:43.480
taxi platform, but I shared the corundum, uh, yeah, like, oh, high level view because it just tells

14:43.560 --> 14:57.640
you that with the, you can actually find gain control the hardware cues and also, which is important

14:57.640 --> 15:06.680
is that if you, even if you have, uh, like, ultra scale, what it's seven FQJ for AMD or Intel,

15:06.760 --> 15:15.080
you can use that with your own, uh, L2 stack. So, that is basically, I haven't, uh, like, this is,

15:15.080 --> 15:21.800
this is the, uh, I have to mention that this, I have cited from the corundum, um, page as well,

15:21.800 --> 15:29.480
only the page as well. So, that is the, uh, I'm using more on the, this taxi part. So, what I,

15:29.560 --> 15:39.960
because, uh, like I mentioned, corundum is a bit more, uh, complex. So, it does not, uh, I mean,

15:39.960 --> 15:46.200
it's for me, it was an overkill if I had to just run like HFT or, uh, application. So,

15:47.080 --> 15:55.000
luckily, the group has, uh, the contributors of corundum has released this, uh, taxi project,

15:55.320 --> 16:03.640
which is basically, you have reusable IPs in terms of, you know, Ethernet Mac and then you can use the,

16:04.920 --> 16:14.280
uh, XC IPs as well, you can use time stamping. So, what I did was that I took the bear minimum

16:14.280 --> 16:22.680
inside the taxi project, which you had to deal with the L2 and L3 layer, RTL, and, uh,

16:22.760 --> 16:33.000
understandably, cocaTV for the, uh, mainly I'm using cocaTV for the, uh, regression verification.

16:33.320 --> 16:41.400
And, very later, uh, it's also an open source tool, I'm using it for the, uh, more cycle

16:41.400 --> 16:49.240
accurate simulation on the FPGA. So, this is, uh, if someone who is interested into using, you know,

16:49.960 --> 16:54.680
smartnicklyachorundum, and it's complex, so you can actually break down to this tool chain,

16:55.560 --> 17:03.560
you can, uh, really, really get started with the, with the smartnick competitions.

17:05.720 --> 17:18.520
Partying, so, I mean, apart from this, uh, financial terms, what was the most important part when I

17:18.600 --> 17:25.800
started implementing the parsing part inside that architecture, uh, and why, I mean,

17:27.080 --> 17:36.680
I mean, why basically, it's, uh, what is actually common in FPGA is HFT and why I'm using this

17:37.080 --> 17:41.000
exchange native binary protocols, because, you know, the point is that you listen hobby projects.

17:41.000 --> 17:47.640
So, you have a lot of protocols, and for me, it was important that, you know, I need to see,

17:47.720 --> 17:56.200
which is the most, uh, uh, implementable in a less complex way. So, I chose this

17:56.200 --> 18:04.360
exchange native binary protocol, why, because I had, you know, uh, it's understandable, because

18:04.360 --> 18:09.960
I had to see, you know, which has a fixed or semi-fixed binary layout, because I don't want

18:11.080 --> 18:16.920
complex binary layouts, and I wanted to have known schemas, meaning that, you know,

18:17.960 --> 18:26.360
I don't want to really go into the litigated details of the financial application. I just want to

18:26.360 --> 18:35.400
get started on the FPGA. And, uh, yeah, don't dynamic fields that makes the implementation easy,

18:35.480 --> 18:49.480
because, uh, of course, if it's, uh, static, it's easier to, uh, also reconfigure on the FPGA part.

18:49.480 --> 18:56.440
And then the domestic is, because that is common. So, you have to, uh, in any case of parsing,

18:56.440 --> 19:03.640
you, you need, uh, the domestic protocol. So, that's why choose that protocol. Uh, easy to

19:03.720 --> 19:08.920
pipeline and art here. So, that is also that when I was choosing this protocol, I was mostly

19:08.920 --> 19:17.400
interested that, if I have to use this open art here, and if I have to have a POC, which is

19:17.400 --> 19:26.520
implement this exchange native binary protocol, also reduce or have a acceptable, uh, tick to data

19:26.600 --> 19:36.040
latency. I needed an easy protocol so that it's easy to pipeline as well inside the RTL.

19:39.880 --> 19:48.840
So, when I, when we say parsing, so, in most cases, what I understood was that, you know,

19:48.840 --> 19:55.720
what are the core design principles? So, you have this parsing module and what I want to achieve

19:55.800 --> 20:01.720
on the FPGA. So, one thing was that, you know, it has, should we have a zero buffer

20:01.720 --> 20:10.360
cut through? So, meaning that, you know, it's, you know, when you have, uh, the market feed

20:11.480 --> 20:19.640
feeding on the FPGA. So, you need the data optimized for the field extraction. This is what I was

20:19.640 --> 20:25.320
targeting, and then, uh, what I wanted to more as well, that I mentioned last slide was that

20:25.720 --> 20:31.000
pipeline determinism, that, you know, I need to have this fixed latency execution part. So,

20:31.800 --> 20:41.000
I want, this was the one core principle I was targeting on that, uh, on the FPGA implementation.

20:41.000 --> 20:49.000
So, it's mostly streaming, not mostly always streaming. So, it's the pipeline FSM's,

20:49.960 --> 20:56.200
when it's, uh, in any parsing, it would be pipeline FSM's, and the bell shifter. So, that is,

20:56.200 --> 21:05.720
was the most, uh, common, uh, computation when, uh, using this parsing. So, you have this, you know,

21:05.720 --> 21:14.440
for this, uh, realignment within the access to me bus. So, this is was when I was, uh, using this

21:15.400 --> 21:25.320
RTL, open RTL, I had this thing, okay. When I have to implement the parsing part. So,

21:25.320 --> 21:32.680
I need this, uh, multi-stage, you know, bell shifters. So, because it helps you for this,

21:33.400 --> 21:38.200
so-called dynamic fixed realignment. So, that was the one parsing part.

21:39.160 --> 21:51.400
This is the parser FSM. So, I mean, it's, like, uh, most common FSM. So, you have basically,

21:52.120 --> 21:59.800
it starts getting the payload, and then you have, you, you get the byte inside the parsing module

21:59.800 --> 22:06.600
from the network part, and then it detects, you know, what type of message is it, and, uh, you have

22:06.680 --> 22:14.760
a common, uh, messes protocols like, you know, you can cancel add, add message. So, that is implemented

22:14.760 --> 22:25.720
inside the, the parsing module, which is facing integrating with the L2 layer. So, I took this

22:25.800 --> 22:35.960
image because that is, uh, also an HFT accelerator project, which I, uh, the use and RTL, open source

22:35.960 --> 22:44.440
RTL. So, just understand, you know, what kind of FSM, what, what is the state machine inside that,

22:45.880 --> 22:52.040
the, what kind of state machine is that in the, the parsing module.

22:52.360 --> 23:03.080
Uh, memory parts. So, uh, when it comes to this autoboke, well, there's only one main data structure,

23:03.080 --> 23:10.280
which actually affects your, you know, latency. So, what is mostly you're dealing is, we're dealing

23:10.280 --> 23:16.600
with this memory and combinational, combinational logic, you know. So, you have, uh, pipeline and streaming.

23:16.600 --> 23:27.080
So, in most cases, which I can, we shall show the next slide is that, you know, it's the B-RAM.

23:27.080 --> 23:33.720
So, you, you, you play with most B-RAM because that is, uh, more than enough. And when I mentioned

23:33.720 --> 23:40.120
membrane combinational, combinational logic, meaning that is, that is the, understandably,

23:40.280 --> 23:46.600
the two basic things, which you're dealing on the, on the FPGA silicon, when it comes to order

23:46.600 --> 23:56.200
book architecture. Uh, yeah, like I mentioned, that is the, uh, on chip B-RAM memory, mostly playing

23:56.200 --> 24:02.600
with B-RAM. You can also use ultra-RAM, uh, but that is, when the specific, so not every FPGA

24:02.600 --> 24:12.040
using the ultra-RAM. So, when it comes to, memory, mostly is, they're using B-RAMs. And,

24:12.840 --> 24:21.400
S-P-M-N overkill, because we're not bound by the capacity inside the order module. So, that is,

24:21.400 --> 24:28.600
uh, you can, you can implement, you can do anything with B-RAM for all order books.

24:29.160 --> 24:39.880
Uh, since we are running out of time. So, yeah, actually this is one project, which is,

24:39.880 --> 24:51.240
which dates back to 2016, 2017, uh, they implemented the complete tool chain, except the, I think,

24:51.960 --> 24:59.480
uh, the Ethernet part inside, uh, C++ high level synthesis. So, what I have been doing recently

24:59.480 --> 25:05.240
was that, you know, that project was implemented around 2016, 2017. So, what I do now is that,

25:06.280 --> 25:14.040
I try to use some, uh, other compiler optimization, which were not existing at that time,

25:14.120 --> 25:22.440
to see if we can achieve better latency in this pipeline. So, this is one part, which I actually

25:22.440 --> 25:33.160
wanted to show in the code walkthrough, but yeah, I can also make a video later on and put it there.

25:33.160 --> 25:39.480
So, when I mentioned the beginning of the talk that, you know, what kind of,

25:39.560 --> 25:46.200
uh, open source, maybe modules, I was, I was looking into, I mean, I did not use, uh,

25:46.200 --> 25:55.000
so far anything, uh, regarding the B-RAM or maybe modules, but I found out falling projects,

25:55.000 --> 26:00.920
which were really interesting to see, you know, how you can use these,

26:01.640 --> 26:11.640
a couple of projects, which has, uh, memory modules to have, for instance, uh, you know,

26:11.640 --> 26:22.520
you can implement memory banks where traditionally you are dependent on the proprietary tools.

26:22.520 --> 26:28.200
So, yeah, I think there can't, time is up, so because, uh, timing closer. So, that is the,

26:29.160 --> 26:35.560
mostly tools, which I was using, a hard problem. This is what I'm following tools, which I was using

26:35.560 --> 26:43.800
for the CDC tools. So, that's it. Yeah, the last thing is that I have been actually fixing a box

26:43.800 --> 26:50.440
in one, uh, open source tool, which is from the AMD, but that's open source. So, what I mean,

26:50.440 --> 26:57.880
actually doing is that in this tool is that to make it easier for the users to use, because

26:57.960 --> 27:04.360
if you go into this project, you'll find out that it's, it's, it's really tough to even get started.

27:04.360 --> 27:11.560
So, this is what I wanted to show as well in, in a demo, but yeah, I can actually, it's on the

27:11.560 --> 27:18.840
GitHub. So, I'll release that, you can, you can see that. That's it. Yeah. Thank you.