WEBVTT

00:00.000 --> 00:22.400
All right, hello everybody, hello everybody, my name is Yolsa and I will be talking

00:22.400 --> 00:27.560
today about generating programmable ampuse with linoch, sorry from linoch with MLR

00:27.560 --> 00:33.480
and circuit. I only have 20 minutes, so I can show you everything, but I'll do my best.

00:33.480 --> 00:38.000
So I'm Yolsa, I'm currently a PhD researcher at Kilo Loven in the Micah's research group

00:38.000 --> 00:42.920
and I've been working on compilers for AI hardware exilators since 2021.

00:42.920 --> 00:48.200
Fun fact, I'm also looking for a job at the end of this year, so if you know anybody who

00:48.200 --> 00:55.400
has work, then send them to me. So let's briefly go over the agenda for this talk, so what

00:55.480 --> 01:01.960
will you learn in the stock? What are ampuse? How do people design ampuse right now or as far

01:01.960 --> 01:07.320
as I'm aware of how people design ampuse right now? And how people should design ampuse in the

01:07.320 --> 01:14.760
future according to me, with my very opinionated opinion, and spoiler is going to be with compilers.

01:16.440 --> 01:22.360
So ampuse are processors that optimize events within your algebra, if you're not quite sure

01:22.440 --> 01:28.040
what density in your algebra is. It's usually a bunch of forloups with some aridmatic

01:28.040 --> 01:33.000
on a bunch of data. So in this case, for example, this is a C code for an element-wise multiplication.

01:33.720 --> 01:39.080
It's quite simple. There's multiplication and there's the forloup. There's other kernels as well,

01:39.080 --> 01:44.280
so a matrix multiplication. There's the forloups and there's a accumulate and multiplication.

01:44.920 --> 01:50.600
And recently there's also the softmax layer that's super popular, for example, in large language

01:50.680 --> 01:54.280
models. So there's a lot of other different operations that are happening here.

01:56.840 --> 02:01.720
I think what is quite important for you to know at this stage of the talk is that in MLAR,

02:02.600 --> 02:08.760
we typically represent this in the LG dialect, and so you'll see that at a much higher level of

02:08.760 --> 02:14.360
abstraction, you will be representing kind of the same thing. So the forloups are represented

02:14.440 --> 02:20.120
as a fine maps. And then the operations themselves are actually represented in the various

02:20.120 --> 02:26.120
dialects that MLAR has. So for example, floating point multiplication in this case for this example.

02:27.640 --> 02:36.280
Now, so what is an MPU? There's many examples of MPUs, so you can see the Google TensorFlow

02:36.280 --> 02:43.800
sorry, TensorFlow Synunit V4, the Apple M1 and all the series process after supposed to

02:43.800 --> 02:51.800
also have this neural engine. Basically, if you're buying a laptop today or a phone or whatever,

02:51.800 --> 02:57.640
chances are that you're having one of these NPUs. So you might be wondering, are they useful?

02:58.520 --> 03:04.040
So according to research, I'm mainly doing research. Our stance on this is of course.

03:04.520 --> 03:10.120
There's this graph on the left that shows you that a lot of things that we have been trying to

03:10.120 --> 03:16.600
do are currently not possible anymore. So we cannot scale up the frequency anymore, scaling up

03:16.600 --> 03:21.320
the single trip performance is also difficult, scaling up power is difficult, and a number of

03:21.320 --> 03:27.080
logical courses is also difficult, but we can keep having more transistors. So there's more area

03:27.080 --> 03:34.520
for specialization, essentially. And then on the right side, there's this plot that shows you

03:34.520 --> 03:42.280
the performance per what's increased when they compared in 2017, the TPU V1 versus various

03:42.280 --> 03:48.120
configurations of CPUs and GPUs, and you can see that for this example, specifically, they get a

03:48.120 --> 03:54.920
200x increase. I should know that this is for a not implemented example, it's still research.

03:55.800 --> 04:01.720
But in reality, are they really useful? Yes, but according to me, it's mostly for marketing,

04:01.720 --> 04:07.640
because in reality, it's quite difficult to use them. These units, they can be too specific

04:07.640 --> 04:13.000
for newer workloads. So maybe they were just nine, two years ago, and two years ago, they were

04:13.000 --> 04:16.840
in the world, that's where they were, I'll work loads, but people have moved on, there's new way

04:16.840 --> 04:22.520
I work loads, and they just don't work on these old accelerators anymore. Also, most compilers cannot

04:22.600 --> 04:27.320
leverage them directly. So if you send the C codes into Clang and you're expected to work on

04:27.320 --> 04:32.360
your Apple neural engine, well, probably it will never end up there. And also, the hardware

04:32.360 --> 04:38.040
firmware and libraries are often close source. So tough luck working on these things.

04:39.480 --> 04:44.360
So in principle, there's multiple interdependent problems, there are new algorithms,

04:44.360 --> 04:49.640
for which we fail to quickly create new hardware. There are new, for the new hardware that

04:49.640 --> 04:54.920
we created takes too long to ship compilers support. And for people to actually use our devices,

04:54.920 --> 04:58.520
there's this close source, those people make these new things. They're like, all this works

04:58.520 --> 05:02.280
really good, and then they want share it with you. So this is kind of annoying.

05:03.480 --> 05:10.200
And so maybe before we think about any solution, it's useful to think about how we ended up

05:10.200 --> 05:15.800
with this problematic situation. So how do people typically create and use? In the beginning,

05:15.880 --> 05:20.360
there's this time where people look at the different algorithms and they start thinking,

05:20.360 --> 05:26.440
like, I would be cool if we can upload this specific part of an algorithm to an accelerator.

05:26.440 --> 05:32.040
Then let's say two months later, they start working on the hardware design. Then six months later,

05:32.040 --> 05:36.200
they're far enough in the hardware design so they can come up with some kind ofization specification.

05:36.760 --> 05:42.600
And then six months later, after it's properly settled, then a library and compile development

05:42.680 --> 05:48.680
can start. Then once a year later, maybe if you're lucky, you can start shipping it to your customers.

05:49.720 --> 05:54.040
But I should note that these are very optimistic estimates. And it's still taking two

05:54.040 --> 05:58.840
plus years to go through this full cycle. And by the time I just said, if someone has come up with

05:58.840 --> 06:02.600
a new algorithm, then all of the work is for nothing, because we're starting from scratch again.

06:03.320 --> 06:09.080
So this is way too slow. The question is, can we do this faster? And the answer in my opinion is,

06:09.080 --> 06:14.280
yes, with compilers, we can do this thing, which I want to call programmable hardware synthesis.

06:16.040 --> 06:20.040
But how, well, we have all of these really cool compilers technologies that we can leverage

06:20.040 --> 06:25.480
to do this. So how do we quickly create a machine learning compiler? We have MLAR for that.

06:26.520 --> 06:30.520
How do we quickly generate hardware? Well, we can use the circuit project, which is a counterpart

06:30.520 --> 06:36.280
of MLAR, the specifically focuses on hardware design. How do we adapt the compiler to the hardware?

06:36.280 --> 06:41.960
Well, if we dynamically register new backends to the created hardware, and how do we quickly

06:41.960 --> 06:46.360
develop this? Well, that is by using a few secret shortcuts, which I will share with you later.

06:46.360 --> 06:50.200
And of course, everything is happening in fully open source, which is cool. So everybody can

06:50.200 --> 06:56.920
hack on this. Now, what are we trying to create? So I think it's useful if we go over the initial

06:56.920 --> 07:05.320
or the typical anatomy of an MPU platform. Typically, what you have is a bunch of memory at the top.

07:05.400 --> 07:11.000
I connected to the memory as a CPU. Usually you use a CPU to also control your MPU. So the

07:11.000 --> 07:19.320
MPU is never really alone. And then to take in, then there's this processing elements. Usually

07:19.320 --> 07:24.360
we use a bunch of them, so you can fully exploit the parallelism of your workload. And then to get

07:24.360 --> 07:28.760
data into the processing elements, you take a lot of load units, or you take a load unit,

07:28.760 --> 07:32.920
it loads in a lot of data. And then there's also typically a store unit that stores back the data

07:33.000 --> 07:39.400
once the processing elements are done with this. And how do we quickly create on this? Well,

07:39.400 --> 07:45.240
what I'm proposing is this programmable hardware synthesis flow in which we technically just create

07:45.240 --> 07:51.240
one giant loop to do the full design of everything. So we start off with the supported operations.

07:51.240 --> 07:55.560
So the things that we want to make an accelerator for, the things that we think are interesting to

07:55.560 --> 08:02.520
accelerate. Still an open question to choose which operations are actually useful. But I hope

08:02.520 --> 08:08.520
by making this quicker that we can find out what is this sooner. Then out of the supported operations,

08:08.520 --> 08:14.520
we generate this PE or the processing element array. And then we stick it together with the

08:14.520 --> 08:21.080
load unit, the store unit, and the CPU. And then we can actually create a simulation of the platform.

08:21.080 --> 08:25.880
And then we have this machine learning compiler, which also ingests the back end information,

08:25.880 --> 08:30.520
which we also get from the supported operations in MLAR. And then we use that to create a binary

08:30.600 --> 08:36.200
which you can run in the simulation on this platform. With this software, all of the issues that we

08:36.200 --> 08:42.680
have. Well, new algorithms, we can just create new PE or at least at the design stage. We can

08:42.680 --> 08:50.280
play around with new multiple PE. So that's not really an issue. For new hardware, there's also

08:50.280 --> 08:55.160
another issue for the compiler support because it's also happening in the loop. And everything is open

08:55.160 --> 09:01.480
source. So everybody can just play around with it, which is a lot of fun. And what I want this

09:01.480 --> 09:07.160
to work in is in two minutes as opposed to two years and two months. So we'll see if we'll get there.

09:07.720 --> 09:14.520
I'm still working on it. And because I can obviously not share everything with you today,

09:14.520 --> 09:21.640
today we'll focus a bit more on this concept of PE every conversion. So to remind you,

09:21.640 --> 09:28.760
those are these prosing element parts of the NPUs. And so what we're going to do, so in more detail,

09:28.760 --> 09:33.880
we have a look of these of these supported operations, which we typically express in Linux in Erics.

09:33.880 --> 09:40.520
So you have Linux in Erics A, Linux in Erics B. Then you convert each of these Linux in to actual

09:40.520 --> 09:48.760
processing elements in a dialect of MLAR that I create myself. Then there is a way to aggregate

09:48.760 --> 09:56.440
these PE's into a PE that supports multiple of these operations. And then you just multiply them

09:56.440 --> 10:01.800
in a way that fits your parallelism. And then you have you end up with an MP with a supports

10:01.800 --> 10:08.120
Linux in Erics A and Linux in Erics B. And if the source file that you use, so if the software

10:08.120 --> 10:14.600
workload that you want to run on your hardware contains the same support operations as your

10:15.480 --> 10:20.120
support operations that you use to generate your hardware, well, then it's guaranteed to work on this MPU.

10:21.400 --> 10:30.520
So let's go over some examples. So the simple example number one, the task is to create a

10:30.520 --> 10:36.920
four-way parallel processing element array that supports elementwise operations, namely addition,

10:36.920 --> 10:43.960
subtraction, multiplication, and bitwise exclusive for. And so we start off with this workloads,

10:44.040 --> 10:50.360
so you can see that there's multiple Linux for multiplication, addition, and subtraction, and

10:50.920 --> 10:57.320
XOR. In the next step, what we do is we generalize all of these operations, so we get a Linux generic

10:57.320 --> 11:02.920
for all of these different operations. And then we will create a PE that is supported for each of these

11:02.920 --> 11:11.000
operations. Then we aggregate all of these PE's into one PE and then we create the four-way array.

11:11.000 --> 11:18.120
And let's now go over some example where we actually deploy the algorithm on this simple accelerator.

11:18.120 --> 11:23.960
So we use, in this case, we use the same software as we used to create the hardware array.

11:24.840 --> 11:30.120
So the first thing we have to do is we have the array that supports all of these operations.

11:30.120 --> 11:34.200
The first operation we encounter, maybe the difficulty in the back, but it's the addition.

11:34.920 --> 11:40.680
So we program it, read as the addition color, obviously. So we program it to run the addition.

11:41.320 --> 11:46.040
Then the next step is to do this subtraction. So we load in the data, we are

11:46.040 --> 11:48.520
querying it to do this subtraction, and we send off the data back.

11:49.720 --> 11:52.360
And so we do this for all the other operations as well.

11:53.320 --> 11:59.160
So if you look at the end, what you had to do was, you load the data four times into your

11:59.160 --> 12:04.440
processing element array, you store data four times back into your memory, and you need to program

12:04.440 --> 12:08.360
the accelerator four times to choose all of these different operations.

12:09.080 --> 12:13.400
But being in MLR, there's a few cool tricks that we can pull off.

12:13.400 --> 12:17.960
So one of these is something that people use sometimes in software as well, which is called

12:17.960 --> 12:26.200
operator fusion. And what is operator fusion essentially, instead of fetching data from memory,

12:26.200 --> 12:31.240
multiple times, you just fetch it once, and you do all of these subsequent operations on it

12:31.240 --> 12:36.840
before you send it back to memory. Obviously, that doesn't always work in each algorithm,

12:37.240 --> 12:43.480
and you might not always want to do it, because it can generate quite complex workloads.

12:43.480 --> 12:47.240
But in some cases, you might, and because we are supported anyway, we can just write.

12:47.240 --> 12:54.840
So in this case, for example, this is the first workload. If you look more closely, you'll see

12:54.840 --> 13:00.200
that actually all of these workloads are connected. So you can fuse them together by

13:00.200 --> 13:04.760
running a few MLR passes. And so now all of these workloads are actually contained in the same

13:04.760 --> 13:09.720
Linux generic. So all that rest is to do is to create this go over this full Linux generic,

13:09.720 --> 13:19.080
convert it into one PE. Then we make another four-way PE array. Then we to deploy the workload,

13:19.080 --> 13:23.640
we just program it to run this thing. And now the performance would be stored data once,

13:24.600 --> 13:30.040
load data once, and program the accelerator once, which obviously helps with performance by a lot.

13:30.680 --> 13:37.720
But obviously, it is leads to a trade-off. So we will create a more specific accelerator

13:38.760 --> 13:46.760
for more performance. Or maybe this one is too specific, and it's not future's proven off.

13:47.560 --> 13:54.680
So currently with this PEHS system, there should be a way to try these out more quick.

13:55.000 --> 14:02.360
So you might be wondering, I told you about a few secret shortcuts. How did we create this?

14:02.360 --> 14:06.200
So I mean, how did we create this? It's not finished yet, we're still working on it. But like,

14:06.200 --> 14:11.960
how did we start working on it? The shortcut number one that we use is we use XDSL for quick

14:11.960 --> 14:18.040
portrait typing. So for people who are not familiar with XDSL, XDSL is an open source MLR, and

14:18.040 --> 14:22.520
does also circuit compatible framework written in Python. So you can very quickly

14:23.160 --> 14:28.040
prototype new MLR dialects and combinations of compiler classes. You don't need to recompile each time you

14:28.040 --> 14:33.800
can just run it immediately, and then do interactively bugging without any issues.

14:35.560 --> 14:41.720
You should totally check it out, it's a really cool project. Then shortcut number two is for the PERA.

14:42.600 --> 14:47.720
The PERA alone is not enough. You need low tuning, you need story unity, memory, you need a CPU.

14:48.280 --> 14:54.680
For that, we use this next platform that was developed at K11 to create an NPU. So that platform is also

14:54.680 --> 14:59.800
fully open source. It uses a wrist-five core to control everything, and all of the search

14:59.800 --> 15:02.600
core for the hardware is totally open source, so you can also check that out.

15:03.880 --> 15:10.440
And we also worked on, well, before we start working on this programmable flow, we also worked

15:10.440 --> 15:18.760
on a flow to just compile the not just to, let's say, hand made accelerators, which we call

15:18.760 --> 15:24.440
the Snacks MLR software compilation to obtain, which is also written in XDSL, and is also fully open source.

15:25.000 --> 15:29.880
And it's also the repository value to find the search code for this approach today.

15:31.640 --> 15:37.240
And in the future, we want to overlay functionality on parts of the PERA, and we're just copying

15:37.240 --> 15:43.720
the PERA, but it could be possible, and maybe interesting to choose different combinations of PERA.

15:44.280 --> 15:50.520
We also want to support multiple data flows. So sometimes you need to do reductions on your data,

15:50.520 --> 15:55.240
or you need to, this was just a simple elementwise example, but you need to do multiple different things.

15:56.760 --> 16:01.240
Sometimes it makes more sense to group together certain operations, and several operations in

16:01.240 --> 16:05.720
different pools. In that case, you would generate different accelerators, which is also something

16:05.720 --> 16:11.560
we're working on, and then also support for multi-cycle data paths for, especially for floating

16:11.560 --> 16:15.800
point supports, not necessarily because we like floating point supports in hardware, it's kind

16:15.800 --> 16:22.840
of a pain, but we just don't like quantization. So that's the main message there.

16:23.560 --> 16:30.440
So with that, I thank you very much for attending this talk, and here are the decoys of this talk.

16:30.680 --> 16:32.200
If you have any questions, shoot.

16:41.320 --> 16:42.200
All right, see, yeah.

16:42.920 --> 16:48.280
What do you think about driving to the stream of various specific model territories,

16:49.320 --> 16:55.480
maybe going from high doors with ports, and I are down to an R, and that too soon?

16:56.120 --> 16:59.960
So the question is, what do I think about making very specific accelerators for models?

17:01.160 --> 17:06.760
I think making very specific accelerators for models is a lot very sensible, because it's

17:07.000 --> 17:12.840
like making an AC or taping out an AC, which is the goal of this work at some point,

17:12.840 --> 17:17.960
is super expensive. So I think just generating hardware for one specific model, I don't think it

17:17.960 --> 17:24.200
makes sense. The question, I think is more interesting is what operations do you need to support

17:24.200 --> 17:29.560
a broader range of models without giving up too much of the specific performance you gain by

17:29.560 --> 17:34.280
going for more specific circuits? But I don't know the answer to the question, that's why I'm

17:34.280 --> 17:37.560
making this tool too. Fair enough. Yes?

18:04.760 --> 18:12.120
So I guess the question is, is there a place for more SRAM on the NPU, because the NPU is usually

18:12.120 --> 18:14.440
required more SRAM? I guess that's a question.

18:16.760 --> 18:21.240
I think it's so the next platform, which is where the memory comes from is quite

18:21.240 --> 18:26.280
parameterizable, so if you have more money, you can make the chip bigger, and you can add more memory.

18:34.280 --> 18:43.720
Yeah, so how programmable is the outcoming hardware? I think currently it's just quite simple,

18:43.720 --> 18:49.000
so it just generates kind of like functional units, and then it will deflect your around it.

18:50.120 --> 18:56.840
But depending on how you combine them together, you can like squeeze things together and load it off.

18:57.800 --> 19:04.680
So and then combine with the load and store units that we have, that actually determines

19:05.320 --> 19:10.440
how much dimensions your input data can have, but those are designed time parameterizable.

19:16.440 --> 19:20.440
So that depends on which supported operations you're given. So the question was how

19:20.440 --> 19:23.960
do you decide on the operator fusion? If you give the operators as few

19:23.960 --> 19:29.560
operations in the supported operations, then it will create a few accelerated out of that.

19:29.560 --> 19:34.360
If you supply both of them, then it should generate and like both the unfused and the few

19:34.360 --> 19:39.880
ones. It should generate an accelerator that supports both unfused and fused, and it should automatically

19:39.880 --> 19:43.640
detect that, that's kind of the goal. Yes?

19:43.640 --> 19:47.880
Can I just pick up that first question, a classroom all the way around? You think you're going to want

19:47.880 --> 19:54.200
specific things? If I go and try and accelerate it now, I might make a convolutional accelerator,

19:54.200 --> 19:59.880
but adjusted to a commutes of multiplication and intuitions well. What's the approach that you do?

19:59.880 --> 20:05.080
If you're coming off a single NIO operator, how do you say, oh, there's these three operators,

20:05.080 --> 20:09.480
like agreement together, and I can handle it so, and then we'll actually accelerate all three.

20:09.480 --> 20:15.880
The scenario costs are just having three. Sorry, so can you repeat the question?

20:15.880 --> 20:19.080
So, you were asking the first question, if you already have a convolutional?

20:26.280 --> 20:28.680
On getting more generic accelerators?

20:28.680 --> 20:33.080
Yeah, so I don't know. You've got three NIO operators.

20:33.080 --> 20:38.760
Yeah. If they're related, you can have a single accelerator.

20:38.760 --> 20:46.520
Yeah. I think that is, so I think the question is, how do you aggregate these accelerators?

20:46.520 --> 20:50.040
Yes. So, it's that you have one accelerator and three different accelerators for each

20:50.040 --> 20:57.000
new operation you add. I think part of that is in the way how we support the various data flows.

20:57.000 --> 21:01.720
So, the load and store units of snacks are quite programmable. So, they will support many different things.

21:02.600 --> 21:07.640
And then, currently, the aggregation is quite limited, but that's currently something I'm

21:07.640 --> 21:12.680
working on to make it possible so you can aggregate more of these, even if they have different data flows.

21:12.680 --> 21:13.480
For example.

21:13.480 --> 21:17.160
Thank you. Thank you.

