WEBVTT

00:00.000 --> 00:14.680
We are going to explain what this title means, hopefully by the end of the talk.

00:14.680 --> 00:16.400
So this is the agenda.

00:16.400 --> 00:18.640
Is it microphone too high?

00:18.640 --> 00:19.640
No.

00:19.640 --> 00:20.640
Just fine.

00:20.640 --> 00:25.840
So, we are going to explain, we are going to introduce the ETI minion, what it is, by

00:25.840 --> 00:31.120
scaring the architecture, the customer extension, and then some very brief conclusion.

00:31.120 --> 00:38.000
But me, well, this hasn't changed since last year, this slide except this line and this line.

00:38.000 --> 00:39.000
Oh.

00:39.000 --> 00:40.000
Right.

00:40.000 --> 00:48.280
So, that the only thing is that now I am not an echo, it has zero to three times to do my

00:48.280 --> 00:50.800
personal projects.

00:50.800 --> 00:53.040
So what is an echo and why am I here?

00:53.040 --> 00:57.760
So it's our elistage startup, we are really trying to take the world open seriously when

00:57.760 --> 01:03.800
it comes to AI and what we have amongst other things we started, I found that we will

01:03.800 --> 01:05.920
find it at that link.

01:05.920 --> 01:11.480
We famously, I think, for many people, acquired the experimental technologies IP and then we

01:11.480 --> 01:13.320
have been sourcing it.

01:13.320 --> 01:17.760
We don't stop there and in every discussion that we have internal to the company, opens

01:17.760 --> 01:22.560
us harder and what it means and now we are going to do it is like things that takes 90%

01:22.560 --> 01:25.400
of a fund discussion.

01:25.400 --> 01:28.400
So what is it for the ETM union things?

01:28.400 --> 01:33.960
ET is actually a rest here but essentially it's the prefix that we use for everywhere the

01:33.960 --> 01:38.640
stunt of the architecture technology when it's been open source because all the code was

01:38.640 --> 01:42.800
as planned to the technology and it's okay, let's call it ET.

01:42.800 --> 01:47.520
Of this slide, that that values links here just to show how open we are on every time, this

01:47.520 --> 01:52.520
is for more emulator, for example, everything, but the things that is still important

01:52.520 --> 01:57.560
for this stunt case test is that they here we are going to find all the manual and even

01:57.560 --> 02:02.680
the schematics of the board are you going to see soon.

02:02.680 --> 02:09.040
So what am I here, why am I here, good question.

02:09.040 --> 02:13.680
What I wanted to do is like now that we have to open source this and we found we got the

02:13.680 --> 02:19.120
IP and we saw the amount of work that was done for Esperanto and we open source and get

02:19.120 --> 02:23.680
that think this was the perfect room to actually discuss this architecture and what they

02:23.680 --> 02:29.440
did and things that I like but I'm not going to express opinion in this talk, by the way.

02:29.440 --> 02:33.280
This is not going to be a comparison between existing operables in the director of this

02:33.280 --> 02:39.040
fire extension, we kind of had a talk about that and it's not a declaration that this extension

02:39.040 --> 02:45.440
and whatever we have in the regional it is a one is absolutely perfect, far from it and you

02:45.440 --> 02:47.920
would see why.

02:47.920 --> 02:52.000
So in order to start doing this we need to start talking about the architecture because

02:52.000 --> 02:55.760
you cannot really express anizer if you want to scrap your architecture for which you

02:55.760 --> 02:57.000
was created.

02:57.000 --> 03:04.800
So here's the board, the board of course behind this very optimistic hit sink you will

03:04.800 --> 03:11.200
find the extra one ship, you find that PDR4 which is the most valuable thing of the whole

03:11.200 --> 03:15.280
board and a P-MIC and a FDDI.

03:15.280 --> 03:21.760
So behind this trip actually what is the it is a one is it is actually going to find

03:21.760 --> 03:28.160
a thousand eighty eight minions, so small is five core, four out of all the core and they

03:28.160 --> 03:35.600
all learn about the under eight under my guides again do not plus that hit sink for that.

03:35.600 --> 03:41.120
Right, so this is how logically I see the board when I have to think about it.

03:41.360 --> 03:45.200
You're going to find the it is a one which is the main ship, you're going to find the PCIe

03:45.200 --> 03:51.360
that connects the activity is a one, so it's the LPDR4 that's a microcontroller that controls

03:51.360 --> 03:56.240
the voltage regulator which is I ask us his slave of it is a one and then you got the

03:56.240 --> 03:59.600
values you had to actually control the board.

03:59.600 --> 04:04.960
And now we're going to start looking inside it is a one, at least one person in this

04:04.960 --> 04:10.320
room will be offended by this slide on this slide because this is a logical view and doesn't

04:10.400 --> 04:15.760
show what kind of beautiful mesh architecture the ship is actually made of.

04:16.720 --> 04:22.800
The essentially you got six by six, shires everything is a shire and you got four and four

04:22.800 --> 04:28.720
memory shire on the side and we need to start getting used to this kind of no

04:28.720 --> 04:32.800
magnitude that the Esperanto domain for sure they got shire, they got minion and they got

04:33.760 --> 04:40.080
so the shire essentially is a module that in which the chip is divided into.

04:40.080 --> 04:44.880
So let's start from the things we're now going to look at the IOS shire.

04:44.880 --> 04:49.680
The IOS shire is contains all the values devices but that's a contains full

04:49.680 --> 04:54.160
actions which is the opposite of minion which are big out of all the core will connect

04:54.160 --> 04:58.880
to know them and there's a very small minion that is just minion called the service processor which

04:58.960 --> 05:05.680
is the actual mic of a controller that controls the board. PSA shire memory shires can imagine there's

05:05.680 --> 05:12.720
a lot of IP there but it's not ours and the minion and so what is going to look at this is the

05:12.720 --> 05:18.480
minion shire. The minion shire is the actual compute shire and this is where all this compute

05:18.480 --> 05:26.640
calls that I saw are. Let's and so this is what I'm going to in so what is a shire in a minion term?

05:27.120 --> 05:34.560
It's going to be a hierarchy of definitions so bear with me. So a shire is a compute shire

05:34.560 --> 05:40.320
minion shire is actually composed by full neighborhood again no manager and full megabytes

05:40.320 --> 05:48.320
or L choose L3 cache. Now what does L2's L2's L3 cache pins is a complicated thing and

05:49.040 --> 05:58.320
I'm going to drive there to successive approximation. So H shire has this full megabytes

05:58.320 --> 06:02.880
of dual T cache which means that actually in Esperanto you can actually see the manual

06:02.880 --> 06:09.440
how it's done is have to be quite elegant in my opinion. The cache module can actually

06:09.440 --> 06:16.240
be configured across the whole chip into this full megabytes of S3 can actually be divided into

06:16.240 --> 06:22.320
12 to cache part of an L3 system wide cache that then becomes actually the global cache before

06:22.320 --> 06:28.960
the RAM or L2's cache pad and this can be actually configured runtime but I wouldn't suggest

06:28.960 --> 06:35.440
doing that. So keep in mind the cache is going to be very important for a thing.

06:38.000 --> 06:41.680
Then we go to the neighborhood and we open up the neighborhood and what we found in the neighborhood

06:41.680 --> 06:48.320
is eight minions which is the actual CPU and I shared and one cache, iCash, instruction cache.

06:49.360 --> 06:54.480
That's the simplest slide so we go down and we actually see what a minion is like and I repeat

06:54.480 --> 07:01.680
we have a thousand eighty eight of them in the chip and well it's a simple another core

07:01.680 --> 07:10.000
two hearts per it per core it's a Lv64 IMFC. So it's a 64 bit as you can see it's

07:10.080 --> 07:15.520
cleverest in the italics and we'll see why and then as you can see there's a big thing called

07:15.520 --> 07:24.960
a VPU with eight lanes and that's going to be also the talk. There's also a full keyword L1 cache,

07:24.960 --> 07:31.680
L1 data cache so please note the instruction cache was shared in the neighborhood. The D cache is

07:31.760 --> 07:39.120
specific for the CPU and you start to see the pattern of the design here the D cache is

07:39.120 --> 07:45.520
actually configurable in its minion and I can split it so I can there's the simplest mode in which

07:45.520 --> 07:51.280
the both hearts share the same D cache which is only for a kilobite or I can split it and then I

07:51.360 --> 08:00.560
can not only split it but also create fifteen half a kilobite per a heart of cache and

08:00.560 --> 08:05.840
a kilobite of scratch pad. So what is scratch pad is just a buffer and of course you have to

08:05.840 --> 08:11.840
scratch pad is a bit farther away and share custom minion the L1 scratch pad is a very fast memory

08:11.840 --> 08:19.120
right by the chip is and this is when things start to get complicated if you think this was

08:19.200 --> 08:24.320
fun to program because there's absolutely no currency about all the caches that I spoke to.

08:25.120 --> 08:30.560
So you need to think in terms of cache lines when you program this chip and you have to decide

08:30.560 --> 08:37.360
that which sits where and there's actually quite a quite a few important things so actually you

08:37.360 --> 08:42.640
control the CPU but the caches would be the things we'll speak in the most.

08:43.520 --> 08:50.000
Right so as I said the minion which is now we know what is the ET minion I say means

08:50.640 --> 09:00.720
has custom extensions but let's start from the basics so right so the manual is originally optimistic

09:00.720 --> 09:06.640
it says that of course as you can see as fancy and although fancy as I essentially not quite

09:06.720 --> 09:14.560
useful but you know like this is an IMFC standard but he has machine models supervisor I say

09:14.560 --> 09:21.120
nice I can run operating system in it except for some slightly major router so pitch table are

09:21.120 --> 09:28.800
unusable and but there is a PMP of some sort so you're going to be fine one of the biggest

09:29.520 --> 09:36.800
minor deviation of the of the design is that the performance monitor units actually

09:36.800 --> 09:42.800
can't everything in the everything in the neighbor then you can actually start seeing how this

09:42.800 --> 09:47.760
thing is supposed to be programmed because when you have an i cache and the performance

09:47.760 --> 09:54.960
consider a share essentially the neighbor becomes to compute unit so this is where we start from

09:55.040 --> 10:04.400
this is the major bug and the basicizer and from this we built so and we're going to look at

10:04.400 --> 10:10.640
three basic extension the first one is the SIMD extension that might get some people excited in

10:10.640 --> 10:16.400
the room and here there is the atomic extension and what it means to have atomic in a system with cache

10:17.280 --> 10:24.080
a non-coerent and the third one is the tensile extension I don't have the time no one I think has the

10:24.160 --> 10:30.320
time to explain the full tensile extension at all but I'm going to discrete basic mechanism for it

10:32.080 --> 10:40.720
right let's talk about the SIMD instructions so this is the thing about the SIMD like

10:40.720 --> 10:47.600
this chip has not been the same recently I think it was definitely less second and 16 more done

10:47.600 --> 10:54.560
for that and so this was before a VV and to me that I come from x86 feels like more of a

10:54.560 --> 10:59.120
classic also in the extension for his file it doesn't feel like it when you have to a program it

11:01.520 --> 11:06.560
the biggest thing that I did is that they definitely didn't do for example what the p-accession

11:06.560 --> 11:10.880
does that just groups or just so together they didn't do whatever VV does that kids a different

11:10.880 --> 11:18.480
bank or register what they did was extend the floating providers to 256 bits and people can

11:18.480 --> 11:24.240
start imagine working possibly go wrong with that but in general it's actually quite elegant

11:24.240 --> 11:33.360
if you fix a compiler when spilling the registers which we did not so we have this 250 bits

11:33.360 --> 11:39.440
they're actually this is why if you look at the previous slide you saw seven lanes because each of this

11:39.520 --> 11:44.880
is considered like eight eight that you do bit facadelament.

11:45.920 --> 11:52.320
The other things that I did is of course a mask zero is that we implicit that kind of a standard thing

11:52.320 --> 12:01.280
but they actually added eight bit eight bit eight mask registers so the reason is that is of

12:01.280 --> 12:06.000
course that's 64 bit 64 bit registered is added and that is the mask.

12:06.960 --> 12:12.480
So, how can we start if you actually look at the manual you will see a lot of instructions

12:12.480 --> 12:18.480
so I decided to kind of logically group them into this and going from the simplest to the most

12:18.480 --> 12:25.680
complicated. So the first one that you're going to look at is a masking structure to the LSSV not much to say

12:26.240 --> 12:30.480
you got this eight register and then you're just going to see the various operations so you know you

12:30.560 --> 12:35.520
save it from the register you get a bunch of podcasts from that and then you can pop the count

12:35.520 --> 12:42.560
the one count to zero I end up not so. So, this is our essential you can actually start masking

12:42.560 --> 12:46.320
operation because you will find mask operation like you actually read some about mask

12:46.320 --> 12:55.040
much easier like this. Load and store again think to take in mind that this is an architectural

12:55.360 --> 13:03.440
where caches are non-credent so all of this load is still go to LL1 so to the L1 decash and so

13:03.920 --> 13:11.360
keep in mind this because it will be important for the future. This is essentially a mask load store

13:11.360 --> 13:18.160
of the full floating point register, very useful for stealing. These are actually just this that

13:18.160 --> 13:23.600
feel like very simple to me because it feels very classical you can see there's the load store that is

13:23.680 --> 13:31.440
mask the broadcast that some weird things going on for example you can see there are two different

13:31.440 --> 13:37.840
prefixes one is PS your API one's time to pack packet signal which means floating point

13:37.840 --> 13:44.400
your days packet integral which means we're into 32 or into 32. For example, the things that you

13:44.400 --> 13:49.920
will find when you look at these are the isolated are some very strange surprises like the immediate

13:49.920 --> 13:55.840
of the broadcast is actually when it's a floating point it's the top 20 bit and the last four

13:55.840 --> 14:03.760
bit repeat the three times so because we do that has zero. As you can see the scatter gather

14:03.760 --> 14:09.760
everything that you expect condition move permutation and all this thing so this feels kind of normal

14:12.320 --> 14:16.960
this feels almost normal at the beginning until you go to the last page and this is the

14:16.960 --> 14:24.720
converter to convert the elements so you know they supported p16 to fp32, into 32 to fp32 and

14:24.720 --> 14:30.320
the vice versa but this is when you start seeing that the mean was meant for graphics or

14:30.320 --> 14:35.920
each or even he was designed they actually want to get graphics GPU not an axlator so

14:37.040 --> 14:41.600
we support and I don't really want to see the format of this for these things but we support fp

14:41.680 --> 14:47.280
and fp like when I know the values integrate normalization numbers including you know too

14:49.120 --> 14:56.960
it's part of the legacy of the of the same thing and then yes this is the values

14:58.400 --> 15:02.640
the actual integral for the incorporation that they're actually things that we actually don't

15:02.720 --> 15:11.520
use imagine these are all master of course interestingly the the floating point has the

15:11.520 --> 15:19.120
mole at and not the integral one things and so this is essentially where most of the

15:19.120 --> 15:24.160
specification is I mean I could list all the instructions but I think it's very useful because

15:24.240 --> 15:33.040
there's an interaction to it right so this is where you know all the values things start happening

15:33.040 --> 15:38.480
because for example not all the instructions are under finding that you'll find the manual

15:38.480 --> 15:42.960
are actually implemented in the hardware and if you look at the very log which soon will be able

15:42.960 --> 15:49.040
to because we're open so soon that you will see that after if they're out so they were implemented

15:49.040 --> 15:53.760
and then they moved for saving space and so for example the touch and then actually instruction

15:53.760 --> 16:00.640
like sign cosine I think even exponential wasn't implemented and there are things that

16:00.640 --> 16:04.320
actually create problems with after we've programmed these things with a compiler which is not

16:04.320 --> 16:10.480
that way we don't have an integral divide and we're actually breaking the actual C programming

16:10.480 --> 16:18.000
because you cannot translate the normal register size which is you know in the 64

16:18.000 --> 16:21.840
to have to teach you because there's no converter in hardware so you need to be very careful

16:21.840 --> 16:29.440
if you're going to see these things to actually use float long. The architectural actually

16:29.440 --> 16:34.880
generated a new exception which is M code emulation instead of relying on the illegal

16:34.880 --> 16:38.720
instructions because this way the exception can actually pass you some later that helps you

16:38.720 --> 16:45.120
understand which instructions that to it is how I think you have a more complicated the code

16:46.080 --> 16:52.560
but it's not used by the firmware so that's something we need to add. This is something

16:52.560 --> 16:58.160
that I wanted to see this is how the actual CPU looks like when you program and I don't know

16:58.160 --> 17:04.720
to you but this feels very inspired to me the same time essentially what we did is essentially just

17:04.720 --> 17:14.720
added PS as a as a prefix a suffix and this essentially is part of something I take when

17:14.720 --> 17:18.880
a code I was apparently convolution using the same name section so as you can see like

17:20.400 --> 17:25.760
actually the from a point of view of the user the idea of actually using floating point

17:25.760 --> 17:30.320
like so makes a feel much more natural and that's something I like despite all the very

17:30.320 --> 17:38.960
details of that project. This is where cash becomes important so this is why I introduce it first

17:38.960 --> 17:44.640
and now we're going to see the consequence of it. As I can see the way I like to think about the

17:44.640 --> 17:51.680
system despite what soon we realize this is a lie is that you essentially have a

17:51.680 --> 17:59.440
meaner level and a meaner and each meaner CLL1 and an L1's cash fund. There's an L2 or an L2

17:59.440 --> 18:05.280
's cash fund there is shared with the Shire and then there's an L3 global man so the plan that we have

18:05.280 --> 18:12.320
in a system that is actually there is actually non-career and is that it gets very hard to flash

18:12.320 --> 18:17.120
everything not the L1 and not controlling and then since all the cash not communicate with each other

18:17.200 --> 18:23.840
you can actually easily create a lot of problems on the cash line. So the way this is solved essentially

18:23.840 --> 18:31.440
and the atomic is stuck to you in a way you actually create essentially create implement so

18:31.440 --> 18:36.960
in actual order this is actually implementing an L2 and L3 in the actual cash module so essentially

18:36.960 --> 18:41.680
the operation actually don't have the cash level module and so all these operation that you do

18:41.680 --> 18:47.600
to find as a topic you specify whether you're going to execute an L2 or an L3 and this is usually

18:47.600 --> 18:53.680
specified with global and local and when you use an atomic operation you actually completely bypass

18:53.680 --> 18:59.600
the one so essentially the way you program this machine is by decided on some data of Shire

18:59.600 --> 19:06.720
local some data global you're just going to access the atomic instruction to exercise this

19:06.800 --> 19:15.200
it requires a lot of how can I say discipline but I had to fix out of the maximum format for that

19:15.200 --> 19:22.400
so it can be done note is not a real there does no real LSE of course you can actually have the

19:22.400 --> 19:27.600
usual reason we we're not GC we don't have the A extensions because the atomic's needs to do this

19:28.720 --> 19:33.680
and so you can have like the various amoxyl global and local for wood so you actually when you see

19:33.760 --> 19:39.280
that you essentially realize that it's quite the same of the atomic instruction extension with

19:39.280 --> 19:46.080
the global and local so everything is stopped there are any questions about this stats are

19:47.600 --> 19:58.640
I'm going to continue right so what else is there um to something that I am personally

19:59.200 --> 20:06.000
I don't know I was shocked because I remember working on circus back in my previous life

20:06.000 --> 20:11.360
and so for example they actually have a comparison what we sell the amocas and of course it's local

20:11.360 --> 20:16.320
in global so that's nice although not exactly the architecture would like to have a spin look on

20:18.880 --> 20:21.920
yes and what is actually becomes interesting is since essentially

20:22.000 --> 20:27.920
the what you're going to see is since atomic is actually used to really not work

20:27.920 --> 20:32.640
so you can have atomic comparison because it's the cache layer that actually prevents them that

20:32.640 --> 20:38.560
those are the all you part but is actually used to control a rich level of the cache in which cache

20:38.560 --> 20:44.160
you actually write in the data and so actually you're going to find scatter and data, scatter and

20:44.880 --> 20:51.360
local and global we're going to find all the various operations that usually are for

20:52.640 --> 20:57.840
for floating point in the version of farmer up global and local so actually all the

20:57.840 --> 21:02.640
operational usually you have for floating point they actually become atomic because of to this

21:03.520 --> 21:09.680
in this architecture atomic are not meant to prevent someone to do something because everything is

21:09.760 --> 21:14.480
an uncreated it's actually meant for you to be able to control at which level of the cache

21:14.480 --> 21:23.840
of item data and that's very important but now we actually I hope I was running out of time before

21:23.840 --> 21:33.440
starting this right so this is if you think that this cache was complicated you should say this one

21:34.320 --> 21:44.080
um right now we definitely have 15 minutes I'm definitely going to finish this to uh okay so

21:45.040 --> 21:51.920
that's our processor the view you had at the minions so far was very simple and nice because you

21:51.920 --> 21:59.520
are simply atomic and things the tensile operations actually work on matrices and

21:59.680 --> 22:07.600
this side actually looks nice because essentially you have a matrix which is 64 byte by 16 rows

22:08.880 --> 22:13.040
and of course since it's the funnest bite you can actually have different size of matrix as

22:13.040 --> 22:22.320
so you can actually do that um I'm I mean that's about one line here but it's found by essentially

22:22.320 --> 22:28.080
like the fact that you actually can have in date 15 minutes to 2 you can actually have this kind of

22:28.640 --> 22:36.400
m and mkn and so this is the kind of operation that you have and you can actually define

22:36.400 --> 22:43.120
because you have various FMA operations for matrices so the problem is where do you put all this

22:43.120 --> 22:50.240
all the status because if you make some calculation it is 64 by 16 that's 500 12 bits I believe

22:50.320 --> 22:55.520
I had it at home and so this this becomes kind of important for us where we

22:57.920 --> 23:03.040
where we do this and this is where it gets complicated because essentially

23:04.160 --> 23:09.120
when you have this kind of copper this kind of operation of matrices in this kind of machine where

23:09.120 --> 23:15.200
the cashes of you know you have a distributed chip in a knock the way you handle data becomes very,

23:15.360 --> 23:22.320
very important and this is where we actually start saying that what I told you before about

23:22.320 --> 23:30.560
the cashes was an actual lie because the way cashes work in it's ok one is that actually

23:31.600 --> 23:38.000
they're all marked in a global space not the L2 but essentially like I cannot access

23:38.000 --> 23:44.480
from another minion the L1's cashes but I can access someone else's L2's cashes but so this is

23:45.360 --> 23:49.280
all this around get marked in different ritual at the space and not so I can access them.

23:50.160 --> 23:54.640
So the two things that the way I should see when I'm talking about my minion how the memory

23:55.600 --> 23:59.600
how the memory work is I'm going to see that as I'll if I enable the sketch but which

23:59.600 --> 24:04.000
you need for transformation I'm going to see that is a L1's cashes but I'm going to see that

24:04.000 --> 24:10.400
there's an L2's cashes but and to me I'm not going to only buy the hard idea I can know which one is

24:10.400 --> 24:15.840
my L2's cashes but others what I'm going to notice is that the one that is closer to me because

24:15.840 --> 24:21.120
it's a much higher is going to be faster to access and that's what I'm going to use. So essentially

24:21.680 --> 24:28.160
you need to understand that even though the model to SAP can be can be passed. So how does the

24:28.160 --> 24:40.320
transformation work in this case? Well it's a game between two matrices I B and C and what is

24:40.320 --> 24:47.600
interesting about this is that the FMA which is the operation there are many of them by the way

24:47.600 --> 24:54.080
there's a bunch of simplification it takes the first matrix from the L1 SAP so I need to be there

24:55.520 --> 25:01.760
there is something called the tensor B tan B register you structure a sleeping register

25:01.760 --> 25:08.720
these are not the simplification but essentially you issue a load B and then immediately you issue

25:08.800 --> 25:14.080
a tensor operation or whatever uses you up to this and so this is not an actual this you don't

25:14.080 --> 25:21.120
really load ever the full value of the matrix into this tensor just something as streams and then

25:21.120 --> 25:28.400
the tensor operation actually saves the result matrix in the full floating point in the full

25:29.360 --> 25:36.240
floating point registers of the machine and then you need to issue a tensor store. So why did I

25:36.400 --> 25:40.880
put this because usually and it's going to be I guess talk that it's going to be much

25:40.880 --> 25:47.840
because size I'm pretty sure it's going to be like a tensor load to a Cp that I can actually do

25:47.840 --> 25:52.960
parallelly and then actually I can do a tensor load that can load from everywhere but usually the

25:52.960 --> 25:57.280
weight works is that I have something to load from an external memory of from wherever it wants to

25:57.360 --> 26:06.720
to Cp I do a Cp and I load from from here and also stick the tensor load A if that is the case

26:07.280 --> 26:12.400
and then I do the tensor store back so essentially you should see this two parallel just two things

26:12.400 --> 26:18.480
are parallel but the system doesn't force you to do that and usually what happens is that you actually

26:18.480 --> 26:27.200
load A and then you change B because you stream that so this is the system

26:27.360 --> 26:32.240
it's kind of complicated as I said at the beginning in a very first slide this is not going to be

26:33.040 --> 26:39.040
a solution to avoid reading the manual I simplified massively because the two days have a

26:39.040 --> 26:44.640
tensor FMA can answer read from L1 SP the for the B matrix so completely avoiding that

26:46.240 --> 26:52.480
and so as you can see we started from this very beautiful idea what this doesn't cause

26:52.560 --> 27:00.160
views and then now we are here and it's extremely silent zoom right there's going to be a

27:00.160 --> 27:05.120
tall country now found the later but actually I believe it's going to create I don't

27:05.120 --> 27:10.720
have a much better explanation of this it's going to actually show how to program these things

27:10.720 --> 27:18.800
it's in the air I found the and I planned as the room is this afternoon so I'm actually past the

27:18.800 --> 27:27.040
tens of things I'm quite happy so what to make of this personally I felt very stupid when I

27:27.040 --> 27:32.720
program at VV in my past life I felt that I don't know like it really hit me in a wrong way

27:33.360 --> 27:39.920
mostly with rage looking at this in the code that I program from time to time and you know despite

27:39.920 --> 27:47.120
all the cashy issues and all the stuff I felt it extended refreshing there's a lot more

27:47.120 --> 27:52.400
extension it is a one for example I took a lot about the cashes and now to handle the

27:52.960 --> 27:58.880
non-Korean stuff but of course there's a lot of talk about this I have the things that I ignore

27:58.880 --> 28:07.440
very much is that actually the ET allow you to have to send tens of magic to another there's actually

28:07.440 --> 28:15.280
there's a lot of way too there's a lot of way to actually control that there's a lot of things for

28:16.000 --> 28:21.520
synchronization of that from data passing so I end up cooperating low to cooperating so it's quite

28:21.520 --> 28:26.960
complicated for that I didn't go into that I just wanted to see how this cash system and of this

28:27.680 --> 28:33.760
very interesting scratch parts like configurable cashes at L10Q can become very interesting

28:35.680 --> 28:41.280
is it ready to be like the future of the next thing now

28:41.520 --> 28:46.240
this case is tough if you see like you for example you're going to find

28:46.240 --> 28:54.880
pack eight pack for example you're going to find pack eight and various various instructions

28:54.880 --> 29:00.560
that were necessary for them but clearly like this file I've defined it on so you know they're

29:00.560 --> 29:06.880
not exactly very well defined this is the thing that for example is 20 meters and the

29:06.880 --> 29:13.040
push to be neutral some to ashamed of doing something like this because as we know we all have

29:13.040 --> 29:20.480
in this file some issues with the instruction length instruction and the way the space is actually

29:20.480 --> 29:26.400
allocated and the designer of these is a quantity that I didn't care so they just said ah I'm

29:26.400 --> 29:34.000
going to use the 48 and 64 bit space which in LVM is actually fine because you just defined

29:34.080 --> 29:41.440
the table generator done in binodils you actually need to have a niff death when they calculate

29:41.440 --> 29:47.520
the instruction length and I think that makes the batch completely unabsimable and these are good news

29:48.240 --> 29:53.200
we actually have a lot of support I did you have steam you can actually go now to compare

29:53.200 --> 30:02.560
explorer and play with it and I guess that's the end of my talk I beat this is I found it so

30:03.520 --> 30:09.280
it's not a company technically behind this like you can just join discord I don't like this

30:09.280 --> 30:17.200
god but we have discord so we I love discord now but we definitely go there and it will be quite interesting

30:17.200 --> 30:24.640
to do what you do soon we'll we'll also have the TL for this so yes please let us know and

30:24.640 --> 30:29.840
does everything here in case you want to see manual implementation the actual details such

30:30.720 --> 30:34.480
scheme of the board and everything else and how the film will handle everything

30:39.520 --> 30:47.760
any question I thought you were excited that the beginning says that

30:48.640 --> 30:55.760
if we should you mentioned that some instructions are implemented as a mobile

30:55.760 --> 31:03.280
driver can correct it was a rational behind this decision low overhead or specific

31:03.280 --> 31:10.240
reports and that's that's a very good point and that's exactly what I asked because I didn't

31:10.240 --> 31:15.360
decide it I wasn't there we are quite happy but I could talk to the very smart people that

31:16.240 --> 31:23.520
ah the question is sorry question is what was the reason for adding another exception which is

31:23.520 --> 31:29.920
type 30 by the way to have these these instructions not implemented please I'm called emulate

31:29.920 --> 31:35.520
what was the reason behind it because the two this like I remember the realization was done

31:35.520 --> 31:40.160
with top and emulate when I was young and illegal instructions or you need

31:41.120 --> 31:46.080
ah the reason I've been I asked this question first thing and this one I've been told it's like

31:46.080 --> 31:51.760
well we did the code in the chip why should we waste that information which is possibly

31:51.760 --> 31:58.640
fine but I don't know how much the TL was actually complicated for that so essentially in the

31:58.640 --> 32:03.440
in the arrow in the T1 or whatever that is you get information about this structure you should emulate

32:04.160 --> 32:06.000
that's why

32:10.640 --> 32:21.520
you're not able to buy you can be ah can can he buy the it is ok one we do have the chips

32:21.520 --> 32:27.680
we do have the boards we do have actual open access to developers if you want to play with it

32:28.880 --> 32:35.360
usually we do have some scheme not my decision is not I'm not in the loop of deciding who

32:35.360 --> 32:39.600
guess the board it would doesn't but essentially yes we definitely have boards we definitely have

32:39.600 --> 32:44.960
chips we definitely going to do something with them and even today if you join this code

32:44.960 --> 32:48.800
you say can I have a SSH access to the machine you're going to have a machine with these things

32:48.800 --> 32:53.760
we can do whatever you want you're going to film the the compiler everything everything is open

32:56.480 --> 32:57.360
please

32:58.320 --> 33:05.760
and for presentation my question is that since you plan to open source the very low code and also

33:05.760 --> 33:14.400
obstructing the compiler stuff the complexity of this design both at the compiler and when you

33:14.400 --> 33:23.600
can level for this complex architecture it's made in a way and then then having a code

33:25.040 --> 33:36.160
upstream it's just two complex for this design so do you think that having for your company or

33:36.160 --> 33:42.640
the community to maintain that when you try to simplify or will you want to continue with

33:43.280 --> 33:50.480
this kind of project so the question is this is complicated this architecture is complicated

33:50.480 --> 33:57.440
and we can open source of the thing and what happens to the mountain ability of code and it's a

33:57.440 --> 34:02.080
very good question to be honest we notice it's not that complicated like we notice just implemented

34:02.080 --> 34:08.400
the feelings and you're done and the problem is actually gcc engine gcc for the spilling for the

34:08.400 --> 34:15.440
staff and you write it's complicated but I believe that if they manage to support 886 they can

34:15.440 --> 34:22.000
support this architecture right essentially like the the thing is this like yes but this is why you

34:22.000 --> 34:27.760
abstimate right you have I mean I can imagine that there are definitely now some bug in the

34:28.400 --> 34:36.000
motor last 68000 code in gcc because no one uses it but if you abstimate this is where we should go

34:36.000 --> 34:42.480
because if somebody uses it and it's upstream we can fix it so are we going to keep going

34:42.480 --> 34:47.680
towards that architecture well first of all if a chip exists it should be supported we cannot

34:47.680 --> 34:53.360
say ah sorry we have no software right that's not our open source work especially if you have access

34:53.360 --> 34:58.160
to this level of details we're going to change architecture yes it's not that we're going to

34:58.160 --> 35:03.680
take out the same thing all over again and by the way if you join our discord and if I found

35:04.640 --> 35:08.720
we are actually going to a process of tape out right now and define the chip and since we have

35:08.720 --> 35:13.840
really serious about open source you can actually find the spec of the chip we're going to

35:13.840 --> 35:20.640
tape out the moment the other engineer actually changes it so you can actually find a school

35:20.640 --> 35:25.760
album and you will find actually the specs and this is and we're definitely evolving with fixing

35:25.760 --> 35:31.040
your router we are this is like the first test chip that we're going to do but then of course

35:31.120 --> 35:37.200
yeah this you have the people that are at the back on our know how long our next architecture

35:37.200 --> 35:43.200
know passionate to use our new teleworld the discussion about architecture have been so

35:43.200 --> 35:50.720
one is laughing because I didn't as well I promise so yeah times up thank you very much for everything

36:01.040 --> 36:03.040
you

36:31.040 --> 36:33.040
you