WEBVTT

00:00.000 --> 00:06.000
Sorry, great, great.

00:06.000 --> 00:08.000
Scoping out the 10th floor, warm hall.

00:08.000 --> 00:09.000
Take it away, Peter.

00:09.000 --> 00:10.000
OK.

00:10.000 --> 00:17.000
APPLAUSE

00:17.000 --> 00:18.000
OK.

00:18.000 --> 00:21.000
So I'm holding here the 10th floor and 300

00:21.000 --> 00:24.000
asked my purchase from 10th floor and so online

00:24.000 --> 00:25.000
store.

00:25.000 --> 00:27.000
And if you like what you hear in this talk and in the next

00:27.000 --> 00:31.000
talk, you can go to this store and via own warm hall card.

00:31.000 --> 00:34.000
A bunch of other form parties in the store as well.

00:34.000 --> 00:35.000
But this is not a sales talk.

00:35.000 --> 00:36.000
It's a first time talk.

00:36.000 --> 00:39.000
I will here to talk about open source software.

00:39.000 --> 00:43.000
So 0.1, 10th floor, make a open source software stack for that

00:43.000 --> 00:44.000
hardware.

00:44.000 --> 00:46.000
And all of that that hardware is up on a GitHub.

00:46.000 --> 00:47.000
We can find it.

00:47.000 --> 00:48.000
It's all great.

00:48.000 --> 00:50.000
Why did I want to talk about software?

00:50.000 --> 00:51.000
I want to talk about hardware.

00:51.000 --> 00:53.000
Can I believe that good software,

00:53.000 --> 00:57.000
it's really understand the hardware on which they are running?

00:57.000 --> 00:59.000
So I'm going to tell you all about the hardware here.

00:59.000 --> 01:02.000
And then you can think of what the software should be or better

01:02.000 --> 01:05.000
understand why the transfer software is the way it is.

01:05.000 --> 01:07.000
Now, I don't want to take my card apart.

01:07.000 --> 01:08.000
It's like, hey, it's good.

01:08.000 --> 01:09.000
Money for it.

01:09.000 --> 01:12.000
But if you did, take off the case and the heat sink in the pan.

01:12.000 --> 01:14.000
You would see this photo here.

01:14.000 --> 01:18.000
But I'm not talking to talk about PCBs because I don't do PCBs.

01:18.000 --> 01:21.000
So we're going to switch to this much simpler block.

01:22.000 --> 01:26.000
So on the bottom of it, a piece I express thinking on the left,

01:26.000 --> 01:31.000
got two QCP pages, and then two proprietary internet ports on the top.

01:31.000 --> 01:33.000
And then in the middle of the card, what is two,

01:33.000 --> 01:37.000
one-for-seps each fine-fire bunch of DRAM.

01:37.000 --> 01:40.000
Now, if I show things around just a little bit,

01:40.000 --> 01:43.000
I can show you how everything is wired together.

01:43.000 --> 01:44.000
There we go.

01:44.000 --> 01:46.000
So I've drawn three types of connection here.

01:46.000 --> 01:49.000
Each green line is representing DRAM.

01:49.000 --> 01:52.000
The red line is representing PCI express.

01:52.000 --> 01:55.000
And each blue line is representing byte write 100 gigabit.

01:55.000 --> 01:56.000
Ethernet.

01:56.000 --> 02:01.000
Now, in theory, each one-on-one chip can drive 16 ports of that 100 gigabit.

02:01.000 --> 02:04.000
Ethernet, but the form part of this particular board,

02:04.000 --> 02:07.000
only gives you 8 ports with that on the left chip,

02:07.000 --> 02:09.000
and 4 ports with that on the right chip.

02:09.000 --> 02:12.000
Which is good, but it could be better.

02:12.000 --> 02:15.000
But in any event, the scale up factor here,

02:15.000 --> 02:17.000
the scale up so here is Ethernet.

02:17.000 --> 02:19.000
At some point, we're going to get larger and larger,

02:19.000 --> 02:22.000
and we're going to get too large for a single chip.

02:22.000 --> 02:24.000
Even if your single chip is like the whole size I hold,

02:24.000 --> 02:26.000
wait, that's super us.

02:26.000 --> 02:29.000
You will still need multiple chips at some point.

02:29.000 --> 02:32.000
So the hope is you could buy as many chips as you need,

02:32.000 --> 02:34.000
and then link them together with point pointer

02:34.000 --> 02:38.000
if that in whatever topology makes sense for you.

02:38.000 --> 02:40.000
Is that what we wouldn't say here?

02:40.000 --> 02:41.000
I think so.

02:41.000 --> 02:44.000
Yep, okay, so.

02:45.000 --> 02:47.000
However, if you scale out too much,

02:47.000 --> 02:50.000
then giving each chip its own PCI Express connection,

02:50.000 --> 02:52.000
it's going to get a little bit pricey.

02:52.000 --> 02:54.000
So the hope is that we can get away with giving

02:54.000 --> 02:56.000
only some chips a few sites,

02:56.000 --> 03:00.000
access, connection, and rely on Ethernet to reach the other chips.

03:00.000 --> 03:02.000
And this board is designed to give you a little taste about.

03:02.000 --> 03:04.000
So as you can see, the chip on the left,

03:04.000 --> 03:06.000
has a PCI Express connection,

03:06.000 --> 03:08.000
but the chip on the right does not.

03:08.000 --> 03:11.000
So it is only vegetable beer that is that link,

03:11.000 --> 03:13.000
printed on to the circuit boarder.

03:14.000 --> 03:16.000
So from the software point of view, this means that you can

03:16.000 --> 03:18.000
aim that memory from the left chip,

03:18.000 --> 03:21.000
you can't aim that memory from the right chip.

03:21.000 --> 03:23.000
Implications for software there,

03:23.000 --> 03:25.000
I'm just going to leave that since there,

03:25.000 --> 03:26.000
and let me flip the arm.

03:26.000 --> 03:28.000
So if we zoom in to one of these chips,

03:28.000 --> 03:31.000
we see a 10 by 12 grid of tiles.

03:31.000 --> 03:34.000
I've got various types of interface tile around the edge.

03:34.000 --> 03:37.000
We've got Ethernet tiles, left and right,

03:37.000 --> 03:39.000
DRM tiles, top and bottom,

03:39.000 --> 03:43.000
also in the bottom, one piece I express how and one art tile.

03:43.000 --> 03:47.000
Now, on the form factor of this particular board,

03:47.000 --> 03:50.000
two columns of computer tile, which is all this stuff,

03:50.000 --> 03:53.000
in the middle, are disabled for yield reasons.

03:53.000 --> 03:55.000
So I'm just going to mark two columns here.

03:55.000 --> 03:57.000
I can randomly chosen two columns,

03:57.000 --> 03:59.000
but it will vary from chip to chip,

03:59.000 --> 04:00.000
just based on the yield,

04:00.000 --> 04:03.000
where the defects are on these tiles.

04:03.000 --> 04:05.000
If you buy other form factors,

04:05.000 --> 04:06.000
you can get more computer tiles,

04:06.000 --> 04:08.000
if you buy an N15S board,

04:08.000 --> 04:10.000
which I think someone has.

04:10.000 --> 04:12.000
Back here, you get some two tiles,

04:12.000 --> 04:14.000
and if you buy a Galaxy system,

04:14.000 --> 04:16.000
you get all 80 computer tiles on every chip.

04:16.000 --> 04:19.000
I, that they say of the really good yielding chips

04:19.000 --> 04:20.000
for that particular product,

04:20.000 --> 04:22.000
which will cost you lots of money,

04:22.000 --> 04:25.000
because people have to make some money.

04:25.000 --> 04:27.000
Okay.

04:27.000 --> 04:29.000
So what's next?

04:29.000 --> 04:31.000
Yep.

04:31.000 --> 04:35.000
So I've numbered each DRM chip.

04:36.000 --> 04:38.000
Each DRM tile, based on the DRM model,

04:38.000 --> 04:39.000
it takes,

04:39.000 --> 04:40.000
connects to,

04:40.000 --> 04:41.000
I've numbered each,

04:41.000 --> 04:42.000
it's not tile initially,

04:42.000 --> 04:43.000
because point point,

04:43.000 --> 04:45.000
it's that the things are unique.

04:45.000 --> 04:46.000
I've not numbered any,

04:46.000 --> 04:47.000
if the other tiles,

04:47.000 --> 04:48.000
because all the computer tiles,

04:48.000 --> 04:50.000
are identical to each other.

04:50.000 --> 04:53.000
Now, before we delve into the various types of tile here,

04:53.000 --> 04:57.000
I'd like to talk about the memory hierarchy.

04:57.000 --> 04:58.000
Or,

04:58.000 --> 04:59.000
more to the point,

04:59.000 --> 05:01.000
the lack of a memory hierarchy.

05:01.000 --> 05:02.000
So on most GPUs,

05:02.000 --> 05:04.000
if you get one,

05:04.000 --> 05:05.000
today,

05:05.000 --> 05:08.000
you'll have a single other space that spans the entire

05:08.000 --> 05:09.000
chip,

05:09.000 --> 05:11.000
kind of covering everything from the computer to the memory,

05:11.000 --> 05:15.000
and you'll have various layers of caches in between the compute and the memory.

05:15.000 --> 05:16.000
But we have,

05:16.000 --> 05:17.000
not that here,

05:17.000 --> 05:18.000
no hierarchy of memory,

05:18.000 --> 05:19.000
no caches.

05:19.000 --> 05:20.000
And instead,

05:20.000 --> 05:23.000
each tile has own other space.

05:23.000 --> 05:26.000
And if you're running code on one of these tiles,

05:26.000 --> 05:30.000
then you can only access the memory of the local tile.

05:30.000 --> 05:31.000
So your load and saw,

05:31.000 --> 05:32.000
as far as you can,

05:32.000 --> 05:34.000
only access the local tile.

05:34.000 --> 05:35.000
But of course,

05:35.000 --> 05:39.000
we need some way to move things around further than just a single tile.

05:39.000 --> 05:41.000
Otherwise life would be incredibly boring.

05:41.000 --> 05:44.000
So that's where the network on chip comes in.

05:44.000 --> 05:46.000
Now, before I can add that network to this diagram,

05:46.000 --> 05:48.000
I need to make some space.

05:48.000 --> 05:51.000
So I'm going to get rid of the external things here.

05:51.000 --> 05:54.000
Unshuffles around from the physical grade to the logical grade.

05:54.000 --> 05:56.000
So all the same tiles,

05:56.000 --> 05:57.000
but we don't just know,

05:57.000 --> 05:59.000
viewing them is slightly changed way.

05:59.000 --> 06:04.000
And now I can add the network on chip or knock to the diagram here.

06:04.000 --> 06:06.000
So this is all of these are purple arrows.

06:06.000 --> 06:12.000
And this is things every tile to every other tile in a to the tourists.

06:12.000 --> 06:14.000
So if you go off the white edge,

06:14.000 --> 06:15.000
you end up on the left edge,

06:15.000 --> 06:16.000
if you go off the top edge,

06:16.000 --> 06:18.000
you end up on the bottom edge here.

06:18.000 --> 06:19.000
And each,

06:19.000 --> 06:21.000
not transactions,

06:21.000 --> 06:23.000
any tile can initiate a nox transaction.

06:23.000 --> 06:28.000
And each transaction is effectively an asynchronous memory copy.

06:28.000 --> 06:31.000
Now, I've put the conceptual API to this on the site here.

06:31.000 --> 06:36.000
You choose two tiles by giving the x and i coordinates in the tile grid.

06:36.000 --> 06:41.000
And you copy some bytes from the other space of one tile to the other space of the other.

06:41.000 --> 06:42.000
And in most cases,

06:42.000 --> 06:49.000
the tile that is causing this transaction to happen will be either the source or the DST.

06:49.000 --> 06:54.000
But you can do more exotic things if you are so wish.

06:55.000 --> 07:02.000
Now, I should have also mentioned that each knock transaction has some overhead to it.

07:02.000 --> 07:05.000
So if I could have peak performance,

07:05.000 --> 07:06.000
the transfer size,

07:06.000 --> 07:07.000
the number of bytes,

07:07.000 --> 07:11.000
what's been the range like 1 to 8 kilobytes?

07:11.000 --> 07:13.000
But if I used to the N video world,

07:13.000 --> 07:15.000
it's going to sound like quite a lot.

07:15.000 --> 07:19.000
So quick change over to the CUDA world.

07:19.000 --> 07:20.000
In that world,

07:20.000 --> 07:22.000
you're thinking like 120 bytes,

07:22.000 --> 07:24.000
it's a kind of like transfer size,

07:24.000 --> 07:25.000
because in that world,

07:25.000 --> 07:27.000
you have 32 threads in a warp.

07:27.000 --> 07:29.000
And each thread is used for white load,

07:29.000 --> 07:33.000
and the harder it will coalesce all of that in the units of 128 bytes.

07:33.000 --> 07:35.000
But sitting back to DST,

07:35.000 --> 07:37.000
we have none of that.

07:37.000 --> 07:40.000
We have no coalescing of memory.

07:40.000 --> 07:41.000
You have one thread,

07:41.000 --> 07:44.000
and you wish you a single large 1D memory copy.

07:44.000 --> 07:46.000
Which is a little bit limiting,

07:46.000 --> 07:48.000
but hopefully not too so.

07:48.000 --> 07:50.000
I guess also limiting is that we can only move

07:50.000 --> 07:53.000
white words and upwards on this particular network.

07:53.000 --> 07:54.000
But for that problem,

07:54.000 --> 07:56.000
we have a very similar solution,

07:56.000 --> 07:57.000
which is,

07:57.000 --> 08:00.000
a second network on chef that is identical in every way.

08:00.000 --> 08:03.000
It could all of the arrows go in the other direction.

08:03.000 --> 08:04.000
So this guy can go,

08:04.000 --> 08:05.000
you know,

08:05.000 --> 08:06.000
left and down,

08:06.000 --> 08:07.000
and again it's a 2D towards.

08:07.000 --> 08:08.000
You've got the power left,

08:08.000 --> 08:09.000
you've got the power right.

08:09.000 --> 08:10.000
If you've got the power,

08:10.000 --> 08:11.000
you've got the power on the far top.

08:11.000 --> 08:13.000
And both of these networks,

08:13.000 --> 08:15.000
in addition to doing plane copies,

08:15.000 --> 08:16.000
can do multicars,

08:17.000 --> 08:19.000
and all three of these things,

08:19.000 --> 08:20.000
plane copies,

08:20.000 --> 08:21.000
multicars,

08:21.000 --> 08:22.000
and a tonics,

08:22.000 --> 08:24.000
are naturally a synchronous.

08:24.000 --> 08:25.000
So you far them off,

08:25.000 --> 08:27.000
and then if you care about whether they.

08:27.000 --> 08:28.000
Complete,

08:28.000 --> 08:29.000
and then you can,

08:29.000 --> 08:30.000
then you can pull to ask whether they have,

08:30.000 --> 08:31.000
yet completed.

08:31.000 --> 08:32.000
And this is,

08:32.000 --> 08:34.000
the second major difference,

08:34.000 --> 08:36.000
to the usual GP model,

08:36.000 --> 08:38.000
in that if you need to hide any,

08:38.000 --> 08:40.000
latency,

08:40.000 --> 08:41.000
I of memory,

08:41.000 --> 08:45.000
then you are using asynchronous API as to do that.

08:45.000 --> 08:46.000
So automatic,

08:46.000 --> 08:47.000
context searching,

08:47.000 --> 08:49.000
between multiple boards.

08:49.000 --> 08:50.000
So again,

08:50.000 --> 08:53.000
another major thing to bear in mind that.

08:53.000 --> 08:54.000
So,

08:54.000 --> 08:55.000
you're going to want to hear about,

08:55.000 --> 08:57.000
these are various types of tile.

08:57.000 --> 08:59.000
And I'll see that the,

08:59.000 --> 09:00.000
the compute ones are,

09:00.000 --> 09:01.000
the magic really happens,

09:01.000 --> 09:03.000
but we're going to keep you waiting just a little bit,

09:03.000 --> 09:04.000
and look at,

09:04.000 --> 09:06.000
every other type of tile first.

09:06.000 --> 09:07.000
So.

09:07.000 --> 09:08.000
So,

09:08.000 --> 09:10.000
looking at the art tile first,

09:10.000 --> 09:12.000
this guy is boring,

09:12.000 --> 09:13.000
but it has to be that to,

09:13.000 --> 09:14.000
power,

09:14.000 --> 09:15.000
some more thought,

09:15.000 --> 09:16.000
in research management,

09:16.000 --> 09:17.000
you know,

09:17.000 --> 09:18.000
things that have to happen somewhere,

09:18.000 --> 09:20.000
but you don't really have to care about what they,

09:20.000 --> 09:21.000
really are.

09:21.000 --> 09:24.000
So code for this particular tile is currently closed source,

09:24.000 --> 09:26.000
but it's not really a problem,

09:26.000 --> 09:27.000
because you can basically figure out

09:27.000 --> 09:29.000
that this tile is even there.

09:29.000 --> 09:30.000
Other than if you need to be set the board,

09:30.000 --> 09:32.000
or change the plot speed,

09:32.000 --> 09:36.000
so we can move the swiftly on to a more interesting type of tile.

09:36.000 --> 09:37.000
So,

09:37.000 --> 09:38.000
Dram tiles.

09:38.000 --> 09:40.000
Each Dram tile is a bridge between the,

09:40.000 --> 09:41.000
never come,

09:41.000 --> 09:43.000
but you don't hear any purple and in till,

09:43.000 --> 09:45.000
and they're Dram module.

09:45.000 --> 09:49.000
Each Dram module having two gigs of GDR6,

09:49.000 --> 09:51.000
and these tiles kind of initiate

09:51.000 --> 09:53.000
not translations of their own,

09:53.000 --> 09:55.000
but you can read or write from them from other tiles,

09:55.000 --> 09:57.000
using a not read or write.

09:57.000 --> 09:58.000
And if you do do that,

09:58.000 --> 10:00.000
then the other space of these tiles is extremely simple.

10:00.000 --> 10:02.000
The first one gig is the first one gig channel.

10:02.000 --> 10:05.000
The second one gig is the second one gig channel,

10:05.000 --> 10:08.000
and the rest of the other space is a map.

10:08.000 --> 10:09.000
So, these lines are, you know,

10:09.000 --> 10:12.000
where you're restoring all of your large...

10:12.000 --> 10:13.000
T-t-t-t.

10:13.000 --> 10:15.000
If we move on again,

10:15.000 --> 10:17.000
we reach the PCI-e-x,

10:17.000 --> 10:18.000
the PCI-e-tile.

10:18.000 --> 10:21.000
So, this guy is a bridge between the network on tip,

10:21.000 --> 10:22.000
again in purple and in till,

10:22.000 --> 10:25.000
and the host address space.

10:25.000 --> 10:28.000
So, if the host system issues are moved or write,

10:28.000 --> 10:29.000
against the board,

10:29.000 --> 10:31.000
this tile will translate that to a PCI-expertsuit,

10:31.000 --> 10:32.000
moved or write into a knock,

10:32.000 --> 10:36.000
moved or write against the address space of some of a tile.

10:36.000 --> 10:38.000
And we've got a four hundred,

10:38.000 --> 10:39.000
nine, six, no white,

10:39.000 --> 10:41.000
aperture for that purpose,

10:41.000 --> 10:43.000
broken up into a number of regions,

10:43.000 --> 10:47.000
and each region can be mapped to a sun per tile.

10:47.000 --> 10:49.000
Other states,

10:49.000 --> 10:51.000
depending on who you want to map them to.

10:51.000 --> 10:52.000
Going the other way,

10:52.000 --> 10:54.000
if some other tile issues are moved or write,

10:54.000 --> 10:56.000
against this tile using the knock,

10:56.000 --> 10:58.000
this tile will translate it to a PCI-expertsuit,

10:58.000 --> 11:01.000
right, against pin physical post memory.

11:01.000 --> 11:02.000
And if we're using one gig,

11:02.000 --> 11:03.000
huge pages,

11:03.000 --> 11:04.000
you can pin four,

11:04.000 --> 11:08.000
gigabyte aperture going in that direction.

11:08.000 --> 11:10.000
So, moving on again,

11:10.000 --> 11:13.000
we finally reach something particularly interesting.

11:13.000 --> 11:15.000
So, this is an internet tile.

11:15.000 --> 11:17.000
And at the top now,

11:17.000 --> 11:19.000
we have a, for example, risk five-quarter,

11:19.000 --> 11:21.000
which is what makes this guy interesting.

11:21.000 --> 11:23.000
So, this is the RB32,

11:23.000 --> 11:25.000
I am instruction set.

11:25.000 --> 11:27.000
So, we've got some of that.

11:27.000 --> 11:29.000
Inter, GPR's death, 32,

11:29.000 --> 11:31.000
but it's wide and 32-bit,

11:32.000 --> 11:35.000
and you can do all the usual risk-five things.

11:35.000 --> 11:36.000
So, you've got to ensure,

11:36.000 --> 11:37.000
hey, I'll do that.

11:37.000 --> 11:39.000
We can do control flow with branching and calling.

11:39.000 --> 11:42.000
And we put a little load-stoy unit.

11:42.000 --> 11:45.000
And then a number of things on the memory box,

11:45.000 --> 11:47.000
which can be accessed by that load-stoy unit.

11:47.000 --> 11:49.000
So, you've got to block up S-fun there,

11:49.000 --> 11:52.000
in ping that holds all the code and

11:52.000 --> 11:55.000
data for this risk-five-quarter.

11:55.000 --> 11:57.000
And then on the left of S-fun,

11:57.000 --> 12:00.000
to not interface units or N-I-U's,

12:00.000 --> 12:05.000
and these are how the risk-five talks to the not.

12:05.000 --> 12:07.000
So, these guys can either initiate

12:07.000 --> 12:09.000
not transactions of their own,

12:09.000 --> 12:10.000
or they can service transactions

12:10.000 --> 12:13.000
which have been initiated by other tiles.

12:13.000 --> 12:16.000
So, if you are running code on this risk-five,

12:16.000 --> 12:19.000
and you want to initiate a not transaction,

12:19.000 --> 12:21.000
then you take the argument for the A-sync.

12:21.000 --> 12:25.000
Then copy, you put an interesting memory map to registers on this,

12:25.000 --> 12:28.000
load-stoy unit, and then you write a go signal again

12:28.000 --> 12:29.000
with turn memory maps.

12:29.000 --> 12:30.000
But just enough.

12:30.000 --> 12:33.000
And if you care whether a thing has a computer

12:33.000 --> 12:35.000
yet again, you can read from a memory maps.

12:35.000 --> 12:37.000
But just search to find out what other things

12:37.000 --> 12:38.000
have have finished.

12:38.000 --> 12:41.000
So, if you are used to doing any kind of low-level

12:41.000 --> 12:43.000
kind of like driver-dev,

12:43.000 --> 12:46.000
then this pattern should be a community.

12:46.000 --> 12:48.000
And if not, you can wrap up the nicer.

12:48.000 --> 12:53.000
If I hide all of the finicky little details.

12:54.000 --> 12:56.000
Now, on the other side of S-RAM,

12:56.000 --> 12:57.000
we have two other units,

12:57.000 --> 12:59.000
we have the E-Net transfer unit,

12:59.000 --> 13:01.000
and the E-Net receive unit.

13:01.000 --> 13:03.000
And this per unit is again an A-sync-less

13:03.000 --> 13:05.000
memcopy thing.

13:05.000 --> 13:08.000
So, you can do a A-sync-less

13:08.000 --> 13:11.000
memcopy all bit only from the E-Net hall

13:11.000 --> 13:14.000
that is the app source here to some other E-Net hall.

13:14.000 --> 13:17.000
So, you know, data can flow from the S-RAM in

13:17.000 --> 13:20.000
the style through the E-Net transfer unit

13:20.000 --> 13:22.000
over a point pointer.

13:22.000 --> 13:24.000
Is that linked to some other E-Net

13:24.000 --> 13:27.000
tile through that tiles are actually,

13:27.000 --> 13:29.000
you know, and into that other tiles, RAM.

13:29.000 --> 13:32.000
And the S-RAM,

13:32.000 --> 13:33.000
thus, by the Dust Nation Raster,

13:33.000 --> 13:36.000
to write to hence why we have not had to put

13:36.000 --> 13:39.000
the specific E-Net on the memory bus here.

13:39.000 --> 13:42.000
So, I should also say that if the E-Net link

13:42.000 --> 13:45.000
flaps, then the risk-5 is responsible for

13:45.000 --> 13:48.000
noticing this and retraining the E-Net link.

13:48.000 --> 13:50.600
And again, there's a tiny little bit of a closed source

13:50.600 --> 13:54.400
somewhere for doing that retraining should you need to.

13:54.400 --> 13:57.600
Which is unfortunate, but it is what it is.

13:57.600 --> 13:59.800
Time, heck, good.

13:59.800 --> 14:02.720
So moving on again, we reach what we are actually

14:02.720 --> 14:04.400
kind of properly interesting.

14:04.400 --> 14:07.200
So this is a computer.

14:07.200 --> 14:09.200
Somewhat similar to what we just saw.

14:09.200 --> 14:11.400
All that was five risk-five cores up there,

14:11.400 --> 14:13.000
rather than just one.

14:13.000 --> 14:16.000
Again, it's RB32, I am instruction set on each.

14:16.000 --> 14:21.920
And one bare metal thread running on each.

14:21.920 --> 14:25.320
So again, it is one thread per risk-five core.

14:25.320 --> 14:28.760
And if you want to hide any latency, then you

14:28.760 --> 14:31.840
are using asynchronous APIs to do so.

14:31.840 --> 14:34.080
There is no automatic context switching

14:34.080 --> 14:35.240
between multiple threads.

14:38.920 --> 14:39.280
Yeah, OK.

14:39.280 --> 14:43.600
So again, on the left of SVM2 and I use just as previously,

14:43.600 --> 14:46.720
and then a huge block of RAM here.

14:46.720 --> 14:48.640
We've got more RAM here than on the previous tile,

14:48.640 --> 14:52.280
so that it and tiles had 256K of RAM on them.

14:52.280 --> 14:54.480
These guys have a meg and a half.

14:54.480 --> 14:57.200
But it serves the same role that holds all the code

14:57.200 --> 15:02.320
and the data for these five risk-five cores.

15:02.320 --> 15:04.240
And then the majority of the tile on the right

15:04.240 --> 15:06.720
is taken up by this 106 copress, which is where

15:06.720 --> 15:09.120
the magic really happens.

15:09.120 --> 15:12.440
Now, I've drawn it on the general memory bus here,

15:12.440 --> 15:14.400
but in practice, only the three risk-files

15:14.400 --> 15:17.000
on the right can meaningfully talk to it.

15:17.000 --> 15:18.960
Which leads to a very natural division of labor,

15:18.960 --> 15:20.760
whether the three risk-files on the right

15:20.760 --> 15:23.320
are the copress or leaving the two risk-files

15:23.320 --> 15:26.120
on the left to drive the two NIUs.

15:26.120 --> 15:30.400
And then they can also communicate to SVM2 shattings.

15:30.400 --> 15:31.480
So obviously, we need to zoom in.

15:31.480 --> 15:34.440
So this guy, because this is about the magic,

15:34.440 --> 15:36.160
really happens.

15:36.160 --> 15:37.400
And there we go.

15:37.400 --> 15:39.800
So I've drawn to a front-end type bits

15:39.800 --> 15:43.680
in green, back-end type bits in brown,

15:43.680 --> 15:47.200
memory is in pink, and instruction arrows in lilac,

15:47.200 --> 15:51.280
if the lilac is about on the slider.

15:51.280 --> 15:53.960
So at the top here, we've got three instruction

15:53.960 --> 15:59.080
fibers, fly-forging, another word for Q.

15:59.080 --> 16:01.560
If that's not a turner's familiar to you.

16:01.560 --> 16:06.640
And there's no real control flow in the front-end here.

16:06.640 --> 16:08.760
If there isn't a control flow, which is there,

16:08.760 --> 16:13.120
there will be then the risk-file calls to now zoomed out.

16:13.120 --> 16:16.120
They will resolve all of the control flow.

16:16.120 --> 16:18.200
That you have, and then each risk-file

16:18.200 --> 16:20.560
will push a stream of tensile instructions

16:20.560 --> 16:22.880
into one of these fly-files.

16:22.880 --> 16:24.120
And we've got three fly-files here.

16:24.120 --> 16:25.720
Hence, why I only three risk-files

16:25.720 --> 16:27.800
can meaningfully talk to this thing.

16:31.560 --> 16:34.240
And then, so there are two ways to push instructions

16:34.240 --> 16:36.200
into these fly-files.

16:36.200 --> 16:37.600
The first one you'll probably guess

16:37.600 --> 16:39.560
on the theme of what I've been saying

16:39.560 --> 16:42.640
is that you can do memory-mout rights to some memory-mout

16:42.640 --> 16:46.000
just to write into these fly-files.

16:46.000 --> 16:48.640
You only pushing a lot of tensile instructions

16:48.640 --> 16:51.920
into this guy to do the work.

16:51.920 --> 16:54.920
So there's a second, chief-water to push things in,

16:54.920 --> 16:58.320
which requires a quick diversion into risk-file.

16:58.320 --> 17:01.040
So if you're familiar with risk-file,

17:01.040 --> 17:04.400
you should know that three courses of the off-code space

17:04.520 --> 17:07.840
is preserved for the RVC extension.

17:07.840 --> 17:11.360
But none of these calls infinitely RVC extension,

17:11.360 --> 17:13.400
which leaves that to be courses of the off-code space

17:13.400 --> 17:16.160
available for other use.

17:16.160 --> 17:18.520
So think that's repurposed.

17:18.520 --> 17:21.160
And any of these risk-file calls are given

17:21.160 --> 17:24.400
an instruction in the RVC off-code space.

17:24.400 --> 17:26.880
They will not treat it as a RVC in such

17:26.880 --> 17:29.120
and they will instead treat it as a opaque

17:29.120 --> 17:31.480
sedative value to push onto the fly-file attached

17:31.480 --> 17:34.000
to that call.

17:34.520 --> 17:35.680
Whichever way we're pushing,

17:35.680 --> 17:38.480
in these fly-files, if we then look down,

17:38.480 --> 17:42.480
we get to the next unit, which are these macro-expanders,

17:42.480 --> 17:44.120
which are configurable units that allow

17:44.120 --> 17:47.240
one instruction to expand out to several.

17:47.240 --> 17:49.600
And if you configure and use these guys properly,

17:49.600 --> 17:54.120
then you can keep the back end of the tensics unit fully

17:54.120 --> 17:57.440
if it's saturated, only need an issue

17:57.440 --> 18:01.960
of one instruction every kind of ten cycles also from the risk-file,

18:02.040 --> 18:06.280
because you can expand out one to ten on average,

18:06.280 --> 18:08.800
which frees up the risk-file to do other things,

18:08.800 --> 18:11.440
such as resolving all of the control flow.

18:12.760 --> 18:15.320
Stopping down again, the next unit here is the sixth phase.

18:15.320 --> 18:17.800
Asian unit, and it contains, well,

18:17.800 --> 18:20.000
it gets three streams of instructions flowing

18:20.000 --> 18:22.000
in from the top, and it then marks us now

18:22.000 --> 18:24.680
to the various back end units underneath.

18:24.680 --> 18:26.520
And it contains the mutixes and some of the force

18:26.520 --> 18:28.480
to control the wealth of all three of things

18:28.480 --> 18:30.200
between these three streams,

18:30.200 --> 18:31.760
which we're going to want one to use

18:31.760 --> 18:33.680
because the various back end units here

18:33.680 --> 18:37.000
are shared between the three incoming streams,

18:37.000 --> 18:41.600
both there, execution units, and the registers in all of them,

18:41.600 --> 18:43.640
which is a little punchy.

18:43.640 --> 18:46.440
Didn't rock quite right, just coming up.

18:46.440 --> 18:48.560
OK, fine.

18:48.560 --> 18:51.040
I will therefore give you very briefly

18:51.040 --> 18:53.520
on all of the units here.

18:53.520 --> 18:56.320
Unpacked on the left to move stuff from S-Farm

18:56.320 --> 18:59.200
into the unit, we've got mammals on the middle unit there,

18:59.200 --> 19:01.680
that's where the mammals really happen.

19:01.680 --> 19:04.320
FPA-T4, or BF16, or half-rate.

19:04.320 --> 19:08.160
Well, BF16-ish at half-rate is a little weird.

19:08.160 --> 19:10.560
F16-ish, it should curate,

19:10.560 --> 19:12.360
and then you can push out the data back out

19:12.360 --> 19:14.960
from your packing it back out to S-Farm, after you finished,

19:14.960 --> 19:18.000
working with it, and then all of the contact on the side.

19:18.000 --> 19:20.520
You can either drive from Memum-Out rights

19:20.520 --> 19:22.800
to the rest files or from the skeleton

19:22.800 --> 19:24.400
on the side that was that helped you drive

19:24.400 --> 19:27.200
all of this contact, and thought, and out of time,

19:27.200 --> 19:28.840
so that will have to do that.

19:28.840 --> 19:31.160
And you can then zoom out and we are back

19:31.160 --> 19:35.000
to, well, we solved it with this nice, perfect, thank you.

19:35.000 --> 19:36.160
Give him a round of applause.

19:36.160 --> 19:43.280
And just like the previous speakers, I'm going to kick Peter

19:43.280 --> 19:45.720
out over there, so you can go find him if you have some more questions

19:45.720 --> 19:47.280
about his presentation.

19:47.280 --> 19:48.440
Martin, there you are, ready?

