WEBVTT

00:00.000 --> 00:11.840
Good morning everybody. Hi, I'm Felix. You might know me as Felix CLC, FCLC, here

00:11.840 --> 00:16.800
there and everywhere. Talk this morning is a fun one. There are going to be parts where

00:16.800 --> 00:22.000
I'm being humorous. There are portions where I might be getting a little bit more serious.

00:22.000 --> 00:25.840
There are portions where I'm going to call things very silly. Namely, I'm going to be talking

00:25.840 --> 00:30.960
about the actual architectural implications of various portions of the risk-five ISIS specification.

00:30.960 --> 00:35.840
I want to make it clear that my problems are with the specification, not the people that made

00:35.840 --> 00:41.920
that specification. I want to make that explicitly clear because when I call things silly online,

00:41.920 --> 00:46.480
sometimes people get a little bit angry. Granted, that's how I've gotten every job I've ever had,

00:46.480 --> 00:52.000
was making people angry and fixing it, but that's neither here nor there. So the talk is titled

00:52.080 --> 00:55.440
risk-five had 40 years to learn from. What did they get right and what they did they get

00:55.440 --> 01:01.760
hilariously wrong? We're going to talk about a few things. Namely, the good about the ugly.

01:01.760 --> 01:07.440
We're going to talk about the RV-64 specification more broadly on the application side and the

01:07.440 --> 01:12.560
embedded side. We're going to talk a little bit about the 32-bit spec. We're going to talk about

01:13.520 --> 01:19.280
risk-five compressed and we're going to talk about risk-five vector. All of these fundamental

01:19.280 --> 01:22.720
contracts in the way that we actually have to address the machine. Because at the end of the

01:22.720 --> 01:26.640
day, we're building hardware for people and we're building hardware so that people can actually

01:26.640 --> 01:31.520
get things done. If we don't get the interfaces right, then that fundamentally compromises all the

01:31.520 --> 01:35.760
software at the input, it also compromises the software at the output when it actually gets

01:35.760 --> 01:40.880
lowered to the machine. Be that a micro-ops sequencer, be that an order core, an out-of-order core,

01:40.960 --> 01:49.760
in a server or beyond. First, who's this bozo? I realize not everyone outside of Canada knows

01:49.760 --> 01:54.960
the definition of bozo. I think it's apt and accurate. A stupider and significant person,

01:54.960 --> 02:01.440
typically used of a man, that's me. I'm on-masses on as FCLC. I'm a proud known as a practical

02:01.440 --> 02:07.440
trouble maker. My employer is AI-neco. We're open source hardware risk-five company. I want to

02:07.440 --> 02:13.280
make it clear. I am representing my opinions. These are mine. They can't have them. I help run HPC

02:13.280 --> 02:17.840
dot social on the side and I'm a contributor for the publication, chips and cheese, where we do

02:17.840 --> 02:21.680
hardcore architecture deep dives for performance modeling across different types of devices.

02:22.960 --> 02:28.720
So, where do we like risk-five? What is the fundamental premise of risk-five? I know we're in

02:28.720 --> 02:32.960
the risk-five devrim, so it should be self-evident. But I want to reiterate the beauty of risk-five

02:33.200 --> 02:38.480
is a concept. It's for the first time ever we actually have a true open source community driven

02:39.200 --> 02:45.120
instruction set architecture. We have a set of specifications that allow anyone and everyone to go

02:45.120 --> 02:50.640
through and build a chip. Be that a toy chip in academia for your first undergraduate electrical engineering

02:50.640 --> 02:56.240
course or actual high performance designs. Be that from the likes of test torrent, from rebots,

02:56.240 --> 03:01.040
and ease and so on. There's a bunch of vendors and manufacturers actually going to market,

03:01.040 --> 03:05.040
betting their lives and their careers on this specification that is fundamentally controlled

03:05.040 --> 03:10.640
by the community. That is a really good thing. This is starting to move in the direction of hardware

03:10.640 --> 03:15.920
that is open and that is actually readable and implementable. The same way that we had the compiler

03:15.920 --> 03:20.560
change in the 80s and 90s moving towards open source with implementations of Linux and so on.

03:21.680 --> 03:26.160
Giving that power to the user anywhere and everywhere and giving the user more freedom means that

03:26.160 --> 03:31.840
inclementations for specific use cases can thrive. That's really important and it's a good step forward.

03:33.600 --> 03:37.360
It's real. It's actually real. This would have been a dream.

03:38.080 --> 03:44.800
Now RB64 embedded is the version of the RB64 spec that is meant for embedded use cases.

03:45.360 --> 03:51.440
It's a really weird spec in no small part because why am I making an embedded microcontroller type

03:52.400 --> 03:58.240
system where I actually need a 64 bit machine. But one of the things you end up with in risk five

03:58.240 --> 04:04.160
is that because anyone and everyone has a use case that may or may not differ from the original

04:04.160 --> 04:08.960
use case, you need to support that use case. We end up with a plethora of possible implementation

04:08.960 --> 04:15.120
spaces because it may be that someone actually does need an embedded 64 bit CPU and that's fine.

04:15.360 --> 04:18.400
This is a fun one.

04:22.960 --> 04:27.760
It has been said that risk five 32 especially on the embedded sizes I can't believe it's not

04:27.760 --> 04:33.040
nips and I think that's pretty much true. If you look at the academic usage and so on where

04:33.040 --> 04:37.600
are we actually using it and that's fine because mips feels familiar and that means people can

04:37.600 --> 04:42.960
get to work faster. There are differences and that's fine. But it means now I have a 32 bit embedded

04:43.920 --> 04:49.280
that has consistent, mostly support for the way that the memory model works, the way that it

04:49.280 --> 04:53.760
does address things, the compilers, the tool change field, do a device. Which means I can go and take

04:53.760 --> 04:58.640
my design and for the most part use someone else's risk five 32 bit core because this is an isa

04:58.640 --> 05:03.520
an isa is a contract and when you go and implement that contract you have guarantees of how things

05:03.520 --> 05:08.240
will work. You don't have guarantees on how the implementation may do things at a order of

05:08.240 --> 05:12.720
versus an order course is a classical thing where my performance characterization, my visibility

05:12.720 --> 05:17.920
of the machine in terms of how I get an output changes. But the actual underlying output itself

05:17.920 --> 05:22.720
does not change. If I'm saying I'm doing an FMA, the answer to that FMA is consistent and I will always

05:22.720 --> 05:27.360
get the same behavior if you implement things properly before you go through and get to the machine.

05:28.320 --> 05:39.840
So, the meat of the discussion. What is RVB? RVB is the risk five embedded specification.

05:40.400 --> 05:45.120
You'll see it change every few years as the committees get together and we add new instructions,

05:45.120 --> 05:50.560
new extensions and say now we are iterating on this specification. We are adding new guarantees

05:50.560 --> 05:55.680
for what you as the developer as a software person as the hardware person can implement with consistency.

05:55.680 --> 06:01.680
What are your guarantees there? What's in it? Well, how do we read and decipher this?

06:01.680 --> 06:10.320
Risk five B for basic is how I read it for embedded 23 profile. So, this is the set of ratified specification

06:10.320 --> 06:13.760
that was based on okay what instructions are I going to have as a guarantee? You can

06:13.760 --> 06:20.160
essentially map this to things like arm VA or arm V8 or V8.2, V8.3 and so on all the way

06:20.160 --> 06:26.800
now to 9.4. You have consistencies and when I have my compiler and when I'm developing software

06:26.800 --> 06:32.640
I can just add my flag of these are the profile I'm targeting and then it because you can target that

06:32.640 --> 06:37.120
profile it also means that you can target those instructions. They're also guaranteed super

06:37.120 --> 06:42.960
sets of instructions. So, I am guaranteed that if I'm targeting RV823 it also guarantees that I can

06:42.960 --> 06:48.960
target everything in RV822 all everything in RV820 so on so forth. I should have been saying

06:48.960 --> 06:56.240
to be there but it's because I never deal with the embedded spec. What is RV8? RV8 is the general

06:56.240 --> 07:02.160
purpose risk five application profile. This is the one that everyone basically wants and has been

07:02.160 --> 07:09.360
crying out for for years. How do we go through and have a real CPU? Something that I would be able to

07:09.360 --> 07:15.360
put in this laptop or in my desktop or in my server. These are the application class processors

07:15.440 --> 07:18.720
when you're designing a spec that means that you're making certain underlying assumptions

07:18.720 --> 07:22.720
of what the machine can and will be used for and that's what the application spec is.

07:23.920 --> 07:28.880
So, it's a general purpose-ice target. It has everything from your floats, it has your doubles,

07:28.880 --> 07:33.840
it has your vectors, it has your compress, it has your atomics, it has your memory models

07:33.840 --> 07:39.440
and simple things like multiplication, integer support. This is a general purpose-ice that

07:39.440 --> 07:45.200
can use for any in all cases. It does not however prescribe what that use case needs to be.

07:45.760 --> 07:50.720
That can be a small core that's just powering up a Raspberry Pi like device and we're seeing a lot

07:50.720 --> 07:55.600
of those. What I would refer to as glorify TV set top boxes, which is where the chips for the original

07:55.600 --> 08:01.360
Raspberry Pi came from. These are meant for small machines. At the same time, hey, you look at

08:01.360 --> 08:07.280
any C, any C, the big HPC provider from Japan. They have taken their vector engines, their vector

08:07.280 --> 08:10.880
cores, of what you should be crased out of vector computers and they're porting those to risk

08:10.880 --> 08:16.080
5, specifically the application profile and their intention is to do an in order core with

08:16.080 --> 08:20.880
really wide vectors. And what that allows them to do is, hey, I don't need to design an isa.

08:20.880 --> 08:24.720
I don't need to redesign all the compilers in the world in terms of how do I support those

08:24.720 --> 08:29.840
and how the operating systems address the machine. Instead, they can get to the work of

08:29.840 --> 08:34.400
how do I actually extract performance from the machine. How do I make sure the compilers understand

08:34.400 --> 08:43.760
how wide my vectors are and what are the implications thereof? Risk 5 compressed. We're going to

08:43.760 --> 08:49.600
I'm going to start getting a little bit spicier in a second. The idea of a compressed instruction set

08:49.600 --> 08:56.800
is very simple. There are certain use cases where code density matters. When I say code density,

08:56.800 --> 09:02.800
remember that all risk 5 instructions are 32 bits. That is the fundamental decision. And when

09:02.880 --> 09:07.360
you're designing a core, when you're designing a decoder for the hardware, what that means

09:07.360 --> 09:13.440
is I guarantee consistency in the size of my encodes. But there are cases, think of I'm actually

09:13.440 --> 09:18.720
for work right now at a cruel joke, having to write a boot ROM in the mask. That's actually going

09:18.720 --> 09:24.960
to be printed in silicon. It turns out silicon is really honking expensive. So having instructions

09:24.960 --> 09:31.520
that are compressed that are only 16 bits means that I can quote unquote fit twice as many instructions.

09:31.600 --> 09:35.200
Which means I'm not going to run out of space or I don't need to increase the amount of space

09:35.200 --> 09:39.920
on my actual lithograph, like on my actual chip. So it has its use cases there.

09:41.360 --> 09:48.320
There are also arguments that I won't refer to as reasonable. That mean that if you're looking

09:48.320 --> 09:53.520
at an eye cache, an instruction cache, being able to fit more instructions in there without having

09:53.520 --> 09:58.720
to go to a design that is fully micropoded or that executes its own internal, I say its own internal

09:59.520 --> 10:03.520
language that does direct execution of the off codes. Well if I'm storing those all in the eye

10:03.520 --> 10:07.840
cache and I can fit twice as many instructions in the eye cache, that means my probability of a

10:07.840 --> 10:12.560
cache it goes up significantly. That is a very reasonable argument for when and where to use compressed.

10:13.440 --> 10:23.360
But what's the cost? So the cost, yes, that's too fast from that. So the underlying cost here

10:23.440 --> 10:30.080
is when I think about the machine, think about the encoding of an instruction. I have two bytes.

10:30.080 --> 10:36.480
I have 16 bits. Oh, the camera only fits here. Sorry, I thought it was the whole front.

10:37.840 --> 10:42.960
What that means though is now if I have a set of bytes that represents an off code, the 32 bit off

10:42.960 --> 10:48.400
code. If I take that and then I say I'm going to do a compressed version,

10:49.360 --> 10:53.440
think about the encodings what I now put two of those instructions directly beside each other.

10:54.560 --> 11:01.120
Because now I don't know necessarily with the front side, if what I'm executing is two

11:01.120 --> 11:08.480
risk-five compressed instructions, necessarily, or one single 32 bit instruction because I have a variable

11:08.480 --> 11:13.680
boundary. And this is the biggest complaint you'll get out of all the x86 core designers,

11:13.680 --> 11:17.680
is we have to do variable length instructions because you don't know what you're getting.

11:17.680 --> 11:21.360
Am I getting a VX instruction at EVX instruction, a classical instruction, so on and so forth?

11:23.200 --> 11:29.200
You also have the unfortunate reality that the promise of a risk-machine or a risk-isah is we're

11:29.200 --> 11:34.080
going to do very simple instructions that allow you to do anything you're trying to do.

11:34.080 --> 11:38.960
We're going to try to avoid complexity. What that means though is as new things come along,

11:38.960 --> 11:44.720
we need more fundamental building blocks. We need simple tools to be able to get the compiler to

11:44.720 --> 11:49.680
target different new cases, the assembly programmer and so on, the interfaces. But because of this

11:49.680 --> 11:55.120
problem of two off-codes being right beside each other in the encoding, you very quickly run into

11:55.120 --> 12:02.160
the scenario where am I overloading an off-code? The solution to that very quickly becomes well,

12:02.160 --> 12:08.640
now I have to reserve a significant portion of the off-code space, all the possibility is of encoding,

12:08.640 --> 12:14.240
to make sure that I don't have an overlap. I don't have a malformed instruction being executed.

12:15.040 --> 12:20.320
Net of net though, we only have 32 bits to encode all possible off-codes at risk-5,

12:20.320 --> 12:26.480
could or would ever want to use if we want to keep that promise of consistent usage of being able

12:26.480 --> 12:32.400
to stay in that 32-bit space without also having to move to a 64-bit space on the instruction off-codes side.

12:33.280 --> 12:38.640
Which means now I actually have to go through and the committee has done this and say this entire

12:38.720 --> 12:43.760
portion of the all potential instructions that could ever be encoded has to be thrown away.

12:45.280 --> 12:52.320
For a use case that I don't believe is frankly reasonable. This was the meme that I made as a joke

12:52.320 --> 12:58.640
that got me to write this talk in the first place. It's a problem. Specifically,

12:59.600 --> 13:04.080
I think we're going to get into a little bit more of the engineering now, but my biggest problem

13:04.080 --> 13:12.160
with the way that compressed was added is that RVA 2023 mandates risk-5 compressed as part of

13:12.160 --> 13:16.400
the application profile. When I'm designing and when I was designing in a previous life a high

13:16.400 --> 13:22.720
performance risk-5 core, I was we had to deal with things that were forced on us from people that

13:22.720 --> 13:30.320
thought engineering technical decisions was ultimately a marketing and business question for risk-5

13:30.320 --> 13:38.800
international and risk-5 architecture. The silliness and frankly stupidity of being able to say

13:38.800 --> 13:44.960
that the way that we design the machine, the actual fundamental interface of how the machine must

13:44.960 --> 13:50.240
and will work now and into the future for successful architecture is actually a marketing decision.

13:51.120 --> 13:55.600
I think that offloads the reality that we have people that are giving their lives and their careers

13:55.600 --> 14:00.480
to try to build real chips for real people to serve real use cases, and yet the interfaces that

14:00.480 --> 14:07.760
allows us to define that machine are considered marketing plus, not okay. So architectural discussion,

14:08.160 --> 14:13.040
non-consistent front end design and variable length decode. The way that we've made processors

14:13.040 --> 14:18.720
fast one of many in the last decades is by moving to what's known as an out-of-order machine with

14:18.720 --> 14:24.560
massively wide front end decodes. You get an instruction and you cement, you have when you receive

14:24.640 --> 14:29.520
that instruction, I don't actually execute it. I receive data that is the implication that the machine

14:29.520 --> 14:34.240
needs to run. What is the operation that you are asking the machine to provide to you?

14:34.960 --> 14:41.920
That is what an instruction fundamentally is, but what that means is now I can decouple the actual

14:41.920 --> 14:47.280
execution of that instruction from its decoding, and that allows you to do things like reordering.

14:47.280 --> 14:51.600
It also allows you to do things that was brought up during the FFM Pactalk of instruction fusion,

14:51.600 --> 14:56.240
because this is an implementation question. I can do, I can look that someone's doing a

14:56.240 --> 15:01.680
gather and then a store. Well, okay, why am I not pipelining that? The instructions might not be

15:01.680 --> 15:05.600
back and forth, but I can do the dependency analysis, the same way we do the dependency analysis

15:05.600 --> 15:10.000
and compilers, to say, well, I can go and I can fuse these blocks together, I can schedule these

15:10.000 --> 15:15.120
blocks in order right here, but I can also do the same thing for power, because when you're building

15:15.200 --> 15:21.840
a high performance design met for mobile applications, I know the heuristics on my race to sleep.

15:22.560 --> 15:26.160
I know that if I'm going to execute certain instructions, say on the vector side,

15:26.960 --> 15:32.560
it takes about, call it a hundred to 150 cycles for that unit to wake up. When that happens,

15:33.600 --> 15:38.480
that's going to burn a lot of power. So if I go and I execute a single or two or three vector

15:38.480 --> 15:43.600
instructions or Cindy instructions, and then I execute that and I go back to move the data in the

15:43.600 --> 15:49.280
primary part of the core, it's still going to take 150 cycles for that to power back down.

15:49.280 --> 15:55.280
It's very easy and an out of order core to instead decide, hey, I can look at my instruction stream,

15:55.280 --> 16:01.200
schedule all these independent instructions to be in the same block and then turn the frequency

16:01.200 --> 16:06.720
back down. I can burn a lot less power because I'm not wasting power idle in a high power state.

16:06.720 --> 16:10.720
This is one of the tricks we do for saving power, and you'll see this all over the place we're

16:10.720 --> 16:16.480
now because of risk-five compressed. What it means is I have variable length input.

16:17.440 --> 16:22.880
I cannot as a machine designer predict the way that interfaces will be handed to me.

16:23.840 --> 16:30.880
Now trivial example is cash lines. Cash lines are powers of two, they're 32-bit aligned, right,

16:30.880 --> 16:36.480
64-bit aligned implementation detail here, but now what happens when I have an odd number of

16:36.480 --> 16:39.680
compressed instructions. Remember compressed instructions are 16 bits.

16:40.960 --> 16:47.920
So now I have to have the weird, terrible case where if I have three column three compressed instructions

16:48.480 --> 16:53.440
in a row and followed by one non-compressed instructions, because the compressed instructions by

16:53.440 --> 16:58.720
definition cannot hold the entire encoding space of the entire machine or else we would do everything

16:58.800 --> 17:07.360
in 16 bits, right. What it means is now I've got 16, 32, 48, but now at the 49th bit,

17:08.240 --> 17:12.640
my instruction now has to be crossing a boundary if I have a 64-bit cash line say.

17:13.200 --> 17:18.240
So now when I go to execute that, if I'm doing a pointer chasing pattern effectively in the machine,

17:18.960 --> 17:26.080
now the next half of my instruction at decode stage will be split across a different cash line.

17:27.040 --> 17:32.720
And sheuristically, 50% of the time that I have any ristified compressed instructions in that

17:32.720 --> 17:37.680
machine, I have to deal with this. But it's not that easy, it's not that simple. Because one of the

17:37.680 --> 17:43.280
ways we've made machines fast over the past 30 years is we've gone to massively wider front ends,

17:43.920 --> 17:50.400
massively wider. What that means is I'll be decoding up to say 8, 10, 12 instructions at the same time

17:50.400 --> 17:55.520
as a big, strided load. And then I go through, I do my dependency analysis, I do my temporal analysis,

17:55.520 --> 17:59.600
I reorder them, I execute them, I retire them, and then the user gets math. Great.

18:00.720 --> 18:05.360
But now think about that 48-bit example in the way that things can be unaligned.

18:06.000 --> 18:10.320
Notice that now you have the case where for everything except the first port of the machine,

18:10.880 --> 18:18.080
you now have non-known size entry ports at every single boundary if you're trying to do a consistent,

18:18.080 --> 18:24.960
long load. So now every single portions of my decode has to be off by one or off by two,

18:24.960 --> 18:29.360
the other way. I never know how wide I'm going to be. I also now don't know how many instructions

18:29.360 --> 18:35.440
I'm getting. Because in the case where I'm getting 16 normal instructions, 32-bit encoded instructions,

18:35.440 --> 18:40.000
I could also be getting S and 16 just now. I could be getting 32 compressed instructions,

18:40.000 --> 18:45.440
which means now the delay of the entry into my front end. I have twice as much work to do on the

18:45.440 --> 18:48.640
same load, which means my heuristics, I don't know for my instruction scheduling yet.

18:49.520 --> 18:54.560
The natural conclusion there is let's make a bigger front end. Let's add more deco capabilities.

18:54.560 --> 19:00.080
Let's go and amortize more space in the machine. But that's area. That's power. That's gates.

19:00.080 --> 19:04.000
That's SRAM. All these things that are really expensive. I don't know about you, but I would

19:04.000 --> 19:09.280
much frankly use that area and that power to go into better branch prediction or to save power

19:09.280 --> 19:14.240
to the user and give them better battery life when they pull out their phone. These are real

19:14.240 --> 19:20.640
and use cases. You also go and talk to the people that would have been using this because

19:20.720 --> 19:26.160
risk fund compresses effectively thumb but worse for the people that have dealt with arm assembly.

19:30.000 --> 19:36.240
Yeah, great. Because I care. But you do end up in the case naturally there. We're okay.

19:37.280 --> 19:41.440
If we're implementing it, it's because we have users. Because you don't implement something for

19:41.440 --> 19:47.520
nobody. So then if the use case was, okay arm has it, arm must have a good reason for it.

19:48.240 --> 19:52.960
I follow the very natural following question of does arm still have thumb.

19:54.400 --> 19:59.040
It's been defricated. This isn't available anymore. This was known to be happening before

19:59.040 --> 20:04.320
risk five compressed was even ratified. We knew it was going away and the primary reason

20:04.320 --> 20:09.200
to make this exist was to have feature parity with arm. So if arm doesn't care about it,

20:09.920 --> 20:13.840
we can't find our own users that are justifying it within the spec. And the only thing that

20:13.840 --> 20:17.440
really happened was one implementation already had it and was like, well, we already put

20:17.440 --> 20:21.760
engineering resources into it. That's not good enough. That's not good enough folks.

20:22.880 --> 20:28.320
So we'll come back to RBC later. Am I? How much time do I have left by the way?

20:30.160 --> 20:33.040
15? Okay, perfect. I want to leave some time for questions.

20:33.600 --> 20:38.000
Risk five vector. Sometimes we want to do math and we want to do a lot of math.

20:38.560 --> 20:42.720
The point of vector is to be very similar to arm SV, the scalable vector extensions.

20:42.800 --> 20:47.600
Let's go do the risk five vector extensions. As a design consideration,

20:47.600 --> 20:55.200
it is a vector length agnostic design. The way the machine executes is intentionally such

20:55.200 --> 21:00.880
that the user does not need to worry about how wide the machine is when writing code.

21:00.880 --> 21:07.120
They may choose to worry about the size of the machine, but they don't have to.

21:07.360 --> 21:15.760
What this looks like is I go and I say execute on this block of memory.

21:16.480 --> 21:19.920
I'm not telling you how big that block is. Yeah, because it can be dynamic.

21:20.720 --> 21:23.760
And I'm saying this is the precision and run these instructions.

21:24.720 --> 21:28.960
Mentally think of it as vertical lanes. This is just like the vector process. There's a

21:28.960 --> 21:33.840
old, which is why any season is implementing it. This says I'm going to have a consistent stream of

21:33.840 --> 21:38.800
instructions where the same input, same type of input is always getting the same

21:38.800 --> 21:45.120
instruction executed upon it. I'm not going to care about where the machine is. I just go and say,

21:45.120 --> 21:50.720
great, here's some data, run instructions on it. That sounds actually really nice.

21:51.600 --> 21:56.160
That sounds actually great, because what it means is from the open source community and the package

21:56.160 --> 22:01.360
standard packaging community. What it allows me to do is I don't have to worry about the implementation

22:01.440 --> 22:07.040
that my user is going to get. I can just give them code and it will run as

22:07.040 --> 22:13.040
performantly as that machine is capable of executing. So reasonable goal, right? Why do we want,

22:13.040 --> 22:19.120
why did this come to be as a need? Well, think about it. Right now in the x86 world,

22:19.120 --> 22:24.320
which is where I come from, we have a case where 80 x by 12 is awesome, like truly.

22:25.120 --> 22:31.680
But I never know if my user is going to have it. Well, I don't know if that user is going to have

22:31.680 --> 22:36.560
it, because 10 years ago when I was designing some of these algorithms, AVX-512 didn't exist,

22:36.560 --> 22:41.840
right? I got my timeline a little bit off there. But what it meant is I can't take advantage of the

22:41.840 --> 22:47.920
new machine that does new things. For a use case of RVV and a vector length agnostic architectures

22:47.920 --> 22:53.920
and design choices is in five years' time when silicon gets faster, when vector engines get

22:53.920 --> 22:59.200
wider and bigger and deeper. Well, the machine can just automatically go and say, well,

22:59.840 --> 23:05.280
I don't care if it's 128 bits when I was originally designed. My machine does 10 24 bits

23:05.280 --> 23:10.160
and it can reorder, and because it's just a block of memory with no operations operating on them,

23:10.160 --> 23:17.600
that's great. I don't have a problem. It'll just get faster. This assumes that the machine,

23:17.600 --> 23:21.760
the way the machine's work and the assumptions you may 10 years ago are still true.

23:22.240 --> 23:28.000
Fundamentally, because as the machine's change, as we went from in order to out of order,

23:28.720 --> 23:33.200
in order to super scale it out of order machines, to branch predictions, the way you extract

23:33.200 --> 23:38.480
performance out of the machine has also drastically changed. And the point I'm driving to is

23:38.480 --> 23:43.920
the promise of it'll just get faster automatically, when you don't, when you fundamentally

23:43.920 --> 23:49.360
cannot know what is coming, is a false assumption that doesn't actually help anyone. It doesn't

23:49.360 --> 23:56.480
hold up to second order scrutiny. But this is supposed to be the good part. The goal is a reasonable

23:56.480 --> 24:02.720
one, offloading complexity to the hardware. There are a lot of software people where I'm an assembly

24:02.720 --> 24:07.760
here. I wear that as a badge of honor. I know a lot of incredible programmers that they just want

24:07.760 --> 24:11.600
to get shit to work. They want to work on the desktop environment. They want to work on a browser,

24:11.600 --> 24:17.920
and that's fine. The machine is there to serve them. And if this takes complexity away from the

24:18.000 --> 24:23.680
compiler implementers, it takes complexity away from the general purpose software implementers.

24:23.680 --> 24:28.080
That's a great and noble goal. I just don't know that it holds up to scrutiny.

24:30.640 --> 24:34.400
It wouldn't be a risk-wide vector mention if I didn't talk about the foolishness of this article.

24:35.200 --> 24:41.280
Specifically, SIMD instructions considered harmful by some guy that nobody's heard of.

24:41.680 --> 24:51.840
Patterson for those in the back. This is a fundamental design decision. When you look at X86, X86 did

24:51.840 --> 24:59.600
SSC. Well, they did MMA Max, but no one wants to talk about that anymore. They did SSC. They did AVX-1, AVX-2,

24:59.600 --> 25:04.480
AVX-3, which never saw the light of day. Then they went to AVX-512. AVX-5-4 was really good, but it was a

25:04.480 --> 25:09.440
mess, and now we're getting AVX-10, and that's a very reasonable standard. What is the premise of AVX?

25:10.320 --> 25:18.560
Or if an arm's neon, is I have a fixed-sized boundary. I have 128 bits, and those 128 bits

25:18.560 --> 25:25.280
can address 64 bits, be they integer, unsigned floats. They can do the same thing in 32 bit,

25:25.280 --> 25:30.400
and as of recently you can do the same thing in 16-bit entries, and it's 8 bit on the integer side.

25:31.760 --> 25:36.800
Great. I know the exact size of my types. I know the exact size of my register.

25:37.600 --> 25:44.400
Why is that interesting? Notice earlier, when I was talking about vectors, I would keep going vertical.

25:45.120 --> 25:51.680
Vertical pipeline, without dependency chance. Running XP. I'm running an FMA consistently.

25:51.680 --> 25:56.480
I can go and I can pipeline that very simply. It turns out a lot of high-performance software

25:56.480 --> 26:01.120
doesn't actually do that. That's great when you have an HPC background, and you want the number

26:01.120 --> 26:05.600
one on the top 500 super-computer. That's not actually useful when you want to get real work done.

26:06.320 --> 26:11.440
Because in the real work, you have dependency chains. You're doing hashes. You're doing all these

26:11.440 --> 26:16.480
computations that have dependencies. I'm bringing all different types of data together, and I'm doing

26:16.480 --> 26:22.800
that dynamically. I'm doing logical operators. In the case where you don't have logical, when you're trying to

26:22.800 --> 26:28.720
do these manipulations, you make the decision fundamentally that because there are dependencies in the

26:28.800 --> 26:32.880
vector, I need to be able to manipulate those entries. I need them to work together.

26:33.760 --> 26:37.200
So when you're trying to do something where the machine takes care of everything, and you're just

26:37.200 --> 26:43.920
doing linear algebra all day, you build a bloss machine, and this is a bloss machine. This is

26:43.920 --> 26:52.640
DAXP all day all night. That's not the code we actually run. So, they're not so good part in more

26:52.640 --> 26:59.360
detail, architectural discussion. Implied state in out-of-order machines. You can do vector-length

26:59.360 --> 27:04.480
agnostic machines that are good. Arms SVE design is actually quite good. It's quite tasteful. It's not

27:04.480 --> 27:09.520
my preferred style, but that's fine. We can have differences opinion, but still respect good implementations.

27:10.720 --> 27:15.360
One of the major issues with RVP, and it was brought up during the FMTEP and Peg talk earlier,

27:16.240 --> 27:24.000
is VMAX and so on, where your vector-length can change underneath you. What does RVVDU? RVV tells you

27:24.000 --> 27:30.960
that DAXP shalt implement vectors of either 128 bit, 256 bits, 1024 bits, all the way up to 2K.

27:32.000 --> 27:40.800
It also has this wild instruction called LMO, length multiply. LMO is a way to create a virtual

27:40.800 --> 27:48.960
vector register of powers of two larger than the actual physical register. So, I can have an RVV128

27:48.960 --> 27:57.200
implementation, and the user can set LMO equals 8, and then to software, you're interfaced into the

27:57.200 --> 28:07.680
machine, you now have 124 bit registers. Okay, why would you do this? You would do this when you're

28:07.680 --> 28:12.720
in a late, you're trying to amortize latency and go for peak throughput. Because what this tells

28:12.720 --> 28:18.160
the machine is when I'm reordering my buffers, when I'm reordering my vectors, well now I can just

28:18.160 --> 28:23.680
simply do decomposition and pipeline through, and because LMO is very large, that means my vectors

28:23.680 --> 28:28.880
are very large, which means I have guarantees on how many instructions I'm going to run,

28:29.440 --> 28:33.440
and the same way I was looking at the power analysis earlier, I'll be able to do batches of things

28:33.440 --> 28:38.400
together than when I'm finished shipped them off. I can do those same tricks, but better.

28:38.400 --> 28:44.560
It's a great idea. Until, you realize that the implied state can change,

28:45.760 --> 28:50.720
because it's not actually encoded in the instruction itself. My vector FMA does not know

28:50.720 --> 28:58.960
that it's operating on LMO 8 or LMO 2 or LMO 1. The machine must track this, and you have no

28:59.040 --> 29:03.360
consistent guarantees, and I consider this a bug in the ISO, others consider it a feature.

29:04.080 --> 29:10.480
You have no explicit guarantees on when LMO will change, which now means that when I reorganizing,

29:10.480 --> 29:15.760
and I'm doing my vector pipelining across the entire machine, I don't know if my implied vector

29:15.760 --> 29:24.640
right now at any given time is 128 bits, 256, 512, all the way to 2K, or if you have a 2K

29:24.720 --> 29:30.880
native implementation all the way to 16K virtual. I have to track all of those implied dependency

29:30.880 --> 29:37.840
states across the whole machine at all times, and I have no, because I can only look ahead so far.

29:37.840 --> 29:41.280
I don't know what the next instruction stream is going to be. I have no guarantees on what that's

29:41.280 --> 29:49.520
going to look like, and this is why hardware designers go mad. So, I wanted to leave a lot of

29:49.520 --> 29:54.000
time for questions, and I'm kind of coming up to the five minute mark. So, what do we want? I want

29:54.000 --> 29:59.200
to kill off the sea extensions broadly, but I want to learn from where we came from, and be more

29:59.200 --> 30:05.680
willing to change opinions, if and when things don't work out. When I gave a talk on it for

30:05.680 --> 30:11.520
easy build years ago on EVX 10, I got this amazing quote from Tom Forsyth. Tom Forsyth is one of

30:11.520 --> 30:18.320
the architects of five. He did EVX 512. An ISO is the train tracks to which we chain our screaming

30:18.400 --> 30:27.280
future colleagues. We have a specification. That specification is out in the wild now, which means

30:27.280 --> 30:34.240
users are depending on it. I cannot change the semantics of the machine without a hard break.

30:34.240 --> 30:40.080
Without telling everyone, the software you previously compiled is all gone dead dusted.

30:41.280 --> 30:45.200
Think of what that means now for all the package maintainers, all the software developers.

30:45.760 --> 30:50.320
Sure, it's great when all of your software is open source, but I also don't know how many people in

30:50.320 --> 30:54.400
the room are like Gen 2's outlets that want to go and recompile everything every week for more

30:54.400 --> 31:00.880
than new optimization flags. That's not a reasonable idea. So, then the question becomes now we have this

31:00.880 --> 31:07.840
spec. How and when are we going to improve it? How do we learn? The first part is, I didn't put.

31:08.480 --> 31:12.880
Rescribe is going to continue to evolve. It's an open source community driven project. That's a good

31:12.960 --> 31:19.680
thing. As it evolves though, having more humility and being able to engage more broadly with the

31:19.680 --> 31:25.280
industry with other instructions that's be that open power, be that nips, be that RMB, that X86,

31:25.280 --> 31:31.120
so on and so forth. If you hit yourself, I tell you I'm being able to interact with these machines

31:31.120 --> 31:35.120
and learn from what worked and what didn't. What were the contracts that those machines were

31:35.120 --> 31:40.720
enforcing and where were they useful and where did they get in the way? That's what I'm actually

31:40.720 --> 31:45.120
asking for from the broader risk by community. And also maybe the Monday morning spec meetings where

31:45.120 --> 31:48.720
you do five hours of matrix discussions, so that's going to go away. Let's get a little bit of

31:48.720 --> 31:58.560
annoying for those. Let's have some questions. Or did I board everyone to sleep on Saturday morning?

31:58.640 --> 32:00.640
I'm so sorry.

32:08.480 --> 32:16.160
What has some, I've seen some, some alternatives to the, uh, to the unique way things work,

32:16.160 --> 32:20.320
in particular with OBV, which kind of way do you think things are going?

32:21.040 --> 32:27.680
So the question is, you've seen, we have RBV, but there's alternatives being developed,

32:27.680 --> 32:33.760
pushed, opened up. What do I think of them? I have to, that's a nuanced question.

32:34.560 --> 32:38.880
The nuance being, I'm working on open sourcing and alternative to RBV right now.

32:40.080 --> 32:43.040
And I don't want it to be super biased where I'm just saying, hey, this is the thing you should

32:43.040 --> 32:47.280
be doing, what are you doing, not doing this, like, come on. Now, I think more broadly it's a

32:47.280 --> 32:54.320
design paradigm question. I am a fan and almost every really good programmer I know

32:56.960 --> 33:05.040
is in favor of fixed length packed 70, because of the consistency guarantees it gives you.

33:06.000 --> 33:09.920
Because it means the machine can be simpler, the tracking infrastructure can be simpler,

33:09.920 --> 33:14.560
the dependency chain can be simpler in the machine, and also me as a programmer, I can

33:14.560 --> 33:21.360
reason about those interfaces, right? I'm not mentally offloading the complexity to the machine

33:21.360 --> 33:27.120
in a way that I will never understand and can only holistically determine at the end.

33:28.240 --> 33:33.200
So being able to present consistent interfaces is something I think RBV needs,

33:34.400 --> 33:39.040
getting rid of LML, I think is a good idea, but that's out there, right? It's out in the wild.

33:39.120 --> 33:46.560
Um, implied state is just nasty. I think vector length agnostic is interesting, but it's

33:46.560 --> 33:50.880
more academic and it's usage than I think actual implementations. There's a reason Apple still doesn't

33:50.880 --> 33:57.680
support SVE on their arm course, and they set up a bastardized thing called SSEE specifically

33:57.680 --> 34:03.760
because of the way it works. Um, yeah, and there's other alternatives. I would really like a fixed

34:03.760 --> 34:10.960
length Cindy. I would love an AVX 10.2 that's actually just in risk five. Give me good

34:10.960 --> 34:15.280
consistency, swizzle support, give me good compression, give me masked predication, which I didn't

34:15.280 --> 34:19.360
have time to cover predication. Um, but effectively, if you've ever heard of a mask register for

34:19.360 --> 34:23.920
lane selection, that's what predication is, that gets a lot easier to implement when you're doing

34:24.480 --> 34:30.320
dynamic data changes in a SIMD machine, then in a vector machine. Um, so there's lots to go.

34:30.320 --> 34:47.440
They're probably best for a deer. So I mean, that very seriously. So as a user, the biggest promise

34:47.440 --> 34:55.680
I make to you when I give you hardware is you give me software. I run your software. The interface,

34:55.760 --> 35:01.120
the application binary interface, the compiler standard and so on, says that shout, not that may,

35:01.120 --> 35:08.400
that shout support, risk five compressed, which means, is it reasonable for me to build hardware

35:09.040 --> 35:14.960
for you to just compile as normal? The compiler determines, oh, actually, there's a reason here to

35:14.960 --> 35:20.320
generate a compressed instruction, and then suddenly I fault. You're using compliance software,

35:21.120 --> 35:26.400
you're writing it in ISO C23, because you like yourself. You go through with a standard

35:26.400 --> 35:31.760
compiler that fully implement things. That goes to the machine. You hit go, you hit enter,

35:31.760 --> 35:36.960
and it breaks is not okay. That is breaking the trust contract to a hardware and software.

35:37.840 --> 35:42.640
Now, I think it's more tasteful if you go through and you just disable them on the command line.

35:43.520 --> 35:48.640
You can do that. I think your code will be better, but because of the reality of

35:48.720 --> 35:53.680
their part of the spec, I don't have the choice of not implementing them. I can make them slow,

35:54.720 --> 35:57.600
but you need to implement them. Sorry. Thank you, about the time.

35:58.880 --> 36:02.880
Right, thank you. Yeah, thanks everybody.

