WEBVTT

00:00.000 --> 00:08.000
Okay, we're ready to let Felix start his job.

00:08.000 --> 00:12.000
We're using of the 20 of the 36 day road trip.

00:12.000 --> 00:15.000
Coming from Canada to Europe back to the U.S.

00:15.000 --> 00:17.000
And I'm back, Bob.

00:17.000 --> 00:19.000
And he has 47 slides.

00:19.000 --> 00:21.000
Go ahead, Bob.

00:21.000 --> 00:28.000
Everybody, I'm Felix.

00:28.000 --> 00:30.000
We're going to be talking about floating point.

00:30.000 --> 00:33.000
We're going to be skipping some of the nonsense of floating point.

00:33.000 --> 00:34.000
I'm talking a little bit of history.

00:34.000 --> 00:36.000
A little bit of where we got our air today.

00:36.000 --> 00:38.000
Why is the decision that we made when we built hardware?

00:38.000 --> 00:43.000
And when we built hardware for the software hardware contract of how we do math.

00:43.000 --> 00:46.000
And as you can see, we've been computing in general.

00:46.000 --> 00:48.000
So who's this bozo?

00:48.000 --> 00:49.000
I'm Felix.

00:49.000 --> 00:52.000
I'm the need to see it in the air over at 10 o'clock.

00:52.000 --> 00:53.000
We built hardware.

00:53.000 --> 00:55.000
We've opened the source all of our software.

00:55.000 --> 00:57.000
And then I'd like to make this very clear.

00:57.000 --> 00:58.000
Because they're mine.

00:58.000 --> 00:59.000
They're all mine.

00:59.000 --> 01:00.000
They're not anyone else who's mine.

01:00.000 --> 01:01.000
They're long to me.

01:01.000 --> 01:02.000
Don't happen.

01:02.000 --> 01:05.000
So what is floating point?

01:05.000 --> 01:06.000
Very quick version.

01:06.000 --> 01:08.000
You can leave in three minutes.

01:08.000 --> 01:12.000
Both of the point is just a day in the lab.

01:12.000 --> 01:13.000
All we're trying to do.

01:13.000 --> 01:15.000
Sometimes the interviews are great.

01:15.000 --> 01:16.000
The interviews are great.

01:16.000 --> 01:17.000
Most of the time.

01:17.000 --> 01:18.000
Sometimes you have a decimal point.

01:18.000 --> 01:19.000
Because we're a little cool.

01:19.000 --> 01:21.000
And people like to think in.

01:21.000 --> 01:22.000
Fractions are great.

01:22.000 --> 01:25.000
Floating point allows us to do that in a standard way.

01:25.000 --> 01:30.000
There we go.

01:30.000 --> 01:32.000
Yeah.

01:32.000 --> 01:33.000
Yeah.

01:33.000 --> 01:36.000
For the stream, I'm feeling about some HPC.

01:36.000 --> 01:38.000
Those are going to talk about floating.

01:38.000 --> 01:40.000
All kind of.

01:41.000 --> 01:44.000
So, I'd like to point out that little asterisk.

01:44.000 --> 01:46.000
Floating point has some nonsense in it.

01:46.000 --> 01:47.000
This is ominous.

01:47.000 --> 01:50.000
There's a little devil there for a good reason.

01:50.000 --> 01:52.000
But fundamentally, it's simple.

01:52.000 --> 01:55.000
It's, we have a power 2 format.

01:55.000 --> 01:57.000
We have a sign that gets surprised.

01:57.000 --> 01:58.000
Things are positive.

01:58.000 --> 01:59.000
Sometimes things are negative.

01:59.000 --> 02:01.000
You have an order of magnitude.

02:01.000 --> 02:02.000
So scaling up.

02:02.000 --> 02:05.000
And then you have a precision within that order of magnitude.

02:05.000 --> 02:06.000
What that allows you to do is.

02:06.000 --> 02:08.000
And we're going to talk about the fixed point.

02:08.000 --> 02:09.000
A little bit.

02:09.000 --> 02:12.000
I can say, well, I'm this hard, far away from zero.

02:12.000 --> 02:14.000
And then within that range above zero,

02:14.000 --> 02:16.000
I am within this sub range.

02:16.000 --> 02:19.000
But that allows you to do is that you effectively get to scale.

02:19.000 --> 02:21.000
You know, a precision need at all orders of magnitude.

02:21.000 --> 02:24.000
That also works for the opposite when you're going to float.

02:24.000 --> 02:26.000
Or within the interval of zero to one.

02:26.000 --> 02:28.000
You can flip that.

02:34.000 --> 02:35.000
This is me trying again.

02:35.000 --> 02:37.000
Hopefully, I'm.

02:37.000 --> 02:38.000
A lot of all.

02:47.000 --> 02:48.000
Okay.

02:48.000 --> 02:50.000
Well, we got there in the end.

02:50.000 --> 02:52.000
So, that's the short version.

02:52.000 --> 02:53.000
You can all leave now.

02:53.000 --> 02:54.000
There's no nonsense after this.

02:54.000 --> 02:56.000
We're all in good shape.

02:56.000 --> 02:57.000
Except for this part.

02:57.000 --> 02:58.000
The medium length version,

02:58.000 --> 03:02.000
because I was told I have 20 minutes of which I've already used three.

03:02.000 --> 03:04.000
So, why do we care?

03:04.000 --> 03:06.000
The energy isn't fixed point.

03:06.000 --> 03:08.000
So, map is hard.

03:08.000 --> 03:09.000
At least for me.

03:09.000 --> 03:10.000
You guys are smarter than me.

03:10.000 --> 03:12.000
So, probably you guys have less trouble with it.

03:12.000 --> 03:15.000
But fundamentally, making sure you get the right answer is even harder.

03:15.000 --> 03:18.000
And floating point is a way where when we're using computers,

03:18.000 --> 03:20.000
computers are a fixed finite resource.

03:20.000 --> 03:22.000
The way we do bits, the way we do encoding,

03:22.000 --> 03:24.000
is a fixed resource.

03:24.000 --> 03:26.000
So, one of the things that we have to do is decide,

03:26.000 --> 03:27.000
this is how many bits I have.

03:27.000 --> 03:31.000
Do I want to represent numbers up to 20 billion?

03:31.000 --> 03:33.000
Or do I want to be able to represent really tiny numbers

03:33.000 --> 03:36.000
at fractions of millimeters and the density of the nanometers?

03:36.000 --> 03:37.000
Right?

03:37.000 --> 03:39.000
So, the obvious thing is when we're counting on our hands

03:39.000 --> 03:42.000
and we're in elementary school, then integers are totally fine.

03:42.000 --> 03:46.000
But then, let's say I have a number where I have 0.51 of something.

03:46.000 --> 03:49.000
All at millimeters, call integers, whatever you're into.

03:49.000 --> 03:51.000
What happens if you print that value?

03:51.000 --> 03:52.000
Right?

03:52.000 --> 03:55.000
If you print that as an integer, well, you get zero.

03:55.000 --> 03:57.000
Because integers truncate to zero.

03:57.000 --> 04:00.000
Right? You cannot have a decimal point as part of an integer.

04:01.000 --> 04:02.000
Okay, great.

04:02.000 --> 04:04.000
How do I fix that problem?

04:04.000 --> 04:07.000
If they can't store anything past the point?

04:07.000 --> 04:09.000
Well, my domain is a meter,

04:09.000 --> 04:12.000
but sometimes needs the way down to millimeters.

04:12.000 --> 04:16.000
I can move the point in my data, in my format.

04:16.000 --> 04:20.000
So, we'll say, if I need to represent something that's 1,000

04:20.000 --> 04:23.000
of something, then 2 to the 10 is 1024.

04:23.000 --> 04:27.000
So, then I can represent up to 1024 of something below the point.

04:27.000 --> 04:31.000
And then that leaves me with, say, 22 to the 20,

04:31.000 --> 04:34.000
I wrote it down to the 21 amount of states above zero.

04:34.000 --> 04:38.000
And that allows me to scale very precisely within that range.

04:38.000 --> 04:42.000
And if we call this format a fixed point format,

04:42.000 --> 04:45.000
or an fxp, and you print that, what do you get?

04:45.000 --> 04:48.000
Well, you get 0.509 on and on and on.

04:48.000 --> 04:49.000
Right?

04:49.000 --> 04:52.000
So, it's not the correct answer, but it's a lot closer.

04:52.000 --> 04:55.000
And it's better than printing 0 when my value is 0.51.

04:55.000 --> 04:57.000
So, you're getting closer to what you actually want,

04:57.000 --> 04:58.000
and what you actually care about.

04:58.000 --> 04:59.000
And it is close.

04:59.000 --> 05:01.000
And then, for the nerds that want to verify my math,

05:01.000 --> 05:03.000
that's the fixed point pattern.

05:03.000 --> 05:05.000
It's someone always asks.

05:05.000 --> 05:08.000
Now, it's only 0.04% off.

05:08.000 --> 05:10.000
And that's pretty close, and that's pretty good.

05:10.000 --> 05:13.000
But notice that the number I picked, 0.51,

05:13.000 --> 05:16.000
is awfully close to 0 and 1.

05:16.000 --> 05:19.000
So, within that range, when you do a tapered value

05:19.000 --> 05:21.000
format, a fixed point format,

05:21.000 --> 05:23.000
representing values that are very close to other values,

05:23.000 --> 05:25.000
the correct powers of 2, both positive and negative.

05:25.000 --> 05:27.000
Exponents is really easy.

05:27.000 --> 05:29.000
But as you diverge from that, if you try to do,

05:29.000 --> 05:32.000
like an inverse of a prime, you end up being very far away

05:32.000 --> 05:33.000
from those values.

05:33.000 --> 05:36.000
And the amount of distance between those values

05:36.000 --> 05:38.000
ends up growing larger and larger.

05:38.000 --> 05:41.000
So, while numbers very close are about 0.04% away,

05:41.000 --> 05:43.000
as you go further and further, you can easily run

05:43.000 --> 05:45.000
into issues where A's and numbers non-represented,

05:45.000 --> 05:48.000
or B, you're actually say,

05:48.000 --> 05:51.000
3, 5, 10% off from your values.

05:51.000 --> 05:53.000
And if this is an HPC and data science,

05:53.000 --> 05:55.000
I don't know about if every time I read

05:55.000 --> 05:56.000
write a data point.

05:56.000 --> 05:59.000
If suddenly it's 10% off every time I do a time

05:59.000 --> 06:01.000
step, it's a bad idea to have a bad time.

06:01.000 --> 06:02.000
Let's not do that.

06:02.000 --> 06:05.000
So, in the beginning, there was base 10.

06:05.000 --> 06:07.000
It looked like that machine.

06:07.000 --> 06:09.000
This is an idea in 650.

06:09.000 --> 06:13.000
Yeah, it's really cool for it's time.

06:13.000 --> 06:15.000
But one of the problems you have when you're trying

06:15.000 --> 06:17.000
to do a base 10 system with transistors

06:17.000 --> 06:19.000
is transistors are finally base 2.

06:19.000 --> 06:22.000
So, if I want to be able to represent up to 10,

06:22.000 --> 06:24.000
in my encoding, and I want to do it as a fixed format

06:24.000 --> 06:26.000
instead of a partial format, you end up having

06:26.000 --> 06:29.000
to need the amount of bits to represent 16,

06:29.000 --> 06:31.000
which means, hey, now suddenly,

06:31.000 --> 06:35.000
six of my 16 states are actually just wasted garbage.

06:35.000 --> 06:37.000
So, I'm incredibly inefficient from the geth go.

06:37.000 --> 06:39.000
So, that's not good for power, it's not good for

06:39.000 --> 06:41.000
density on a chip, and all the rounds is at that time.

06:41.000 --> 06:43.000
And it's also, when the computers are

06:43.000 --> 06:45.000
doing all of their encoding, all of their math,

06:45.000 --> 06:47.000
fundamentally with base 2,

06:47.000 --> 06:49.000
to represent everything in base 10 means

06:49.000 --> 06:51.000
I'm actually adding overhead every time.

06:51.000 --> 06:54.000
So, I'm better off realizing the computer works this way.

06:54.000 --> 06:56.000
This is how I encode data.

06:56.000 --> 06:59.000
Instead, I can move on and move forward and just do something

06:59.000 --> 07:01.000
at the end where I just print and decimal.

07:01.000 --> 07:02.000
The C standard library is there.

07:02.000 --> 07:03.000
It's a good thing.

07:03.000 --> 07:04.000
It'll help you out.

07:04.000 --> 07:05.000
That's the part that matters.

07:05.000 --> 07:06.000
Avoid decimal times.

07:06.000 --> 07:08.000
But, for the time, that's what we have.

07:08.000 --> 07:10.000
See, yeah, I'm back in good time.

07:10.000 --> 07:12.000
I'm 12 minutes, and I'm less than 50 minutes in,

07:12.000 --> 07:15.000
and I'm 19 slides in, all as well in the world.

07:15.000 --> 07:18.000
So, on to binary floating point.

07:18.000 --> 07:20.000
IEEE 754 is the boat.

07:20.000 --> 07:22.000
It is the greatest part I've ever created,

07:22.000 --> 07:24.000
and I will fight you about that.

07:24.000 --> 07:27.000
What it does is it finally gave us a single

07:27.000 --> 07:29.000
frame of reference.

07:29.000 --> 07:31.000
Everything is IEEE 754.

07:31.000 --> 07:33.000
You know that when I'm running in an arm,

07:33.000 --> 07:36.000
power PC, a VACS, an IBM, an X86 machine.

07:36.000 --> 07:38.000
If I go and I tread my language and my standard

07:38.000 --> 07:41.000
onto IEEE 754 conformance,

07:41.000 --> 07:44.000
the language, the hardware software contract,

07:44.000 --> 07:47.000
the guarantees that I will do this.

07:47.000 --> 07:50.000
This will be when I say a float or a double in C,

07:50.000 --> 07:52.000
or something else.

07:52.000 --> 07:55.000
You guarantee that you are going to do math the same way

07:55.000 --> 07:56.000
every time.

07:56.000 --> 07:58.000
And that's one of the first few things that really

07:58.000 --> 08:00.000
unlock the concept of portability within the

08:00.000 --> 08:02.000
wide world of HPC, is that we could trust

08:02.000 --> 08:04.000
our computations to get the same answer every time.

08:04.000 --> 08:06.000
And what's again, if we're doing science,

08:06.000 --> 08:08.000
that's a fundamentally important and

08:08.000 --> 08:10.000
fundamental building block of it.

08:10.000 --> 08:12.000
So, how did we get there, though,

08:12.000 --> 08:14.000
is that the standard, the Bible,

08:14.000 --> 08:15.000
did not come out of nowhere.

08:15.000 --> 08:16.000
It came out of Intel.

08:16.000 --> 08:18.000
Good friends, loving the bits,

08:18.000 --> 08:20.000
they're having a bad time right now.

08:20.000 --> 08:21.000
We'll be that where it is.

08:21.000 --> 08:24.000
But in the 1980s, Intel had their X86 system,

08:24.000 --> 08:26.000
specifically it was the 8086.

08:26.000 --> 08:28.000
Great chip for the time, little 8 and

08:28.000 --> 08:30.000
slash 16 bits depending on how you look at it,

08:30.000 --> 08:32.000
little micro processor, and work really well.

08:32.000 --> 08:34.000
But once again, sometimes you want to do math

08:34.000 --> 08:36.000
and sometimes that math needs decimal points.

08:36.000 --> 08:39.000
So, Intel came out with the 8087 co-processor in 1980,

08:40.000 --> 08:42.000
and it implemented three data types.

08:44.000 --> 08:47.000
FB32, FB64, and FP80.

08:47.000 --> 08:50.000
FP, in this case, obviously, is floating point.

08:50.000 --> 08:52.000
Now, what happened is you didn't,

08:52.000 --> 08:54.000
these were the formats for the data in memory.

08:54.000 --> 08:56.000
But when you went to actually do a computation,

08:56.000 --> 08:59.000
the computation actually did FP32 data,

08:59.000 --> 09:02.000
extended to FP80 for all of my entries.

09:02.000 --> 09:06.000
Do all the math in FP80, then convert back to FP32.

09:06.000 --> 09:08.000
Yes, that is this curs as you think it is.

09:09.000 --> 09:10.000
Yes, that was horrible.

09:10.000 --> 09:12.000
But it gave us at least something to work with,

09:12.000 --> 09:14.000
where we had a fixed format that was,

09:14.000 --> 09:17.000
a floating point format that was relatively common,

09:17.000 --> 09:19.000
became ubiquitous through the micro computers,

09:19.000 --> 09:22.000
the items of the era, the personal computers.

09:22.000 --> 09:25.000
And what it meant is people could start to build libraries,

09:25.000 --> 09:27.000
and it was also implemented in such a way that

09:27.000 --> 09:29.000
you could do real math on it.

09:29.000 --> 09:30.000
But if you look at the amount of gates,

09:30.000 --> 09:32.000
the amount of multipliers that you need it in silicon,

09:32.000 --> 09:34.000
to actually be able to build the things,

09:34.000 --> 09:37.000
it was relatively inexpensive to build as well.

09:37.000 --> 09:38.000
Right?

09:38.000 --> 09:39.000
Now we're talking to peers in 1980,

09:39.000 --> 09:41.000
so it's still 200 US dollars in that era.

09:41.000 --> 09:43.000
So that like a thousand bucks today,

09:43.000 --> 09:46.000
oh look, that's a cost of a graphics card floating point works.

09:46.000 --> 09:49.000
But it was there, and everyone could start to build on it.

09:49.000 --> 09:50.000
And that was in 1980.

09:50.000 --> 09:54.000
So from there, we could go on and use part of it

09:54.000 --> 09:56.000
to go and say, this is the standard.

09:56.000 --> 09:58.000
So let's go and build upon that standard,

09:58.000 --> 10:01.000
and make it ubiquitous across all machines.

10:01.000 --> 10:03.000
And this is when IEEE 755 was published,

10:03.000 --> 10:05.000
was in 1985.

10:05.000 --> 10:09.000
And fundamentally what 7554 did is that it took all of the formats,

10:09.000 --> 10:15.000
it took the two formats that were reasonable in the 887,

10:15.000 --> 10:19.000
and brought that to a point where it was codified by a standard

10:19.000 --> 10:20.000
by the anyone he used.

10:20.000 --> 10:26.000
And where that gets even better is these are the two formats.

10:26.000 --> 10:30.000
We had a standard, and then the big point is this last line

10:30.000 --> 10:31.000
down here.

10:31.000 --> 10:33.000
The line which is included.

10:34.000 --> 10:37.000
So the two languages that matter,

10:37.000 --> 10:40.000
probably won't be hugely surprising to this

10:40.000 --> 10:44.000
to room in this audience, is we had ISOC in 1989.

10:44.000 --> 10:47.000
Adopted, IEEE 7554, it was its fundamental underlying

10:47.000 --> 10:50.000
mathematical arithmetic standard.

10:50.000 --> 10:53.000
Adopted 7554 for flowing points.

10:53.000 --> 10:56.000
That means, and then the language of the system,

10:56.000 --> 10:59.000
is everything is C, C is how we define the ADI,

10:59.000 --> 11:00.000
the application binary interface.

11:00.000 --> 11:02.000
It's a hardware software contract.

11:02.000 --> 11:05.000
And then when you're looking at doing math, it's scaled.

11:05.000 --> 11:08.000
That's all 4-tran in a lot of ways.

11:08.000 --> 11:11.000
If you look at the actual way that we do math on Windows,

11:11.000 --> 11:14.000
even Windows, if you call a PyTorch call,

11:14.000 --> 11:16.000
actually uses 4-tran Bloss bindings.

11:16.000 --> 11:19.000
The 4-tran Bloss bindings are relevant to F9,

11:19.000 --> 11:22.000
and F9 uses actually 7554 from 1989 to 1989.

11:22.000 --> 11:23.000
It's all the way down.

11:23.000 --> 11:26.000
These are the standards that allows to do the math across

11:26.000 --> 11:28.000
all of our applications.

11:29.000 --> 11:31.000
Now, 7554, I've talked about.

11:31.000 --> 11:32.000
Let's talk about the weird ones.

11:32.000 --> 11:36.000
Because there are certain domains where working on floating

11:36.000 --> 11:38.000
point, you can do some really interesting work

11:38.000 --> 11:41.000
when you're working on domain-specific problems.

11:41.000 --> 11:45.000
So, RGBPact, RGB for anyone who's done game programmers,

11:45.000 --> 11:48.000
shaders, and so on, will make you think of red green and blue.

11:48.000 --> 11:50.000
One of the problems that a lot of the consoles had,

11:50.000 --> 11:53.000
especially in the late 90s, and the early 2000s was,

11:53.000 --> 11:57.000
how do I do a format that has three different data types in it?

11:57.000 --> 11:58.000
That's not a power 2.

11:58.000 --> 11:59.000
That's really ugly.

11:59.000 --> 12:01.000
That's a pain in the butt to deal with.

12:01.000 --> 12:03.000
But also, I know that I'm going to be

12:03.000 --> 12:05.000
outputting to an 8-bit screen,

12:05.000 --> 12:07.000
and I know that I want to work out powers of 2.

12:07.000 --> 12:09.000
So, what the idea was is,

12:09.000 --> 12:12.000
I'm going to go and do two of my formats.

12:12.000 --> 12:14.000
Two of my entries in that packed format

12:14.000 --> 12:16.000
that are 11 bits.

12:16.000 --> 12:17.000
One of them that is 10 bits,

12:17.000 --> 12:19.000
and because I'm dealing with color,

12:19.000 --> 12:22.000
I know that my color will always be positive numbers.

12:22.000 --> 12:25.000
So, what that allows you to do is you two 11-bit formats,

12:25.000 --> 12:28.000
and you fit in that 32-bit format,

12:28.000 --> 12:30.000
which means that for the memory architecture,

12:30.000 --> 12:32.000
the memory subsystems of consoles,

12:32.000 --> 12:34.000
really effective, very fast.

12:34.000 --> 12:36.000
We're kind of following the history here.

12:36.000 --> 12:37.000
Why do we care?

12:37.000 --> 12:39.000
32-bit intervals are great for memory,

12:39.000 --> 12:41.000
fitting three things in hard,

12:41.000 --> 12:42.000
but it's something you can do,

12:42.000 --> 12:44.000
and there is value of doing that.

12:44.000 --> 12:46.000
And sure it's non-standard,

12:46.000 --> 12:47.000
but you do run into the case where,

12:47.000 --> 12:49.000
if you're designing for a console,

12:49.000 --> 12:50.000
you're like,

12:50.000 --> 12:53.000
idea that code is never going to be needed

12:53.000 --> 12:56.000
to run across multiple different types of device classes.

12:56.000 --> 12:58.000
Especially in the late 90s,

12:58.000 --> 12:59.000
early 2000s,

12:59.000 --> 13:01.000
you were designing for maybe one console,

13:01.000 --> 13:03.000
maybe two in certain areas,

13:03.000 --> 13:05.000
and that was probably an idea in PC,

13:05.000 --> 13:07.000
or it was an early knack-to-top,

13:07.000 --> 13:09.000
or it was the Atari's of the other.

13:09.000 --> 13:11.000
Right?

13:11.000 --> 13:15.000
The other thing is that it enabled the idea

13:15.000 --> 13:17.000
for better or for worse,

13:17.000 --> 13:19.000
of domains-specific data types.

13:19.000 --> 13:21.000
And enabling certain industries to say,

13:21.000 --> 13:23.000
this format is great across everyone

13:23.000 --> 13:24.000
when I need to be portable.

13:24.000 --> 13:25.000
Simultaneously,

13:25.000 --> 13:26.000
when I'm in other domains,

13:26.000 --> 13:29.000
when there is a domains-specific reason

13:29.000 --> 13:30.000
to do a custom format,

13:30.000 --> 13:32.000
there is a time in place for it.

13:32.000 --> 13:33.000
Just be careful,

13:33.000 --> 13:34.000
and when you're doing it,

13:34.000 --> 13:35.000
make sure that you specify it.

13:35.000 --> 13:37.000
Because the last thing you want to do is

13:37.000 --> 13:39.000
go and try to use this thing

13:39.000 --> 13:41.000
that, oh, this looks really cool for my build main,

13:41.000 --> 13:44.000
but it's not probably specified,

13:44.000 --> 13:46.000
which in turn means that I can't use it,

13:46.000 --> 13:48.000
the behavior will change across my machine.

13:48.000 --> 13:52.000
And there's one thing that you cannot get

13:52.000 --> 13:54.000
wrong if you're designing hardware and designing APIs,

13:54.000 --> 13:57.000
and that is that you cannot break the trust of the user.

13:57.000 --> 13:59.000
So making sure that if you're going to do something

13:59.000 --> 14:00.000
weird and specific,

14:00.000 --> 14:02.000
make sure you specify it.

14:02.000 --> 14:05.000
So, there are two slides.

14:05.000 --> 14:06.000
I'm flying.

14:06.000 --> 14:07.000
Back to greatness.

14:07.000 --> 14:09.000
I believe I truly still got it.

14:09.000 --> 14:11.000
Still to go.

14:11.000 --> 14:12.000
Still the greatest of all time.

14:12.000 --> 14:15.000
And what happened was we had a lot of formats,

14:15.000 --> 14:17.000
but IEEE 754 was from the 80s.

14:17.000 --> 14:20.000
And we had use cases where, for example,

14:20.000 --> 14:21.000
if you look at mobile phones,

14:21.000 --> 14:24.000
there were times where we needed lower precision

14:24.000 --> 14:25.000
data types for our phones,

14:25.000 --> 14:27.000
because we needed to lower power,

14:27.000 --> 14:29.000
but we still wanted to be able to do the math properly.

14:29.000 --> 14:30.000
Simultaneously,

14:30.000 --> 14:31.000
you talk to say the new physicists,

14:31.000 --> 14:32.000
the particle physicists,

14:32.000 --> 14:33.000
some of the cosmologists,

14:33.000 --> 14:35.000
and they're like FB64 is nowhere near

14:35.000 --> 14:37.000
precise enough of what I need.

14:37.000 --> 14:38.000
So, even if you're not always going to be

14:38.000 --> 14:40.000
implementing the different types and hardware,

14:40.000 --> 14:43.000
I can go through an in software implement FB128,

14:43.000 --> 14:44.000
which means now it's portable.

14:44.000 --> 14:46.000
If you dealt with GCC in the past,

14:46.000 --> 14:49.000
you might be familiar with live GCC quad.

14:49.000 --> 14:51.000
That's a library that's specifically

14:51.000 --> 14:54.000
is a software flow implementation that is fully compliant,

14:54.000 --> 14:57.000
and that works on all GCC compiler implementations.

14:57.000 --> 14:59.000
And also works on your other devices,

14:59.000 --> 15:01.000
such as your GPs that they're running on GCC,

15:01.000 --> 15:03.000
LVM, has something equivalent.

15:03.000 --> 15:05.000
And same for FB126.

15:05.000 --> 15:06.000
Broadly,

15:06.000 --> 15:08.000
you can assume that almost every device

15:08.000 --> 15:10.000
will always support FB32.

15:10.000 --> 15:13.000
Most support FB64,

15:13.000 --> 15:16.000
and anything that's new will support FB16.

15:16.000 --> 15:18.000
FB128 only existed in two devices

15:18.000 --> 15:20.000
that I'm aware of outside of FBGAs.

15:20.000 --> 15:22.000
Those were the NEC vector engines,

15:22.000 --> 15:24.000
as well as the IDM power nine machines.

15:24.000 --> 15:26.000
Everyone else, so you can count on FB32

15:26.000 --> 15:28.000
and FB64 being reasonable.

15:30.000 --> 15:32.000
The other thing is that we got this really new

15:32.000 --> 15:34.000
cool mode called FMA.

15:34.000 --> 15:37.000
FMA is stands for Fused Multiply Add.

15:37.000 --> 15:39.000
So one of the things you run into is

15:39.000 --> 15:40.000
we don't actually,

15:40.000 --> 15:42.000
the reason for the amount of precision

15:42.000 --> 15:44.000
that we have is not because every

15:44.000 --> 15:47.000
we're running against the boundary of

15:47.000 --> 15:49.000
rounding at every single equation.

15:49.000 --> 15:51.000
It's more so when you're running

15:51.000 --> 15:53.000
big HPC workloads, big simulations,

15:53.000 --> 15:56.000
you're actually iterating on the same data point.

15:56.000 --> 15:58.000
Hundreds of thousands of times.

15:58.000 --> 16:01.000
And it's that the air accumulates

16:01.000 --> 16:04.000
every time you go and you touch a piece of data.

16:04.000 --> 16:06.000
So it's not that every single time

16:06.000 --> 16:08.000
step needs more than FB32.

16:08.000 --> 16:10.000
It's that at the accumulation of your simulation,

16:10.000 --> 16:12.000
you will no longer be able to converge for certain

16:12.000 --> 16:15.000
types of simulations above FB32.

16:15.000 --> 16:16.000
You'll see this, for example,

16:16.000 --> 16:18.000
if you're doing computational fluid dynamics

16:18.000 --> 16:20.000
and you're looking at the transsonic regime,

16:20.000 --> 16:22.000
or you have a really weird boundary conditions

16:22.000 --> 16:24.000
of force hitting a mock value.

16:24.000 --> 16:25.000
I digress.

16:25.000 --> 16:28.000
So the point of FMA was,

16:28.000 --> 16:30.000
normally I go through,

16:30.000 --> 16:33.000
and the classic way of doing it without FMA

16:33.000 --> 16:37.000
is let's say I'm doing A is equal to A times X plus B.

16:37.000 --> 16:39.000
What happens is I go and I do A times X,

16:39.000 --> 16:41.000
I do a rounding step.

16:41.000 --> 16:43.000
That's fully compliant.

16:43.000 --> 16:44.000
Then I have to go do my ad.

16:44.000 --> 16:46.000
Then I round again.

16:46.000 --> 16:48.000
The idea of FMA is, hey,

16:48.000 --> 16:50.000
I'm already in my vector register.

16:50.000 --> 16:52.000
I'm already in my floating point, you know.

16:52.000 --> 16:55.000
It'd be really good if I didn't have to round

16:55.000 --> 16:56.000
between because then I could pipeline it

16:56.000 --> 16:57.000
when I'm designing hardware.

16:57.000 --> 16:58.000
I could do some really cool stuff.

16:58.000 --> 17:01.000
But if you do that, then use broken compliance.

17:01.000 --> 17:03.000
And we ran into a situation where the hardware

17:03.000 --> 17:05.000
was working faster than the standards

17:05.000 --> 17:07.000
and faster than software could.

17:07.000 --> 17:08.000
So we had to go through

17:08.000 --> 17:10.000
and standardize this.

17:10.000 --> 17:11.000
And that's what we did in 2008.

17:11.000 --> 17:17.000
We had electrically set by 4.

17:17.000 --> 17:20.000
I'm only going to lightly make fun of Google here.

17:20.000 --> 17:23.000
Google came out with a type called

17:23.000 --> 17:26.000
a brain float 16, which is an AI specific data type.

17:26.000 --> 17:28.000
Essentially it's a truncated FB32.

17:28.000 --> 17:31.000
So IEEE has its own FB64 map.

17:31.000 --> 17:34.000
I have to show the goose.

17:34.000 --> 17:38.000
Because one of the things you run into you very quickly

17:38.000 --> 17:41.000
is when you have all this massive diversity of data types,

17:41.000 --> 17:43.000
all these different ways of doing the math.

17:43.000 --> 17:45.000
Everyone says, oh, I've got 50 key fonts.

17:45.000 --> 17:47.000
I've got 200 tariff fonts.

17:47.000 --> 17:51.000
I've got half an X of font, you know, 60-U machine.

17:51.000 --> 17:52.000
Wow, that sounds great.

17:52.000 --> 17:53.000
And then you realize, well, no,

17:53.000 --> 17:55.000
it's actually a two-bit floating point type.

17:55.000 --> 17:58.000
That is in no way reasonable or comparable

17:58.000 --> 17:59.000
to the other type.

17:59.000 --> 18:00.000
And that's why you'll see a lot of HPC people

18:00.000 --> 18:02.000
get really mad whenever new hardware comes out of them.

18:03.000 --> 18:05.000
Well, no, it's not that great type of flop.

18:05.000 --> 18:06.000
What are you doing?

18:06.000 --> 18:07.000
Stop it.

18:07.000 --> 18:09.000
I'm going to hit you on the nails with the ruler.

18:09.000 --> 18:14.000
So anyway, the idea there was to truncate down

18:14.000 --> 18:17.000
of, I only need these bits, these lower bits.

18:17.000 --> 18:19.000
Because those are important for AI.

18:19.000 --> 18:20.000
When you're looking at AI workloads,

18:20.000 --> 18:22.000
they care a lot of it orders of magnitude.

18:22.000 --> 18:24.000
They don't care as much about precision.

18:24.000 --> 18:26.000
So the other thing that comes with that, though,

18:26.000 --> 18:28.000
is there's a few benefits.

18:28.000 --> 18:30.000
You don't care about the precision all that much.

18:30.000 --> 18:31.000
Prefer the orders of magnitude.

18:31.000 --> 18:32.000
It's always a trade-off.

18:32.000 --> 18:33.000
It's always a balance point.

18:33.000 --> 18:35.000
Multiplyers, which is actually what you use

18:35.000 --> 18:38.000
if you could go back to an electrical engineering design course.

18:38.000 --> 18:41.000
When you're looking at the amount of precision in a format,

18:41.000 --> 18:43.000
that's actually defined by the,

18:43.000 --> 18:46.000
those use multipliers instead of just adders when you're designing them.

18:46.000 --> 18:48.000
And the thing is, when you double the precision,

18:48.000 --> 18:51.000
you actually go up by a factor four on the amount of multipliers you need.

18:51.000 --> 18:54.000
And multipliers are already significantly more expensive

18:54.000 --> 18:56.000
and silicon than adders.

18:56.000 --> 18:59.000
So if I have a type that's more relevant for my domain,

18:59.000 --> 19:03.000
specifically, then I can go through and be more precise for my domain.

19:03.000 --> 19:05.000
And also use a lot less power.

19:05.000 --> 19:08.000
They'd a lot more multipliers, more floating point units,

19:08.000 --> 19:10.000
within my chip, which means that for that domain,

19:10.000 --> 19:14.000
I can drive a lot more performance.

19:14.000 --> 19:17.000
TF32, or why not?

19:17.000 --> 19:19.000
This was an Nvidia specific format,

19:19.000 --> 19:22.000
which I get how they got there, but it's a little bit silly.

19:22.000 --> 19:24.000
So it's called TensorFlow 32.

19:24.000 --> 19:26.000
It has 19 bits.

19:26.000 --> 19:29.000
Interpiece.

19:29.000 --> 19:34.000
But the idea there was, if I'm designing all of these gates for brainflow 16,

19:34.000 --> 19:37.000
simultaneously I'm supporting all of these gates for it.

19:37.000 --> 19:39.000
I triply FP16.

19:39.000 --> 19:41.000
Well, I can just take the long part of one,

19:41.000 --> 19:43.000
the long part of the other, smash them together.

19:43.000 --> 19:44.000
Right?

19:44.000 --> 19:47.000
And what that does is compare to, say, a full FP32.

19:47.000 --> 19:52.000
It's not as good as FP32, but it's better than VF16 and FP16.

19:52.000 --> 19:55.000
And because I'm already going to be supporting both of those data types,

19:55.000 --> 19:58.000
it means that I effectively get TF32 for free.

19:58.000 --> 20:01.000
And once again, it means I can fit a lot of them more of them in hardware,

20:01.000 --> 20:05.000
but both Google and Nvidia have decided that they don't want to have

20:05.000 --> 20:07.000
an official specification of how those data types work.

20:07.000 --> 20:09.000
So if you look at each generation,

20:09.000 --> 20:12.000
TF32 on hopper is different than on ampere,

20:12.000 --> 20:14.000
which is different than on blackwell.

20:14.000 --> 20:15.000
And it's a pain of the butt.

20:15.000 --> 20:17.000
Good luck with your performance.

20:17.000 --> 20:19.000
Have fun.

20:19.000 --> 20:21.000
Yeah.

20:21.000 --> 20:24.000
I spoke about mild players that was more there.

20:24.000 --> 20:26.000
I'll add this part, only minor headaches.

20:26.000 --> 20:29.000
It's why I'm growing my hair out so that it's easier to rip out

20:29.000 --> 20:30.000
when I get angry.

20:30.000 --> 20:31.000
It's a different story.

20:31.000 --> 20:33.000
FP8 is mostly nonsense.

20:33.000 --> 20:37.000
There's 17 current FP8 types.

20:37.000 --> 20:38.000
Some of them are a log.

20:38.000 --> 20:39.000
Some of them have infinity.

20:39.000 --> 20:40.000
Most of them don't.

20:40.000 --> 20:42.000
It's a little bit silly.

20:42.000 --> 20:44.000
There's two of them that matter.

20:44.000 --> 20:45.000
You can ignore them.

20:45.000 --> 20:46.000
They're the OCP ones.

20:46.000 --> 20:49.000
Everyone else is deprecating their support.

20:50.000 --> 20:53.000
And then this was supposed to be a question period.

20:53.000 --> 20:55.000
But in case I don't get to show the slide,

20:55.000 --> 20:57.000
I didn't want to show this one.

20:57.000 --> 20:59.000
It's very quickly.

20:59.000 --> 21:00.000
Questions.

21:00.000 --> 21:01.000
Please be nice.

21:01.000 --> 21:02.000
Thank you.

21:02.000 --> 21:03.000
Thank you.

21:03.000 --> 21:04.000
Thank you.

21:04.000 --> 21:05.000
Thank you.

21:05.000 --> 21:06.000
Thank you.

21:06.000 --> 21:07.000
Thank you.

21:07.000 --> 21:08.000
Thank you.

21:08.000 --> 21:09.000
Thank you.

21:09.000 --> 21:10.000
Thank you.

21:10.000 --> 21:11.000
Thank you.

21:11.000 --> 21:12.000
Thank you.

21:12.000 --> 21:13.000
Thank you.

21:13.000 --> 21:14.000
Thank you.

21:14.000 --> 21:15.000
Thank you.

21:16.000 --> 21:17.000
So different stuff.

21:17.000 --> 21:22.000
And the concept of vendors have decided that sometimes I could be

21:22.000 --> 21:27.000
set that 7.5 words don't set it more than only hands.

21:27.000 --> 21:30.000
So like, give your if you do something like that.

21:30.000 --> 21:33.000
That's finally those thumbs.

21:33.000 --> 21:34.000
Then you get that.

21:34.000 --> 21:35.000
You better close the homies.

21:35.000 --> 21:38.000
So it's how we get to know the stuff.

21:38.000 --> 21:43.000
So the question is fundamentally from the vendor side.

21:43.000 --> 21:47.000
You'll see a lot of implementations that are close to IEEE 754 compliant.

21:47.000 --> 21:49.000
But are not fully compliant.

21:49.000 --> 21:52.000
Compliant matters to a lot of people.

21:52.000 --> 21:56.000
So how do we make sure that vendors are honest about how they implement things?

21:56.000 --> 21:59.000
And the answer is it's not easy.

21:59.000 --> 22:02.000
I am a vendor and I'm fighting for conformance.

22:02.000 --> 22:05.000
But one of the things you quickly run into when you're doing silicon design

22:05.000 --> 22:11.000
and you're down at the big gate level and your libraries is doing things like

22:11.000 --> 22:17.000
the IEEE 754 compliant and 4 trans slash C compliant addition means that I act

22:17.000 --> 22:19.000
to do everything ordered.

22:19.000 --> 22:22.000
So that means that I'm actually linear steps for all that ordering.

22:22.000 --> 22:28.000
Whereas in hardware, you can do different tricks where I can say turn it into a bunch of parallel trees.

22:28.000 --> 22:33.000
And the thing is that parallel tree is a lot to events instead of just something.

22:33.000 --> 22:35.000
I probably can set that exponent.

22:35.000 --> 22:37.000
Which means it's a lot faster.

22:37.000 --> 22:40.000
I only want to do that when I absolutely have to.

22:40.000 --> 22:45.000
So what we typically do is you'll break down the vector unit portion of it.

22:45.000 --> 22:50.000
You'll do compliant on there and you'll have it as an opt-in mode.

22:50.000 --> 22:52.000
There's no good answer there.

22:52.000 --> 22:54.000
It's a given day.

22:54.000 --> 22:59.000
Some of the pilots offer the option to drop activity by a certain type of number.

22:59.000 --> 23:02.000
Do you have to operate and pilot that?

23:02.000 --> 23:05.000
Yeah, I've hunted down most of the compiler engineers that enable that.

23:05.000 --> 23:07.000
They've disappeared off the face there.

23:07.000 --> 23:09.000
I'm covering for you.

23:09.000 --> 23:11.000
Oh sorry.

23:11.000 --> 23:12.000
Sorry.

23:12.000 --> 23:15.000
The question was definitely not a joke.

23:15.000 --> 23:17.000
Not definitely not bait for yours truly.

23:17.000 --> 23:20.000
A lot of compilers allow you to disable or

23:20.000 --> 23:25.000
innately disable floating point compliance from the get go.

23:25.000 --> 23:27.000
What are my thoughts on that?

23:27.000 --> 23:29.000
My thoughts on that is that it's very silly.

23:29.000 --> 23:32.000
Beyond the joke, it's because once again,

23:33.000 --> 23:36.000
a user should only ever have to opt into something on standards.

23:36.000 --> 23:40.000
The standard is there because it forms the fundamentals of the hardware software

23:40.000 --> 23:43.000
contract that you as a vendor, you as a compiler,

23:43.000 --> 23:47.000
developer, you as a user have all together to prove that yes,

23:47.000 --> 23:50.000
I can trust that my math is going to be done the way I expect it to be.

23:50.000 --> 23:54.000
And when you disable that, that is an opt-in feature.

23:54.000 --> 23:55.000
That is fine.

23:55.000 --> 23:58.000
Because that is the user telling everyone else I'm smart.

23:58.000 --> 23:59.000
I know what I'm doing.

23:59.000 --> 24:03.000
I can throw away compliance because this in this specific sub domain of a domain

24:03.000 --> 24:07.000
that is a reasonable thing to do in the name of my end goal.

24:07.000 --> 24:12.000
The problem comes around when you disable it from the get go.

24:12.000 --> 24:16.000
Because then between hardware implementations between different compilers

24:16.000 --> 24:18.000
between different compiler generations,

24:18.000 --> 24:22.000
I no longer can trust on my math will be done correctly.

24:22.000 --> 24:26.000
And for applications that have sensitivity to underlying mathematics,

24:26.000 --> 24:28.000
say a lot of HBC software,

24:28.000 --> 24:30.000
I can no longer trust my results.

24:30.000 --> 24:34.000
Now, at the same time, I kind of want to book in this.

24:34.000 --> 24:37.000
I used to work for the weather service in Canada.

24:37.000 --> 24:42.000
And one of the things that we had is we simultaneously had operational weather models

24:42.000 --> 24:44.000
that those have to be precise though,

24:44.000 --> 24:46.000
but they also have to run like clockwork.

24:46.000 --> 24:48.000
But to validate the models,

24:48.000 --> 24:51.000
one of our problems was the whole bit reproduction problem,

24:51.000 --> 24:56.000
where a numerical result was calculated in the early 70s,

24:56.000 --> 25:00.000
way before 754, before 4x707 was even around.

25:00.000 --> 25:02.000
This was written on punch cards.

25:02.000 --> 25:06.000
And the requirement of the new model is to replicate that exact numerical result.

25:06.000 --> 25:09.000
Now, that is conformance for the sake of conformance that is not reasonable.

25:09.000 --> 25:11.000
And I know that that's very prevalent in the weather world.

25:11.000 --> 25:14.000
I know it's prevalent in some of the banking world,

25:14.000 --> 25:17.000
where you have that exact result problem.

25:17.000 --> 25:19.000
So there's a balance point,

25:19.000 --> 25:22.000
and I don't want to be a zealot for the sake of being a zealot,

25:22.000 --> 25:25.000
and I don't want to be a zealot for the sake of conformance.

25:25.000 --> 25:28.000
It's that zealot not disabled conformance from the user.

25:28.000 --> 25:30.000
Because it's the user that gets to make the decision of knowing

25:30.000 --> 25:32.000
what matters and what doesn't matter.

25:32.000 --> 25:35.000
And it's not reasonable for the hardware to disable that conformance for them.

25:35.000 --> 25:36.000
Okay.

25:36.000 --> 25:38.000
I'll get off my soapbox.

25:38.000 --> 25:39.000
Thanks.

25:39.000 --> 25:41.000
Thank you very much for being here.

