WEBVTT

00:00.000 --> 00:13.200
All right, again, this is John. He's going to be talking about the bare metal perspective

00:13.200 --> 00:23.320
on AMD GPU A6. And I think, oh, maybe, oh, no, I thought we were going. They're going to

00:23.320 --> 00:30.320
the laser. You got to connect. You got to say yes. Oh, really?

00:30.320 --> 00:36.320
Quick. If we can't have it, it's okay.

00:36.320 --> 00:38.320
Okay.

00:38.320 --> 00:39.320
Yes.

00:39.320 --> 00:40.320
Hey, wax.

00:40.320 --> 00:43.320
All right, John, you ready? Yeah, I might turn.

00:43.320 --> 00:44.320
Is this on?

00:44.320 --> 00:45.320
Hello.

00:45.320 --> 00:46.320
All right, take away John.

00:46.320 --> 00:48.320
Okay, awesome.

00:48.320 --> 00:52.320
There you go. There's an echo from back of room, which is incredibly off-pacing.

00:53.320 --> 00:58.320
But nevertheless, I want to talk about programming AMD GPU.

00:58.320 --> 01:02.320
It is very late on the Sunday. You are looking tired.

01:02.320 --> 01:05.320
We're going to go with very simple.

01:05.320 --> 01:06.320
Okay.

01:06.320 --> 01:11.320
Look, one over there, I want to get across. I'm going to hit it very hard.

01:11.320 --> 01:13.320
This stays for play.

01:13.320 --> 01:18.320
We are programming GPUs. We use GP programming languages.

01:19.320 --> 01:23.320
I am very lucky, but my previous speaker demonstrated beautifully for

01:23.320 --> 01:25.320
joys of OpenCL and Vulcan.

01:25.320 --> 01:29.320
It cannot have us for better introduction.

01:29.320 --> 01:32.320
And we're very happy with this.

01:32.320 --> 01:34.320
We, I love OpenMP.

01:34.320 --> 01:36.320
Other people love other languages.

01:36.320 --> 01:41.320
They are all uniformly very complicated.

01:41.320 --> 01:43.320
They offer convenience.

01:43.320 --> 01:46.320
And what they take from you is time.

01:46.320 --> 01:52.320
But they also generate huge amounts of revenue for some companies, which successfully

01:52.320 --> 01:55.320
defend this world.

01:55.320 --> 02:04.320
There's a small glitch in this as an axiom, which is the GPUs not actually special.

02:04.320 --> 02:06.320
It was GPU thing.

02:06.320 --> 02:11.320
Very much the same thing as a CPU.

02:11.320 --> 02:14.320
That is the idea I'm trying to get across.

02:14.320 --> 02:16.320
I like a brief show of hands.

02:16.320 --> 02:21.320
Stick hand up in the air if you think I'm wrong.

02:21.320 --> 02:23.320
One person.

02:23.320 --> 02:24.320
Okay.

02:24.320 --> 02:26.320
We need to wake up a little bit.

02:26.320 --> 02:27.320
Okay.

02:27.320 --> 02:31.320
This is not meant to be like reasonable statements.

02:31.320 --> 02:33.320
Those are companies made GPUs.

02:33.320 --> 02:37.320
The charge money for GPUs will make you program the down things in OpenCL.

02:37.320 --> 02:39.320
We don't do this just for a laugh.

02:39.320 --> 02:43.320
We do it because they are supposed to look different.

02:43.320 --> 02:45.320
That particular one.

02:45.320 --> 02:48.320
I'd particularly like because it's one block in memory.

02:48.320 --> 02:50.320
We've load of X86 interpreters on.

02:50.320 --> 02:53.320
And a load of GCN interpreters on.

02:53.320 --> 02:58.320
It's about when it's particularly easy to forget which language you're working with.

02:58.320 --> 03:00.320
What you're programming with, thank you.

03:00.320 --> 03:06.320
But for the sake of the one person who doesn't agree with us.

03:06.320 --> 03:12.320
If this firm holds, if CPUs and GPUs are the same thing, then it follows.

03:12.320 --> 03:18.320
But the GPU languages for programming the special thing don't have to exist.

03:18.320 --> 03:20.320
Right?

03:20.320 --> 03:25.320
If GPUs are special, you need special languages to work with them.

03:25.320 --> 03:32.320
If GPUs and CPUs are the same thing, this didn't have to happen.

03:32.320 --> 03:34.320
You are a tough crowd.

03:35.320 --> 03:38.320
Come on. Give me something here.

03:38.320 --> 03:43.320
We're just great. Good. One person disagrees.

03:43.320 --> 03:49.320
So I think what we are actually working with here is not a protective mode,

03:49.320 --> 03:53.320
which means your wonderful software stacks.

03:53.320 --> 04:00.320
Carefully built on industry standard reliable proprietary software.

04:00.320 --> 04:04.320
It's not a feature. This is something you've done wrong.

04:04.320 --> 04:07.320
Because there's a thing called vendor lock-in.

04:07.320 --> 04:10.320
Which we've forced them crowd should have heard of.

04:10.320 --> 04:12.320
Right?

04:12.320 --> 04:20.320
This lovely castle with its wonderful features comes with things like support contracts.

04:20.320 --> 04:25.320
When your kid stops working, the expensive box full of fans.

04:25.320 --> 04:29.320
You can't do anything with until the end gets back to you.

04:29.320 --> 04:31.320
We never as a solution to that.

04:31.320 --> 04:35.320
And it goes something like open source.

04:35.320 --> 04:37.320
Right?

04:37.320 --> 04:39.320
I am standing here in AMD branding.

04:39.320 --> 04:46.320
I am aware of the open source perception of AMD's computer stack.

04:46.320 --> 04:52.320
Outside of a HPC world who loves us, everyone else is not quite as convinced.

04:52.320 --> 04:58.320
Which is, I suspect, part of why we go to the safety of industry standard software.

04:59.320 --> 05:05.320
But what AMD does come with is you've got our driver on your laptop now.

05:05.320 --> 05:07.320
It's in the Linux kernel.

05:07.320 --> 05:10.320
In entire user space, it's on Rockham.

05:10.320 --> 05:13.320
It's on GitHub.

05:13.320 --> 05:15.320
It's on GitHub slash Rockham.

05:15.320 --> 05:21.320
So when you do not like how our software works, changes.

05:21.320 --> 05:23.320
That's exciting.

05:23.320 --> 05:26.320
I know I haven't these slides.

05:26.320 --> 05:29.320
Do we know what to do about that?

05:29.320 --> 05:31.320
There.

05:31.320 --> 05:34.320
The laptop has gone on standby.

05:34.320 --> 05:36.320
It doesn't matter much.

05:36.320 --> 05:41.320
I guess so.

05:41.320 --> 05:46.320
So what I'm going with this is.

05:46.320 --> 05:51.320
People believe very firmly that CPUs and GPUs are different things.

05:51.320 --> 05:55.320
Even though the underlying silicon looks really the same.

05:55.320 --> 06:01.320
And the reason people think they're different is that the experience of using them is very different.

06:01.320 --> 06:04.320
You grab random 886 machine.

06:04.320 --> 06:05.320
It beats.

06:05.320 --> 06:06.320
You write code.

06:06.320 --> 06:07.320
You have a debugger.

06:07.320 --> 06:08.320
It's easy.

06:08.320 --> 06:10.320
You grab random GPE.

06:10.320 --> 06:12.320
You try to make it do stuff.

06:12.320 --> 06:13.320
And it fights you.

06:13.320 --> 06:15.320
It's a bad experience.

06:15.320 --> 06:20.320
And this leads to fear and sorrow.

06:20.320 --> 06:25.320
And the belief that the GPU-specific languages are important.

06:25.320 --> 06:28.320
So help you keep things working.

06:28.320 --> 06:33.320
But it's not anything to do with the hardware.

06:33.320 --> 06:34.320
This was one of them.

06:34.320 --> 06:38.320
The horrible, narrowly embedded programming is done by the Linux kernel people.

06:38.320 --> 06:42.320
And who present you this friendly world that you build upon.

06:42.320 --> 06:45.320
And the other world, you're actually doing embedded programming.

06:45.320 --> 06:47.320
Which is hard.

06:47.320 --> 06:49.320
That's not a GPU CPU thing.

06:49.320 --> 06:51.320
You program x86 directly.

06:51.320 --> 06:55.320
You have a bad time too.

06:55.320 --> 06:59.320
And this, this picture, which I really thought we'd get laughs.

06:59.320 --> 07:01.320
You guys are fighting me here.

07:01.320 --> 07:04.320
Have we not seen this picture of software engineering?

07:04.320 --> 07:05.320
Oh.

07:05.320 --> 07:08.320
This is how we build all our code, right?

07:08.320 --> 07:11.320
The GPU-driven picture.

07:11.320 --> 07:13.320
The sound sucks.

07:13.320 --> 07:14.320
Yes.

07:14.320 --> 07:15.320
Yes.

07:15.320 --> 07:16.320
Let's go be true.

07:16.320 --> 07:18.320
I'm trying to shout at you all.

07:18.320 --> 07:19.320
Ah.

07:19.320 --> 07:20.320
Yes.

07:20.320 --> 07:23.320
We work with what we got.

07:23.320 --> 07:24.320
Where am I going?

07:24.320 --> 07:25.320
All right.

07:25.320 --> 07:26.320
Fine.

07:26.320 --> 07:28.320
This is meant to be funny and weight people up.

07:28.320 --> 07:31.320
But you're not loving the ball of mud metaphor for a kid.

07:31.320 --> 07:34.320
So we will continue.

07:34.320 --> 07:40.320
So I claim that CPUs and GPUs have the same thing.

07:40.320 --> 07:43.320
That's exciting.

07:43.320 --> 07:45.320
That's x86, which works really easily.

07:45.320 --> 07:47.320
It's fighting me a bit.

07:47.320 --> 07:49.320
Another mind.

07:49.320 --> 07:58.320
And I think it is a shame that so much a GPU programming is done in sort of slightly copying

07:58.320 --> 08:01.320
and past follow the API guide fashion.

08:01.320 --> 08:04.320
Because most vendors don't really tell you anything about the hardware.

08:04.320 --> 08:06.320
But this vendor does.

08:06.320 --> 08:07.320
You can get read.ox.

08:07.320 --> 08:11.320
We tell you how the damn thing works all the way down to the isle.

08:11.320 --> 08:16.320
If you don't like our software, not only can you ignore our driver, like at least one very

08:16.320 --> 08:18.320
vocal person has decided to do.

08:18.320 --> 08:20.320
You don't have to ignore the assembler.

08:20.320 --> 08:21.320
And the compiler.

08:21.320 --> 08:22.320
You can ignore all of it.

08:22.320 --> 08:26.320
We've given you enough documentation that you can push raw bits across PCI.

08:26.320 --> 08:29.320
And it will do what you tell it to.

08:29.320 --> 08:37.320
But this will be incomplete without mentioning that there is one difference between

08:37.320 --> 08:39.320
the two architectures.

08:39.320 --> 08:41.320
And it is fundamental.

08:41.320 --> 08:42.320
And it is historical.

08:42.320 --> 08:44.320
And it is important.

08:44.320 --> 08:49.320
But it is also invisible of software layer.

08:49.320 --> 08:51.320
And it goes roughly.

08:51.320 --> 08:53.320
If you've got one process.

08:53.320 --> 08:55.320
And you're only running one thing.

08:55.320 --> 08:59.320
You work really, really hard to make that one thing run very fast.

08:59.320 --> 09:04.320
This is what most of x86 and a x64 spends all of its time doing.

09:05.320 --> 09:08.320
But we don't have one process.

09:08.320 --> 09:12.320
We haven't for 20 years.

09:12.320 --> 09:13.320
40 years.

09:13.320 --> 09:15.320
Quite a long time now.

09:15.320 --> 09:18.320
And if you've got more of one process running anyway.

09:18.320 --> 09:20.320
On memory stall.

09:20.320 --> 09:23.320
Instead of all this careful speculative work.

09:23.320 --> 09:26.320
You can just run the other process.

09:26.320 --> 09:31.320
And that is what a GPU gives you.

09:32.320 --> 09:39.320
Provided you're willing to start curfing up with at a least a few hundred separate things to run.

09:39.320 --> 09:44.320
Like every server any of you are using is already into the thousands.

09:44.320 --> 09:47.320
It's just going to run something else on memory stall.

09:47.320 --> 09:50.320
That is so much simpler.

09:50.320 --> 09:55.320
The experience of programming a GPU is very complicated.

09:55.320 --> 09:57.320
It takes a lot of work.

09:57.320 --> 10:01.320
The skill of can underneath is doing something much more straightforward.

10:01.320 --> 10:05.320
And that is the core of the efficiency gain.

10:05.320 --> 10:12.320
It's why your compete per what can be so high.

10:12.320 --> 10:14.320
How long do you have?

10:14.320 --> 10:15.320
Do you know?

10:15.320 --> 10:17.320
Anyone know how?

10:17.320 --> 10:18.320
Ten minutes.

10:18.320 --> 10:20.320
Ooh.

10:20.320 --> 10:22.320
Like as my options.

10:22.320 --> 10:28.320
Um.

10:28.320 --> 10:30.320
How are the steps of you guys to questions?

10:30.320 --> 10:32.320
So we got two good paths to take care.

10:32.320 --> 10:39.320
We can try to do questions or I can run to you about CUDA.

10:39.320 --> 10:40.320
CUDA.

10:40.320 --> 10:41.320
Awesome.

10:41.320 --> 10:42.320
Awesome.

10:42.320 --> 10:43.320
We're doing that.

10:43.320 --> 10:44.320
So I've been waiting for an opportunity.

10:44.320 --> 10:47.320
And you guys have basically not echoed me with anything.

10:47.320 --> 10:49.320
So I still have time left.

10:49.320 --> 10:50.320
Okay.

10:50.320 --> 10:51.320
So this thing.

10:51.320 --> 10:53.320
This thing down here.

10:53.320 --> 10:55.320
Single instruction, multiple thread.

10:55.320 --> 10:57.320
This is a hack.

10:57.320 --> 11:01.320
This is a really clever hack for quite a long time ago.

11:01.320 --> 11:07.320
And it says essentially what has filled the mode around Nvidia's castle with tar.

11:07.320 --> 11:10.320
This is what they have done wrong.

11:10.320 --> 11:12.320
How is that still not contentious?

11:12.320 --> 11:14.320
Do you guys not like CUDA?

11:14.320 --> 11:17.320
No one loves pixel shaders.

11:18.320 --> 11:19.320
All right.

11:19.320 --> 11:24.320
So in the olden days, a GPU had a vector unit.

11:24.320 --> 11:26.320
And memory operations.

11:26.320 --> 11:28.320
That was it.

11:28.320 --> 11:33.320
So if you want to do control flow, you had to do masking that vector unit.

11:33.320 --> 11:39.320
And if you have to write that by hand, you get very sad, very quickly, and very confused.

11:39.320 --> 11:44.320
So what you can do instead, I've just worked out for Microsoft positional.

11:44.320 --> 11:50.320
And what you can do instead of masking the thing by hand is invented programming model,

11:50.320 --> 11:51.320
which hides up from you.

11:51.320 --> 11:54.320
And that makes everyone very happy.

11:54.320 --> 12:01.320
The problem is, 20 years later, you've had the great idea of having an entity unit.

12:01.320 --> 12:03.320
But you don't just have a vector unit anymore.

12:03.320 --> 12:07.320
You can now do scalar instrument, you can use it for branches.

12:07.320 --> 12:13.320
And what the same team gives you is a programming model, note that you do.

12:14.320 --> 12:21.320
Where when you talk about int or float, you are implicitly talking about a vector of 32 int or flex.

12:21.320 --> 12:26.320
So when you want to talk about single int, you don't really have a language to do it.

12:26.320 --> 12:31.320
There's a clever piece of work from Intel, where you understand stuff of uniform,

12:31.320 --> 12:36.320
and try to make sense of the semantic supporters.

12:36.320 --> 12:45.320
But for the majority of it, SIM team means writing easy, single elements of a lane as time code, because very well.

12:45.320 --> 12:48.320
And writing complicated stuff is very badly.

12:48.320 --> 12:54.320
And that doesn't shame, because it means you are prone to using GPs for writing things like glass.

12:54.320 --> 13:00.320
And we should be using GPs, things like running the processes of a arbitrary server.

13:00.320 --> 13:03.320
Because it's a general purpose computer.

13:07.320 --> 13:10.320
And sadly, SIM team is really deeply ingrained.

13:10.320 --> 13:21.320
If you pick up LLVM and you feed some C into it, because you want to run some C on a GPU, you are stuck with a SIM team model.

13:21.320 --> 13:24.320
And I would personally very much like to burn us out.

13:24.320 --> 13:30.320
But I am having trouble winning widespread support for changing the model of GPU programming,

13:30.320 --> 13:33.320
which looks like writing a Vx code.

13:33.320 --> 13:39.320
Which confused me a bit, because people don't mind a Vx code that much, and they don't like SIM team that much.

13:39.320 --> 13:44.320
So it should be possible to persuade people to do the simpler thing,

13:44.320 --> 13:52.320
which lets you tell the hardware what you want to do, instead of telling the compiler roughly what you would like to happen,

13:52.320 --> 13:59.320
and then being cross when the code comes out, isn't what you had in mind when you turn your code into a scalar representation.

14:00.320 --> 14:07.320
In particular, a whole load of bug reports about how the quality of code emitted by LLVM is unacceptable,

14:07.320 --> 14:13.320
would disappear if people would talk to the damping in terms of vectors, instead of in terms of scalars.

14:13.320 --> 14:16.320
No, no, no, no.

14:16.320 --> 14:20.320
Well, we have five, five.

14:20.320 --> 14:22.320
Okay, cool.

14:22.320 --> 14:26.320
So, this is actually the last slide.

14:27.320 --> 14:30.320
I am so dependent on you guys talking to me.

14:30.320 --> 14:35.320
You're just not doing it, and it's messed up all my time.

14:35.320 --> 14:44.320
So instead of OpenCL, or Vulkan, or OpenMP, or that was shown up in the name, or HIP.

14:44.320 --> 14:50.320
I think we should program the GPU, the same way we program the CPU.

14:50.320 --> 14:52.320
We really should.

14:52.320 --> 14:57.320
At the least, start using C++ instead of, I don't care.

14:57.320 --> 15:06.320
Then you've still got the same horrible semantics you used to, but you get all the happy feelings from a C++ ecosystem,

15:06.320 --> 15:12.320
instead of working for translated documentation.

15:12.320 --> 15:18.320
But if you would like to program the thing efficiently and effectively,

15:18.320 --> 15:24.320
just write the damn thing in a assembly, but you should be doing an X86.

15:24.320 --> 15:26.320
Seriously.

15:26.320 --> 15:27.320
Okay.

15:27.320 --> 15:31.320
Show a hands-peeply, like-racing assembly.

15:31.320 --> 15:32.320
All right.

15:32.320 --> 15:34.320
Four of you.

15:34.320 --> 15:36.320
You should all like-racing assembly.

15:36.320 --> 15:42.320
Racing assembly is much more fun than all any of these other things.

15:42.320 --> 15:44.320
One guy shaking head.

15:45.320 --> 15:47.320
Oh, cool.

15:47.320 --> 15:50.320
All right.

15:50.320 --> 15:54.320
So, let's close in thoughts.

15:54.320 --> 16:01.320
It's worth noting that instead of the various languages, which are being marked as a good idea,

16:01.320 --> 16:12.320
if you are building on LLVM, you can use anything which limits IR, like typing into your editor, or C, or Rust,

16:12.320 --> 16:14.320
or whatever you see fit.

16:14.320 --> 16:20.320
Anything which will special, daily structured IR, can be fed to your GPU.

16:20.320 --> 16:27.320
And if you want to go down the whole, like, PyTorch TensorFlow fashion, also fine.

16:27.320 --> 16:33.320
I personally like people to stop running this through LLVM, because it drives you through the SIMT model.

16:33.320 --> 16:39.320
I think you should, instead, map from the tensors in your machine language representation,

16:39.320 --> 16:42.320
directly to the vectors of your machine.

16:42.320 --> 16:47.320
Instead of going through this weird scalar imaginary world,

16:47.320 --> 16:54.320
and that would yield much faster code, and much faster compilation, and much happier users.

16:54.320 --> 16:58.320
But if you don't fancy that, you can't still do the scalar thing.

16:58.320 --> 17:00.320
And I might be out of slice.

17:00.320 --> 17:03.320
Oh, yes, there's a disclaimer.

17:04.320 --> 17:08.320
So, my employer is very nice.

17:08.320 --> 17:16.320
This set of ideas does not correlate very well with our general marketing approach to GPUs.

17:16.320 --> 17:23.320
In particular, all of the, this, this back here, this, our GPU specific languages.

17:23.320 --> 17:26.320
We really like them, and we put a lot of effort into implementing them.

17:26.320 --> 17:28.320
I like, we read your bug reports, we do.

17:28.320 --> 17:30.320
Sometimes we even add some of them.

17:30.320 --> 17:36.320
So, my employer would very much like you to continue programming things like you currently do.

17:36.320 --> 17:41.320
But if you would, instead, like huge amounts of your life back,

17:41.320 --> 17:46.320
and you fancy bypassing a whole load of compute stack, the favour of doing the SIMT model thing,

17:46.320 --> 17:51.320
I want you to know that you can, and it's okay, it will still work.

17:51.320 --> 17:56.320
Okay, I'm going to give up at that point, but thank you all.

17:56.320 --> 17:58.320
I'm grateful for your time.

17:59.320 --> 18:01.320
Thank you. Thank you very much.

18:01.320 --> 18:02.320
Thank you.

18:02.320 --> 18:04.320
Great presentation.

18:04.320 --> 18:06.320
I love writing.

18:06.320 --> 18:07.320
Excellent.

18:07.320 --> 18:08.320
Very good.

18:08.320 --> 18:10.320
Yes, it brings me much joy.

