WEBVTT

00:00.000 --> 00:28.200
In video with its GPUs and its CUDA software ecosystem has between 70 and 95% of the world AI

00:28.200 --> 00:31.600
chip market.

00:31.600 --> 00:41.760
If AI is going to thrive, we need a wider ecosystem of both hardware and software.

00:41.760 --> 00:46.760
And the question I give to you today is how?

00:46.760 --> 00:50.200
I'm Jeremy Bennett.

00:50.200 --> 00:57.200
Today we're going to give you a step on the answer using open source.

00:57.200 --> 01:02.560
I hope that you've come away from this with an understanding of how you can have a new chip

01:02.560 --> 01:09.680
design for AI and you can bring up the software ecosystem you need so that you can run

01:09.680 --> 01:12.920
all your favorite AI systems.

01:12.920 --> 01:17.720
I've joined by my colleague William Jones, who will take you through the practical real world

01:17.720 --> 01:18.720
of this.

01:18.720 --> 01:22.360
I'm going to give you an overview to start.

01:22.360 --> 01:26.520
We're focusing on neural networks and I am fully aware that neural networks is not

01:26.520 --> 01:30.640
the whole of machine learning and AI, but it's the big one.

01:30.640 --> 01:37.040
The way all systems work is the neural networks represented as a graph and the software,

01:37.040 --> 01:43.920
whether it's PyTorch, TensorFlow or whatever you're using, may do a bit of graph level

01:43.920 --> 01:47.920
transformation to make the graph a bit more efficient, but fundamentally sits there

01:47.920 --> 01:52.840
walking over that graph, looking at the nodes and the nodes tell it what the arguments

01:52.840 --> 01:57.560
are, the sensors, the glorified matrices, which are the data, and what the operation

01:57.560 --> 02:02.920
to perform is, whether it's an ad or a matrix multiplication or a convolution.

02:02.920 --> 02:07.560
They sit in a world that is a host and accelerator-based system.

02:07.560 --> 02:13.120
The host may be an x86 today, the accelerator almost certainly, unfortunately, is Nvidia,

02:13.120 --> 02:18.760
not because Nvidia is bad because they've got 90% of the market.

02:18.760 --> 02:23.800
The dispatcher sits in there and works out which of those to run on.

02:23.800 --> 02:24.920
Do I run on the host?

02:24.920 --> 02:26.560
Do I run on the accelerator?

02:26.560 --> 02:30.960
Or do I run across both of them?

02:30.960 --> 02:35.920
And we're focused, particularly, today, on the dispatcher and how it decides and makes

02:35.920 --> 02:40.680
that decision, pushes software onto the accelerator.

02:40.680 --> 02:47.440
And that works for microcontrollers and standard software, PyTorch, TensorFlow, you've

02:47.440 --> 02:53.720
got executor, you've got LightRT, and it's probably handling single nodes off to be accelerated

02:53.720 --> 02:58.760
by a special accelerator, and it's probably only doing that with some operations.

02:58.760 --> 03:03.280
But it goes right up to huge co-processors.

03:03.280 --> 03:05.080
And we've worked on both these scenarios.

03:05.080 --> 03:06.080
We work on the executors.

03:06.080 --> 03:12.120
We work on one of these, which is a risk-five chip, which has more than 1,000 cores on

03:12.120 --> 03:13.120
it.

03:13.120 --> 03:17.040
And in this case, the dispatcher's got to work out how to get across all those cores.

03:17.040 --> 03:21.400
It's probably trying to handle multiple operations at one time, postponing delivery of

03:21.400 --> 03:25.480
results, and so forth.

03:25.480 --> 03:31.080
So Williams, now, going to talk you through a real open source example, William is my head

03:31.080 --> 03:32.080
of AI.

03:32.080 --> 03:37.320
He is also responsible for the UK's national guidelines on best practice in AI for the

03:37.320 --> 03:40.080
electronic systems industry.

03:40.080 --> 03:46.680
William, we're going to now be a brief pause while we change over the microphone.

03:46.680 --> 03:57.960
I'm going to find which point I have, but I think you can.

03:57.960 --> 03:58.960
Cool.

03:58.960 --> 04:00.960
It's the earlier, OK.

04:00.960 --> 04:04.520
Yes, those the project I want to talk you through is a student project that we've

04:04.520 --> 04:08.280
been doing for the last four or five months, I think.

04:08.280 --> 04:13.520
So every year, we do a student project with Southampton University in the UK, and we host

04:13.520 --> 04:19.880
a set of them as the students to do some sort of relevant industrial project to help

04:19.880 --> 04:21.800
them along their degrees.

04:21.800 --> 04:26.800
And this year, we had six students for ten weeks, and we asked them to go away and integrate

04:26.800 --> 04:32.160
well, go through the process and create a demonstrated project of integrating a new accelerated

04:32.160 --> 04:34.520
into the PyTorch framework.

04:34.520 --> 04:39.320
And in particular, by the way, we were hoping had pictures of the students up here, but we

04:39.320 --> 04:41.280
didn't get that sorted in time, so we just have their names.

04:41.280 --> 04:45.080
And there's a lot of credit because this was an incredibly good project.

04:45.080 --> 04:51.600
So yeah, so particularly what we wanted to do with this project is we asked the students

04:51.600 --> 04:56.560
to bring up a risk-vive core as the accelerator, MNFBGA of their choice.

04:56.560 --> 05:01.560
We asked them to go into PyTorch and create a new device in PyTorch and modify the dispatched

05:01.560 --> 05:03.360
to dispatch to this device.

05:03.360 --> 05:06.880
But we also, most importantly, asked them to go away and create a tool chain between

05:06.880 --> 05:10.040
the two that would let this dispatched to any hardware.

05:10.040 --> 05:12.520
And sort of work in a hardware-agnostic way.

05:12.520 --> 05:14.400
And this is in many ways the tricky bits.

05:14.400 --> 05:18.000
And in terms of the slides that Jeremy puts up earlier, we're sort of looking at this

05:18.000 --> 05:19.000
bottom bit here.

05:19.000 --> 05:20.000
The students have to go in.

05:20.000 --> 05:23.080
They have to modify the dispatched to in PyTorch and create a tool chain to connect it

05:23.080 --> 05:27.320
to a, well, it's an accelerator, but it's obviously not much of an accelerator, because

05:27.320 --> 05:33.640
it's just a risk-vive core, because the ultimate goal of this is just a demonstrated.

05:33.640 --> 05:37.680
And the sort of end goal of this for the students or their stretch goal at the end was

05:37.680 --> 05:43.040
to try and get resonates in working on this risk-vive core as an accelerator.

05:43.040 --> 05:48.760
Now, we asked the students to do this with the one ABI ecosystem and the one ABI construction

05:48.760 --> 05:49.760
kit.

05:49.760 --> 05:54.200
So if you haven't met this, this is a construction kit that's largely based on the efforts

05:54.200 --> 05:59.680
of sickle and opens the L, that influence a model of heterogeneous computing.

05:59.680 --> 06:03.040
And the way that this would work and the way that we were expecting was units to solve

06:03.040 --> 06:04.040
this project.

06:04.040 --> 06:07.840
It did something a little bit different, is that they would go to PyTorch.

06:07.840 --> 06:13.960
They would modify the dispatched to this new piece of hardware for the operations that

06:13.960 --> 06:14.960
they cared about.

06:14.960 --> 06:18.360
The reason they're 18 ones, they would provide sickle implementations.

06:18.360 --> 06:24.000
I think this was going to be add, match norm, and the convolution 2D.

06:24.000 --> 06:30.000
And then with the sickle compiler, which is the DC++ compiler in the one ABI construction

06:30.000 --> 06:34.000
kit, they would produce a multi-architecture.

06:34.000 --> 06:39.840
The binary, they'd have the code on the host that is driving what is happening and the code

06:39.840 --> 06:44.520
on the accelerator, which is obviously, like, actually what is doing things.

06:44.520 --> 06:50.000
And this multi-architecture binary would call out to the OpenCL API, which would implement

06:50.000 --> 06:52.280
this heterogeneous computing.

06:52.280 --> 06:57.160
And in the one ABI construction kit, there is an extremely generic and simple implementation

06:57.160 --> 07:03.960
of this OpenCL API, which calls out to a really basic low-level hardware abstraction layer.

07:03.960 --> 07:08.440
Which, just sort of, defines, like, I think, six functions, which sort of defined writing

07:08.440 --> 07:12.680
to the device, reading from the device, and things like this.

07:12.680 --> 07:15.160
And the scope of the student project was essentially, they'd have to go in, they'd have

07:15.160 --> 07:21.320
to do a little bit of work on the producing the multi-architecture binary, and probably

07:21.320 --> 07:23.840
a bit more work on the hardware abstraction layer.

07:23.840 --> 07:26.800
And then they'd be able to go through this, demonstrate it could all work, stitch everything

07:26.800 --> 07:30.400
together, be a nice project.

07:30.400 --> 07:35.080
And ultimately, the goal of this, if the students had time, or if we were doing this

07:35.080 --> 07:39.160
for a real project, because this is how, you know, this is in the family way for us to

07:39.160 --> 07:40.160
approach a project like this.

07:40.160 --> 07:44.560
Sickle is a very mature tool chain for doing this type of thing.

07:44.560 --> 07:47.880
You start with this generic, simple implementation of OpenCL, and you develop something

07:47.880 --> 07:53.520
more target specific and rich over time.

07:53.520 --> 07:56.680
Now our students actually ended up doing something a little bit different to this, about

07:56.680 --> 07:59.320
four weeks into the project, so the students came to us.

07:59.840 --> 08:02.400
And I think they felt they were running out of time a little bit.

08:02.400 --> 08:06.760
It's a tough thing to get done in 60-engineer weeks when you're, when you're still

08:06.760 --> 08:10.240
a young student, so you haven't done this type of thing before.

08:10.240 --> 08:14.600
And they basically said, look, we think we can do this in an even quicker and simpler way,

08:14.600 --> 08:18.600
which gives us even better chance of success, and we were very proud of them for doing this.

08:18.600 --> 08:22.200
It's not easy to have conversations like this with your sort of friendly industrial customers

08:22.200 --> 08:24.040
at the best of times.

08:24.040 --> 08:31.120
And they did a really good job of not just saying, not just explaining what they needed,

08:31.120 --> 08:33.960
but what they're supposed to do to the problem.

08:33.960 --> 08:39.040
And what the students ended up doing is they ended up sort of writing a micro-harder

08:39.040 --> 08:42.640
abstraction layer that sort of sidesteped quite a lot of what we'd originally expected

08:42.640 --> 08:44.560
them to have to do.

08:44.560 --> 08:50.760
So the flow that I described before, how we were expecting them to solve the problem,

08:50.760 --> 08:55.800
was that there would be implementations of stuff to happen in sickle.

08:55.800 --> 09:01.280
The DC++ compiler would take the OpenCL library and compile this now into this multi-architecture

09:01.280 --> 09:05.440
binary, where things would happen.

09:05.440 --> 09:08.680
And this multi-architecture binary can take into these calls that the OpenCL API would

09:08.680 --> 09:11.840
drive how things happened.

09:11.840 --> 09:15.320
But what the students did is they just wrote a sort of, instead of going through this whole process

09:15.320 --> 09:19.960
and implementing this whole hardware abstraction layer and way of interfacing with the hardware

09:20.040 --> 09:24.160
through the ecosystem, they just wrote a little interposet in the OpenCL library that

09:24.160 --> 09:29.080
captured the sort of two calls that they were actually interested in dealing with, which

09:29.080 --> 09:34.480
are the set arguments for an operation and doing operation calls.

09:34.480 --> 09:38.400
And they just captured those and then went away, dispatched to the hardware on their own

09:38.400 --> 09:40.560
and got the results back.

09:40.560 --> 09:44.480
And this was like, it was a really good solution to what we'd asked them to do.

09:44.480 --> 09:48.960
And in many ways it's sort of, in many ways it was better than what we'd asked them to do

09:48.960 --> 09:52.560
or it was a better demonstrated because it's even more minimal than what we'd originally

09:52.560 --> 09:56.720
expected and this, and that is what this was supposed to be, a sort of minimum viable

09:56.720 --> 10:02.080
administrator.

10:02.080 --> 10:05.560
In a bit more detail, because their solution was interesting sort of, in more detail

10:05.560 --> 10:12.520
than the students actually ended up using TCP to implement the communication with their

10:12.520 --> 10:13.520
FPGA accelerator.

10:13.520 --> 10:15.960
And it was quite a nice little system.

10:15.960 --> 10:23.600
And the students ended up using this Zilings Zink Board as their FPGA host of the RIS5

10:23.600 --> 10:24.600
Core.

10:24.600 --> 10:28.120
And this actually is one of these FPGA boards that comes with a little processing system

10:28.120 --> 10:31.400
on it, had a few arm cores and it had a few peripherals and it meant that the students were

10:31.400 --> 10:36.760
able to sort of just, like, almost overnight set up a TCP communication with it and use

10:36.760 --> 10:39.400
that to offload things to the core.

10:39.400 --> 10:42.960
Now, obviously, this isn't a realistic thing that you do with sort of a real trip.

10:42.960 --> 10:48.240
You could probably try and use PCIe or something, but it was a good demonstrated.

10:48.240 --> 10:53.000
It meant that the flow for the students' work was that they have these sickle implementations

10:53.000 --> 10:59.720
of, as net operations, their OpenCL interposer would capture the arguments and the data

10:59.720 --> 11:00.720
from these.

11:00.720 --> 11:06.640
It would send through TCP to the FPGA board, FPGA boards processing system.

11:06.640 --> 11:08.200
We'll put this into shared memory.

11:08.200 --> 11:11.440
The shared memory would be operated on by the S5 Core.

11:11.800 --> 11:15.680
There was output back into shared memory and everything sent all the way back.

11:16.680 --> 11:21.880
And yeah, it's a really nice elegant solution that the students came up with for sort of in this problem.

11:21.880 --> 11:27.440
And it actually, yeah, it sidesteps one of the issues with sort of doing things the way we've

11:27.440 --> 11:30.560
originally proposed a few slides back.

11:30.560 --> 11:34.960
In this way of doing things, the building up things via a sort of simple,

11:34.960 --> 11:38.240
very basic hardware abstraction layer that's provided in the construction kit,

11:38.360 --> 11:44.240
you actually end up compiling your whole program for your GUSC Live Core down,

11:44.240 --> 11:47.880
sending that to the ExcelO extra executing it there with the solution of the students,

11:47.880 --> 11:49.080
but both of you don't have to do that.

11:49.080 --> 11:51.920
You can just send the data, which is at my improvement.

11:54.120 --> 11:59.360
So in terms of the overall success of the project, to sort of sum up,

11:59.360 --> 12:04.120
we set the students with the task of creating a minimum viable product of basically integrating

12:04.160 --> 12:10.320
a new piece of hardware and a tool chain with an accelerator framework, which was quite

12:10.320 --> 12:14.920
arch, and we asked them to see if they could get resident 18 working.

12:14.920 --> 12:18.760
So obviously, they achieved what we'd asked them to do in getting a demo to work.

12:18.760 --> 12:19.960
That was fantastic.

12:19.960 --> 12:22.320
All of this is available for you at what?

12:22.320 --> 12:28.160
If not, now we'll be shortly available for your open source after their work has been marked by their examiner.

12:28.160 --> 12:29.640
So they've achieved that.

12:29.720 --> 12:32.320
They very nearly got resident 18 working as well.

12:32.320 --> 12:36.360
We initially wanted to get the sort of three dominant operations in resident 18 working,

12:36.360 --> 12:39.000
which were at Anne Bach's norm and two de-convolution.

12:40.720 --> 12:48.440
The students got Anne and Bach's norm done very well and didn't quite finish off convolution to the by

12:48.440 --> 12:50.920
basically by the time they had to stop working right up.

12:50.920 --> 12:55.760
They got a very long way, a very long way, a very long way along finishing that.

12:55.760 --> 12:57.120
So that was very good.

12:57.240 --> 13:00.520
And I spoke earlier about making a hardware agnostic solution as well,

13:00.520 --> 13:06.360
so that they'd be able to basically substitute out any piece of hardware for any other piece of hardware

13:06.360 --> 13:09.800
that was supported by the single content that we designed.

13:09.800 --> 13:10.720
And that worked as well.

13:10.720 --> 13:16.880
The students initially started working with, I think, the Zylinks micro-place 5 core for their FPGA,

13:16.880 --> 13:21.960
a risk 5 core that Zylinks provide that does actually work particularly well with FPGA.

13:21.960 --> 13:26.720
And they were able to demonstrate that by sort of just substituting out for a new piece of hardware.

13:26.720 --> 13:29.560
And swapping out for even something with a completely different instruction set.

13:29.560 --> 13:35.320
I think there's still basically worked out the box, and yeah, it was very good.

13:35.320 --> 13:37.600
And the students deserve a lot of praise for the work they've done.

13:37.600 --> 13:40.280
Anyway, I think I'm now handing back to Jeremy to finish off.

13:52.200 --> 13:56.120
So Williams just described to you that we're not talking about theory.

13:56.120 --> 13:56.960
This is practical.

13:56.960 --> 13:59.320
We're able to talk about this one, it was a student project.

13:59.320 --> 14:02.760
We do this commercially for real customers as well.

14:02.760 --> 14:07.840
And that work is all freely available for students to use as a starting project.

14:07.840 --> 14:11.720
And I think it says something that six students in ten weeks,

14:11.720 --> 14:19.560
who had no previous exposure to AI software and infrastructure were able to bring that up successfully.

14:19.560 --> 14:22.080
So how do you do it?

14:22.160 --> 14:25.920
And let's look at what we mean by AI.

14:25.920 --> 14:27.640
And there is a pyramid problem here.

14:27.640 --> 14:33.400
We've all played with chatGPT millions of people having the world or these days deep seek.

14:34.400 --> 14:36.040
There's a lot of professional engineers.

14:36.040 --> 14:38.160
This is actually the big revolution in AI.

14:38.160 --> 14:42.600
It's people using standard models to make businesses run better.

14:42.600 --> 14:44.000
Get rid of the drudgery.

14:44.000 --> 14:48.280
Helping the lawyers find all their cases automatically.

14:48.280 --> 14:53.440
Helping people buying homes find all the legal documents they need automatically.

14:53.440 --> 14:57.960
Automating all the paperwork you need if you're in England and want to take a

14:57.960 --> 15:00.520
Laurie load of goods into urine.

15:00.520 --> 15:03.560
That's where the big revolution is.

15:03.560 --> 15:07.880
Actually, the number of people developing models much smaller people developing

15:07.880 --> 15:11.120
resin that 18 and all the new models, that's a smaller.

15:11.120 --> 15:15.960
And sitting on the top are people like us who actually develop the AI tools.

15:15.960 --> 15:18.040
And it's not just people.

15:18.040 --> 15:21.760
Some of that development is itself done by AI's.

15:21.760 --> 15:25.520
So how do you get involved?

15:25.520 --> 15:28.280
Executive torch is part of pie torch.

15:28.280 --> 15:29.240
Light RT.

15:29.240 --> 15:32.840
What used to be called TensorFlow Light for Micros is part of TensorFlow.

15:32.840 --> 15:34.320
And they've got their official tutorials.

15:34.320 --> 15:37.720
These slides will all be on the file stem site.

15:37.720 --> 15:39.960
So you'll get the links.

15:39.960 --> 15:43.320
Cycle and OpenCL is already in pie torch 2.4.

15:43.320 --> 15:45.160
So you don't have to write the implementations.

15:45.160 --> 15:48.600
You've got implementations already available.

15:48.600 --> 15:54.600
The one API construction kit, it's freely available on code plays GitHub.

15:54.600 --> 15:57.640
That work has done in Southampton and other work we're doing elsewhere.

15:57.640 --> 15:59.920
We're turning into some more of our how-to.

15:59.920 --> 16:02.440
And I know many of you will have used our how-to before.

16:02.440 --> 16:04.720
They're coming this year.

16:04.720 --> 16:06.960
And ultimately it's what we do for our day job.

16:06.960 --> 16:10.560
So if you'd like to do more and you want some help,

16:10.560 --> 16:11.680
come and ask us.

16:11.680 --> 16:14.120
We'll be here, William and I here all day.

16:14.120 --> 16:19.400
And we're at the AI Plummers conference tomorrow as well.

16:19.400 --> 16:20.560
So thank you all very much.

16:20.560 --> 16:22.560
I asked a question that beginning.

16:22.560 --> 16:27.560
How do we get AI working on new hardware?

16:27.560 --> 16:29.840
And I hope we've given you a bit of insight

16:29.840 --> 16:31.400
into how that can be done.

16:31.400 --> 16:34.000
And I think we have a minute or two for a few questions.

16:34.000 --> 16:35.200
Thank you.

