WEBVTT

00:00.000 --> 00:09.500
Very much, Mark. Good morning, everyone. I'm sorry for the remote attendees. I'm not sure.

00:09.500 --> 00:18.120
You can see my slides. There's mainly images. And I will, by the way, put the slides, also

00:18.120 --> 00:24.960
in GitHub, just after the talk. I mean, as soon as I will have a working internet connection.

00:24.960 --> 00:31.360
So I am Sylvan, folks for Golf Kiroradio. And my presentation this morning is a kind of follow-up

00:31.360 --> 00:35.520
of something I presented last year, but I will come back with this soon. And it's called

00:35.520 --> 00:42.320
using AI hardware accelerators for real-time DSP and embedded devices. I was looking for

00:42.320 --> 00:51.720
even longer name, but I didn't. Okay. So more seriously, the talk is organized as this.

00:51.720 --> 01:01.960
I will briefly explain how this side came. Explain, share with you the, something I had

01:01.960 --> 01:11.800
doing this. And show you what is the outcome. And what can we do with this? What can

01:11.800 --> 01:20.200
be done with this? Sorry. So last year, in the previous episode, I presented a GPU based. So

01:20.200 --> 01:26.840
NVIDIA only, I would say, solution to process a very wide bandwidth and do multiple digital

01:26.840 --> 01:37.360
down converters in CUDA. Just to be very honest with you, this code was extracted from some

01:37.360 --> 01:45.560
code we using my company. And it's working. So the idea was to use, so basically the idea

01:45.640 --> 01:50.280
is as follows, you have a very wide stream coming from a radio. And you want to extract from

01:50.280 --> 01:58.520
this radio slices sub-bans. You have many ways to do this. And the algorithm I proposed was to

01:58.520 --> 02:04.200
use the GPU. So you basically push all the samples to the GPU. The kernels are running inside

02:04.200 --> 02:15.160
and just extracts the sub-bans. That works. And one of the points was why should, what could be

02:15.160 --> 02:22.520
done or what could be used as another solution to do the same? And what if we would use this

02:22.520 --> 02:29.240
super fancy super nice AI chips that they are promoting a lot. You see now that if you buy a new

02:29.240 --> 02:36.760
PC, you have an AI PC. What is the AI PC? It's a PC plus some chips inside of pure ships.

02:36.760 --> 02:44.200
So my point was, can we use this specific devices for signal processing? And the show

02:44.200 --> 02:49.400
times are yes, but it's not so easy. It's not difficult. There's tricks as usual. So the idea here

02:49.400 --> 02:55.320
is to share with you the tricks I used. So the motivation is to be able to use some

02:55.320 --> 03:02.200
either specific chips like the Google Coral, you see on the right, which are just inference

03:02.200 --> 03:08.840
when the consist tensile processing units will come back to this in a minute. Or devices like the

03:08.840 --> 03:17.960
Rochip, some of the Rochip CPUs will have multiple arm machines and have a small units dedicated

03:17.960 --> 03:24.760
to AI. So these are very cheap. I mean you see the prices, the Google Coral stuff is 30 euros,

03:24.760 --> 03:32.200
depending on how you buy it. And the Rochips you have these are complete boards. Means

03:32.200 --> 03:38.520
CPU plus RAM plus IOS, everything you just have to connect your screen and

03:38.520 --> 03:50.360
Ethernet and it's a full computer. The RK 3588, you can see this as a Rochip as a Raspberry Pi

03:50.360 --> 03:55.720
computer module 5 or 4. That's basically the same. But there is something specific in the chip.

03:57.720 --> 04:05.800
This is this AI unit. You also have these other things. The Accelerar for example, they claim

04:05.880 --> 04:11.640
very big figures. I will show that. And on the right, something that we're testing in my company,

04:11.640 --> 04:17.720
which is this 10 store-ins machine, which is a monster. And the good thing with this is that

04:18.920 --> 04:24.040
it uses only the Rochips 5 hardware. It means that this chip, the

04:26.440 --> 04:34.680
sorry, I can't read the name because there's different ones. The chip is using a network of

04:34.680 --> 04:39.960
REST 5 units. And you code this in C, full C. I mean, as long as you can code in C, you can do

04:39.960 --> 04:45.560
something else. I would say. Which is quite different from the artificial intelligence accelerators.

04:45.560 --> 04:53.800
And I will come to this now. So the promise they have is, sorry, I will increase the OK.

04:53.800 --> 05:01.720
So the promise, if you look at the red hash sheet from the Rochip RK 3588, it claims 6 tops.

05:01.720 --> 05:07.800
What is a top? There are operation per second. We are not talking about flops, but tops.

05:07.800 --> 05:13.160
There is a small difference. It's not a difference between the two. But basically, it means a lot of

05:13.160 --> 05:21.320
a lot of math in real time or fast, at least. What's interesting is that they claim triple

05:21.320 --> 05:26.360
court, I say, sorry, excuse me, I say they claim. I'm not saying that it's wrong. I'm saying

05:26.360 --> 05:34.520
that's what is written in the commercial sheet. So you have, basically, what we can understand

05:34.520 --> 05:41.000
from the second line, include triple MPU core and triple coba, so I mean that you have kind of

05:41.000 --> 05:45.960
polymism. We can understand from this that we can run multiple streams simultaneously.

05:46.920 --> 05:52.040
And just for those who are attending my speech last year, the fact that you can use multiple streams,

05:53.000 --> 05:59.320
simultaneously is very important. If you want to use multi-threaded or multi-task DSP systems.

06:01.320 --> 06:05.400
The thing also is that if you look at this last line, support deep learning frameworks,

06:05.400 --> 06:11.400
TensorFlow, Cafe, TensorFlow Lite, PyTorch, when it makes an end and really an x, et cetera.

06:12.680 --> 06:16.680
That looks very open. Thank you.

06:17.160 --> 06:24.600
So the promise is that if you look at the models, these things are running,

06:27.160 --> 06:29.000
you don't think there's something? No, no.

06:32.280 --> 06:36.680
If you look at the performances, you will see that these things are able to run some

06:37.320 --> 06:43.640
AI tasks, classification, object detection. So this is the well-known yellow, you look only once.

06:43.640 --> 06:47.640
This algorithm where you feed in images and it's able to look in image and say you have a

06:47.640 --> 06:52.600
catch, you have a dog and so on. That's what these things are made for. And in this column,

06:52.600 --> 06:58.520
the third, sorry, the, if I can move the mouse on the good screen, yes, here, this column,

06:58.520 --> 07:05.720
the last column shows you the supported platform, that's the list of Roxip CPUs that are doing this.

07:05.720 --> 07:11.640
And if you look on the table here, you see some figures like this is able, for example,

07:11.640 --> 07:16.680
for the yellow, so that is the one that is doing object detection. This is able to run at

07:16.680 --> 07:25.480
something like 33 frames per second. That moves amazing. First of all, it's flopping, it's 6.16

07:25.480 --> 07:33.320
or 8. Means that we're processing bytes. I mean, integrers or small or floating point numbers

07:33.320 --> 07:38.520
with limited resolution. This has an impact. So first of all, you need to be aware that

07:39.160 --> 07:46.360
the performances that are claimed are usually for int 8 or FP4. Of course. Of course.

07:47.480 --> 07:53.800
So if everything here is said to be ON and ONX and you look at the files here, you can download,

07:53.800 --> 07:59.560
they say you have a lot of models, you can download and this model are ON and ONX files. So what is

07:59.560 --> 08:12.840
an ONX model? This is an open source framework and I copied this image from their websites.

08:13.720 --> 08:17.640
So you have on the left those tools, which generates where you design the model. So you have a

08:17.640 --> 08:24.280
pie torch, TensorFlow and blah, blah, and so on. On the right, on the right hand side, you have the

08:24.280 --> 08:31.160
platform. So you see CPU, GPU, FPGA, blah, blah, blah, and in the middle, the ON and ONX one time is that

08:31.160 --> 08:38.040
magic group that somehow makes an obstruction of the hardware below. So far, I haven't been speaking

08:38.040 --> 08:46.840
at all about radio, but stay tuned, it's coming fast. So in fact, the ON and X is the ONX is the

08:46.920 --> 08:53.800
coldest and execution provider. So you have your model. So if the file that describes the

08:53.800 --> 09:00.680
process you want to perform. So you can see this as the microcode that we run inside the processor.

09:00.680 --> 09:06.840
On the left, you have your input data here, on the right, the output. So for example, let's say that

09:06.840 --> 09:14.120
you input, you inject in this, the samples coming from the radio and at the right hand, for example,

09:14.200 --> 09:19.800
you want to have the audio. And you want this chip to do everything inside without choosing the main

09:19.800 --> 09:26.680
CPU. And the idea is that this execution provider is able to read this file and configurably

09:26.680 --> 09:34.920
underlying hardware for you. That looks very interesting, doesn't it? The good thing is that this works

09:34.920 --> 09:42.120
for GPU chips. So the former, I mean, what I'm used to use, that is the NVIDIA GPUs

09:43.000 --> 09:47.240
should be working, and that should work also for these nice chips.

09:49.960 --> 09:57.720
A very brief introduction to ONX. You describe the execution flow as a graph, where you have

09:58.440 --> 10:05.560
more than one input, multiple inputs, multiple outputs, and you describe the execution flow like a graph

10:05.640 --> 10:13.960
and each node in the graph is an operation. So for example, this graph, which is from the

10:13.960 --> 10:24.120
documentation, is doing a match, multiply and add operation. And that's a very simple ONX file that

10:24.120 --> 10:34.120
just explained this. They don't speak about numbers or vectors, they talk about tensors.

10:35.720 --> 10:42.760
And you will see that sometimes this is a mess. The other good thing with ONX is that it's

10:42.760 --> 10:50.120
quite widely documented. There's a lot of versions, 11, 12, and so on, but basically you will find

10:50.120 --> 10:57.320
online, a lot of documentation, explaining what is, what each operation is doing, and how it works,

10:57.320 --> 11:02.840
and what is the inputs you need to fit in, and what is the, without the outputs you would get.

11:03.720 --> 11:11.560
So just to make a very brief summary of what is ONX, it's designed for AI, means that the

11:11.560 --> 11:16.280
operations you will find are basically those who are used in deep learning, and mainly in image

11:16.280 --> 11:25.480
processing. And of course, those chips, for example, the raw chip ones, are usually used in remote

11:25.480 --> 11:31.640
IP, IP cameras that are doing motion detection, object detection. So the idea is that these chips

11:31.720 --> 11:35.960
are used in these cameras that take image, real time, check what's going on, and if there is

11:35.960 --> 11:43.160
a movement, then it can send something on the SMS connection. So these chips are optimized for

11:43.160 --> 11:49.240
that kind of input-fix, so it's matrix of pixels with either black and white, which usually

11:49.240 --> 11:55.640
three colors. So the dimensions of the inputs and the data structure is organized to process images.

11:55.960 --> 12:02.280
And in our case, we're not processing images, we're processing streams of complex signals,

12:02.280 --> 12:10.360
so there will be some tricks to make this work. So if that works, what would we love to have?

12:11.720 --> 12:16.920
So the idea would be to be able to run some of these, some of our DSP blocks,

12:17.800 --> 12:25.240
inside these chips, and for example, that could be audio processing. If you've seen recently

12:25.240 --> 12:30.360
that's, or that's it here as a plug-in system, and they're using, they're not using Linux,

12:30.360 --> 12:35.080
they're using another one, but basically that's the same ID, you have a model we can load,

12:35.080 --> 12:41.080
and the model is offloaded to the local acceleration if you want. It could be, for example,

12:41.080 --> 12:45.000
something an extension to something like the radio companion where you have this graph,

12:45.880 --> 12:51.720
and we can imagine that instead of loading Python code or C++ code, that would generate

12:51.720 --> 12:55.880
the files that would be executed by the next models. Why not? That works interesting.

12:56.920 --> 13:03.720
But, but, there are some tricks. If you look in details at the documentation in rock chip,

13:03.720 --> 13:11.640
for example, it said RKNN, which is rock chip and N. And if you look even in more details,

13:11.640 --> 13:17.480
there is only a subset of instructions. Means that most of the instructions that you would

13:17.480 --> 13:27.720
love to see are in fact not implemented. So then the trick comes. The Google Coral uses TensorFlow

13:27.720 --> 13:33.080
light, which is basically the same, that is to say you have the specifications that everybody has

13:33.080 --> 13:38.760
agreed, but the operators are too complex. So let's make it simple and let's trash everything

13:38.760 --> 13:44.360
that is complex and just keep very basic operations. So in fact, the available, the list of available

13:44.360 --> 13:49.160
instructions is limited. If you look at the table here, the which is an extract from the,

13:50.200 --> 13:57.080
sorry, the RKNNN tool kit, you see that most of the functions are not implemented. Apps,

13:57.080 --> 14:03.000
not implemented. Our costiness not implemented, etc., etc., many functions are just not available.

14:03.560 --> 14:16.200
And in fact, you will see that RKNN is basically on its version 11. But just for information

14:16.200 --> 14:21.960
today's version 20 something. So there's a big difference between the tools you can use that are

14:21.960 --> 14:31.000
available on GitHub and what runs in the chip. And this is the RKNN info from their website.

14:31.000 --> 14:40.440
But that's basically the kind of the same kind of, it's quite similar to this step, sorry.

14:41.160 --> 14:47.240
To this one but it's on the raw chip source. So basically, what it says is that you have tools

14:47.240 --> 14:54.520
by torch 1 and X, TensorFlow. There toolkit does the translation to the API that is understood by the chip.

14:54.520 --> 15:01.200
So, my understanding was, and I confirmed that, that as long as you are able to generate an

15:01.200 --> 15:09.680
ONNX file that does the DSP, it works directly on the other, almost.

15:09.680 --> 15:11.280
So what do we need in the end?

15:11.280 --> 15:15.280
We need complex number, okay, because we are processing IQ row samples.

15:15.280 --> 15:18.760
So now let's go to the DSP side of the problem.

15:18.760 --> 15:23.680
Now that we think that the chip can be used, now the question that comes is, how does

15:23.680 --> 15:26.600
do I feed my samples in that stuff?

15:26.600 --> 15:27.600
What do we get?

15:27.600 --> 15:31.640
We get IQ samples, complex numbers, stream of complex numbers from the radio.

15:31.640 --> 15:34.880
So we need to do complex number arithmetic, basic.

15:34.880 --> 15:40.160
We need to be able to do the convolutions, because we need to do one, we want to filter,

15:40.160 --> 15:44.640
so low pass filter, band pass filter, rejection and so on.

15:44.640 --> 15:48.880
We may need some trigonometry, trig functions, like cost, sign, tangents, let's say we

15:48.880 --> 15:56.120
want to demodulate FM, at some point we will need, we will need some trigonometry functions.

15:56.120 --> 15:57.320
Do we have this ready?

15:57.320 --> 15:58.320
No.

15:58.320 --> 16:01.920
Those chips have no clue of what the complex number is, okay?

16:01.920 --> 16:02.920
But so what?

16:02.920 --> 16:07.000
We learn that school what the complex number is, so let's do it, why not?

16:07.000 --> 16:11.720
So let's go to the basics, and that's where the fund starts, because in fact, you realize

16:11.720 --> 16:17.480
that you have to do everything, it's not so difficult, but it took me some hours.

16:17.480 --> 16:24.160
So first of all, how do you represent in the chip the complex numbers?

16:24.160 --> 16:29.480
There's two approaches, either you could say the classical one, which is this one, is IQ

16:29.480 --> 16:32.360
Interleaf, real park imaginary park, real park imaginary.

16:32.360 --> 16:33.480
That's the classical one.

16:33.480 --> 16:38.680
You could say, but these devices were designed to process images with color plans, so

16:38.680 --> 16:43.360
you had one block for the red, one block for the blue, green and one block for the blue.

16:43.360 --> 16:47.440
So the first idea I had was, yeah, let's split the complex number into the real

16:47.440 --> 16:51.360
park is money match, and the imaginary park is the second image.

16:51.360 --> 16:57.120
In some cases, that makes sense, because some operations in the GPU in the processor have

16:57.120 --> 17:01.960
been designed to work like that, but in fact, in most of the cases, it's a bad idea.

17:01.960 --> 17:02.960
It doesn't work.

17:02.960 --> 17:08.360
So I finally tested the two and selected the classical one, that is to say the Interleaf

17:08.360 --> 17:10.360
IQ sample.

17:10.360 --> 17:16.920
As we do, as it's done in, I would say 99% of the existing platform.

17:16.920 --> 17:25.560
So if we want to multiply complex vector A, a series of complex numbers, and with a second

17:25.560 --> 17:30.160
one, point wise multiplication, that is this one.

17:30.160 --> 17:36.680
We want, so we have A, which is, for example, A vector A is the stream of samples you get

17:36.680 --> 17:44.880
from the radio, and D is some, some taps from filters, because you want to filter, you

17:44.880 --> 17:46.080
want to make a low pass filter.

17:46.080 --> 17:50.840
So at some point, you will need to compute for each of them the multiplication point

17:50.840 --> 17:54.400
by turn, that's the point wise multiplication.

17:54.400 --> 17:59.040
So I used colors, and this was for me to debug to be honest.

17:59.040 --> 18:04.440
So you want at the end this multiplication, and that's where the problem comes, is that

18:04.440 --> 18:10.480
the architecture is not designed at all to swap pairs, and it doesn't work well.

18:10.480 --> 18:15.800
So you end with something which looks like this, it's honest, it's not difficult, it's

18:15.800 --> 18:18.720
just that you need to do this step by step.

18:18.720 --> 18:23.600
So the yellow box is gather elements, is one of the instructions provided in the

18:23.600 --> 18:24.600
or in an experiment.

18:24.600 --> 18:32.200
So it means that you take vector one, you take vector two, you have a table that indexes

18:32.200 --> 18:35.480
of indices, that tells you which number you have to take.

18:35.480 --> 18:40.440
That selects Q, IQ, that just swaps real path and imaginary path, one by one.

18:40.440 --> 18:48.120
I see that you want checking, and then you make a multiplication and so on and so on.

18:48.120 --> 18:55.240
And in the end, you end with a nice graph like that, which just does all the multiplication.

18:55.240 --> 19:00.480
So to do this, I used in fact Python code, because there is an API in Python that helps

19:00.480 --> 19:06.800
you to generate this one and x5.

19:06.800 --> 19:10.400
I will tell you where at the end of the talk, but all these files are on my GitHub, you can

19:10.400 --> 19:13.680
take them, of course.

19:13.680 --> 19:18.560
So you generate a graph and at the end, you put vector A, your input, your first one, that

19:18.560 --> 19:24.720
will be the second one, and magic at the end, you have a multiplication.

19:24.720 --> 19:31.360
So it means that one of the issues you may have seen is that you need to know how many

19:31.360 --> 19:34.000
complex numbers you will process.

19:34.000 --> 19:38.440
Because the problem is that if you process images, you need to know the size of the image.

19:38.440 --> 19:39.440
That's exactly the same.

19:39.440 --> 19:47.840
It means that you generate an x5 based on a known number of samples you want to process.

19:47.840 --> 19:53.400
So it means that if you want to generate something for, I don't know, 500 samples.

19:53.400 --> 19:54.600
You have 500 samples.

19:54.600 --> 19:59.000
So if you have 1000, you need to call it twice or generate the second model.

19:59.000 --> 20:09.000
And that's, so I ended with Python files where I have put variables, saying how many

20:09.000 --> 20:10.560
samples I'm processing.

20:10.560 --> 20:15.960
And then it generates all the tables and all the indexes based on this input.

20:15.960 --> 20:16.960
It's quick and dirty code.

20:16.960 --> 20:20.640
I'm sorry for that, but it was just a test, a proof of concept.

20:20.640 --> 20:21.640
So it works.

20:21.640 --> 20:29.800
And then I have made, so the complexity teacher, the IQ oscillator, and just to show

20:29.800 --> 20:32.000
this noise stuff here.

20:32.000 --> 20:45.720
So this generates the local oscillatory signal you would need for the down or up conversions.

20:45.720 --> 20:48.720
And this was just an FFT plot to see the quality of the signal.

20:48.720 --> 20:55.560
But to see if it works first, second, to have an idea of the roundings and issues in

20:55.560 --> 20:59.080
the sine and cost generators of I use in the chip.

20:59.080 --> 21:01.640
And honestly, this depends on the chip.

21:01.640 --> 21:06.800
I suspect that the sine and cost are not so nice.

21:06.800 --> 21:13.640
So we may have some, I have not finished the test to be honest with you.

21:13.640 --> 21:19.440
I have also made the filter, so sorry, I'm going to first.

21:19.440 --> 21:24.720
So the complex multiplication is not very difficult, it's just swapping the stuff and making

21:24.720 --> 21:28.520
them be in a line in the good way, so that's the multiplication work.

21:28.520 --> 21:29.520
That's easy.

21:29.520 --> 21:34.200
The local oscillator is basically very stupid, you generate the phase.

21:34.200 --> 21:37.640
And then you apply cost and sine to the phase and you're happy.

21:37.640 --> 21:42.360
So you have on the left, you generate the phase, then you generate cost and sine.

21:42.360 --> 21:46.480
And then you just reorganize sine and cost to have a real part imaginary part on the

21:46.480 --> 21:47.480
good set.

21:47.480 --> 21:49.120
That's easy.

21:49.120 --> 21:52.440
The filter is a bit more tricky.

21:52.440 --> 21:55.480
In fact, a convolution that is done on an image is a filter.

21:55.480 --> 21:58.040
So basically, you don't have to reinvent the wheel.

21:58.040 --> 22:06.120
The problem is that those stupid stuff are expecting multidimensional tensors.

22:06.120 --> 22:11.960
So they expect multiplane image with structure organized by colors.

22:11.960 --> 22:16.920
So you cannot send a one dimension, at least in raw chip, it doesn't work.

22:16.920 --> 22:23.880
So you have to take a tensor, so you have to add a lot of annotations, so that's the CPU

22:23.880 --> 22:26.600
believes it's an image.

22:26.600 --> 22:34.520
So you have specific blocks, sorry, this here, this reshape stuff that I'll just use to make

22:34.520 --> 22:36.160
the CPU happy.

22:36.160 --> 22:39.480
It's used this, but it works.

22:39.480 --> 22:45.320
The good thing is, so this, this filter only works for real tubs, not for complex

22:45.320 --> 22:49.880
tubs, because it avoids the complex multiplication, well, you'll see an implementation

22:49.880 --> 22:52.040
it's already there.

22:52.040 --> 22:57.360
I also tested the A and the modulation, that works.

22:57.360 --> 23:03.480
I wanted to try the FM, because on VHF and above, we most of the time I FM, sadly I couldn't

23:03.480 --> 23:09.240
find the octangent operation, and I didn't want to spend, I had no time to implement

23:09.240 --> 23:10.240
it.

23:10.240 --> 23:14.400
By the means, so it's not working yet.

23:14.400 --> 23:22.560
So is this a mature tool, well, I wanted to share with you some of the cryptic messages

23:22.560 --> 23:28.160
I had when I tested that, so you can't wait, basically it's very strange, my error messages

23:28.160 --> 23:34.880
like, this is an invalid model, tensor double of type blah blah, and you see the error messages

23:34.880 --> 23:38.800
like, hey, what the fuck is it?

23:38.800 --> 23:46.800
It's sometimes crazy, I mean, reading of the messages you have from the API, and I realized

23:46.800 --> 23:54.320
that there is one very stupid thing is that if you use the very last on X platform on your

23:54.320 --> 24:01.800
PC, which is operation set 21 or something, and you target a device that is a version 11,

24:01.800 --> 24:04.840
like the Rochip, then sometimes it just doesn't work.

24:04.840 --> 24:16.080
So you may have to download old packages from, so some references, so I put on my slides

24:16.080 --> 24:22.560
all the stuff I found testing this, there is a very good introduction in Python, the last

24:22.560 --> 24:34.680
link here gives you a lot of hints on how to do that, a kind of conclusion, a kind of,

24:34.680 --> 24:38.920
I think it can, I haven't shared with you any figures at this stage, does this work?

24:38.920 --> 24:44.120
It does, is it efficient, I don't know yet, to be honest, I don't know, but I think it's

24:44.120 --> 24:53.120
interesting, because it may open, it may enable you to use very low cost hardware to do

24:53.120 --> 24:57.680
some intensive processing, I mean, I did this, I used that for GPU, but it's not a low

24:57.680 --> 25:05.840
cost device, while these chips are very cheap, and you can find easily some add-ons, some

25:05.840 --> 25:11.120
PCI Express or M2 modules with these devices, means that, even on the very, for example,

25:11.120 --> 25:18.840
the Rochberry Pie, that could add a significant amount of DSP power, but there is a long

25:18.840 --> 25:27.200
way still ahead to make this work correctly, let's be honest.

25:27.200 --> 25:33.440
So, you can find all the code I made, but these are sections of code yet, I just put what

25:33.440 --> 25:40.080
works, in my GitHub, it's called Onix Radio, I put very quickly and dirty the code I

25:40.080 --> 25:47.800
have made to do this, there is in the GitHub, there is the, you will find the Python

25:47.800 --> 25:56.040
codes that are used to generate these basic functions, complex multiplication, filtering,

25:56.040 --> 26:01.480
demodulation blah blah, so that gives you ideas if you want to go one step further, there

26:01.480 --> 26:08.560
is also a C++ example that loads, the Onix file and generates numbers and calls the

26:08.560 --> 26:14.040
complex multiplication, because I tested that it works, but of course, it's far from finished,

26:14.040 --> 26:19.800
it's just a proof of concept at this stage, so where is this, so that's where you can

26:19.800 --> 26:28.480
find it, sorry, did I, yes, I think that's what the two slides, okay, that's all

26:28.480 --> 26:33.160
faults, there is still a long way ahead, if you have questions, I'll be happy to try not to

26:33.160 --> 26:58.840
answer that, thank you, any question, yes, no, no, no, that doesn't avoid that, so yes, for those

26:58.840 --> 27:05.000
who cannot hear the question, the question is, does this simplify somehow, does this removes

27:05.000 --> 27:09.840
or simplifies the bottleneck we usually see in GPU systems where we have to copy the data

27:09.840 --> 27:16.400
from the main memory to the process unit in GPU case, no, that's exactly the same architecture,

27:16.400 --> 27:22.760
and the problem is if you look at the Rochip architecture, I'm going back to the slide where

27:22.760 --> 27:27.480
it is said, you will see that the available memory for transfer is in quite limited

27:27.480 --> 27:41.200
in fact, it is 30 roughly 400 kilobytes, which is very small, if you compare this to a Nvidia GPU,

27:41.200 --> 27:47.160
where you have let's say 8 gigabytes, you see that it's a huge difference, what it means

27:47.160 --> 27:52.840
that you need, by the way, you will need some thread spalling and pushing data, otherwise

27:52.840 --> 27:57.320
it doesn't work, it's not efficient, but at least what is interesting here is that if

27:57.320 --> 28:03.880
you look at the classical architecture where you have USB stack, pulling samples from a USB device,

28:03.880 --> 28:08.920
you could imagine that the samples you get from the USB device, you just directly push them,

28:08.920 --> 28:13.520
assuming the stuff is fast enough, you have chunks coming from the USB, for example, you

28:13.520 --> 28:21.720
would be able, that's what I'm trying to do, is to directly push the USB, come payload, directly

28:21.720 --> 28:28.800
in the queue to the NPU to be processed, so by the way, you have to do the code, at least

28:28.800 --> 28:34.840
this, but for sure, it does not simplify this issue, that's clear, is it more efficient,

28:34.840 --> 28:49.040
I haven't tested yet, and I don't think so, yes, not all of them, yes, the ONNX architecture

28:49.040 --> 28:57.840
has a 4 operation, and a while, for example, in many models, I spend quite a lot of time

28:57.840 --> 29:02.080
opening existing ONNX files and looking at them while they were implemented, and in many cases

29:02.080 --> 29:08.720
you have the ifs and you have 2 branches, typically when you process the audio, you have cases

29:08.720 --> 29:13.360
which is depending on the sampling rates, so instead of having a stage that resembles, they've

29:13.360 --> 29:19.760
made branches, 8k, 16k, and they swap, and when you have streams like that, they have

29:19.760 --> 29:25.920
while loops, so I think it works, but I'm not sure all the ships do it to be honest,

29:25.920 --> 29:31.320
I don't know if there's a like a break, a return stuff like this, I don't know, I come

29:31.320 --> 29:45.200
side, yes, I think it comes also to the point that you somehow need to loop inside the

29:45.200 --> 29:53.080
ship, I don't know, I don't know, no, I mean, no, sorry, I'm saying something long, it means

29:53.080 --> 29:58.560
that you would have to store somewhere the current phase, lock statues or phase counter,

29:58.640 --> 30:05.200
I tried that, I couldn't find any way to have a local viable, so that's why in the oscillator

30:05.200 --> 30:16.080
I had to push the phase, sorry, I couldn't find it, probably it's possible, sorry, it's

30:16.160 --> 30:30.320
the excuse me, I see time is running, where is it, yes, here, the IQ oscillator I made, I had

30:30.320 --> 30:36.320
to store the phase, the initial phase, so in fact I put the initial phase, because if you

30:36.320 --> 30:41.280
want continuity between runs to make sure that you don't have a phase shift, you need to store

30:41.360 --> 30:48.160
the current phase equivalent, and there's no local viable, so probably there is one, but I couldn't

30:48.160 --> 31:04.800
find it from the documentation on that, I see, yes, sorry, yes, yes, they have also floating

31:04.800 --> 31:11.680
points, but what I meant is that the, yeah, and 32 bits also, but it means that the throughput

31:11.680 --> 31:17.120
is divided, the throughput is given for one byte operations, so if you have six terra operations,

31:17.120 --> 31:23.120
it's in five for one byte, means that if you go to floating point 32, you divide by at least

31:23.120 --> 31:29.840
at least four, yes, you have to specify the type, no, no, no, no, no, no, no, no, no, no, no, no, no,

31:30.080 --> 31:39.120
you specify the type at the input, you say this is floating Fp32 or Fp16, but, not in the rock

31:39.120 --> 31:45.280
chip ones, I haven't seen that, yeah, sorry, I just try to keep the track of the question,

31:45.280 --> 31:47.560
Yeah, you had one question, yeah.

31:47.560 --> 31:50.440
Question, thank you very simple, yeah.

31:50.440 --> 31:54.520
I see that you're using the question that you use.

31:54.520 --> 31:56.760
Do you use that or do you use both?

31:56.760 --> 32:01.960
In the past, I just queued only, because it's faster.

32:01.960 --> 32:05.920
Sorry, there was another question, yeah.

32:05.920 --> 32:08.400
Yeah, I think it was kind of easy for all in five to four months,

32:08.400 --> 32:10.520
like, how many maps can this kind of edit like that?

32:10.520 --> 32:11.560
No, I haven't, that's smart here.

32:11.560 --> 32:13.760
No, no, I have to do it for sure, because otherwise,

32:13.840 --> 32:16.520
I mean, first, the first idea was, does it work?

32:16.520 --> 32:17.720
Yeah, it works.

32:17.720 --> 32:19.000
Now, is it efficient?

32:19.000 --> 32:22.560
I, all of a sudden, I don't know.

32:22.560 --> 32:25.560
One last question, I mean, I'll try to live.

32:25.560 --> 32:27.880
Okay, okay.

32:27.880 --> 32:28.720
Yes?

32:28.720 --> 32:33.840
Do you know how it would work to run by some other next models

32:33.840 --> 32:36.160
or did you get this problem?

32:36.160 --> 32:39.840
Ah, okay, that's a pretty good question, in fact.

32:39.840 --> 32:45.000
The one thing that I didn't test, but it's claim to be working,

32:45.000 --> 32:47.080
is that you can chain, chain models.

32:47.080 --> 32:49.400
So, basically, I made blocks with the idea

32:49.400 --> 32:52.000
that you could chain them.

32:52.000 --> 32:53.640
So, for example, you would have the local procedure,

32:53.640 --> 32:56.600
which is inside, and somehow you will connect the output

32:56.600 --> 32:59.480
of the yellow to the inputs of, so basically,

32:59.480 --> 33:02.960
you have, you imagine a new radio photograph,

33:02.960 --> 33:06.600
you have these blocks, and instead of generating a big model,

33:06.600 --> 33:10.160
you would download in the chip, the blocks,

33:10.160 --> 33:12.840
and also the connections.

33:12.840 --> 33:17.200
It looks like, from the specifications, it might be working.

33:17.200 --> 33:22.040
And in fact, if you look at the files, the ones that provide us,

33:22.040 --> 33:25.080
example, in the model zoo, they give, that's what they're doing.

33:25.080 --> 33:27.840
The problem is that it's extremely difficult

33:27.840 --> 33:30.200
to reverse engineer this, because when you open them with a,

33:30.200 --> 33:33.280
there is a photograph application in,

33:33.280 --> 33:38.840
there is, sorry, I mean, find the name on the last slide, I think.

33:38.840 --> 33:40.960
That's net, netrun.

33:40.960 --> 33:45.040
Netrun is the stuff you can, it's an open source software,

33:45.040 --> 33:48.120
that draws the graph of the operations,

33:48.120 --> 33:49.600
the one I used for the slides.

33:49.600 --> 33:52.880
The problem is, if you look, if you open a yellow model,

33:52.880 --> 33:57.280
you have thousands of nodes, and then reverse engineering

33:57.280 --> 34:00.600
this and understand the tricks, and then it's, no.

34:01.560 --> 34:06.560
So, I think it's doable, because I think that's exactly what they're doing.

34:06.560 --> 34:11.720
They have blocks, they work on small blocks, I mean, that makes sense.

34:11.720 --> 34:17.280
And they had, and they pile, and they pile, they had additional processes on it.

34:17.280 --> 34:22.920
So, very probably, there are options to connect the different blocks,

34:22.920 --> 34:26.360
but I haven't been able yet to spend enough time on this to go on.

34:26.400 --> 34:30.080
That was a part-time project for fun, and,

34:30.080 --> 34:32.320
it's only one act of time.

34:32.320 --> 34:36.400
No, if they claim in the other sheet, but you can have up to three.

34:36.400 --> 34:39.000
And that's what I was using in, in QDA.

34:39.000 --> 34:40.560
That's why I was using QDA, basically.

34:40.560 --> 34:42.160
You have multiple streams of the same time.

34:42.160 --> 34:44.640
So, you should have three models working,

34:44.640 --> 34:47.320
and you could have, as far as the industry,

34:47.320 --> 34:52.360
possibilities to connect blocks, but you have to spend time

34:52.360 --> 34:56.640
operating the Chinese documentation.

34:56.640 --> 35:01.800
And the promise, in this case, you fall to the very cheap,

35:01.800 --> 35:05.240
specific solutions, because if you want something to be working

35:05.240 --> 35:08.240
on other platforms, that's what I tried.

35:08.240 --> 35:12.480
I wanted to stick to the OIN and X implementation,

35:12.480 --> 35:17.400
keep standard, and use the minimum version 11.

35:17.400 --> 35:19.040
Otherwise, I would have problems without a cheap,

35:19.040 --> 35:21.080
and I would not be sure it would be working.

35:21.080 --> 35:21.920
Yeah?

35:21.920 --> 35:25.720
I have the question, when you're saying minimum version,

35:25.720 --> 35:30.640
if you target this R&N or R&D.

35:30.640 --> 35:31.480
So, yeah.

35:31.480 --> 35:35.000
So, run on the run times for simulation.

35:35.000 --> 35:38.600
Yeah, I tested this, if that's exactly how I did.

35:38.600 --> 35:41.640
First, I wrote this on Python, generated on Python,

35:41.640 --> 35:43.080
tested on Python.

35:43.080 --> 35:46.440
So, it means that I could generate outputs.

35:46.440 --> 35:49.200
That's the test cases that you can find in the GitHub,

35:49.200 --> 35:53.760
and tested that the numbers were correct, then debugging.

35:53.760 --> 35:55.880
And then, start to target.

35:55.880 --> 35:58.640
So, then I generated the real one and bypassed Python,

35:58.640 --> 36:02.400
and used the CAPI to generate numbers, memory,

36:02.400 --> 36:11.240
and try, but still with the Linux PC-based emulation.

36:11.240 --> 36:12.840
That works.

36:12.840 --> 36:15.280
And then, tested on the Roch chip.

36:15.520 --> 36:20.280
The only type is only used, or you need it for testing on the Roch chip.

36:20.280 --> 36:22.880
Yes, the problem is, if you don't pay attention,

36:22.880 --> 36:23.960
yeah, that's very good.

36:23.960 --> 36:27.280
Because it means that you write once, run anywhere, almost.

36:27.280 --> 36:28.280
That's the AD.

36:28.280 --> 36:30.160
That's a very cool thing.

36:30.160 --> 36:32.400
The issue is that you have some,

36:32.400 --> 36:36.080
the very problem is that they don't implement all the operations.

36:36.080 --> 36:38.040
Means that if you don't pay attention from the beginning

36:38.040 --> 36:40.880
to which obstacles you use at the end,

36:40.880 --> 36:43.080
your model doesn't work, it crashes.

36:43.080 --> 36:45.960
Because the chip doesn't recognize the operation,

36:45.960 --> 36:47.720
and there is an exception.

36:47.720 --> 36:51.360
So, I learned that from pain, I would say,

36:51.360 --> 36:54.080
but basically trying to understand why it doesn't work.

36:54.080 --> 36:57.920
And then, realizing that I was not using the good version.

36:57.920 --> 37:02.560
And it's, sometimes, on the messages, you see that,

37:02.560 --> 37:05.120
if you read that this, the operator, schema,

37:05.120 --> 37:08.160
and other functionality may change before blah blah blah.

37:08.160 --> 37:10.560
Run time will not down see back what compatibility.

37:10.560 --> 37:12.240
What the, why?

37:12.240 --> 37:14.040
Parent official support for the main,

37:14.040 --> 37:17.000
is only upset 21.

37:17.000 --> 37:18.320
If to understand the scripting message,

37:18.320 --> 37:20.480
you need to read the documentation,

37:20.480 --> 37:22.600
and we are like, ah, that's not the good chip,

37:22.600 --> 37:24.360
that's not the good up code.

37:25.800 --> 37:27.600
For those of you who have already programmed

37:27.600 --> 37:30.640
in a assembly language, that's exactly the same.

37:30.640 --> 37:32.000
It's the same problem.

37:34.200 --> 37:35.040
One last question?

37:37.160 --> 37:38.000
Yes?

37:38.800 --> 37:41.520
In your in seal, the sign of course documentation,

37:41.520 --> 37:43.040
because in the beginning, you said that

37:43.040 --> 37:45.200
the ungradated operation or implemented,

37:45.200 --> 37:46.520
the sign of course, sign of course, sign of course,

37:46.520 --> 37:47.720
and then yourself?

37:47.720 --> 37:49.520
No, no, no, no, no, I tested the, no, no, no, no,

37:49.520 --> 37:52.560
I were using here, I used the existing ones.

37:52.560 --> 37:54.200
And on the raw chip, it should not work.

37:54.200 --> 37:55.040
That's a good point.

37:57.040 --> 37:59.320
I have to use tables, generations.

37:59.320 --> 38:02.000
Yes, yes, yes, yes, no, no, I mean,

38:02.000 --> 38:06.680
what I've used here, sorry, to be clear on your question.

38:08.000 --> 38:10.160
That's this one.

38:10.160 --> 38:13.160
Here, I have used the sign and course functions,

38:13.160 --> 38:15.240
provided by the API, okay?

38:15.240 --> 38:17.920
So it means that in certain versions of chips,

38:17.920 --> 38:19.000
it will not work.

38:19.000 --> 38:22.240
It, what is not implemented in the raw chip,

38:22.240 --> 38:24.960
is the act course and arcs in.

38:24.960 --> 38:27.920
But course and sign are implemented, yes.

38:27.920 --> 38:29.400
So course and sign work.

38:29.400 --> 38:31.960
But the promise, if you want to use octangents

38:31.960 --> 38:35.040
to do FM-double-ish F-s-state emigration,

38:35.040 --> 38:37.960
then you are a problem, because you cannot use

38:38.040 --> 38:41.320
octangents, no octangents, no octangents, no octangents.

38:41.320 --> 38:44.000
So you need to have, you need to implement a look at the table,

38:44.000 --> 38:46.840
or another trick to generate that.

38:46.840 --> 38:51.000
No, but for sign and course, sorry, I'm not on the good one.

38:51.000 --> 38:57.640
For sign and course, they are available in the arcane and end

38:57.640 --> 39:02.400
when an x version 11, which is the very basic one.

39:02.400 --> 39:04.000
So that works.

39:04.960 --> 39:06.960
At least that works.

39:06.960 --> 39:08.960
Okay?

39:08.960 --> 39:12.160
Maybe you've seen that in 2015 or 2016,

39:12.160 --> 39:14.760
I remember well, in this room, there was a presentation

39:14.760 --> 39:17.600
about any installation in the last one in the arcane.

39:17.600 --> 39:21.600
Yes, yes, yes, yes, yes, yes, yes, yes.

39:21.600 --> 39:24.880
The thing is that I was lazy, and I wanted to test

39:24.880 --> 39:27.040
all the things in this, I confused octangents.

39:27.040 --> 39:29.880
So I have to be clever, and I am not.

39:29.880 --> 39:33.280
So I did not know your right.

39:33.440 --> 39:35.640
That's all. For me, unless there is a very last question,

39:35.640 --> 39:38.640
otherwise, I will pack my things.

39:38.640 --> 39:40.440
That's all. Okay? Thank you, folks.

39:40.440 --> 39:42.640
Thank you very much for your time.

