WEBVTT

00:00.000 --> 00:29.600
Thank you. Sorry that the only thing standing between you and your dinner is me rambling

00:29.600 --> 00:36.400
about high performance is the Iran times in embedded systems, but since I'm going to speak

00:36.400 --> 00:44.560
about radio 3.8, 3.10 sorry for the two futures DR, I'm half new as the Iran time I'm

00:44.560 --> 00:52.400
developing. I would like to be with this job, it's a mandatory quote from SKCD, right?

00:52.400 --> 00:58.400
So the story goes like this, last summer I was doing a project for the radio for the two,

00:58.400 --> 01:03.600
which was kind of kicking the tires and trying to make a back and modem radio, you see

01:03.600 --> 01:09.920
in for the toe. And through that I got quite familiar with how for the toe works. I've been

01:09.920 --> 01:17.360
using Radio 3 for a long time also, so I could compare between the two. I've also been interested

01:17.360 --> 01:24.960
in futures DR since it launched really because I like developing in brass, but I've never used

01:24.960 --> 01:31.680
it before engaging on this project. So I was kind of interested in comparing all these three

01:31.680 --> 01:38.800
SDR frameworks and I was really thinking, well which is the one I should use, which is the best,

01:38.800 --> 01:43.600
right? I was asking these kind of questions to myself. And of course this is a silly question

01:43.600 --> 01:50.400
because it's not a technical question that you can answer and SDR randoms are just tools. So

01:50.400 --> 01:57.440
maybe for different problems you should use different tools. And so the more technical question I

01:57.440 --> 02:04.240
asked myself is about performance. So if I want to have a really fast SDR framework or SDR

02:04.240 --> 02:10.240
runtime, what other things I should do on what other things I should not do because they go

02:10.240 --> 02:17.120
so much performance. And as I said, I wanted to look at these three SDR runtime because they are

02:18.080 --> 02:22.800
the ones I am more interested in, I'm probably the most popular, so I think right now.

02:25.200 --> 02:31.680
And when I was thinking about all these runtime, I realized they looked the same. And here you have

02:31.680 --> 02:38.400
Generated Companion. This is only available in 3 to 10, really, or the Generated 3 series.

02:38.800 --> 02:44.720
But you can do the same thing with code on the other randoms. You write a flow graph. But that's

02:44.720 --> 02:50.160
no what I mean. What I mean, but they all look the same is not the blocks. So there's a graph.

02:50.160 --> 02:57.040
But I mean the arrows. And this is like a stupid thing to say. But here what we draw as an arrow

02:57.040 --> 03:03.840
is a circular buffer, which is connecting the blocks. So right, the connections in the graph

03:03.840 --> 03:09.520
are circular buffers. Usually there are single producer, multiple consumer because you want to

03:09.520 --> 03:17.120
find out these noise source could go to multiple blocks downstream. And that's how these works.

03:17.680 --> 03:23.840
And I was starting to think, well, I know a way to do differently. Maybe it's not better,

03:23.840 --> 03:33.280
but it's different. And perhaps it does well in some respects. So the disadvantages of using

03:33.280 --> 03:39.040
these circular buffers for the connections are these two. The first is, if you look at these

03:39.040 --> 03:45.680
multiplyconst and block, its work function has an input buffer. It has an output buffer. And it's

03:45.680 --> 03:51.440
moving the data from the input to the output and multiplying by 42. And what happens is for this

03:51.440 --> 03:56.960
block to be very, very fast, you need both the input data and the output data to be in L1 cache.

03:58.000 --> 04:03.040
You could multiply in place, right? Multiply my 42 is a simple operation. You could have a

04:03.040 --> 04:08.960
single buffer, which is your input and your output and you put it there. But by having separate

04:08.960 --> 04:15.200
input and output buffers, you're effectively reducing the size of the cache that these block can

04:15.200 --> 04:22.960
work with by half. So that's one thing. The other thing is, trust me latency. And I always point

04:22.960 --> 04:29.680
people to the same talk, Madadu's cave at GRCone. I think it was maybe four or five years back,

04:30.000 --> 04:35.760
where he considered the following program. If I have instead of the null sync and SDR sync,

04:35.760 --> 04:42.800
which is consuming data at a certain rate. And I have a bunch of blocks. So I have like a long chain

04:42.800 --> 04:50.000
of blocks. What happens is the way the back pressure works in this kind of system is because buffers

04:50.000 --> 04:56.400
get full. So when you have many blocks, you have many buffers between all these blocks. And that

04:56.400 --> 05:03.280
drives your transmitt latency up. And if I work to change at some point this 42 by any other number,

05:03.280 --> 05:10.160
I might see the end effect at my RF output couple seconds afterwards. And that's something that

05:10.160 --> 05:18.000
happens with 3.10. It's something that can happen with 4.0. So it's like some challenge that maybe

05:18.000 --> 05:25.840
this buffer approach is not good to control latency. And to have something which you can control

05:25.840 --> 05:34.560
on the final latency on trying to get low graphs. So my alternative idea was to send samples in packets.

05:34.560 --> 05:40.000
Rather than a half like a continuous stream which is realized by several buffers, you send them

05:40.000 --> 05:45.840
in packets. And the idea is you make a circuit of packets. Why do you make a circuit? So the thing

05:45.840 --> 05:51.280
is you have some fixed number of packets which are constructed at the beginning of a flow graph.

05:51.280 --> 05:57.600
And they are located and they get recycled. Let's say we have just one packet. So it will be one

05:57.600 --> 06:04.080
packet. First is on this noise source. If you use the packet to write, it's output then the packet

06:04.080 --> 06:09.680
goes here. It's going to multiply in place by 42 because we want to do things in places much as

06:09.680 --> 06:14.880
possible. And then it goes to the noise source which doesn't do anything. And then the packet goes

06:14.880 --> 06:23.280
back to the noise source so that it can generate more noise and write it on the packet. And rather

06:23.280 --> 06:27.840
than having just one packet in a circuit, you can have multiple of them. And you have these blocks

06:27.840 --> 06:34.320
on different threads working at the same time on different packets. So this is the idea.

06:35.040 --> 06:42.240
And I know that you say this is new. This is this is not really new. The advantages are many

06:42.240 --> 06:48.640
blocks can work in place on a packet. And so you are effectively making more use of your

06:48.640 --> 06:53.760
CPU cache. Latency is determined by the number of packets. You have a circuit. That's

06:53.760 --> 06:58.720
really trivial. If I only put four packets on the circuit, transmit latency is going to be

06:58.720 --> 07:03.120
at maximum the time it takes to transmit four packets through the final sync.

07:05.120 --> 07:09.520
Also something else. As I was working with packet modems, not only in the

07:09.520 --> 07:16.080
four to toe project, but as well, I was realizing that many of protocols have packets even if they

07:16.800 --> 07:23.280
don't have like a packetized nature, they have sections or frames. And oftentimes you need to

07:23.280 --> 07:29.520
mark those with tags. You might need to align your work functions to the tags. So something good

07:29.520 --> 07:36.000
about packets is that maybe they can be more natural when dealing with this kind of data.

07:36.640 --> 07:40.320
So there is potentially less need for tags and these kinds of things.

07:41.200 --> 07:47.600
It is also more similar to a handcraft implementation. If I told you write these for me from scratching

07:47.600 --> 07:53.520
any language you want, you would probably have like one single buffer maybe and then you pass the

07:53.520 --> 07:58.480
buffer to a noise during the function and then pass the buffer to a multiply constant function which

07:58.480 --> 08:03.200
works in place. And then you're done. And if I tell you, I wanted multi-threaded. Maybe you have

08:04.080 --> 08:10.400
several buffers for the noise source, a multi-plied constant to be able to work at the same

08:10.400 --> 08:19.680
time on multiple buffers, right? So this packet I've been calling in quotation marks, I want to

08:19.680 --> 08:26.000
call it a quantum. First because it's cool people get excited about quantum physics or any

08:26.000 --> 08:32.800
physics related. I know this is not the only use of quantum in SDR applications but

08:33.440 --> 08:39.760
but anyhow. And there are also two serious reasons for calling quantum. The first is that

08:39.760 --> 08:47.120
if I were to call this packet, there are packets elsewhere because packet is a heavily overloaded term.

08:47.120 --> 08:53.360
You have multiple protocol layers. Each of them has their own concept of packet L1L2 etc.

08:53.360 --> 08:59.840
And if you're SDR is on IPSDR which is network connected, it also has like UDP packets. So

08:59.840 --> 09:05.840
let's not call this thing again a packet because that's very confusing. And the second reason is

09:05.840 --> 09:12.160
well the inspiration for this is the idea is not to consider samples as a continuous stream of

09:12.720 --> 09:17.600
data rather to chunk it up in packets. Is the same as with energy in quantum physics, right?

09:17.600 --> 09:23.280
You have these photons which are the quantum which energy can be transmitted. So it's kind of

09:23.440 --> 09:29.360
the same thing. And I want to contain a buffer with flexible margins and I want to explain these

09:29.360 --> 09:34.960
margins within example. So think about computing and adding a CRC. Can you do that in place?

09:35.360 --> 09:39.840
Well no because you need to grow the size of the packet. But if you know before hand that you

09:39.840 --> 09:45.760
want to add a CRC then you can prepare your buffer so that it has some extra capacity at the end.

09:45.760 --> 09:50.720
And then you can do it in place by growing the buffer. This is like the difference in

09:50.720 --> 09:57.040
C++ between a vector size and a capacity. If you design your capacity right from the start

09:57.040 --> 10:03.040
then you can grow your packets by some post-tambles or pre-ambles. So you have like the left margin

10:03.040 --> 10:08.480
and the right margin to be able to add things like a synchronization work or a header to the

10:08.480 --> 10:15.520
beginning without reallocation or going to the different buffer. And you can also strip those down.

10:15.520 --> 10:20.240
So when you receive if you say I'm no longer interested on the header you just say the

10:20.320 --> 10:28.480
margin now somewhere else. So that's the idea. You can also have tags in here marking anywhere

10:28.480 --> 10:35.040
which is a interface within the packet and perhaps author metadata. Here I have a more complex

10:35.040 --> 10:41.600
example photograph. This is not a reall application but it's like a realistic looking application.

10:41.600 --> 10:47.600
And what is different between this photograph and the previous one is two things.

10:47.600 --> 10:54.400
First, in here there's a block which is doing a decimation by two. And in here I have a fun

10:54.400 --> 11:00.240
out connection. So I have a one output to inputs. How do we deal with this?

11:02.240 --> 11:08.080
The idea is to go like this. So let me explain before how I deal with decimation. So what we do

11:08.080 --> 11:14.720
is if I want to decimate by two then it's no longer reasonable to do it in place because the size

11:14.720 --> 11:19.360
of the output is going to be half in this case. So what's going to happen is there's a packet

11:19.360 --> 11:24.320
which is the input. There's a packet which is the output. It's going to be half the size of the input

11:24.320 --> 11:30.560
because you always do like one packet per work function even if you are decimating. And then

11:30.560 --> 11:36.240
that means you have different circuits. So because this is a decimation this is basically

11:36.240 --> 11:41.680
being the sink of one circuit and it sends it back and it's being the source of these other

11:41.680 --> 11:48.560
circuit. So that's the way you deal with decimation. How did you deal with one too many connections?

11:48.560 --> 11:55.040
Because the idea is not to pay for things you are not using all the time. So doing things in place

11:55.040 --> 11:59.920
is great for one-to-one connections. Of course you cannot do it here. And the idea is rather than

11:59.920 --> 12:05.280
to send the packet you cannot copy it out to two different locations. You send just the reference.

12:05.280 --> 12:12.880
So you send in a reference to these two blocks. It's a read-only reference. So these also

12:12.880 --> 12:18.400
cannot work in place. They read from the input which is just a read-only reference. And the right

12:18.400 --> 12:25.920
onto an output packet which has been recycled from here. And then what happens is because of the

12:25.920 --> 12:34.560
way these structures for these reference are organized. The last block which keeps alive the reference

12:34.560 --> 12:42.400
when it destroys it. It's supposed to return the packet to wherever it needs to go to be recycled.

12:43.040 --> 12:48.160
So that's the way you can deal with more complicated flow graphs by still using the idea of packets.

12:50.560 --> 12:57.280
I've gone on many few months working on this idea and I've implemented something I call QSDR.

12:57.280 --> 13:03.280
The Q is for quantum force. This is up in GitHub. Maybe since a month ago or so.

13:04.240 --> 13:10.560
And I'm doing this in Rust. I'm using Async. So this is the same as Futures DR.

13:11.040 --> 13:16.880
But that's just the same commonality. The rest looks quite different I think.

13:18.320 --> 13:25.760
But most importantly and this is how I really began this project. In these QSDR repository there are some

13:25.760 --> 13:32.960
benchmarks comparing goneradio3.10 for the toe and Futures DR and these QSDR implementation.

13:32.960 --> 13:39.360
So you can see where things are regarding performance. This is highly experimental at the moment,

13:39.360 --> 13:46.320
highly working progress. Nothing. Custom schedulers. This is like a huge topic, right?

13:47.600 --> 13:54.000
For the toe supports custom schedulers. Futures DR supports custom schedulers.

13:54.720 --> 14:01.680
And QSDR also. This is my approach. Custom schedulers. Because I wasn't really too happy with the

14:01.680 --> 14:08.720
way for the toe or Futures DR are approaching this idea of custom schedulers. And the reason is yes,

14:08.720 --> 14:16.000
you can write your own scheduler. But it's kind of hard. Before this predict, I haven't written any

14:16.000 --> 14:24.160
custom schedulers for the toe or Futures DR. And it's hard. Now I've written some simple ones.

14:24.880 --> 14:30.400
You basically need to know like the internals of how the DR runtime you are dealing with works.

14:31.120 --> 14:39.200
You are basically calling some functions or something about the API of that particular DR runtime

14:39.280 --> 14:45.200
to write your scheduler. And that makes them not so easy to write. On the other hand, on QSDR,

14:45.840 --> 14:51.200
these schedulers are based on Rust streams, which is a concept I will explain in a second.

14:52.080 --> 14:58.000
And this makes them completely independent of the DR framework, completely independent of anything.

14:59.280 --> 15:06.160
These schedulers are things which run Rust streams. Those could be anything else. They do not

15:06.160 --> 15:13.600
need to do SDR or DSP or anything. So it's like a more general concept. If you want, if your

15:13.600 --> 15:21.360
custom scheduler really needs to, it might dig in the internals of what's going on that is SDR related.

15:21.360 --> 15:27.120
But if not, it doesn't need to do. It's the same as 3.10. 3.10 uses the Linux kernel scheduler.

15:27.120 --> 15:34.000
It's tasks, right, processes or threads. And Linux doesn't know anything about SDR. So it's this

15:34.080 --> 15:43.040
in kind of a DR. But without using the operating system. So quick, primary on Rust streams.

15:43.840 --> 15:52.000
Stream is an asynchronous object which can produce some sequence of values. And in this example,

15:52.000 --> 16:00.240
this is a stream iterator which is going to produce the values 1, 2 and 3. And basically we have the

16:00.320 --> 16:07.120
object. We call it next method. And every time we call it, it returns us 1 values. So it returns

16:07.120 --> 16:14.480
1. Then it returns 2. It returns 3. And the next time we call next, there are no more values. So it

16:14.480 --> 16:20.960
returns non-same unfinished. There is nothing more that I can give you. And something important here is

16:20.960 --> 16:27.680
this a weight. And that means if you are familiar with a weight in any language which supports

16:28.320 --> 16:35.920
async, this means that at this point the stream might not be ready to give us some value. So

16:35.920 --> 16:43.360
we are kind of waiting for it. And if this is a task, then some other piece of code might decide to run

16:43.360 --> 16:49.760
some other task while we are waiting for this. Without the intervention of the operating system

16:49.760 --> 16:57.120
scheduler. You can imagine a more elaborate example where this stream might produce 1, 2 and 3.

16:57.120 --> 17:03.040
But rather than immediately it waits for 1 second between each of the values. So when I call

17:03.040 --> 17:08.160
a weight, I'm going to block for 1 second and maybe some other coding my application runs.

17:10.320 --> 17:16.720
That's the idea. So in Cures, the R, each block has a work function. It is an async function

17:17.280 --> 17:22.400
because it needs to wait for the input to be available. So it's the same as here. We, we,

17:22.480 --> 17:30.400
wait. And that's the reason it's async. The work function processes in general 1 input quantum

17:30.400 --> 17:37.920
to produce 1 output quantum. So it's like 1 quantum per work pull. And then it returns either run

17:37.920 --> 17:44.000
which means call me again. I can do more work. That's like the use of thing or it says don't

17:45.040 --> 17:50.400
I'm finished. Maybe I'm a file source. I've reached the end. I'm a head block. I've already reached my

17:51.360 --> 17:54.960
code. Or maybe there's a file error. Those are the three things that can happen.

17:56.000 --> 18:02.560
Acures the R block. Once you have it connected in the flow graph is converted to a stream.

18:03.280 --> 18:09.760
And each time you call next on the stream, next is going to call the work function once. And

18:09.760 --> 18:15.600
it returns either an error or nothing. And the difference between run or don't, you have it on

18:15.680 --> 18:23.040
this non versus some difference. So it either says, I'm really finished or it says,

18:23.920 --> 18:31.040
I've done some work. I don't have any output to give you like any number. I've done some work.

18:31.040 --> 18:36.560
And you could call next again and expect more work to happen. So that's kind of worse,

18:36.560 --> 18:40.960
streams appear as a natural concept when you are thinking of scheduling work functions.

18:41.200 --> 18:50.960
Then I'd use the R scheduler. It's just some code that calls multiple streams either in

18:50.960 --> 18:58.400
parallel or in one thread or in any specific order. And those streams might be anything.

18:59.040 --> 19:06.240
The natural concept of streams appears when you are thinking about channels which are ubiquitous

19:06.240 --> 19:11.840
in graphs. So if you have a channel, there is like a transmitter or receiver object.

19:11.840 --> 19:18.000
And then you send some data and you get it here. And of course receiving is asynchronous.

19:18.000 --> 19:23.440
It is like a weight received because maybe the transmitter hasn't send the data yet.

19:25.520 --> 19:29.520
And the receiver is a stream object because you can keep receiving messages

19:29.520 --> 19:35.200
until the transmitter is done and decides to close the channel. So any code you have

19:35.200 --> 19:40.720
like a web application framework which is designed to ram multiple channels in part,

19:40.720 --> 19:45.040
you can use it as a QSDR scheduler and there will be an example. There's nothing

19:45.040 --> 19:50.560
custom about QSDR schedulers. And I think that makes me way easier to write.

19:51.920 --> 19:58.240
Something which is very useful to write simple schedulers is stream communicators. The idea of

19:58.240 --> 20:03.520
stream communicators is some operation which takes some streams and produces another stream.

20:04.400 --> 20:10.400
So let's think about if we have two streams which produce either nothing or an error.

20:10.400 --> 20:15.360
And the way this is written in graphs is this result empty type or an error.

20:16.880 --> 20:22.000
If I have these two streams, I can produce a new stream whose next method means

20:22.800 --> 20:27.280
pull the first stream and get something. If it doesn't produce an error, call this a

20:27.280 --> 20:32.560
stream and get something. And if it doesn't produce an error, then that's what you're next.

20:33.440 --> 20:39.520
Function did if some of those produce an error, that's the error. Basically you're trying to

20:39.520 --> 20:44.080
receive two empty messages and if there's an error you really fail when you see the error.

20:45.200 --> 20:50.800
And this is called sequence 2 because it's basically like random two blocks always first block

20:50.800 --> 20:55.680
work, second block work and there are many situations in which you would like to do so because it

20:55.760 --> 21:03.520
makes sense for the flow graph. And yeah there is sequence 3, sequence 4, it's a data combine

21:04.160 --> 21:11.520
more blocks. There is also help for function which is called ran and it takes a stream and it

21:11.520 --> 21:17.760
produces a future. A future is and a synchronous value. So you all wait on it and it returns

21:17.840 --> 21:26.000
something. This run is going to call the stream until it is done or until it produces an error.

21:26.960 --> 21:32.640
Thinking again of the message receiver, it's basically going to drain the message receiver by

21:32.640 --> 21:38.480
receiving all messages until an error or until there are no more messages. And this is the

21:38.480 --> 21:45.440
way you run a flow graph. Basically you call run on your blocks or combinations of blocks

21:46.160 --> 21:53.280
to get them to work until they say they are done because some thing is done and it's saying

21:53.280 --> 22:00.080
that it's not going to produce any more output. And so the block cannot have any more input and

22:01.120 --> 22:06.400
so it is done as well. And this run function is usually the top level for schedule on its thread.

22:06.480 --> 22:16.160
Code in example just to give you an idea of how friendly or ugly the code it is. I want to

22:16.160 --> 22:22.160
dwell on a couple specific points. So in here we are defining a type for the buffer. This is

22:23.760 --> 22:30.480
like a short term to avoid writing too much. It's the same as using in C++. And of course we

22:30.480 --> 22:38.000
have like custom buffers in QSDR because everyone has and this is like the basic cash line

22:38.000 --> 22:47.440
buffer for floating 0.32. Here we are defining our buffer. So we create an iterator which is going to

22:47.440 --> 22:56.160
instantiate number buffer. So four buffers of 1996 flows each of them. But this isn't creating the

22:56.160 --> 23:01.760
buffers at this point because this is an iterator. iterators in Rust are lazy until you do not

23:01.760 --> 23:10.000
execute them. Nothing happens. We create a flow graph. We add some blocks. Something to point out here

23:10.000 --> 23:16.960
is I have this single-producer single-consumer and single-producer single-consumer reference. This

23:16.960 --> 23:22.320
are the input and output challenge for this particular head block. This is null source, head and null

23:22.320 --> 23:30.960
sink. The most simple photograph you can have. And the thing is many things are like have type

23:30.960 --> 23:38.000
parameters. In the C++ world you would say they are templates. And the data types or the buffer

23:38.000 --> 23:45.680
types are part of the template or type parameters and so are the child types because if single-producer

23:45.680 --> 23:51.680
single-consumer is the most efficient channel. Why would you pay for a more general child to do

23:51.680 --> 23:59.120
connections where that suffices? So that's the reason you see here different channels specified.

24:00.160 --> 24:08.080
Then we do this circuit. So we have these closed circuits. To do connections we create a new circuit.

24:09.280 --> 24:15.840
Use that circuit to make a connection and then the last connection we do. We need to say where

24:15.840 --> 24:22.560
the packets need to return to be recycled. So basically I have my source. It's connected to the head.

24:22.560 --> 24:28.480
Then the head gets connected to the sink. And the return point for that connection because the

24:28.480 --> 24:36.960
sink basically is the finishing point for that chain. It is to go back again to the source to recycle.

24:37.920 --> 24:43.040
Then we validate the flow graph to make sure there are missing connections etc. And then there is

24:43.040 --> 24:51.760
this piece of code which is somewhat technical and probably we could not need this if we use a macro

24:51.760 --> 24:57.600
to generate this for us. This is because when we are adding the blocks we are not really adding the

24:57.600 --> 25:02.880
blocks as they will exist when the flow graph runs because for example the chance that these blocks

25:02.880 --> 25:09.840
will use the communicate they do not exist until you do the connections. So we are adding like some

25:10.480 --> 25:15.600
seed for the block something that it's going to become the block once it's properly connected and

25:15.600 --> 25:21.920
everything. And in here we are finalizing the block into something which can really run. That's the

25:21.920 --> 25:29.360
reason we need this. And then finally what happens here is we convert each of the blocks into a stream

25:29.360 --> 25:37.840
and we use our sequence 3 to say it's always going to run source head sink. Run and block on

25:38.800 --> 25:47.360
this is nothing about queues. This is a async executor in Rust which means run this thing on the

25:47.360 --> 25:54.800
current thread until it's done. And this is really a custom scheduler. This run sequence 3 is set custom

25:54.800 --> 25:59.840
scheduler a very simple one but it's a custom scheduler for queues. The R which says random blocks

25:59.920 --> 26:08.480
in this order always. How do you write the blocks? There's a whole bunch of template parameters

26:08.480 --> 26:15.520
or Rust type parameters here that you can basically copy and paste from a similar example. And then

26:15.520 --> 26:24.320
what you have are ports. This is kind of inspired by a grader for the toe who also has the ports

26:25.120 --> 26:32.240
like class members. But the nice thing about Rust is that these are really zero-sized objects.

26:32.240 --> 26:39.120
They do not do anything at runtime. They are just to make our life easier a compile time.

26:40.000 --> 26:44.640
When I was saying at first this is like the seed for a block, then it's going to be coming to the

26:44.640 --> 26:51.440
block which can run and there are channels. These will be replaced by proper channels,

26:52.240 --> 27:00.560
centers or receivers to send these quantum between the packets. And then we can have any local

27:00.560 --> 27:09.840
variables we want. This is a head block. So we have the remaining number of items. We have a constructor.

27:09.840 --> 27:16.800
This is like a regular constructor. And then we have the work function. There are like different

27:16.800 --> 27:22.880
flavors of work functions. You can have the pen in on what you need to do. And in this case this

27:22.880 --> 27:29.200
is work in place because it's going to work in place on the item. So it's going to work in place on

27:29.200 --> 27:36.640
this quantum or buffer. In the particular case of head, it doesn't do anything with the data.

27:36.640 --> 27:43.600
It's just going to say it's done at some point. And that's also another place where this concept

27:43.600 --> 27:50.080
it's nice. It doesn't need to do a memory copy. So in here basically I get a mutable reference

27:50.960 --> 27:58.800
basically a pointer to my input and output data. I could go in there on multiply by 42 every element.

27:58.800 --> 28:04.240
If that's the kind of thing I want to do. In this head I'm just counting elements. And when

28:04.240 --> 28:11.120
if reach zero I say I'm done. But you could multiply by 42. An example multiply constant would be

28:11.120 --> 28:15.920
like iterating over the samples in here. I'm multiplying by your constant. And it would be a

28:15.920 --> 28:23.280
very similar example. Okay, so benchmarks. This is probably the most interesting part of the talk.

28:25.120 --> 28:32.560
So to do benchmarks, this is the process I wanted to follow. First choose a family of simple

28:32.560 --> 28:38.880
programs because if it's a complicated photograph it's hard to understand where the bottlenecks are.

28:38.880 --> 28:42.960
And you also need to implement it in four different NSTR runtime scientists point. So it's going

28:42.960 --> 28:48.480
to be a lot of work. And maybe you don't implement it in a very smart way in one of them. And you

28:48.480 --> 28:55.360
are misperforming or underperforming because of some choice you made or some mistakenly. So

28:55.360 --> 29:03.200
simple photographs. And first, I'm very importantly right by hand an implementation of this problem

29:03.280 --> 29:10.400
that performs the best you can. Without using NSTR framework just ask yourself how fast the hard work

29:10.400 --> 29:16.960
can actually run this task. And write it and benchmark it. Then you can go and write implementations

29:16.960 --> 29:22.560
using photographs in these four different NSTR runtime. And you basically want to measure the rate

29:22.560 --> 29:31.040
at which your photograph is running. So basically the sample rate. The benchmarking platform

29:31.120 --> 29:39.680
I've chosen is the embedded ARM CPU, which is on the science MPSOC. And that's a quad core

29:39.680 --> 29:48.800
cortex A53 CPU. It's 64 bit ARM. It runs with one dot 33 gigahertz clock. If you are thinking

29:48.800 --> 29:55.840
in Raspberry Pi terms, this is a CPU on the Raspberry Pi 3. Maybe not exactly. So maybe not the

29:55.840 --> 30:03.840
same cache size, but it's the same CPU core, the A53. The reason I'm interested in this is because

30:03.840 --> 30:08.880
I'm interested in mostly embedded systems. These are the kind of systems I can launch into space.

30:09.600 --> 30:18.160
And also because if you look at the newest generation of HDRs, they usually use these science

30:18.240 --> 30:25.920
FPGA SOCs, which has these cortex A53 platform. So here this is the X400 series,

30:26.400 --> 30:33.600
USR P. This is from analog devices, the Jupiter's DR. And this is a development platform for

30:33.600 --> 30:40.240
the IRF versus C, which is called the 4 by 2. And these two are IRF versus Cs. And this one is

30:40.240 --> 30:48.320
not an IRF versus C. It's an MPSOC with some other ADI RF chip. And I know what you are thinking.

30:48.320 --> 30:54.880
These platforms, they all have an FPGA, you should be doing your compute on an FPGA. Sure, sure,

30:54.880 --> 31:02.400
I mean, but the cortex A53, it's rather capable. It looks like an old ARM CPU, but you can do

31:02.400 --> 31:09.200
nice things with it if you program on an efficient way. And then the advantage is it's much easier

31:09.200 --> 31:16.000
to write software than FPGA. You do not need an FPGA engineer. So by having a very efficient

31:16.000 --> 31:22.240
DR runtime, to use on these platforms, you can solve problems that maybe you do have needed to go to

31:22.240 --> 31:32.320
the FPGA. For this, I'm using Cria KV 260 board. It's $250, which is some money, but not so much

31:32.320 --> 31:38.720
compared with all of these. It doesn't have any RF hardware in it, but you can run software. You could

31:38.720 --> 31:45.040
also program the FPGA if you want something nice. Also, if you want to replicate these benchmarks

31:45.040 --> 31:50.560
and run the same thing, you can get the same board. If I were to run this on my desktop computer,

31:50.560 --> 31:57.040
it would be hard for you to replicate exactly my desktop computer. This is how the flow graph is

31:57.040 --> 32:03.840
going to look like. So there is a math kernel, which is going to be the SACS P kernel. And this

32:03.920 --> 32:09.840
basically takes an input, it multiplies it by a constant, and it adds a constant. And here's

32:09.840 --> 32:16.480
note, usually in the literature, where people speak about SACS P, they mean in here a vector.

32:16.480 --> 32:22.640
I want to have a constant, because I want to have just one input, one output. So this is just

32:22.640 --> 32:29.840
like multiply constant, but you add also another constant, not particularly relevant for SDR in general.

32:30.720 --> 32:35.120
I thought it was just more interesting than a simple multiply constant. And there is also

32:35.120 --> 32:43.120
this fancy FMA operation on many CPUs. So here is your FMA operation. Unfortunately, in our

32:43.840 --> 32:49.920
FMA is not very good to implement this, because the A doesn't mean add, it means accumulate. And that

32:49.920 --> 32:55.440
means that the B goes in the same register where you put your result, and that's no good for implementing

32:55.440 --> 33:01.440
this operation. But anyhow, the family of programs is just an old source, which is a real

33:01.440 --> 33:06.720
null source, not like the one in your radio. It doesn't do anything. It doesn't mean that the output

33:06.720 --> 33:13.120
it just pretends, it's produced some output immediately. Then we have a chain of SACS P kernels,

33:13.120 --> 33:19.120
and we will play with how many. I'm finally a benchmark sink, which is just counting how many samples

33:19.120 --> 33:29.120
I'm calculating the sample rate. SACS P kernels implementation. I said before, do this the best

33:29.120 --> 33:35.840
way you can. And the best way I can is I go, I hunt right assembly, using neon CMD for this particular

33:35.840 --> 33:43.840
CPU, knowing the exact limitations the hardware has, I know I can do as good as one floating point

33:43.840 --> 33:50.080
number per clock cycle. And that's two floating point operations, because you have the multiplication

33:50.080 --> 33:57.680
and you have the addition. And I have a list of limitations in this CPU here. If you're interested,

33:57.680 --> 34:03.680
I can go in more details about why this is the best you can do with the hardware. For comparison,

34:03.680 --> 34:10.800
an optimal memory copy is only slightly faster. It's 1.33 floats per clock cycle. And the reason

34:10.800 --> 34:16.880
for that is basically on your CPU you have a 64-bit load path, and I want to do a bit store path,

34:16.880 --> 34:23.600
it's kind of a symmetric for a weird reason. And if you want to main copy, you need to go through that path,

34:23.600 --> 34:29.440
and you cannot store on load at the same time. That's how the main copy performance comes.

34:30.880 --> 34:36.000
In here maybe, another note, I've seen people using main copy to benchmark SDRs,

34:36.960 --> 34:42.800
like do memory copy blocks. And I don't think that's a good reason for two reasons first,

34:42.800 --> 34:48.000
you do not know what's exactly on your C library, do a main copy. Maybe I move to a different

34:48.400 --> 34:54.800
distribution, it's slightly different main copy, and it does something else. So that's one reason.

34:54.800 --> 35:00.560
Second reason is I'm always saying on this talk, let's do things in place, and main copy in place doesn't

35:00.560 --> 35:07.280
make sense. What else, this hundred and neon assembly is twice as fast as the code you'd get

35:07.280 --> 35:14.560
from GCC clang, rust, compiler, everything if you write this. And the reason for that is,

35:15.360 --> 35:22.640
well, LVM has a nice tool which is called LVMMCA. It will show you how your machine code is

35:22.640 --> 35:28.240
executing through the CPU pipeline, how many cycles it takes to run, et cetera, et cetera. And that's

35:28.400 --> 35:34.560
the same kind of tool it uses to generate code for you, which is efficient. But if you look at

35:34.560 --> 35:43.120
these for the A53, the results are completely wrong in LVMMCA. Really, LVMMCA has no clue how this

35:44.560 --> 35:51.600
CPU executes code. And the main reason for that is the documentation about how this CPU runs code

35:51.600 --> 35:57.360
is not public. The things that people know are because they've read in the A's or because they've

35:57.440 --> 36:04.480
just don't reverse the engineering. And there's like some full color knowledge of how some

36:04.480 --> 36:09.680
instructions can't view all issue and some of them cannot. So that's the kind of things you need.

36:11.520 --> 36:17.200
This is to give you an idea of how the current looks like. So it's a whole bunch of code. And this is

36:17.200 --> 36:24.080
not because the loop is unrolled. There is a branching here which goes somewhere there. So the main loop

36:24.080 --> 36:28.960
is like half of this code. And you need to have all of this to actually be that fast.

36:29.840 --> 36:35.280
Some other nice details, we have these preferred memory instructions which are free to call

36:35.280 --> 36:40.320
because they do a issue with the presenting instruction. And that makes it more likely that the

36:40.320 --> 36:47.360
data we need is in cache and L1 cache by the time we need it. So we want to benchmark these. And

36:47.360 --> 36:52.720
the way to benchmark it is, we call this function many times on the same buffer of a fixed size. And

36:52.800 --> 37:01.360
we measure how much samples per second. And as I said, this is a 1.33 gigahertz CPU. The thing

37:01.360 --> 37:08.880
ideally runs at one floating point sample per second. So my top maximum, which is the max on this

37:08.880 --> 37:15.920
plot and all the ones which are in the rest of the top is 1.33 giga samples per second. That's

37:15.920 --> 37:21.760
the maximum hardware limit. And you can see when the buffer is molded overhead and I'm doing

37:22.400 --> 37:28.240
less performance than that. When the buffer increases, then I almost reach it and then there is a drop

37:28.240 --> 37:33.600
and then there is a drop. And I can ask you, what's the cache size on this machine? And it's

37:33.600 --> 37:40.880
pretty clear from the graph, right? It has an L1 cache which is 32 kilobytes. It is per CPU and then there is

37:40.880 --> 37:46.640
an L2 cache which is shared by all the cores, the four cores in the system. And basically you can see

37:46.640 --> 37:51.200
when you're running a L1 cache, you're in here L2 cache, you're in here. And DDR is playing full

37:51.200 --> 38:00.640
is low, you are in there. So cover quick comments about SDR runtime performance. We want to spend most

38:00.640 --> 38:05.600
of the time running the work functions because that is where our problem gets solved. The rest is

38:05.600 --> 38:13.120
just deciding what is going to run copying data from point A to point B etc. SDR runtime performance

38:13.120 --> 38:18.720
depends on the number of work calls per second because if you are doing many work calls, you need

38:18.720 --> 38:24.880
to schedule many things per second and you're going to have much over it. So, and if idea would

38:24.880 --> 38:31.760
be, I can reduce the number of work calls if I use larger buffers, but then no because you need to

38:31.760 --> 38:37.520
stay within L1 cache otherwise you're going to be paying full is low as you saw. So you definitely

38:37.520 --> 38:45.440
do not go, do not want to go to DDR. The worst example you can benchmark are really simple math

38:45.520 --> 38:53.120
kernels which can run really fast of L1 cache and so basically your work function is very fast

38:53.120 --> 38:58.480
and you are going to need to call it many, many times per second. For example with his sax

38:58.480 --> 39:07.120
speaker now, if you do a 16 kilobit buffer, which is half the L1 cache, that's just 4000 floats.

39:07.520 --> 39:13.120
So it takes three microseconds per work call. If you think about this and this is an embedded system,

39:13.200 --> 39:20.000
if you ever call the Linux kernel to sleep or to do something for you, that's a huge overhead

39:20.000 --> 39:29.440
compare with three microseconds. So I have some figures in here and maybe go over. So what I'm doing

39:29.440 --> 39:37.840
here is I'm basically running the kernel on multiple CPUs. So on two CPUs, the one on the left,

39:37.840 --> 39:44.720
three CPUs, the one in the middle, four CPUs, the one on the right. And the thing that I see

39:44.720 --> 39:50.560
is a reduction in performance as I increase the number of CPUs and this is because of communication

39:50.560 --> 39:56.880
between CPUs because the data which is in L1 cache of one CPU needs to be read by another CPU.

39:57.600 --> 40:02.720
And it's quite tricky how it gets there. There's not enough documentation from

40:03.440 --> 40:09.360
ARM for us to be able to know what happens on a cycle by cycle basis, but basically this is

40:09.360 --> 40:15.200
the boilerplate. There is a price to pay for CPU to CPU communication.

40:17.760 --> 40:25.760
This I probably also want to close over because I can explain it in the context of SDR runtime.

40:25.760 --> 40:31.600
So let's benchmark SDR runtime. First I want to do a single kernel, single core.

40:32.160 --> 40:37.920
Very simple. So the photograph is null source, sackspy, benchmark sync. And I use the simplest way

40:37.920 --> 40:43.840
I can to run everything on the same CPU that depends on the SDR framework. I can give you

40:43.840 --> 40:50.160
letters if you are in details if you are interested. And this is the sample rate I get. Again,

40:50.160 --> 40:57.200
the maximum possible is the top of the graph. Futers DR is here. It's doing 50%.

40:57.200 --> 41:06.320
And maybe the reason is that Futers DR is also using Rust channels. And for QSDR I really

41:06.320 --> 41:12.000
need it to write custom Rust channels, like hyperformance channels because the ones which are normal

41:12.000 --> 41:19.120
used are not fast enough. So maybe Futers DR would benefit from using the same kind of optimized

41:19.120 --> 41:25.440
channels. We have Galerated 3.10 here. It's awful. I don't know why it's awful with this problem,

41:25.440 --> 41:32.160
but if you look at this, it's running the 3 threads in the same CPU. This is supposed to do

41:32.160 --> 41:39.120
all the math. These are supposed to do nothing. The 3 threads are using 33% of the CPU that tells

41:39.120 --> 41:45.040
me there is a huge overhead in everything besides my warp function. And then we have going to

41:45.040 --> 41:50.080
ready for the toe, which is doing pretty well. And here's DR, which is also doing pretty well.

41:50.320 --> 41:56.160
In this one slightly less than 4 to 0, but we will see cases in which it does better.

41:57.200 --> 42:03.920
So now it's the same kind of problem. I can probably explain it directly with 4 CPU cores.

42:04.960 --> 42:11.280
So what I do is I have 4 CPU cores and I'm going to run the null sync several

42:11.360 --> 42:20.720
sacksp kernels and then the benchmark sync. And the way I can do this is I can pin my sacksp kernels

42:20.720 --> 42:28.160
to fix CPUs. So for example, if I only have 4 kernels, I pin each of them to 1 different CPU.

42:28.160 --> 42:35.040
If I have 5, then I need 2 of them to go on 1 CPU and the rest can go on their own CPU.

42:35.040 --> 42:41.040
That's cutting my performance by a factor of 2, because now my bottleneck is I have 2 math kernels

42:41.040 --> 42:48.160
sharing the same CPU. But that's like a simple thing to do. And you can see that maximum

42:48.160 --> 42:54.320
performance you could achieve with that is this line here. Basically you are cutting off it by a

42:54.320 --> 43:01.520
factor of 2 when 2 of them need to share at least 1 CPU and then this is 3 of them need to share

43:01.520 --> 43:08.640
at least 1 CPU. So you have this staircase pattern. There's something else which is a work stealing

43:08.960 --> 43:18.320
runtime. And that means it's basically figuring out where in which CPU tasks can run a

43:18.320 --> 43:27.200
depending on usage. And then if I for example have 5 kernels to run on 4 CPUs rather than always

43:27.200 --> 43:32.960
running 2 kernels on the same CPU, it can sort of run probing. And so basically the performance is

43:32.960 --> 43:39.840
like 1 divided or 4 divided over the number of kernels you have. So it goes like this.

43:40.720 --> 43:48.960
And then what are the things we have in here. So basically in this light blue is my hand implementation

43:48.960 --> 43:56.640
of things. This pink is curious the R with the custom scheduler. They're really simple sequence

43:56.640 --> 44:05.920
thing. We saw only in each CPU core. This gray thing is curious the R with the Async executor.

44:05.920 --> 44:13.680
This is like a regular Async executor from rats that people use to run any kind of Async code.

44:13.680 --> 44:22.400
So it can be used as a runtime or a scheduler. And then we have gone radio for that over here.

44:22.400 --> 44:27.280
This is basically a custom scheduler. I wrote which is supposed to be doing the same as the

44:27.280 --> 44:33.600
custom scheduler. On here's the R and futures the R. And the simple scheduler you get by default.

44:33.600 --> 44:40.160
And this is quite a little bit for months. If you compare it for example to 2 CPU cores where it

44:40.160 --> 44:47.840
does a better or even 1 CPU cores where it does quite comparable where it is. Yeah, it's the

44:48.800 --> 44:54.400
brown line here. So that's comparable to peers. They are on one CPU core. On four cores I don't

44:54.400 --> 45:02.800
know why it's performing quite worse. And then we have futures the R somewhere in here it's the blue

45:02.800 --> 45:11.200
and orange. It's always like half the performance of Guru Radio 4.2. And Guru Radio 3.10 basically

45:11.280 --> 45:16.080
performs the same no matter what kind of program I give it within this constraints is always

45:18.080 --> 45:27.200
as low. And you can see no difference. And that's basically 8 about benchmark and I run in

45:27.200 --> 45:34.640
out of time. So just to think as this graph show there's a lot of room for improvement in

45:34.640 --> 45:41.440
SDR runtime performance. And who knows where the project this curiously I project might

45:41.440 --> 45:46.080
develop in my stay an experiment or in my go further. I don't know.

45:53.520 --> 45:57.920
I think everyone agrees that they can take one or two three questions.

45:58.880 --> 46:06.000
Yeah, not the question just just to feedback from the customer works and I kind of

46:06.000 --> 46:12.080
shut it down to the resource. But the approach is very similar just to convince the recycling

46:12.080 --> 46:17.600
the rocks. My personal experience is sometimes it's used this compared to the reality.

46:18.240 --> 46:24.240
Because if you use the same size in fact the memory I look into a basic variable. Yeah, it depends

46:25.200 --> 46:31.120
on that. And the second point is in my case I need multi-channel rocks,

46:31.920 --> 46:37.920
which is called mantons. So you have SDR providing a synchronized multiple channels. Look for example

46:37.920 --> 46:45.040
a later F. Yeah. And then one key point is how do you manage synchronized channels. And

46:45.600 --> 46:51.680
I came to the view that was more efficient to put in the same quantum the two rocks from the

46:51.840 --> 46:58.320
channels. Yeah, you can have like a tensor in this case it would be like two by and many samples.

46:58.320 --> 47:04.000
Yeah, because then you brocks like a DOA or stuff like this. Don't need to go away from

47:04.000 --> 47:10.080
form two streams. We'll see the two rocks which are synchronized. Yeah, just the fact. Yeah, thanks.

47:11.040 --> 47:12.640
There were some hands in there.

47:12.640 --> 47:24.480
Yeah, so the question was how using external accelerators affects all of these? And yes, I haven't mentioned,

47:24.480 --> 47:33.440
but I always had like the FPGA accelerator use case in this case. Yeah, so the question was how using

47:33.440 --> 47:40.080
external accelerators affects all of these? And yes, I haven't mentioned, but I always had like the

47:40.080 --> 47:48.160
FPGA accelerator use case in mind when you are using an FPGA. You might be using packets over an

47:48.160 --> 47:53.840
access stream boss. And that's also where these whole idea comes. Things can be quantum on your

47:53.840 --> 47:59.920
software side. Then they get transferred to the FPGA by DMA or maybe as VDP packets, it depends.

47:59.920 --> 48:07.120
And then they are access to impacts on your FPGA. And in the GPU, you have like a device to close

48:07.200 --> 48:18.800
the scope. It's a dining sub buffer. All the questions, Johannes? Yeah, so as you look at the flow

48:18.800 --> 48:25.200
drop, like the first few stages for the four and six or nine, those are mind-bд ones where you

48:25.200 --> 48:31.440
have rather look at this than the continuous question, like actually like basically your example here.

48:31.520 --> 48:37.920
But usually in the digital communication, you synchronize and then every packet is anyway.

48:37.920 --> 48:45.120
So is that more of a saying, hey, we need to look at different parts of the flow drop differently?

48:46.800 --> 48:52.320
Yeah, so the question was, when you have communications receiver at some point, it's a continuous

48:52.320 --> 49:01.520
stream of samples, but then you kind of have packets. And my answer is, yes, so at some point,

49:01.520 --> 49:06.640
you really really need to use packets because that's the kind of thing you are processing.

49:06.640 --> 49:12.800
But also in the beginning, where it's a continuous stream of samples, maybe if you chunk it up,

49:12.800 --> 49:18.880
you are like defining what's the buffer size you want to work with because I think high

49:18.880 --> 49:23.520
performance, maybe you've benchmarked these or have it as a parameter you can tune,

49:23.520 --> 49:28.720
so it can also make sense. For example, I don't know, or if Dn receiver, it's going to be half

49:29.440 --> 49:35.920
fixed FFT size, maybe you do things, which are in blocks that come out nicely when you get to the FFT.

49:39.440 --> 49:46.160
Yep, we mentioned the best about the buffer and the number of flows you can produce,

49:46.160 --> 49:51.200
was it a memory constraint or a computational constraint?

49:53.040 --> 49:53.840
In here you mean?

49:57.440 --> 50:04.800
None of the many protocols, yeah, yeah, this one. So you mentioned that the buffer was the catchest

50:04.800 --> 50:12.880
in the FFT to deal with, is it, so the core algorithms for both log graph was a constraint by

50:12.880 --> 50:19.440
how much you read from the memory at the past, or how fast you can, yeah, it should be

50:19.440 --> 50:28.160
pretty different. Yeah, so the question is about whether compute or reading from memory

50:28.160 --> 50:37.840
its limitation, for this particular math operation, both reading the data into the CPU registers

50:37.840 --> 50:46.240
and computing is the thing that sets the limit, and that's what gives you the maximum one float per

50:46.240 --> 50:52.400
cycle, provided the CPU can read the data in one cycle because it's in L1 cache. If you

50:52.400 --> 50:57.760
descend in L1 cache, you're going to be installing the CPU because this is an in order

50:57.760 --> 51:04.960
execution CPU for as many cycles, and that's where these performance drops come. So in in this

51:04.960 --> 51:18.000
part of the graph, you do our memory bandwidth constraint, other questions? Okay, thank you,