WEBVTT

00:00.000 --> 00:19.000
Okay, I guess we're good now. So again, good morning, everyone. And I'm going to be talking

00:19.000 --> 00:29.000
about, obviously I just said, research optimizations in the FMP project. So I'm going to be going

00:29.000 --> 00:37.000
to be a little bit of a story of how we did that stuff. And then talk a little bit about,

00:37.000 --> 00:43.000
this is all done, so you can pass me a guess of this one. This is done. And then also a little

00:43.000 --> 00:51.000
bit of mark, kind of a feedback or like, post-mortem, it's not finished, but like, experience

00:51.000 --> 00:58.000
from this activity. So this is only my personal opinion and all the societies in my employer,

00:58.000 --> 01:03.000
thank you. And if I speak too fast, although that don't articulate enough, do interrupt me.

01:03.000 --> 01:08.000
So always please take the questions to the end of this talk. So who am I? I've been attending

01:08.000 --> 01:15.000
for them for more than 20 years now. And this is my second presentation. I do open source

01:15.000 --> 01:20.000
mostly on my free time, also I do, I am a software engineer, but I am the risk argument in a

01:20.000 --> 01:27.000
force of MPEG, but I mostly don't use the opportunity of, I'm not normally, but I'm the biggest

01:27.000 --> 01:38.000
developer for the view, see me in the employer. So what I do, I'm a little bit of history

01:38.000 --> 01:44.000
because this, as I've been playing a high-class stuff, so until 2022, there wasn't fairly

01:44.000 --> 01:49.000
anything. I think there was one patch on open, open, BSD to just allow it to compile and

01:49.000 --> 01:53.000
on race five, I was just playing C record because it insisted on knowing the architecture,

01:53.000 --> 01:59.000
but that was basically it. And that's when we talked on the 2022, just the end of COVID,

01:59.000 --> 02:06.000
that's when I started, I got a board from a star five tech and I started, well,

02:06.000 --> 02:11.000
I thought things as I said, because automation first, and also writing some simple

02:11.000 --> 02:19.000
spike and, I said, writing some simple optimizations using spike because at the time

02:19.000 --> 02:27.000
key image and rework, and there was the hardware I got, obviously don't support vectors yet.

02:27.000 --> 02:35.000
So I think we're getting full, so please. Anyway, so the point is that the time,

02:35.000 --> 02:38.000
normally, if I think that doesn't get optimizations, this early in the process, we only

02:38.000 --> 02:42.000
target real hardware, and at that time there was, there was a risk of hardware, obviously, but

02:42.000 --> 02:48.000
there was no vector hardware. Anything was quite tedious to use spike and TK.

02:48.000 --> 02:52.000
In the same year, later on, I actually started doing some, as if I'm not

02:52.000 --> 02:57.000
optimizations, not just VFC, because obviously I've been playing this much more relevant

02:57.000 --> 03:00.000
in terms of in our loop, so that you can optimize these vector units, and

03:00.000 --> 03:04.000
you can finally find a good vector support so we could actually use more

03:04.000 --> 03:10.000
proper emulation than spike really. Then we got the first hardware with the

03:10.000 --> 03:15.000
B extension, and I implemented the KKSM tester, but we will see later on what

03:15.000 --> 03:22.000
KKSM actually is. For a year, I started posting patches, but didn't carry on.

03:22.000 --> 03:27.000
Then came the second, I mean, the vector, the drafts, the hardware, I think,

03:27.000 --> 03:32.000
some of you have probably heard about this problem with the incompatible drafts,

03:32.000 --> 03:35.000
the action of the vector specification, that was implemented in some

03:35.000 --> 03:40.000
hardware, but while it was not really compatible, we were able to, back

03:40.000 --> 03:43.000
forth the code, or I was able to back forth the code with some hacks, and

03:43.000 --> 03:48.000
I've just get some very smart, some real hardware, because still it was

03:48.000 --> 03:52.000
similar to what would be the final spike. And then you know to me, finally,

03:52.000 --> 04:00.000
the first hardware that was small K-230 camera board, and we started getting

04:00.000 --> 04:05.000
patches from China. For a few years, when we started getting money,

04:05.000 --> 04:09.000
that's when I started funding, it's a six-for optimization,

04:09.000 --> 04:13.000
and fortunately, that's now more than finished, and there's no money

04:13.000 --> 04:20.000
anymore, anyway. Then we got the X60 board, or the banana pie, which is a

04:20.000 --> 04:25.000
single, and most people would be using nowadays, if they want to do vector

04:25.000 --> 04:29.000
organizations, or I would generally speak, with very popular RF5 board, I guess.

04:29.000 --> 04:32.000
And that's when we started having the problem with the SVMX, which

04:32.000 --> 04:40.000
we get from later also. And then last year, we got the X-280 from

04:40.000 --> 04:44.000
size for which was the first 512 bit vector unit. It's a bit expensive,

04:44.000 --> 04:47.000
obviously I don't have more than, I don't have any benchmark. So all the

04:47.000 --> 04:50.000
benchmark comments I'm going to make in the rest of the talk are

04:50.000 --> 04:57.000
for the C908 and X60, and they're not pretty sizing size 5,

04:57.000 --> 05:02.000
but also when this, that I just don't have experience with.

05:02.000 --> 05:05.000
So the development of, honestly, has an opportunity, and

05:05.000 --> 05:08.000
I must say it has slowed down last year, and it's still slow.

05:08.000 --> 05:12.000
On one computer from China, I was very assigned to it.

05:12.000 --> 05:16.000
There was a project, another contributor, not from China quit.

05:16.000 --> 05:21.000
There's a job, so that's left-retime, actually.

05:21.000 --> 05:25.000
And as you can guess, nobody really likes to do it, because

05:25.000 --> 05:30.000
especially for free, and especially not when it's long as

05:30.000 --> 05:35.000
there are things, and we'll see why it's assembled or not, something

05:35.000 --> 05:39.000
or something else. So how do you do, as you can get, as you can get,

05:39.000 --> 05:42.000
that's an important thing, I recommend see that this point is

05:42.000 --> 05:46.000
cross-compining it, because the boards are improving, but they're still

05:46.000 --> 05:50.000
slower, and still seeing if you can't discover, and they're

05:50.000 --> 05:54.000
looking at it on an X86 desktop. And then, I understand,

05:54.000 --> 05:57.000
you should, instead of QMU, with the B70 support, you can

05:57.000 --> 06:00.000
transfer and be pretty much running the arm, which is a space

06:00.000 --> 06:06.000
program on Linux, so that's very easy. So in

06:06.000 --> 06:11.000
mind, you can just download, configure, and compile, and run the test

06:11.000 --> 06:14.000
to it. You can also do that natively, but it's going to be

06:14.000 --> 06:18.000
going to take maybe one hour. So the workflow, while you're

06:18.000 --> 06:21.000
finding one function doesn't be an optimizer yet, obviously,

06:21.000 --> 06:25.000
then you reflect your refactor, the C code, so that it looks like

06:25.000 --> 06:28.000
a loop, that is easy to optimize in assembler, then you write

06:28.000 --> 06:33.000
the assembler, then you compile, then you test, then it doesn't work.

06:33.000 --> 06:38.000
Then you debug it, and you compile and you restart the

06:38.000 --> 06:40.000
process, and when you're finally done, you can go and

06:40.000 --> 06:46.000
boast about your benchmarks on the mailing list. What you do

06:46.000 --> 06:51.000
not do is you see in 26, and we explain why, using an

06:51.000 --> 06:54.000
assembler, because you can't express a clover,

06:54.000 --> 06:57.000
correctly, in a rest 5. So, absolutely, you can not use

06:57.000 --> 07:00.000
an assembler for vector optimizations. You can use an

07:00.000 --> 07:04.000
assembler for normally, if 5, but not for vectors.

07:04.000 --> 07:09.000
Optimize function, we have test cases, that's a bad idea.

07:09.000 --> 07:13.000
Specialize the vector length, this is controversial, but

07:13.000 --> 07:19.000
get to it. So, do use the other set,

07:19.000 --> 07:23.000
adversaries, making instructions, it helps a lot. Obviously,

07:23.000 --> 07:27.000
because we don't have the clear performance profile for all the CPUs that

07:27.000 --> 07:30.000
might exist in the future and for more different vendors, just try to

07:30.000 --> 07:33.000
avoid the data dependencies between consecutive instructions, so

07:33.000 --> 07:35.000
that the CPU pipeline is installed, especially on

07:35.000 --> 07:39.000
in all the execution CPUs, and do check benchmarks until

07:39.000 --> 07:43.000
hardware is shut up, possible. So, why we use a

07:43.000 --> 07:47.000
assembler, I'm not in 26, partly it's historical,

07:47.000 --> 07:51.000
on X86, when they've actually started, they didn't

07:51.000 --> 07:55.000
have mixed experience with in 26. There's general

07:55.000 --> 08:00.000
problems, defector, all platforms, including this five,

08:00.000 --> 08:03.000
and you can't really prevent the compiler from

08:03.000 --> 08:07.000
spilling vectors, or in general, because

08:07.000 --> 08:09.000
registers at the wrong time and breaking of the

08:09.000 --> 08:12.000
organization, I'm a Hingit slower, rather. So, you don't have

08:12.000 --> 08:14.000
finally enough control, of course, it's easier, because

08:14.000 --> 08:16.000
you don't have to worry about naming your files. You can name your

08:16.000 --> 08:18.000
variables and let the compiler do the registered

08:18.000 --> 08:21.000
allocation, but the result code is usually, is not

08:21.000 --> 08:24.000
stable and sometimes it's just slower. Also, you can

08:24.000 --> 08:28.000
do one-time detection, because you need to enable

08:28.000 --> 08:32.000
the target extension for vector, in this case,

08:32.000 --> 08:35.000
statically and your binary, is this means your binary.

08:35.000 --> 08:38.000
It's a compiler with nus vectors, five racing, and you won't

08:38.000 --> 08:40.000
be able to run your binary on the non-vector capable hardware with

08:40.000 --> 08:43.000
one-time detection. So, we also don't like that, because

08:43.000 --> 08:46.000
this was one-two has the same binary for both vector and

08:46.000 --> 08:49.000
non-vector CPUs. And in the list, five vectors,

08:49.000 --> 08:52.000
in case specifically, you have one-or-is, and which is

08:52.000 --> 08:55.000
the group multiplier, where you can adjust the register pressure,

08:55.000 --> 08:58.000
the size of the register depending on the register pressure of

08:58.000 --> 09:01.000
your function, which means you need to have really

09:01.000 --> 09:05.000
fine-drying control and the standing of the register pressure

09:05.000 --> 09:08.000
function you're currently optimizing. And you cannot do that is

09:08.000 --> 09:11.000
the compiler is the register allocation for you. So, do not

09:11.000 --> 09:14.000
use intrain 6. I mean, you'll be seen as an effect, and we'll say

09:14.000 --> 09:16.000
no, if you try it, but the nus approaches are

09:16.000 --> 09:19.000
recommend against using intrain 6. Check it is same,

09:19.000 --> 09:26.000
it's a bit more complex than that. So, it's specifically designed to

09:26.000 --> 09:31.000
test assemblable optimizations. So, it has, it will

09:31.000 --> 09:34.000
enable it to also feature a CPU has different as defined by the

09:34.000 --> 09:38.000
project. It then gives random inputs, compares the assemblable

09:38.000 --> 09:41.000
against the reference secret to actually that's how it's

09:41.000 --> 09:44.000
testing. It's not just normal, you just check the results. It

09:44.000 --> 09:47.000
also has all sorts of micro benchmark, and maybe I can

09:47.000 --> 09:50.000
test a very conformant, it's not something you will normally do

09:50.000 --> 09:53.000
really test, because you see compiling a text curve, but if you

09:53.000 --> 09:55.000
find it as similar, you have to worry about it yourself.

09:55.000 --> 09:59.000
So, this comes from x6 store, but it has been copied into

09:59.000 --> 10:02.000
weshing a pack, and now there's also separate, you have

10:02.000 --> 10:06.000
available a project that you can take into use into

10:06.000 --> 10:10.000
other projects with similar issues, any computational project

10:10.000 --> 10:14.000
fairly. So, I do recommend looking at it, if you want to

10:14.000 --> 10:18.000
use it in some of the project. Now, getting involved with

10:18.000 --> 10:22.000
the same pack specifically, I must warn you that there is a

10:22.000 --> 10:26.000
high barrier of entry at this point, because when the easy

10:26.000 --> 10:29.000
functions have been optimised by me or by other people. So, what's

10:29.000 --> 10:31.000
left is either stuff that doesn't have test cases, and it's

10:31.000 --> 10:33.000
there if you're going to have to write the test case, and you're going

10:33.000 --> 10:36.000
to know, and I have to understand what you need to test, or

10:36.000 --> 10:39.000
it's going to be a complex function. So, if you're familiar with

10:39.000 --> 10:43.000
these five vectors, you come and help us. If you're not

10:43.000 --> 10:47.000
recommend starting on user projects first, there are

10:47.000 --> 10:51.000
also non-technical aspects, as I mentioned, code reviews are

10:51.000 --> 10:53.000
a bit slow, because you can read all the new and my

10:53.000 --> 10:57.000
free time. And, as a simple business authority is

10:57.000 --> 10:59.000
for being a bit of a difficult community to use certain

10:59.000 --> 11:03.000
reasons that I will not get into, and that this community

11:03.000 --> 11:07.000
can do nothing about. So, there's probably plenty of

11:07.000 --> 11:10.000
your open source projects to help us if you want to play with

11:10.000 --> 11:12.000
them. So, I've of course, you're welcome to join the

11:12.000 --> 11:15.000
fashion package, you know what you're doing, but they're

11:15.000 --> 11:22.000
warning. So, right. And, I mean, never in multimedia, there's

11:22.000 --> 11:26.000
quite a few projects, I think, like, all exist, these projects

11:26.000 --> 11:31.000
set is completely an optimised moment. So, the pain

11:31.000 --> 11:37.000
point. So, I will try to separate between what I think

11:37.000 --> 11:41.000
implementation issues of the two CPU implementation that we have

11:41.000 --> 11:46.000
been working with, and also, then what I see in this kind of

11:46.000 --> 11:50.000
missing in the specifications has compared to my

11:50.000 --> 12:00.000
competition that is neon SVE and also X86AVX. So, first

12:00.000 --> 12:05.000
problem, and that's an implementation problem, is the

12:05.000 --> 12:09.000
what I call the VHS, this is the N-Maxcombo VLC.

12:09.000 --> 12:15.000
It's basically the question is, if the number of input, so

12:15.000 --> 12:19.000
if the size of your useful input is smaller than sits in

12:19.000 --> 12:24.000
the vector, what is the execution time? What is the

12:24.000 --> 12:27.000
execution time for the CPU? So, specifically, if you

12:27.000 --> 12:31.000
need less than half of the vector for the calculation, because

12:31.000 --> 12:34.000
you have a vector that can sit 16 elements, and you only need

12:34.000 --> 12:38.000
to calculate on seven or eight elements, then you are

12:38.000 --> 12:42.000
wasting the second half of the vector, but since you have

12:42.000 --> 12:46.000
that you have, you can't really just change the vector size.

12:46.000 --> 12:50.000
And the spirit of the specification is that you should

12:50.000 --> 12:53.000
more or less scale to the number of the actual useful elements you

12:53.000 --> 12:56.000
have in the vector, so that's the VLS of vector lines.

12:56.000 --> 13:01.000
But, both of the two implementation data have tested,

13:01.000 --> 13:05.000
instead, scale the execution time to the L-Max, which is a

13:05.000 --> 13:08.000
number of elements that your vector can sit. And this is a

13:08.000 --> 13:12.000
problem, because it means that now, if you have an input

13:12.000 --> 13:17.000
that fits in 128 bits, or an input that we have a lot of

13:17.000 --> 13:22.000
6,000 inputs, and you optimize it for a 110-grade bit,

13:22.000 --> 13:26.000
CPU, if you work right, and then you take the same function,

13:26.000 --> 13:30.000
and you're already from a 256-bit vector CPU. And if you

13:30.000 --> 13:33.000
run at the same speed, even so you have the vector size, and the

13:33.000 --> 13:36.000
both typically the vector computation are units. And that's

13:36.000 --> 13:41.000
not really what you want. So, there's two ways to fix this.

13:41.000 --> 13:44.000
One is fix the implementation to actually scale according to

13:44.000 --> 13:47.000
the L, which is the spirit of the specification. The

13:47.000 --> 13:51.000
other approach, which I really hate, is to just say that

13:51.000 --> 13:54.000
just half the vector size using the property player, which

13:54.000 --> 13:57.000
is then this, well, the complexity, but if you're not

13:57.000 --> 14:00.000
familiar with a vector spec, it's going to be a bit obscure,

14:00.000 --> 14:06.000
but essentially, you can choose. So, you normally have

14:06.000 --> 14:10.000
32 vectors in the specification, and they have a size equal to

14:10.000 --> 14:14.000
the declared state bit width width width of the vector unit

14:14.000 --> 14:18.000
you have. So, on 128 to 56, 5, 12, or on one,

14:18.000 --> 14:23.000
12, 24 at this point. But you can also use a

14:23.000 --> 14:26.000
multi-player tool, and in that case you will get double

14:26.000 --> 14:29.000
size vectors, but of course, you will only get 16 vectors in

14:29.000 --> 14:32.000
the 32, because that you're not inventing bits out of

14:32.000 --> 14:35.000
senior, or you can use multi-player by four, or you can

14:35.000 --> 14:38.000
use multi-player by eight. And this is useful when you want to

14:38.000 --> 14:41.000
enroll, because for instance, if you do the same set or

14:41.000 --> 14:44.000
the same simple function, you only need four or even less

14:44.000 --> 14:48.000
and four vectors in your loop, you only need four variables

14:48.000 --> 14:51.000
basically, then you can use larger vectors in your CPU with

14:51.000 --> 14:55.000
run faster, because you'll have a bit bigger bandwidth.

14:55.000 --> 14:58.000
Or you can also use fractional, a multi-player is

14:58.000 --> 15:01.000
a vector is too big, and you want a small vector. And normally,

15:01.000 --> 15:03.000
you should only do this. If you are mixing with, like you

15:03.000 --> 15:05.000
have 32 bit values, and at some point you can

15:05.000 --> 15:07.000
have them down to 16 bit, and then you do something,

15:07.000 --> 15:10.000
and you can have them back to 32 bit, and realize that.

15:10.000 --> 15:14.000
Then you can use fractional sizes. In this case, you

15:14.000 --> 15:16.000
will shrink the vector size, but you will not shrink the

15:16.000 --> 15:18.000
number of vectors, because the instruction set has no space to

15:18.000 --> 15:21.000
describe more than 32 vectors. But so, you can scale,

15:21.000 --> 15:24.000
you have this seven different multi-player sets, you can use

15:24.000 --> 15:28.000
to scale the vector size. And so, if you use yellow

15:28.000 --> 15:31.000
max, then you need to change, depending on the CPU

15:31.000 --> 15:34.000
bit, which is, you need to select a different multi-player

15:34.000 --> 15:37.000
for your optimization, and draw on that optimal speed.

15:37.000 --> 15:39.000
And this is no good, because now, it means your optimization

15:39.000 --> 15:41.000
that you wrote for 100 or 28 bit CPU, you have to

15:41.000 --> 15:44.000
rewrite it for 5, 2 or less, the 6 bit CPU, the X.

15:44.000 --> 15:47.000
60, and then you would have to rewrite it for the 5, 1 or

15:47.000 --> 15:50.000
1 or 1 third bit CPU, which is already one model, and now

15:50.000 --> 15:52.000
you also have, as a yesterday, or the day before, you

15:52.000 --> 15:55.000
have to rewrite it for 1,024 bit CPU, because space

15:55.000 --> 15:59.000
meter is announced for 1,024 bit CPU. Now, that doesn't

15:59.000 --> 16:02.000
really work, especially for trying the process project.

16:02.000 --> 16:06.000
And this is not the idea of having a scalable vector

16:06.000 --> 16:09.000
extension. So, I'm told that 5,5, this is correctly

16:09.000 --> 16:11.000
and they're clearly scale according to the year. I haven't

16:11.000 --> 16:13.000
been able to consume this, because I don't have access

16:13.000 --> 16:16.000
to the hardware, but if you don't want to take my

16:16.000 --> 16:19.000
work for it, the specification not as I've said,

16:19.000 --> 16:21.000
that's the same thing on the official

16:21.000 --> 16:26.000
respective vector, meaning this. So, the

16:26.000 --> 16:29.000
spirit of respect is to use the L, and also what we need as

16:29.000 --> 16:31.000
open source developers is that you use the L, but

16:31.000 --> 16:34.000
unfortunately so far, are these two implementations

16:34.000 --> 16:36.000
are using the L-Maxine, and they're going to have

16:36.000 --> 16:39.000
problems, and it's not just multimedia, because

16:39.000 --> 16:44.000
think about, again, some simple example, like

16:44.000 --> 16:47.000
let's say you have a même set or a même

16:47.000 --> 16:50.000
copy in your own ellipse, an optimizer in your ellipse,

16:50.000 --> 16:55.000
and that could be any.

16:55.000 --> 16:59.000
Sorry, yes, it's so mic.

16:59.000 --> 17:02.000
So, and now let's say, so a typical même

17:02.000 --> 17:05.000
copy on the same set function doesn't need many variables,

17:05.000 --> 17:07.000
right? You know, you basically only need one variable.

17:07.000 --> 17:09.000
This is the vector of data you're copying, or

17:09.000 --> 17:12.000
initializing. So, you can use multiple

17:12.000 --> 17:16.000
vectors, you need less than four vectors, and this

17:16.000 --> 17:20.000
means now on the 120-forbit CPU, you're going to

17:20.000 --> 17:23.000
get your going to process one kilobite of data per

17:23.000 --> 17:27.000
loop iteration. Your même set is probably going to

17:27.000 --> 17:30.000
be copying initializing a lot less than one kilobite of

17:30.000 --> 17:33.000
data in most coal sites, and now if you wait for the

17:33.000 --> 17:36.000
roll one kilobite of data to be processed, every

17:36.000 --> 17:39.000
time you want to just do that function, you're going

17:39.000 --> 17:42.000
to have your performance, and you're going to end up in

17:42.000 --> 17:45.000
a situation where increasing the bit, the capacity of

17:45.000 --> 17:47.000
your CPU is going to make your code slower.

17:47.000 --> 17:49.000
Or at least not faster, and so you're just wasting

17:49.000 --> 17:52.000
gates on your CPU. So, point being, it's not just a

17:52.000 --> 17:55.000
problem with multi-media code like FFM, which has

17:55.000 --> 17:58.000
a lot of fixed size function, a lot of function working

17:58.000 --> 18:00.000
on fixed size inputs. It also affects generally

18:00.000 --> 18:04.000
code, which has variable but typically small inputs, which

18:04.000 --> 18:06.000
will also be affected by this. So, in my opinion,

18:06.000 --> 18:11.000
I don't think it's a windows 3 in two fixed. Also, as I

18:11.000 --> 18:13.000
said, this kind of in fact, I would also know the

18:13.000 --> 18:15.000
process project to have specialization for what I

18:15.000 --> 18:17.000
want to expect. There were only three commissioning

18:17.000 --> 18:19.000
available links, and I would seem that the

18:19.000 --> 18:21.000
source of the available links, so it's even worse.

18:21.000 --> 18:25.000
So, anyway, so that's the biggest problem in my opinion.

18:25.000 --> 18:30.000
The second problem is poor performance of what's

18:30.000 --> 18:34.000
known in the speakers segmented loads and store. So, in

18:34.000 --> 18:37.000
the rest five, you can do, you have two aspects when you

18:37.000 --> 18:41.000
load or store vectors to it. For memory, you have the

18:41.000 --> 18:44.000
stride, which is the distance between the elements, normally

18:44.000 --> 18:46.000
it's because you need stride and all the elements are

18:46.000 --> 18:49.000
together. In memory, you can also have a variable stride,

18:49.000 --> 18:52.000
when you want to load, for instance, the column of the

18:52.000 --> 18:54.000
metric of 2D matrices. Of course, this is kind of

18:54.000 --> 18:57.000
slow because you have to send the CPU has to eat

18:57.000 --> 18:59.000
many, many different frequencies on the memory

18:59.000 --> 19:02.000
bus. But there's also segments, which is almost

19:02.000 --> 19:05.000
interliving. There is, so, for instance, if you have a

19:05.000 --> 19:09.000
RGB pick up, you have the component like RGB for

19:09.000 --> 19:12.000
it speaks together and you want to load. You probably

19:12.000 --> 19:15.000
want to load each value. So, each R value in

19:15.000 --> 19:18.000
long vector and it's green value in long, another

19:18.000 --> 19:21.000
vector and blue value in the server vector. And for that

19:21.000 --> 19:23.000
we use segmented load, which is automatically

19:23.000 --> 19:26.000
displays the data into a continuous

19:26.000 --> 19:30.000
elements, the interliving to multiple vectors. And on

19:30.000 --> 19:34.000
arm, for instance, this is as fast as segmented loads and so

19:34.000 --> 19:36.000
are as fast as non-signamented loads and so on. And so

19:36.000 --> 19:38.000
but unfortunately, on both implementations that seem so far,

19:38.000 --> 19:42.000
the signature and performance overhead. And that's, again,

19:42.000 --> 19:44.000
an issue in terms of why it's an implementation issue,

19:44.000 --> 19:47.000
obviously, but it's an issue in terms of competitive

19:47.000 --> 19:51.000
activity with at least arm. And it affects us in

19:51.000 --> 19:53.000
much amount of clutter, because we need this kind of

19:53.000 --> 19:57.000
instructions a lot. Mostly for power of two segments.

19:57.000 --> 20:00.000
I mean, three, five non-traditional seven. I absolutely

20:00.000 --> 20:03.000
won't care. Six really use really. We want two, four,

20:03.000 --> 20:08.000
eight, please those needs to be fast. So, yeah, on arm,

20:08.000 --> 20:12.000
it's, it's, it's fast, it's a basic as fast as

20:12.000 --> 20:16.000
if there's no segmentation. Yeah, so that was

20:16.000 --> 20:19.000
for the implementation issues and the spec issues. So,

20:19.000 --> 20:22.000
stuff that we really need to actually write and we can't

20:22.000 --> 20:26.000
transposition. So, in, you know, with the segmentation

20:26.000 --> 20:28.000
and so you can do memory to register transposition,

20:28.000 --> 20:30.000
vector to register to memory transposition, but you

20:30.000 --> 20:33.000
can do, register to register and, and in, in

20:33.000 --> 20:35.000
much easier, we'll find out, especially to the

20:35.000 --> 20:37.000
mentioned transforms, which, I mean, I go through the

20:37.000 --> 20:40.000
mass, but, as you can do some calculation, like,

20:40.000 --> 20:43.000
horizontal and then do it vertically on a matrix,

20:43.000 --> 20:45.000
matrix and so forth, you need to transpose the data

20:45.000 --> 20:48.000
with two steps. And currently, we have to

20:48.000 --> 20:50.000
speed on the stack, because there's no way to

20:50.000 --> 20:53.000
transpose the data inside the CPU registers. And that's

20:53.000 --> 20:57.000
kind of slow. There is a ZVP, a ZVP extension being

20:57.000 --> 20:59.000
developed. So, the first is, let's see,

20:59.000 --> 21:01.000
how long it takes for it to be actually implementing

21:01.000 --> 21:06.000
real hardware. Big sign now in clips. So, there's sign

21:06.000 --> 21:08.000
now in clips and then sign now in clips, which is,

21:08.000 --> 21:11.000
so a clip is when you take a integral and you

21:11.000 --> 21:14.000
shrink it to a smaller bit with, and instead of

21:14.000 --> 21:16.000
dropping the high bits, you clip to the machine

21:16.000 --> 21:19.000
momentum minimum values. In multimedia, we

21:19.000 --> 21:21.000
for some reason, because of our video code,

21:21.000 --> 21:24.000
you always need to take sign 16 bit values and

21:24.000 --> 21:26.000
clips and to un-sign 8 bit values, and that's not

21:26.000 --> 21:28.000
possible in the spec, because it doesn't have

21:28.000 --> 21:32.000
this mixed sign that has a spec. And because of

21:32.000 --> 21:35.000
this, we need three, two, six instructions,

21:35.000 --> 21:38.000
combination for back again. This is something that

21:38.000 --> 21:41.000
interl and arm has, as a single instructions, so

21:41.000 --> 21:44.000
it's a bit. It's not fatal, because we can

21:44.000 --> 21:47.000
still do it in a reasonable way, but it is a

21:48.000 --> 21:51.000
disadvantage. And then there's integral

21:51.000 --> 21:54.000
distance of absolute difference.

21:54.000 --> 21:56.000
What is that one is 25,000? So, in inverted

21:56.000 --> 21:58.000
CG, you can do it in two instructions. You

21:58.000 --> 22:00.000
take the difference between a and b, and then you

22:00.000 --> 22:02.000
take the absolute value, with a signed

22:02.000 --> 22:06.000
instructions. But in integral, it doesn't work

22:06.000 --> 22:09.000
exactly, you have to calculate a minus b and a d minus a,

22:09.000 --> 22:11.000
and then you have to take the max of both, and

22:11.000 --> 22:13.000
you have to hope that there's no overflow in the

22:13.000 --> 22:15.000
terms, otherwise you actually, your

22:15.000 --> 22:17.000
calculation doesn't work, and you have to now

22:17.000 --> 22:20.000
do a widening instruction, and you can

22:20.000 --> 22:22.000
take up to six instructions, where you only

22:22.000 --> 22:25.000
will need it to inflot. And again, there is an

22:25.000 --> 22:27.000
extension being worked on a CG, it's now

22:27.000 --> 22:29.000
in stabilization state, so it's looking quite

22:29.000 --> 22:33.000
good, CV, ABD, but obviously it's still probably

22:33.000 --> 22:37.000
a few years before we see that in real hardware.

22:37.000 --> 22:39.000
There would have been nice to have a way to

22:39.000 --> 22:41.000
change your memory without changing the

22:41.000 --> 22:43.000
memory, or just interesting to play.

22:43.000 --> 22:45.000
That goes back to the VL Max argument, and

22:45.000 --> 22:49.000
I'm about to run out of time, so that's less

22:49.000 --> 22:51.000
important anyway. So, yes.

22:51.000 --> 22:53.000
These are references, obviously, the

22:53.000 --> 22:56.000
restrived vector extension on the zero, if you

22:56.000 --> 23:00.000
need to learn the vector specs, that's, you're

23:00.000 --> 23:02.000
going to have to read it. It's obviously not

23:02.000 --> 23:05.000
meant as a, you know, one-one, but I mean, I

23:05.000 --> 23:07.000
learned from that one, so it's, if you know

23:07.000 --> 23:09.000
at this, the basic of assembler, you can learn

23:09.000 --> 23:13.000
the vector programming from the spec, a

23:13.000 --> 23:15.000
same thing, it opens automatically, there's also

23:15.000 --> 23:21.000
the link to the nightly, okay, nightly test suit,

23:21.000 --> 23:25.000
and then you also have to check the same link.

23:25.000 --> 23:29.000
Yeah, I think I spent 25 minutes, which is what I

23:29.000 --> 23:33.000
said I would do, but, all right, I realized

23:33.000 --> 23:37.000
we started early, so, so we all have one mic, so

23:37.000 --> 23:39.000
there's an investment you have to with you.

23:39.000 --> 23:43.000
Okay, sure. Sorry, man, first here.

23:43.000 --> 23:47.000
There's a link of instructions for specific

23:47.000 --> 23:49.000
instructions, sequences you have to run instead.

23:49.000 --> 23:53.000
What is instruction, fusion, help, and

23:53.000 --> 23:55.000
our vendor's intro.

23:55.000 --> 23:59.000
So the question is, when we have, um,

23:59.000 --> 24:01.000
combination of instructions for

24:01.000 --> 24:03.000
instruction that doesn't exist, could,

24:03.000 --> 24:05.000
instruction version, I guess, in the CPU pipeline,

24:05.000 --> 24:07.000
help, well, obviously in theory, yes,

24:07.000 --> 24:11.000
the problem is more like, from our standpoint, we have to

24:11.000 --> 24:13.000
support kind of all vendors, and it's hard to rely

24:13.000 --> 24:15.000
on, on one, on the fact that there will be

24:15.000 --> 24:19.000
fusion anywhere, because unless, unless there's

24:19.000 --> 24:21.000
a new version of the spec that, like, says,

24:21.000 --> 24:25.000
you really should, uh, because it can't, the

24:25.000 --> 24:27.000
spec is a specific performance, but it kind of

24:27.000 --> 24:29.000
gives in, right? So unless it really says, like,

24:29.000 --> 24:31.000
you're really, really should use this combination,

24:31.000 --> 24:33.000
with something that we're going to be able to rely on.

24:33.000 --> 24:37.000
But if, if, if that would happen, then maybe yes.

24:37.000 --> 24:41.000
So what's the minimum version of the risk pipeline,

24:41.000 --> 24:43.000
I say, this get run on?

24:43.000 --> 24:47.000
So, uh, FSMPega, at least, as, yes, sorry,

24:47.000 --> 24:51.000
so the question is, um, what, um, what's, um,

24:51.000 --> 24:53.000
minimum, um, a spec, uh, I guess a version that,

24:53.000 --> 24:55.000
that, that, that, that to, that's required.

24:55.000 --> 24:59.000
So FSMPega has, uh, extensive, uh, one-time

24:59.000 --> 25:01.000
CPU feature detection, at least on

25:01.000 --> 25:05.000
using hardware probe, um, so, uh,

25:05.000 --> 25:08.000
we only need, technically, you can run FSMPega

25:08.000 --> 25:11.000
sync with only IMA, uh, of course you will definitely

25:11.000 --> 25:14.000
want to have F&D because it's always performance

25:14.000 --> 25:16.000
for floating point benchmark is going to be ready

25:16.000 --> 25:19.000
bad. Uh, technically, you can run FSMPega,

25:19.000 --> 25:21.000
which is just RV820, and then if we run

25:21.000 --> 25:24.000
time detect, if your CPU has anything, uh, on top of

25:24.000 --> 25:27.000
this, that it can use, especially the B,

25:27.000 --> 25:29.000
extension and the V extension. So in this case,

25:29.000 --> 25:33.000
V, versus version 100. Of course, if you want all of

25:33.000 --> 25:37.000
these stuff, um, if you want the CPU that has all

25:37.000 --> 25:39.000
of the stuff that we have in FSMPega at the moment,

25:39.000 --> 25:43.000
I think you would not only need RV820, but you probably

25:43.000 --> 25:47.000
also would need ZVBB, which I'm not sure if it's

25:47.000 --> 25:49.000
included in RV820, so I think somebody

25:49.000 --> 25:52.000
was more familiar with the profile definition.

25:52.000 --> 25:55.000
But we already have a few optimizations. Okay, so

25:55.000 --> 25:58.000
basically, we need RV820, sorry, but then if you

25:58.000 --> 26:00.000
would get the new extension that I mentioned earlier,

26:00.000 --> 26:03.000
then you would need some even future stuff.

26:03.000 --> 26:05.000
Yes.

26:05.000 --> 26:10.000
If there is a chip with RV820,

26:10.000 --> 26:14.000
0.7, it is at least part of it used already.

26:14.000 --> 26:18.000
So the question is, do we have any kind of

26:18.000 --> 26:22.000
support for RV82071, version of the spec?

26:22.000 --> 26:26.000
Okay, and now we're there's called XT-ed vector extension.

26:26.000 --> 26:28.000
No.

26:28.000 --> 26:31.000
There's, to my knowledge, there's only two major

26:31.000 --> 26:34.000
major designs that have been released with this.

26:34.000 --> 26:38.000
One is really, really small C906, which is not interesting

26:38.000 --> 26:41.000
for FSMPega, really, because it's really a macro controller,

26:41.000 --> 26:43.000
mostly kind of thing.

26:43.000 --> 26:46.000
And the other is the C910, and also, so

26:46.000 --> 26:48.000
there's no new hardware coming with this,

26:48.000 --> 26:52.000
and we have to mind that I think both of these hardware

26:52.000 --> 26:55.000
are affected by alt and cache fire instruction,

26:55.000 --> 26:57.000
with a regular ability, which requires you to disable

26:57.000 --> 26:59.000
the vector unit to not trigger.

26:59.000 --> 27:01.000
So effectively, if you want to be secure, you can't

27:01.000 --> 27:03.000
have the vector unit, so it's kind of pointless.

27:03.000 --> 27:05.000
I did write some macros at partial,

27:05.000 --> 27:08.000
the automated process, but it's compilation time.

27:08.000 --> 27:10.000
And we don't want to drop.

27:10.000 --> 27:12.000
So there's no, in my opinion, there's really no

27:12.000 --> 27:15.000
same way to have a runtime support for both at the same.

27:15.000 --> 27:19.000
So it is possible to, with some, not so big changes,

27:19.000 --> 27:21.000
but some changes to take the existing code,

27:21.000 --> 27:24.000
and make it work on 071, but you have to do it manually,

27:24.000 --> 27:27.000
and you have to recompile, which is not very convenient.

27:27.000 --> 27:30.000
I mean, it won't work for this course, basically.

27:30.000 --> 27:31.000
Yes.

27:52.000 --> 27:58.000
So the point is that there's already ways to do transfer,

27:58.000 --> 28:01.000
yes, there's plenty of ways to do a transfer position.

28:01.000 --> 28:04.000
I mean, you can do a bunch of slides and moves.

28:04.000 --> 28:06.000
I'm not sure I heard everything, you were saying,

28:06.000 --> 28:09.000
but so the point was, yes, that you can't already transfer,

28:09.000 --> 28:11.000
so yes, you can do it with the slides,

28:11.000 --> 28:14.000
move, you can probably do it with this compression instruction,

28:14.000 --> 28:18.000
that you cannot, so we use speed and then read back,

28:18.000 --> 28:21.000
because that's the simplest, and it's slower.

28:21.000 --> 28:22.000
It's not that slow.

28:34.000 --> 28:37.000
So I think, I think, night and egg,

28:37.000 --> 28:39.000
it was supposed to work here two years ago,

28:39.000 --> 28:43.000
it was sick, three bunch of different things.

28:43.000 --> 28:46.000
We saw five guys and nothing really planned out,

28:46.000 --> 28:50.000
but I wasn't in those discussions for a contrary say.

28:50.000 --> 28:53.000
I think the point is more like a performance.

28:53.000 --> 28:56.000
Yes, I mean, because obviously there are plenty of ways you can do it,

28:56.000 --> 28:58.000
in the worst case you can do a gas arise.

28:58.000 --> 29:00.000
It's actually inefficient, but you can do it,

29:00.000 --> 29:04.000
but it's only taken more time than going for the memory.

29:04.000 --> 29:10.000
So I think it also depends on the device,

29:10.000 --> 29:14.000
because it's 60 finances as a rather slow memory, but

29:15.000 --> 29:17.000
yeah, there might be slightly better ways,

29:17.000 --> 29:19.000
but anyway, there is an extension being worked on

29:19.000 --> 29:22.000
that's likely to do this properly, and in a way that would be.

29:22.000 --> 29:24.000
So yes, maybe, I mean, by all means,

29:24.000 --> 29:26.000
if there's way to improve, but we currently have

29:26.000 --> 29:29.000
without requiring an extension, and we're interesting to hear about it,

29:29.000 --> 29:32.000
but I expect that it's going to be,

29:32.000 --> 29:34.000
maybe, marginally, faster.

29:34.000 --> 29:36.000
The question is, mainly,

29:36.000 --> 29:38.000
because obviously, you've worked on the X86 side of it,

29:38.000 --> 29:41.000
I think, as well, we're going to specify it now.

29:42.000 --> 29:44.000
Are there any major goals that you've seen,

29:44.000 --> 29:48.000
miss a declaring ambitions for the title of the title of the title?

29:48.000 --> 29:52.000
What is the date like fours on the weekend and the title of the title?

29:52.000 --> 29:57.000
So the question is, having worked on X86,

29:57.000 --> 30:00.000
and then there's five, what would be the,

30:00.000 --> 30:04.000
Mr. Shelley's a gaps in, in, in, in, in, in,

30:04.000 --> 30:06.000
five, four, forgetting similar performance on,

30:06.000 --> 30:09.000
on similar, always on title workloads.

30:09.000 --> 30:11.000
So, to be clear, I have an introduction, X86.

30:11.000 --> 30:13.000
I don't know, X86 has some reason.

30:13.000 --> 30:15.000
I'm starting to beat, I'm sorry, I know,

30:15.000 --> 30:18.000
I've just, job, and a bit of,

30:18.000 --> 30:20.000
it's a 64 bit, but.

30:20.000 --> 30:24.000
So, I mean, the big, the big gaps are,

30:24.000 --> 30:26.000
as I mentioned, there's also,

30:26.000 --> 30:27.000
another list, more completely,

30:27.000 --> 30:29.000
so I think, again, from the nation,

30:29.000 --> 30:32.000
A.G. from David, and also,

30:32.000 --> 30:34.000
it's, I think, is on the rise,

30:34.000 --> 30:37.000
system library, or something like that,

30:37.000 --> 30:40.000
a committee.

30:40.000 --> 30:42.000
So, yes, there's a few of those,

30:42.000 --> 30:44.000
there's also, I mentioned,

30:44.000 --> 30:46.000
now, in terms of implementation,

30:46.000 --> 30:48.000
we are missing a lot, like,

30:48.000 --> 30:51.000
we have nothing for VVC,

30:51.000 --> 30:52.000
which, because nobody cares about VVC,

30:52.000 --> 30:54.000
I guess, but,

30:54.000 --> 30:55.000
X86 is not finished,

30:55.000 --> 30:57.000
XVC is, there's some,

30:57.000 --> 31:00.000
initial implementation coming from,

31:00.000 --> 31:01.000
from the,

31:01.000 --> 31:03.000
the Chinese Academy of Science,

31:03.000 --> 31:04.000
and from Alibaba,

31:04.000 --> 31:05.000
but, I don't think,

31:05.000 --> 31:08.000
you have time to review them.

31:08.000 --> 31:10.000
Yeah,

31:10.000 --> 31:11.000
things you only want,

31:11.000 --> 31:12.000
that's doing kind of well,

31:12.000 --> 31:13.000
at the moment, is a view on,

31:13.000 --> 31:15.000
and that's because it's done by David,

31:15.000 --> 31:16.000
and on by HMPEG.

31:16.000 --> 31:17.000
MPEG, too,

31:17.000 --> 31:18.000
is not implemented,

31:18.000 --> 31:19.000
of course.

31:19.000 --> 31:20.000
Now, as you can,

31:20.000 --> 31:21.000
pretty much, with MPEG, too,

31:21.000 --> 31:22.000
on scale out,

31:22.000 --> 31:23.000
with that,

31:23.000 --> 31:24.000
that's perfectly,

31:24.000 --> 31:25.000
but performance of,

31:25.000 --> 31:26.000
it won't be as fast as,

31:26.000 --> 31:27.000
on the, on the,

31:27.000 --> 31:29.000
on the, on the,

31:29.000 --> 31:30.000
on the,

31:30.000 --> 31:31.000
on the, the coverage,

31:31.000 --> 31:33.000
the code coverage is,

31:33.000 --> 31:34.000
is way, way worse,

31:34.000 --> 31:35.000
and X86.

31:35.000 --> 31:37.000
We do have a few things that

31:37.000 --> 31:38.000
aren't,

31:38.000 --> 31:39.000
because I implemented them,

31:39.000 --> 31:40.000
and nobody cared to implement them,

31:40.000 --> 31:41.000
usually,

31:41.000 --> 31:43.000
especially on the audio side,

31:43.000 --> 31:45.000
all, like, things like,

31:45.000 --> 31:47.000
slicing,

31:47.000 --> 31:49.000
so, like, more like on the,

31:49.000 --> 31:50.000
the,

31:50.000 --> 31:52.000
the, the, the, the, the, the,

31:52.000 --> 31:55.000
but there's a lot of work that's not been done,

31:55.000 --> 31:57.000
including on some major codex,

31:57.000 --> 31:59.000
and also major,

31:59.000 --> 32:00.000
major video processing filters,

32:00.000 --> 32:01.000
because the,

32:01.000 --> 32:02.000
because it has,

32:02.000 --> 32:03.000
a filter,

32:03.000 --> 32:04.000
and so,

32:04.000 --> 32:07.000
right.

32:07.000 --> 32:20.000
well,

32:20.000 --> 32:24.000
this is a somewhat special problem,

32:24.000 --> 32:27.000
in case it doesn't anyhow work,

32:27.000 --> 32:28.000
but it contains,

32:28.000 --> 32:30.000
it contains,

