WEBVTT

00:00.000 --> 00:18.720
It costs anything from $700,000 to $700 million to make a new chip because you want to make

00:18.720 --> 00:33.280
your code smaller or faster. I actually made the code for a particular chip, 7% smaller

00:33.280 --> 00:40.680
in 24 hours by additional application of benchmarking. And that's why I'm going to tell

00:40.680 --> 00:48.680
you about today. My name is Jeremy Bennett. I'm chief executive of Embecosm. Welcome

00:48.920 --> 00:55.720
to benchmarking in the rest five devroom. I hope at the end of this you'll have a good

00:55.720 --> 01:02.280
insight into how much open source benchmarking technology there is around that you can take

01:02.280 --> 01:10.440
advantage of and how it can help you improve your software, your hardware designs. So first

01:10.440 --> 01:15.880
let's think about benchmarking. How do we benchmark? And for good open source benchmark, the

01:15.880 --> 01:22.920
number of criteria, this is just the main ones. benchmark on real code, not synthetic made up

01:22.920 --> 01:29.080
someone thought this looks like a typical program, programs that real people have run. And make

01:29.080 --> 01:33.880
it open source. There's an idea of code source benchmark, trust me, it benchmarks your program, but

01:33.880 --> 01:40.680
you're not allowed to see it. Doesn't float my boat. Make sure you can benchmark everything you need

01:40.760 --> 01:46.760
to. Certainly speed, code size, embedded people worry more about code size than code speed.

01:46.760 --> 01:51.400
And indeed, we've benchmarked things like energy consumption. It's quite important. How you

01:51.400 --> 02:01.160
write your code can affect how much energy it uses. Keep it simple. How good is this system

02:01.160 --> 02:06.280
program according to your benchmark? Just give a number. Don't give a large paragraph of

02:07.000 --> 02:15.480
explanation. And lastly, the technology moves to keep up with the benchmarks. So change your

02:15.480 --> 02:21.240
benchmarks regularly. Update them regularly so that you don't have a compiler that understands

02:21.240 --> 02:28.200
your benchmarks and just magically does them really well. So I'm going to focus today on two benchmarks.

02:28.920 --> 02:34.840
One is spec CPU, excellent benchmarks. It's marginal coming in here because

02:36.280 --> 02:42.600
spec itself you have to pay for. It's not open source. But all the underlying benchmarks are open source.

02:42.600 --> 02:46.600
So we'll live with what we get from that. And I'm mostly going to talk about the underlying

02:46.600 --> 02:55.320
benchmarks, not the spec CPU infrastructure. And Mbench, Mbench, IoT, which is Mbench, is actually a

02:55.320 --> 03:01.000
family of benchmarks, but it's best known for its very small benchmarks, IoT. We'll use that for

03:01.000 --> 03:07.800
Michael control of course. And this will be, I think the first time I've actually spoken about

03:07.800 --> 03:13.800
Mbench 2.0. It's actually not officially released yet, but it's in final release candidate. So

03:13.800 --> 03:20.120
this is doing the regularly updated. This is the second version of Mbench, IoT. So what

03:20.120 --> 03:25.160
am I going to use as my benchmarking targets? Well, the focus is RIS5, so lots of them.

03:25.160 --> 03:29.960
We heard mentioned in the earlier talk. I was involved with about running these the Core 5,

03:29.960 --> 03:36.600
CB30E40BV2 on FPGA boards. I've used that for a lot of the data here. I've used a high five,

03:36.600 --> 03:45.320
I'm matched as a bigger machine. I've got milk five pioneer, which is a more modern

03:46.280 --> 03:54.520
server-class RIS5 platform. I've also used one of these, but not on a pie. This has got RIS5

03:55.320 --> 04:04.120
vector, version 1.0 on it. And I've used some QAMU, because sometimes you don't have a chip,

04:04.120 --> 04:13.160
you can use QAMU to at least give your rough inside. For arm, this is the go-to board. That's the

04:13.160 --> 04:18.680
just an SD discovery board. It's been around for years. It's cheap. We use it as the baseline

04:18.680 --> 04:23.640
reference platform for Mbench, precisely because anyone can get a hold of it for a few dollars.

04:24.920 --> 04:28.280
I've also used an Apple one, and I've also used QAMU and use a note a bit.

04:29.160 --> 04:35.480
And finally, we have looked at X86 in one or two places. And for that, I've used a thread Ripper.

04:35.480 --> 04:40.280
It's an old thread Ripper. This is about seven years old, I think. But the focus here is not

04:40.280 --> 04:46.200
particularly on X86. If anything, it's RIS5. If anything, RIS5 against armors,

04:46.200 --> 04:51.320
being closest to like to like. You may wonder, what about Drystone and Benchmark?

04:51.880 --> 04:59.400
How Drystone and QAMU? We don't use those. It doesn't meet the criterion of being

05:01.560 --> 05:07.240
a set of programs, and it's not regularly updated. Drystone actually tells you to turn off

05:07.320 --> 05:10.760
all the compiler optimizations, because otherwise it goes away completely in everything runs

05:10.760 --> 05:16.440
internationally fast. Coremark is so well understood by the compilers. It's not meaning for any

05:16.440 --> 05:20.840
more, and it's a thin, thin, thin benchmark. And you can see this in this graph here. This is

05:20.840 --> 05:29.560
looking at the impact of a number of RIS5, Icer extensions, the basic ones, IIMC, adding in a

05:29.560 --> 05:35.800
whole load of string, a bit minute ones, which are the ones on the milk five. And then some of the

05:36.360 --> 05:44.840
Core five special extensions, the multiply QML8, and the extended memory addressing

05:44.840 --> 05:51.320
instructions. And you can see with MBench, rich set of programs, you can see variability across

05:51.320 --> 05:56.440
all of them. But actually Coremark and Drystone are so tied into the baseline. You can't tell

05:56.440 --> 06:01.240
whether those last four extensions are any good at all, because they'll give the same result.

06:01.880 --> 06:09.720
So we're going to ignore them. So first up, I'm going to talk about application cost processes,

06:09.720 --> 06:16.280
so these are the big RV64 machines. I've quoted one of two of them that we're using here.

06:16.280 --> 06:23.160
And we're going to focus on Speck CPU 2016 2017. I know 2017's the up to date one. We have one

06:23.160 --> 06:29.480
customer that was very keen to produce 2006 figures, so I've got a lot of those. So first of all,

06:29.480 --> 06:37.000
let's have a look at architecture. So we thought how does a modern RV64 big machine

06:37.000 --> 06:44.040
application class machine compare again, X86 and against AR64? Now these all run at different

06:44.040 --> 06:50.520
clockers as I've scaled the results according to says they were all at one gigahertz. And what

06:50.520 --> 06:57.480
you can see here is RV64 has still some way to go. I'm not, you know, this is was a relatively early

06:57.560 --> 07:04.120
stage of design, so that RV64 has improved since then. But generally whenever we benchmark RV64,

07:04.120 --> 07:09.720
we find at the application class process that there's a way to go before it catches up with

07:09.720 --> 07:18.280
even X86. And actually the figures for the AR64 and an Apple M1, which is a phenomenal machine,

07:19.080 --> 07:24.200
it's in a different league. So we were a long way before we start being replacing the M1 with a

07:24.280 --> 07:33.240
risk 5. And this talk is going to be mostly just graphs, and I will take questions at the end,

07:33.240 --> 07:39.080
but if I say something particularly agree, just wave your hands around a lot, and I will address the issue.

07:40.440 --> 07:46.440
A GCC versus Ella, which compiler should I use? I run a compiler company, we get asked that question.

07:46.440 --> 07:51.960
The answer is they're both good compilers. The quality of the compiler, you get will not depend

07:51.960 --> 07:56.760
on which of those you choose. It'll depend on the quality of the compiler team that you get to do the

07:56.760 --> 08:05.560
work. A good compiler team can make an X compiler out of work. Both in general, this is just looking

08:05.560 --> 08:11.080
at instruction count. I didn't have the time I did this, I didn't have machines compare. So just

08:11.080 --> 08:20.360
looking at the dynamic instruction count as counted by QANU. In general, LLVM generates more instructions.

08:20.440 --> 08:24.680
That doesn't necessarily mean it's slower, it may be generating longer sequences

08:24.680 --> 08:29.240
that are more efficient. With the peculiar example of 462-lib quantum, there's a reason why that's

08:29.240 --> 08:33.960
not in the newer version of SpexyPU. It is the most unpredictable benchmark. You've made a

08:33.960 --> 08:39.080
slide to change, and it doubles in speed, halves, and speed. It's a bit sensitive to precise

08:39.080 --> 08:46.360
little compiler configuration. But broadly, there's some interesting stuff around 456 hammer,

08:46.360 --> 08:52.280
and I'll come back to that later. But broadly, there's not a huge difference and absent actual hardware

08:52.280 --> 08:59.480
to run these on. We're not sure that there's much to be alarmed about hammer and lip quantum

08:59.480 --> 09:08.680
of the two that interest there. What about LTO and PGO? So link time optimization and profile

09:08.680 --> 09:14.120
guided optimizations, too. The most important compiler optimizations are last few decades. And

09:15.080 --> 09:22.200
what we've got here is we've got a baseline, which is we measured the performance of SpexyPU 2006,

09:22.760 --> 09:28.600
and then this is only speckint. And then we turned on LTO, and we turned on PGO, and we turned on the

09:28.600 --> 09:36.200
both on. And in one or two cases, it does make a big difference, a hammer being an example. But in most

09:36.200 --> 09:42.440
of the cases, it doesn't help, and quite often makes things significantly worse. And that's a bit

09:42.520 --> 09:49.480
worrying, optimizations that make things slower and not really popular. And when you look at X86,

09:49.480 --> 09:56.040
you see X86 does a much better idea. We understand a bit what's going on with PGO. Profile

09:56.040 --> 10:02.280
guided optimisation tries to straighten out code flow so you don't take branches as often by

10:02.280 --> 10:07.800
profiling and working out what the preferred flow is. The problem is with risk five, as the PGO is set

10:07.880 --> 10:12.600
up at the moment, that tends to make things get further apart and you switch from using short

10:12.600 --> 10:17.240
branches to long branches, and then you've actually got more branching activity than you had before.

10:19.080 --> 10:29.320
And we are working slowly on fixes to that. So next thing is risk five vector, and at the time I did this,

10:31.800 --> 10:37.000
we didn't have RVV1 hardware. This is why this board is so important. I had hoped

10:37.000 --> 10:45.400
to actually have hardware figures for this, but I haven't. What this is based on QRMU instruction

10:45.400 --> 10:52.200
counts. And what we found is that other time I did this, for most programs, whether you have vector

10:52.200 --> 11:00.120
instructions, doesn't make much difference. But there are a couple of benchmarks notably the X264

11:00.440 --> 11:07.800
benchmark, which does the sort of operations where vectors really imply. Now, actually we found

11:07.800 --> 11:13.640
the number of instructions carved there. So if you execute these RVV, you execute half as many

11:13.640 --> 11:21.720
instructions as you do if you don't have RVV. The data point I'd hope to have by now, but I don't

11:21.800 --> 11:29.560
have, but I have heard from others who have done this, is this, as of crude, it's too crude.

11:29.560 --> 11:35.320
Vector instructions are so different to scalar instructions. Instruction count really isn't

11:35.320 --> 11:40.440
helping you terribly much. And actually what I've heard people say is when you come to real RVV

11:40.440 --> 11:49.000
architectures on X264, the number instructions is hard, but the execution time is actually twice as long,

11:49.000 --> 11:54.680
because those vector instructions take so long to run. I had hope to have that proved out on the

11:57.240 --> 12:03.080
banana pie, but I haven't managed yet that data yet. So I say the vector, the jury is still out

12:03.080 --> 12:09.400
on RISP 5 vector, but the data we start to get in as the first real chip supply suggests that

12:09.400 --> 12:14.680
thinking about when you generate RISP 5 vector instructions needs to be thought about very carefully.

12:14.680 --> 12:18.600
And sometimes even though it's a smaller number of instructions, it might not be the right

12:18.600 --> 12:25.240
thing to do. So that's if you like the application class processes, and then what I really

12:25.240 --> 12:30.840
want to focus on is microcontrollers. I do more work on microcontrollers and it is fully open source

12:30.840 --> 12:38.440
embedtriety. So one of the questions I wanted to look at was compilers over time. Are the compilers

12:38.440 --> 12:42.280
getting better, and I'm a compiler engineer, benchmarking, yes, it's about your hardware,

12:42.280 --> 12:48.120
but it's about your software tools and your libraries as well. And this is code size,

12:48.120 --> 12:53.320
and for the embedded world, and this is microcontrollers, remember that matters. The good news is

12:53.320 --> 12:59.000
that we GCC, looking back the last seven or eight years, code size has been improving.

12:59.640 --> 13:04.600
If we compare against the obvious competitors, the ARM Cortex, in this case of Cortex M4,

13:04.600 --> 13:11.000
we see that's also got better. But the M-bench score for ARM is improved by about 4%,

13:11.000 --> 13:17.400
whereas the M-bench score for RISP 5 is improved by about 7%. So they are getting closer,

13:17.400 --> 13:23.640
and that's good news. When we look at speed, and of course you want to go faster, that's why

13:23.640 --> 13:31.400
the graph goes up, we see that GCC's generally getting better, though it did get worse at one

13:31.400 --> 13:40.600
stage. You'll notice that GCC 10 was lot worse than GCC 9. But generally it's got better, and over the

13:40.600 --> 13:49.720
last seven or eight years the scores gone from 0.91 to 1.03. So we're looking at about 12, 13% difference.

13:50.760 --> 13:57.080
The thing is so as ARM, and I think if your comparing against ARM, ARM has also been getting

13:57.080 --> 14:06.280
better, though interestingly the RISP 5, and this was done with the CB3040P v2 on a FPGA board with

14:06.360 --> 14:12.600
the normalizing out for differences in clock speed. The RISP 5 is actually slightly higher,

14:12.600 --> 14:18.680
it's slightly better than the ARM Cortex M4, which is a nice result. But there's nothing to be

14:18.680 --> 14:23.880
taken for granted. We do need to keep on developing the compilers if we want those trends to continue.

14:27.160 --> 14:31.480
Comparing compilers, do I use GCC CRL over here, and that's the question I asked with

14:31.480 --> 14:37.080
specs, if you have application class, when you look at this class, you can see here the scores

14:37.080 --> 14:41.400
for all the different marks that make up. This is actually a major one point in order to do this

14:41.400 --> 14:48.120
to the world. And what you see here is that for individual programs, one compiler is better,

14:48.120 --> 14:53.320
or the others compiler. But if you look at the average, there is little in it, as a tiny bit,

14:53.320 --> 15:01.400
I think one's 0.98, the other's 1.01. They're very similar in performance, and LLVM

15:01.400 --> 15:07.240
18 is slightly better than GCC 14. Actually, historically, that's a change recently, because when I

15:07.240 --> 15:12.680
did this a few years ago, GCC was the quicker one. There is something important out here. This is where

15:12.680 --> 15:19.080
benchmarking is more than just looking at the numbers. For compilers engineers, the LLVM engineers

15:19.080 --> 15:23.400
can go and look at where GCC does much better and ask, what have they got that we haven't got?

15:23.960 --> 15:29.400
On the GCC engineers do the same with where LLVM and what we got, are both car pilots improve.

15:29.400 --> 15:33.800
So, actually, this all compare, and we do a lot of this of comparative analysis, try two

15:33.800 --> 15:37.320
different compilers, what's the difference, is why the difference can we make it better?

15:40.840 --> 15:45.000
I want to do another look at comparing architectures. This is Mbench 2.0 again.

15:45.720 --> 15:50.680
And what I've done is I've taken all the benchmark, I've taken the, I've looked at the code size,

15:50.680 --> 15:57.160
and this is a little bit code size, matters for embedded the embedded world. And I've compared

15:57.160 --> 16:05.800
risk 5 against arm. And basically, this is looking at arm code size, remember small is good.

16:07.000 --> 16:12.280
And if you're down the bottom, you're smaller, and if at the top, you're higher. So, what we see

16:12.280 --> 16:18.360
is, and then I've ordered them by how well they win. So, we see the very best benchmark is nearly

16:18.360 --> 16:25.880
40% smaller if you compile it on arm, whereas if you go to the other end, state mate,

16:27.240 --> 16:33.640
the code on arm is 20% bigger than risk 5. But if we look at the averages there, we start to see

16:33.640 --> 16:42.840
that actually on average on code size arm is about 9% more compact than risk 5. That's

16:46.120 --> 16:54.200
slightly worrying. It doesn't actually, that's where you end up with, but that's something to be

16:54.200 --> 17:06.040
worked on. In fact, of ice or extensions, this is looking at what's the benefit of the multiplication,

17:06.040 --> 17:14.040
the compression extensions, and then we've looked at a couple of the core 5 extensions as well.

17:14.040 --> 17:19.240
And I think when it comes to code size, what you see is that multiplication makes code a bit smaller,

17:19.240 --> 17:22.840
because you're actually able to just use multiplication instructions rather than have to, you know,

17:22.840 --> 17:29.720
all the emulation of multiplication. Compression, of course, makes a huge difference. You give the big

17:29.720 --> 17:35.800
drop down as compression, and actually multiply accumulate, and then the extended memory addressing

17:35.800 --> 17:44.840
instructions also further improve it from 45. When we look at speed, they're not surprisingly

17:44.840 --> 17:50.280
having hardware multiplication makes a huge difference. So we got a big job as turn on M. Compression

17:50.280 --> 17:58.520
instructions slightly slow things down on the implementation I have, and then the to the Mac and

17:58.520 --> 18:07.560
the memory extensions from the core 5 slightly improved things further. So you can use benchmarking

18:07.560 --> 18:12.120
to measure the impact of the extensions, is it a good extension? And I encourage people to do this

18:12.120 --> 18:19.880
pre-silicon to check that as part of the extension design process. It's not the only thing

18:19.880 --> 18:28.520
we benchmark, what about the compiler community? Is the compiler community healthy? So I've got the

18:28.520 --> 18:39.080
statistics here. In terms of commits, this is looking at 14.1, and there were 158 risk-5 specific

18:39.160 --> 18:50.200
commits for GCC-14.1, 600 left from. That is partly a huge contribution from G-Jong,

18:50.200 --> 18:59.800
but the name right, in China, and his colleague, Ping Li, who contributed the risk-5 vector

19:00.440 --> 19:05.240
initial implementation. That was a huge contribution in that release, so that made a big difference.

19:05.240 --> 19:11.960
In terms of number of committers, 45 for risk-5, 43 from arm, the biggest committer, because of that

19:11.960 --> 19:19.480
risk-5, 363 commits from the most prolific committer for risk-5, only 173 for arm. The number of

19:19.480 --> 19:25.880
committers that accounted for 90% of the commits, 15 from risk-5, 13 from arm, similar communities.

19:25.880 --> 19:30.440
And the number of companies committing, this is a bit strange actually. There's a lot of companies

19:30.440 --> 19:35.320
committing to arm, 12, even though it's only a single company, where you might expect something

19:35.320 --> 19:41.960
from Qualcomm and some from Apple. But you think risk-5 is used by all sorts of people. The lifting

19:41.960 --> 19:49.960
is being done by just 16 of those 4,000-odd risk-5 members, okay? And that is an alarm bell for me

19:49.960 --> 19:58.040
on the risk-5 community. Now this is GCC-C instantly. There are, you can do the same analysis for LLVM,

19:58.040 --> 20:01.480
but I don't think it tells you anything to see anything different. And this is the graph

20:01.480 --> 20:06.680
looking at how the contributions, so what percentage does each contribution is ordered by a

20:06.680 --> 20:13.080
country? And you can see the first two contributors for risk-5, make a huge contribution. That's also

20:13.080 --> 20:17.000
true of the first two contributors from arm. And then you've got a body of contributors

20:17.000 --> 20:22.360
of contributing regularly. And I think I talked in my, the talk I was giving over the GCC rule.

20:22.360 --> 20:29.400
There is a small corpus of committas that commit more than once a week, and they're responsible

20:29.400 --> 20:37.320
for the majority of the code in the compiler. Okay? So I think I would say here, the community

20:37.320 --> 20:43.400
is reasonably healthy, but it's not big. And I am worried that 4,000 members of the risk-5

20:44.680 --> 20:50.120
community only 16 of them are actually doing any work to make sure the compiler works. And that's

20:50.760 --> 21:01.160
a red flag up there. So I just want to finish about one example of using benchmarks. In my case,

21:01.160 --> 21:09.400
I'm a compiler engineer looking at improving the compiler. And I want to introduce you to a technique

21:10.520 --> 21:14.760
called iterative and combined elimination, iterative combinations, the base technique,

21:14.760 --> 21:20.760
combined elimination is an optimized version. And what we did is we took the embeds benchmark

21:21.720 --> 21:30.040
and we compiled it. Now GCC, we compiled it with GCC and we measured its code size. And we got an

21:30.040 --> 21:39.240
embeds score of 1.0 for something. And GCC has something like 300 flags control which optimization

21:39.240 --> 21:46.600
passes you turn on. And actually we started off by turning pretty much all of them on. And then

21:46.600 --> 21:51.240
we did a 1,000 runs and there's a whole load of parameters to control them. So you've got about

21:51.240 --> 21:57.400
1,000 flags you can change. And we tried changing each flag in turn, turning one optimisation off.

21:57.960 --> 22:03.560
And we ran it all again. And those 1,000 runs we saw that some of these optimisations were making

22:03.560 --> 22:09.160
things worse. Okay? And we found the one optimisation was making things worse than anything else.

22:09.160 --> 22:16.200
And we disabled that. Okay? And then we went through and did the same thing with the 999 remaining flags

22:17.880 --> 22:23.240
and find the next worst one. And we went on going through turning off flags until none of the

22:23.240 --> 22:30.760
flags were making anything bad. And that took us about 35 runs. Okay? And it took 24 hours on my laptop.

22:30.760 --> 22:38.920
This wasn't a big server. And the result of that was that we made the code 7% smaller.

22:39.000 --> 22:44.360
Now for something that was automated run over 24 hours, that's quite a big win for no effort.

22:44.360 --> 22:49.400
And I didn't change the compiler at all. Did take away a lot of useful information to

22:49.400 --> 22:53.960
what was going wrong that led to ideas that might further improve the compiler in the future

22:53.960 --> 22:58.920
from those flags we've got rid of. But it's a very simple technique. That's iterative compilation.

22:58.920 --> 23:05.800
That was about 100,000 compilations I had to do over a 24 hour adventure small compiles in

23:05.880 --> 23:11.960
second or so. The problem comes is we've done this with much bigger systems. We've done this

23:11.960 --> 23:20.520
optimising for the performance of spec CPU. And a single spec CPU benchmark even with its test

23:20.520 --> 23:27.880
data set takes a long time to run compile and run. We're talking minutes maybe hours for the

23:27.880 --> 23:34.120
longest running ones. So you can't do 100,000 runs as you turn all the flags off. And that's where

23:34.120 --> 23:39.240
combined elimination comes in. Because combined elimination is a set of ureistics for guessing

23:39.240 --> 23:47.560
what's not worth bothering about. And we use that with one of our customers who's developing a

23:47.560 --> 23:58.360
massively powerful RIS564 multi multi multi core architecture. And that took about three months to

23:58.360 --> 24:05.800
do all those runs. But we gave them a 15% improvement in execution speed. And that's not bad

24:05.800 --> 24:09.640
for something that well it was fit three months. But it wasn't three months of paying someone

24:09.640 --> 24:17.480
it was three months of waiting while your boards burned electricity FPJs. So that was a pretty

24:17.480 --> 24:20.920
good result. And we've gone round and we've gone back again because they've learnt from that

24:20.920 --> 24:24.920
and improved their architecture and we're going to go over a second work. And then they're going

24:24.920 --> 24:29.320
to add vector instructions in and we're going to do it again for vector. So it's all done on

24:29.320 --> 24:34.920
the back of benchmarking whether it's n-bench or whether it's core 5 and we've done it even for

24:34.920 --> 24:40.760
energy efficiency. And for those of you don't realize compiled code makes a difference to energy

24:40.760 --> 24:47.560
contribution. And it's not a huge difference, 5 or 10%. Again the right optimisation flags and it's

24:47.560 --> 24:56.680
worth bearing in mind that likes of Google and Amazon burn of the order of a gigawatt

24:57.480 --> 25:03.320
of electricity. And a useful thing is a gigawatt costs a gigawatt dollar a year. So if I say

25:03.320 --> 25:11.320
if you 5% or 10% of your gigawatt dollar that's a decent win to have. So I leave you that as my

25:11.320 --> 25:16.120
fine thing is benchmarking isn't just about measuring you can use it to improve your systems.

25:17.000 --> 25:20.920
And I'd just like to say thank you to those here. Dave Patterson who leads the

25:20.920 --> 25:26.280
M-bench initiative. I'd say it's an entirely open community. We meet on the second Monday

25:27.400 --> 25:32.280
of every second or third Monday of every month. I just sign up to the mailing list, sign up to

25:32.280 --> 25:40.200
join. And we're working on the next M-bench for DSP class processes. Come and join that.

25:40.760 --> 25:45.320
The open hardware foundation a lot of this was done in support of the open hardware foundation

25:45.320 --> 25:51.880
with core pressure. There are colleagues there. The GCCNLO, the M community is tremendous

25:51.880 --> 25:57.240
amount of input and back and forth with them. And I know particularly on the GCC, there's a lot

25:57.240 --> 26:04.600
of specs for you working done by some of the companies working on a RISFI vector machines.

26:05.160 --> 26:10.600
And lastly all my colleagues at M-COSM because I get come up and talk to you but much of the work

26:10.600 --> 26:15.960
here has done be done by other people and I'm just showing off their work. So thank you all very

26:15.960 --> 26:19.800
much and I'll take any questions you may have.

26:27.320 --> 26:37.240
Hello. Thank you for your presentation.

26:37.240 --> 26:44.760
Can you speak up? I mean it's not. Yeah. I'll repeat the question up this end.

26:44.760 --> 26:51.480
I have to understand the choice between the four arms. It's 64 wide, the upper and one. It seems

26:51.480 --> 26:57.560
like it's under different category. Yes, the RISFI've got a test at the moment.

26:57.560 --> 27:03.080
It was being compared against a RISFI've core that's intended to be in a thousand core

27:03.080 --> 27:10.120
high-performance computing scenario. So it was against a big core. I can't tell you which one

27:10.120 --> 27:17.400
because it's secret. That bit isn't open source. But no, the Apple M. We wanted a D-SAN

27:17.800 --> 27:24.120
AR-64 and I've got an Apple M1 in the office. Sometimes your choices are governed by

27:24.120 --> 27:29.960
simple things like I've got one in the office. Yeah and the question was, were we comparing

27:29.960 --> 27:35.240
light for light when we were comparing Apple M1 against RISFI've 64?

27:37.640 --> 27:38.840
Okay, shall I have it?

27:39.400 --> 27:46.600
So, AR, we're going to be, you mentioned light BGO makes things worse. Yeah. Why does LGO makes things worse?

27:47.640 --> 27:54.760
So, the question is, I explained why the EGO made things worse and not why LTO made things worse.

27:54.760 --> 27:58.520
The answer is still open question. We haven't actually had the time or

27:59.320 --> 28:04.120
unalcomership compile a company. People haven't paid me to have the time to investigate that further.

28:05.000 --> 28:11.720
But it is a cause of deep frustration because LTO can sometimes be the best optimization you have available

28:11.720 --> 28:16.680
with some applications. I do not understand why it doesn't do a good job with RISFI and

28:17.960 --> 28:22.120
if someone would like to say, please could you investigate this? I'm happy to take on the business.

28:25.000 --> 28:31.960
Can you just, you said about the AR-64 with the extension that sometimes we need to

28:32.280 --> 28:38.520
pull but we need to add it, can you explain you're making a bit more? Yes. So, when you compile

28:38.520 --> 28:43.720
for the RISFI vector, you may well replace a whole load of scalar instructions by a single vector

28:43.720 --> 28:49.160
instruction. That's the whole point of it. Auto vectorization works well. Both GCC and LTO, the M-support

28:49.160 --> 28:57.960
are the EV. But RISFI vector instructions are not necessary as quick. Okay, so you may

28:57.960 --> 29:04.360
end up replacing a few quick scalar instructions by a vector instruction that takes rather longer.

29:04.360 --> 29:09.640
It's hugely dependent on the implementation. And certainly the initial analysis suggests that

29:09.640 --> 29:13.880
might be happening certainly with some of these early RISFI vector implementations.

29:16.280 --> 29:24.120
On the QMU, yes, shows it uses fewer instructions. But if I've got five instructions that

29:25.000 --> 29:29.400
run in one nanosecond, and I replace it by one instruction that runs in 10 nanoseconds of loss.

29:30.280 --> 29:34.280
That's the problem. And we don't know the answer. I had hoped to have the answer. I have

29:34.280 --> 29:40.280
got a spreadsheet here, which has actually, I got finally got working at two o'clock this afternoon.

29:40.840 --> 29:45.400
And I have not managed to get the comparison to look to see, it's only M-bench. It's not

29:45.400 --> 29:51.800
specs CPU. Really to see does actually RISFI vector go plus or minus on that. Yeah, I had hoped

29:51.800 --> 29:54.280
to have that by today, but I just run out of the hours.

29:57.880 --> 30:03.560
We're getting the comparison. So there are now some new instructions that came up for

30:03.560 --> 30:07.800
more compressed. There is one of them is now a part of the RISFI vector degree.

30:10.200 --> 30:14.360
Yeah. Yeah. So did you also think those in Guadaldo, just the base C?

30:14.920 --> 30:20.840
Hi, so question is, what about the ZC star new generation of compression extensions?

30:20.920 --> 30:24.920
I didn't measure those, mostly because the work I was doing was with a particular

30:24.920 --> 30:30.040
implementation that didn't have ZC star in it. I have done some working at code size, because

30:30.040 --> 30:37.480
I don't actually have to run the stuff to do the code size for M-bench. And I actually benchmark

30:37.480 --> 30:43.720
the code size effect of all the different, then combine them. And I think that came to the conclusion

30:43.720 --> 30:48.920
across M-bench 1.0, you got between 3 and 4% improvement in code size.

30:49.640 --> 30:55.880
Over standard C, okay, so that which is not as much as being be predicted, but I think

30:55.880 --> 30:59.960
it's still worth having. Okay, and that was with the first generation of compiler

31:01.000 --> 31:03.080
in May improve with future version of compiler.

31:11.720 --> 31:16.360
Yes, a lot of work. I mean, working with various colleagues on actually the QMU side of

31:16.360 --> 31:21.400
Nathan's been working with me on that. And actually the length of the vector is hugely

31:22.280 --> 31:28.520
influential. And so, yeah, I can't say here is a RISFI vector result. I've got to say what length

31:28.520 --> 31:36.360
of vector, this one is 256 length. So definitely makes a big difference. Yeah, question.

31:46.440 --> 31:51.800
In addition to X, for GCCUS, go is in a little bit of a result. So even sometimes crashes.

31:53.000 --> 31:57.480
So as you have the occasion to discuss with the front combination of flags, because if you combine

31:57.480 --> 32:04.040
those, you do always add the exact same results, and you'll crush this or the front or the later.

32:04.040 --> 32:09.880
So I confess that I was only doing code size, I didn't try and run the programs. Okay, so

32:10.840 --> 32:17.320
one of the things, MBench 1.0 took a decision that when it did its code size optimization,

32:17.640 --> 32:21.560
it would do it with dummy libraries that were always the same size. So we could get rid of the

32:21.560 --> 32:26.440
library overhead. That meant, of course, we had nowhere of checking if they were real. That was an

32:26.440 --> 32:31.880
MBench 1.0, I think I did there. Okay, one of the good things that comes with MBench 2.0 is

32:31.880 --> 32:37.240
the stupidity of that decision has been made abundantly clear. So MBench 2.0 does do code size,

32:37.320 --> 32:40.360
but it does make sure you can also execute it so you check you get the right result.

32:41.240 --> 32:46.760
I haven't done, I know with the stuff we did with spec CPU, we did have a few cases where

32:47.560 --> 32:53.640
the combined elimination blew up because a compilation failed, or an execution failed.

32:53.640 --> 32:56.840
Now execution failed, it might have been because it's a architecture under development,

32:56.840 --> 33:00.520
it could have been a bug in the architecture, or it could have been a bug in the code generation.

33:01.480 --> 33:05.400
But over a half that, you know, I've got some a compiler developer, half the time is spent

33:05.480 --> 33:09.720
the compiler tester, because that's where all of that happened, I guess. Yes.

33:26.440 --> 33:31.640
Yeah, so the question is, are we exercising things like atomic and synchronization in MBench?

33:31.720 --> 33:38.680
MBench is aimed at the very smallest of chips. It is defined to require no more than 64

33:38.680 --> 33:43.000
kilobytes of memory. It's a sort of thing that's in your electronic key lock on your hotel room.

33:43.000 --> 33:47.720
It's not of any significance. They're generally single threaded, they're not doing

33:47.720 --> 33:51.320
complex synchronized, they're not running an operating system, they're bare metal, so it's not particularly

33:51.320 --> 33:56.600
relevant. When we come to things like MBench DSP, which is just coming out now, that's going to be

33:56.600 --> 34:01.080
the sort of thing that they will have to worry about on that. Let's go question.

34:26.600 --> 34:39.080
So the question is, about using handwritten assembly optimizations, I'll make two comments about

34:39.080 --> 34:43.160
that, yeah, you're right, spec doesn't allow you to go around hacking the source code if it's

34:43.160 --> 34:49.320
a valid spec score. You do optimize, you turn on whatever optimisation flags you want. I just wanted

34:49.320 --> 34:54.440
to qualify one thing. I get this all my customers. Oh, I've improved the code, because I hand

34:54.440 --> 34:59.000
optimised this by writing a sembler. Why is it going 100 times slower? To which the answer is,

34:59.000 --> 35:03.000
don't try and second guess a more than compiler. It will generally do it better than you. You've

35:03.000 --> 35:07.480
just wrecked the data flow analysis and the optimisation's gone out the window. So that wasn't

35:07.480 --> 35:11.960
the question you asked, but I use them, excuse to make that comment anytime I get it. So thank

35:11.960 --> 35:18.120
you all very much for out of time. I'm around all weekend. Please don't find any questions.