WEBVTT

00:00.000 --> 00:09.480
Thank you very much to everybody for attending this talk at the organisers of the first

00:09.480 --> 00:12.560
and conference and this track.

00:12.560 --> 00:15.120
I'm Chen Nakasanova, she's my lacanal.

00:15.120 --> 00:21.240
We both present in this talk about getting more juice out from the rest of the PIPU.

00:21.240 --> 00:24.000
A brief introduction.

00:24.000 --> 00:27.200
We work together in the graphics thing at Igalia.

00:27.200 --> 00:32.000
We have been working in the graphic stack for the Raspberry Pi for last year.

00:32.000 --> 00:36.680
I'm mainly working on the user's per side related to Mesa and she has been working on the

00:36.680 --> 00:41.280
current side of the project, supported the different parts.

00:41.280 --> 00:45.680
So well, we are talking about and better this Raspberry Pi.

00:45.680 --> 00:53.920
We're going to focus on the Raspberry Pi 5, of course, rounds in 2020, 23 in around October,

00:53.920 --> 00:54.920
I think.

00:54.920 --> 01:01.400
And it's used a next generation of the video core from Broadcom architecture, that is the

01:01.400 --> 01:02.400
number 7.

01:02.400 --> 01:07.680
It's a next step where the Raspberry Pi 4, into the several improvements, many ready to

01:07.680 --> 01:13.400
support up to weight when the targets that allow us to get into some kind of support

01:13.400 --> 01:21.640
for OpenGL, desktop 3.1, improves several operations, there's more part of this

01:21.640 --> 01:22.640
ization.

01:22.640 --> 01:26.400
We have more availability to change the weight that we handle the registers.

01:26.400 --> 01:32.640
And well, since the moment we, the hardware was available in the market, the code BIP, there

01:32.640 --> 01:41.360
was available the source code in the kernel side and in the Mesa side, completely upstream.

01:41.360 --> 01:48.960
So I'm going to give a brief introduction about how is the graphic stack in the Raspberry

01:48.960 --> 01:51.520
Pi generations.

01:51.520 --> 01:54.600
Because sometimes it's complex to identify with server, I are reducing.

01:54.600 --> 02:00.480
If you're using a Raspberry Pi 1, 2, 3, that is based on video core for the generation.

02:00.480 --> 02:05.600
In the kernel side, we have a BC4 driver, that is the one that handles the display and the

02:05.600 --> 02:06.600
rendering.

02:06.600 --> 02:14.000
In the suggested space size, we have a BC4 driver that supports OpenGLS 2.0.

02:14.000 --> 02:18.600
When we change to the new version of the generation, we have a 4 or 5 that have the video

02:18.600 --> 02:20.520
core 6 and 7.

02:20.520 --> 02:25.960
We maintain the BC4 name, but it's only for the display, the things that you put on the

02:25.960 --> 02:28.320
screen at the end.

02:28.320 --> 02:32.240
On the render side, there is a different drive that is B3D.

02:32.240 --> 02:37.360
That has the same name that the user space driver, that is B3D that we use for OpenGL

02:37.360 --> 02:39.520
or OpenGLS.

02:39.520 --> 02:44.440
We also develop the book and drive that is B3D for book.

02:44.480 --> 02:52.000
That's the complete names of different models and drivers.

02:52.000 --> 02:57.440
Now I started in the focus in the part that we thought that it was more interesting

02:57.440 --> 03:03.600
for this presentation, we are talking about performance and I'm going to talk about the

03:03.600 --> 03:11.120
part, ready to use a space, we use the B3D driver and the B3DB, are a base of messa.

03:11.120 --> 03:16.120
So part of the common infrastructure that we have available for all the drivers in Open

03:16.120 --> 03:18.240
Storage are there.

03:18.240 --> 03:26.240
In our case, the OpenGL drivers supports enough performance version of OpenGLC.1, because

03:26.240 --> 03:32.560
the presentation of the hardware and there are some, they are emulated in some way, and

03:32.560 --> 03:39.840
we are OpenGLS, that is the version of OpenGL, adapt the foreign devils, performance

03:39.840 --> 03:44.400
in the Raspberry Pi 5 was launched, it was the same also in the Raspberry Pi 4.

03:44.400 --> 03:51.880
So for the Vulcan site, when the product was launched, we supported it at that time Vulcan

03:51.880 --> 03:59.520
1.2, and last so goes we get the conformance for the new version from Vulcan 1.3.

03:59.520 --> 04:05.920
If you are interested in about what we support in the API, we have a presentation in previous

04:05.920 --> 04:14.360
XC1.3R, we went into the data about the new extension, what you can do with the B3D APIs.

04:14.360 --> 04:22.720
Today we would like to focus on performance, last year we focused working on some scenarios

04:22.720 --> 04:31.480
that are the ones that we have a CPU, GPU limited scenario working with high resolution,

04:31.480 --> 04:39.240
that implies that the limiting factor is the GPU, and we haven't been working with the

04:39.240 --> 04:48.080
DFX bands, that is a common, based on the industry for analyzing the performance of your mobile

04:48.080 --> 05:00.720
phone, and we get a performance from last year, well, 2DF December 35 of December 2023 to the end

05:00.720 --> 05:07.720
of last year, 24 of over an average, 100% of performance increase.

05:07.720 --> 05:18.400
That is, we did this analysis on Android 15, because the LS bands is not available for

05:18.400 --> 05:27.120
arm limbs, so we needed to use this version of Android, so I'm missing work that Michael

05:27.120 --> 05:36.120
had from, so we can try the useless space driver in Android, and then get the real results

05:36.120 --> 05:42.760
from the commit from the end of last year to just one month ago, so we see that there

05:42.760 --> 05:48.080
are different demos, and we are getting in some cases of performance improvement from

05:48.160 --> 05:56.880
this, better based on a slice show to get in 13 frames per second in some cases, so

05:56.880 --> 06:02.880
well, now we're going to go into detail into the what where the TMS issues will have been

06:02.880 --> 06:12.240
doing, we need to understand a bit how the tile-basser render was, that is the kind of

06:12.320 --> 06:21.040
present that we found in the Broadcon GPUs. In our case, when you prepare a job to submit

06:21.040 --> 06:29.640
to the GPU, there are two stages, the third one is you prepare a GPU line, being

06:29.640 --> 06:35.120
a bit job that is the one that is in charge of doing the next thing, you will have multiple

06:35.280 --> 06:42.720
protocols, so it's going to analyze the geometry, identify with the parts of the frame

06:42.720 --> 06:48.320
buffer generating, it's affecting that local, so at the end it creates a list of this tile,

06:48.320 --> 06:55.680
that is the small piece of the image, it's affected by this local, with this list, we go to

06:55.680 --> 07:04.080
the render job, that we see stuff list, and start loading, it's tile, it really knows, with

07:04.160 --> 07:10.720
protocols, we're affecting that tile, it only executes that part, so with load one, once the tile,

07:11.280 --> 07:19.040
we do all the protocols operation, and then we store the result of that, so the mess way of getting

07:19.040 --> 07:25.680
more performances avoiding this load and store, because if you're going to do more laws and

07:25.680 --> 07:34.160
stores, you'll split the things in different jobs, so the first demonstration that we have with

07:34.160 --> 07:41.200
great performance increase, that it goes around 40% in average, was one that we discovered maybe

07:41.200 --> 07:46.240
by chance, it was an extension that was tested by here, that was working in the drive and a

07:46.240 --> 07:51.200
ceiling implementation, because the drive was waiting, if you were writing to a texture,

07:51.760 --> 07:57.760
and then we're going to sample it, there was a job finishing to store that and for a lot that,

07:57.760 --> 08:03.040
but if the combination of rim buffer was the same, you could reuse, you had access to the texture

08:03.040 --> 08:10.480
in the in the in the cache of the of the GPU, so that the results would be already available,

08:10.480 --> 08:16.080
remove it at that and see, get us really nice results, we can see here at a different steam,

08:17.040 --> 08:22.240
on the right side is the current version, on the left side we can see that there is,

08:23.840 --> 08:32.720
you can fill the difference from 16 to 24 frames per second, we also did a lot of compiler

08:32.720 --> 08:38.480
optimization, and maybe there are like 15 per quest, improving, reducing the styles of the

08:39.680 --> 08:44.640
improving the scheduling of this traction, reducing the number and with that work we'll remove the

08:44.720 --> 08:52.160
number of instructions to run up, not almost a 5% but we get a performance about 3.57

08:52.160 --> 08:57.520
in average, a lot of work and well, it should work on not so amazing with the previous one,

08:57.520 --> 09:04.400
it was just one, come in and identify in the issue, we also take advantage, we were together myself

09:04.400 --> 09:10.640
that, you have the ability in a title, in a title error, ketit tour, if you don't need the

09:10.640 --> 09:16.560
results of the hint at the render that it usually happens when you have death or stencil buffers,

09:16.560 --> 09:25.040
you need it to render, but in some cases we can avoid storing them a lot of them, so applying

09:25.040 --> 09:32.480
some realistic, we could improve that behavior and we had another 1% of improvement, this first

09:32.560 --> 09:41.600
sample has some use, not in this demo, but in the case of Google Chrome, you will use this

09:41.600 --> 09:47.920
operation, that improves the results, and our interesting improvement was the work we are

09:47.920 --> 09:54.080
ready to improve in the early fragment test optimization that is supported by the hardware,

09:54.080 --> 09:59.360
but there are some situations that you cannot use it, one of that is, you are using a

09:59.360 --> 10:06.240
draw call, a draw call has a discard operation, but just say, well, you do not, you cannot write anything

10:06.240 --> 10:12.560
we are happy operating to the frame buffers, in that case, you need to disable the optimization,

10:12.560 --> 10:20.160
but there was a situation, I am a scenario that allows you to do that, if death writes, where to disable

10:20.160 --> 10:27.120
what, in that case, so with that we get a 14% of performance improvement, that everything I

10:27.200 --> 10:33.200
selling is accumulating in this potential way, 40% in one case, over that, it is not linear,

10:35.360 --> 10:45.600
the last, I would like to comment this one, in some kind of jobs, usually happening in

10:45.600 --> 10:51.280
a situation that in a scenario that we have transfer feedback, that we are interesting in

10:51.280 --> 11:00.480
now in the results of the geometry, the application can disable the restoration, so you do not

11:00.480 --> 11:04.800
need to execute at the end the frame and say that, because you are not interested in the

11:04.800 --> 11:13.120
result of the end, so in the case of the Manhattan, this is happening a lot, so every transfer

11:13.120 --> 11:19.440
feedback operation, you do not enable this operation for all the protocols, none of that uses

11:19.440 --> 11:25.920
this restoration, you can disable the load and the store of the buffers, and maybe you have

11:25.920 --> 11:31.680
five calls of transfer feedback that would imply for each frame, five loads from a store that

11:31.680 --> 11:36.400
you are not going to use, so take it in that data account, improve a lot of the performance,

11:36.400 --> 11:41.360
the Manhattan name is one of the most, you can see on the left side, before all the optimizations

11:41.360 --> 11:48.640
have commented today, on the right side, the last result with in this case is at 230% of performance,

11:48.720 --> 11:57.680
this results are done with traces over the execution of DLSVNC standard, because it is easier to

11:57.680 --> 12:03.520
compare the same trace and not doing the execution on the, that we do not have the counter

12:03.520 --> 12:10.640
of having the same friends draw, and we can see here on the different comments on the right

12:10.640 --> 12:16.640
and the performance improvement over then in the in the different moments, we can see that

12:16.960 --> 12:22.320
in the case of Manhattan, the programming is huge, we can see that with the previous slide

12:22.320 --> 12:28.480
that we are almost near the 300% of performance from the original, at the end of 2023,

12:30.240 --> 12:34.320
with it also a lot of work on performance tools, and currently that might be going to

12:34.320 --> 12:42.240
explain, I can do the flow. Okay, so the thing about studying performance is that we also need

12:42.640 --> 12:49.360
tools to help us to measure the performance, the current performance, and see scenarios that we

12:49.360 --> 12:56.320
would like to work to have a better performance. So first thing that we did was CPU jobs and time

12:56.320 --> 13:05.200
stamp queries, last fours then, Chairman I talked about how we implemented CPU jobs, because there

13:05.360 --> 13:12.240
are some vocal comments that we cannot perform in the GPU alone, so we had CPU jobs,

13:12.240 --> 13:18.640
and we moved the CPU jobs from the user space to the kernel space in order to avoid GPU

13:18.640 --> 13:27.600
flushes and CPU stores. And in 2023, we landed time stamp queries and the CPU jobs in the

13:27.600 --> 13:36.560
kernel, in the vukondriver only. But the last year, we were able to support time stamp queries

13:36.560 --> 13:44.080
in Mesa as well for the DL driver. And using type stamp carries is really useful for us when

13:44.080 --> 13:50.960
analyzing performance, because it helped us to identify jobs that are taking longer, and if the

13:51.040 --> 13:58.960
job is taking longer, we can analyze it and think about new ways to improve that job. That is

13:58.960 --> 14:05.840
probably a scenario that is also happening in other applications. And with time stamp queries,

14:05.840 --> 14:13.280
we can have very accurately synchronized time stamps that are perfectly synchronized in the

14:13.280 --> 14:21.120
graphics pipeline, so this is really helpful to help evaluate the time of the jobs.

14:21.840 --> 14:29.600
And we also implemented Fefator Support, Fefator is an open stores stack for performance

14:29.600 --> 14:38.560
instruments, it means that we can access system level information and also Apple level

14:38.560 --> 14:48.160
traces to help us analyze basically data from all your system. And we also have Mesa data sources

14:48.160 --> 14:56.800
in Fefator, which means that now we can also add producers to GPU information, such as frequency,

14:57.440 --> 15:06.640
visualization, performance counters, and this help us to have a unified timeline to work on performance

15:06.720 --> 15:13.920
debugging, performance tuning, debugging. So you can see in this slide that you can have a very

15:13.920 --> 15:20.640
system-wide view in a timeline, which is very useful because in the top you have like CPU information

15:20.640 --> 15:27.680
to CPU frequency and more that I mean it couldn't fit here. But I only opened the CPU

15:27.680 --> 15:34.960
information and the DRM fans because we use fans to synchronize stamps in the kernel. And you

15:35.040 --> 15:40.800
can see all the fences that are being used in the jobs. And right here we have information

15:40.800 --> 15:47.920
from Mesa that is coming from GL Mark II, which is the famous Mesa application. And we can see

15:47.920 --> 15:53.200
you know when the job was submitted and if you see the fences you can understand when the

15:53.200 --> 16:00.320
fence was signally by the job end, you can see like operations that when we are waiting for the

16:00.400 --> 16:06.240
fence in the user space. So this is amazing because it can have a very system-wide view and

16:06.240 --> 16:14.720
understand places that we can start thinking about the forms. And now jumping to the kernel work.

16:16.480 --> 16:26.640
Last year we started enabling a feature in V3D that was historically used. V3D the V3D GPU

16:26.720 --> 16:34.640
has support for four kilobyte pages, 64 kilobyte pages that are called big pages and one

16:34.640 --> 16:41.840
megabyte pages that are called super pages. To enable them it looks very simple. You just need to have

16:41.840 --> 16:49.360
a continuous block memory block with you know one megabyte for example and add the page table entries.

16:50.160 --> 16:57.040
And the Linux driver didn't have to support for it. And I mean you can think why it's beneficial

16:57.040 --> 17:03.840
because it's just like the CPU you know. We can improve the performance using a huge pages

17:03.840 --> 17:13.040
by reducing the MIMU infectious, especially when we have memory intensive applications. Nowadays

17:14.000 --> 17:21.920
shaders have like large buffer objects. So this is important. And but we had a very important

17:21.920 --> 17:29.440
issue. We couldn't have a continuous block of memory using shaman like by default in the DRM,

17:29.440 --> 17:35.920
which is the GPU subsystem in the kernel. So we had to think about a solution for it.

17:36.320 --> 17:44.160
By default, tempFS and shaman allocate memory in page size chunks. This means that if your

17:44.160 --> 17:50.640
page size is 4 kilobytes, we are going to allocate 4 kilobytes chunks. But we needed a continuous

17:50.640 --> 17:58.400
block of memory bigger than one page, right? So we decided to create a tempFS mount point with the

17:58.400 --> 18:04.640
huge equal within size option. What this means it means that we enable transparent

18:04.640 --> 18:10.960
huge pages support in that mount point. Transparent huge pages is something that exists in

18:10.960 --> 18:18.240
the kernel for a while. And it's basically an abstraction that help us to utilize huge pages

18:18.240 --> 18:24.080
without really using huge pages in the virtual memory. It's an abstraction that let

18:25.040 --> 18:31.040
basically the applications understand that page as a continuous block of memory. And we just

18:31.040 --> 18:36.160
don't need to understand what's going down. It's a bit different than a huge page for example.

18:37.680 --> 18:44.640
With that continuous block of memory, it's just a matter of placing the page table entries

18:44.640 --> 18:52.160
in the right places, setting the bits, and then it's done. We also work on reducing the virtual

18:52.160 --> 18:58.720
alignment, virtual address alignment to four kilobytes. This helps us to reduce the memory pressure

18:59.760 --> 19:04.800
in the Raspberry Pi because you know, memory is very limited in embedded devices. And we were using

19:05.360 --> 19:18.560
128 virtual address alignment. And this was basically utilizing some addresses in our virtual address space.

19:18.640 --> 19:23.680
So we did reduce it and reducing memory pressure was really useful. We had an average

19:23.680 --> 19:31.680
improvement of 1.33%, which is not that impressive. But we had a significant performance boost

19:31.680 --> 19:39.280
in some emulation cases. Just remember that when we are using an embedded device, it's important to

19:39.280 --> 19:47.680
set a met device when using transparent huge pages otherwise it can be used a lot of memory

19:47.680 --> 19:57.920
that we don't really want to avoid. This is a demo running in PS2 emulation emulator with burnout

19:57.920 --> 20:05.120
tree. And you can see the difference, you know, it's just this small feature in the kernel and you

20:05.120 --> 20:16.160
can see a huge difference in these applications that utilize big buffers. Apart from that,

20:16.160 --> 20:21.680
I mean, this is really important to use a huge pages, but we had an issue with huge pages that

20:24.080 --> 20:32.400
huge pages by default, by default, use a THP by default, use huge pages of the size of

20:33.200 --> 20:42.080
PMD. In ARM 64 this means two megabytes. And as you can see, our interest is just 4 kilobytes,

20:42.160 --> 20:50.000
64 kilobytes, and 1 megabyte. So we didn't really need to use two megabyte pages. This was

20:50.000 --> 21:00.000
leading to some unnecessary fragmentation in our system. So we decided to use Moot size THP, which

21:00.000 --> 21:11.840
allows us to use huge pages only for 64 kilobytes to 1 megabyte. So we are just selecting this range

21:11.920 --> 21:17.920
of pages that we want to have support. And MTHP is something that exists in the kernel and

21:17.920 --> 21:24.560
help us to have the ability to allocate memory in blocks that are bigger than one page than a

21:24.560 --> 21:33.760
page size, but is smaller than the traditional PMD size. And we created the true kind of parameters

21:33.760 --> 21:40.080
to help us set the policies that we wanted for the pages. Because transparent, which pages

21:40.720 --> 21:53.120
multi THP, in Schmann, had an issue that we also had to configure the THP with CSFS. And as

21:53.120 --> 21:58.800
you know CSFS, every time that we reboot, it will go back to the default. And this is not great.

21:58.800 --> 22:04.880
When we have this, and we want to create a product for client, I mean, we need to have the

22:04.880 --> 22:09.920
configuration set for the client with the best performance. So we created this kernel command

22:09.920 --> 22:16.720
lines where you can just set the policy for the transparent huge pages, just as you do for

22:16.720 --> 22:23.200
tempFest, for example, but this is for Schmann. And you can use different policies for different

22:23.200 --> 22:30.480
pages sizes. And then you can just configure like 16 kilobytes to 64 kilobytes is going to have

22:30.480 --> 22:36.640
the policy always THP. And this is really useful, even if you have an application of Schmann.

22:37.280 --> 22:43.920
Our case is that we use Schmann to back our buffer objects. So it's really useful for

22:43.920 --> 22:46.400
each piece, but it can have other applications.

22:48.400 --> 22:50.400
So that's all. Questions?

22:51.360 --> 22:57.360
APPLAUSE

23:04.480 --> 23:08.640
Hi, did you have to make any trade-offs while writing the compiler optimizations say,

23:09.440 --> 23:12.640
like, longer compilation times or higher register pressure or something of the sort?

23:13.600 --> 23:21.440
Well, there are a lot of different evistics there. It depends if you can't even optimize the

23:21.440 --> 23:27.600
compiler to be faster compiling and not do different strategies, or you get the best performance

23:27.600 --> 23:33.600
in some cases. But the default is, you try to get the maximum number of threads working, and

23:33.600 --> 23:39.040
the system already has a cache. So once they say that it is built, you have to write the cache

23:39.040 --> 23:41.520
version, so to kind of take them and do in that.

23:44.800 --> 23:52.400
So the question for the habit write and the love it by writing is the 60 kilobytes and about

23:52.400 --> 24:00.240
the GP benchmark. And I mean, you know, even the CPU changes, CPU division changes, and from eight

24:00.240 --> 24:07.040
and all, and the only eight gigabyte type, so changes the default type, and more, and you know,

24:07.120 --> 24:16.000
the omiti and use of a box. Have you tried and register 60 GB? 60 GB of advice?

24:16.000 --> 24:18.880
Yeah, we haven't tried that, the product.

24:18.880 --> 24:31.360
I think it is a difference. I think it is a difference, and all eight gigabytes, because of the memory

24:31.440 --> 24:35.760
activity in private, and omiti are you with broke?

24:35.760 --> 24:41.200
Yes, maybe, might I do not know about that, because we know that there is a difference in the memory

24:41.200 --> 24:46.080
hand in there, and there is a work on that, I don't know with this already available.

24:54.720 --> 25:00.400
Hi, how much support, if anything, you get from Broadcom, or the Raspberry 5 Foundation,

25:00.480 --> 25:05.680
to implement this? We are working for the Raspberry 5 at the end, so.

25:08.080 --> 25:12.560
Broadcom provides the documentation. We can read the specs.

25:21.200 --> 25:26.480
Hello, thank you, nice presentation. If you compare the drivers that Broadcom give you

25:27.120 --> 25:34.160
on their customers, their proprietary customers, compared to our their proprietary drivers,

25:34.960 --> 25:40.800
compared to the open source, the drivers and the messa that you use here, do you see

25:41.840 --> 25:45.440
are they the same, and do you see a big difference in performance?

25:45.440 --> 25:52.320
We cannot make the check the difference, because the drivers are not for the same kernel are different,

25:52.320 --> 25:57.680
things, in that case. We don't have the comparatives. Are they on par, or are they,

25:57.680 --> 26:03.760
you don't know at all? We don't know, we know that there is room for improvement from their numbers,

26:03.760 --> 26:09.680
but we cannot do the same rang in both of the accommodation, because we don't have access to the other

26:09.680 --> 26:17.120
one, and we need to work on the same platform. One more question, do you have a way to

26:18.080 --> 26:27.600
to see, to ask for the GPU driver, what processes, per process ID?

26:29.600 --> 26:36.800
What their resources are that they use in the GPU. I tell you the use cases, you have an application

26:36.800 --> 26:43.280
manager on the system. We landed, I think, one two years ago, the GPU stops, so you can get

26:43.280 --> 26:52.000
information from FD info. 6.8. 6.8 is available upstream, and downstream in passory OS is already available.

26:52.000 --> 26:57.360
You can use GPU top, I think. Yeah, it shows the information.

26:57.360 --> 27:00.240
That two are starting. Thank you.

