WEBVTT

00:00.000 --> 00:13.400
Welcome to my talk. Thank you for attending it. I'm Vodislav. I'm a tech and render

00:13.400 --> 00:22.880
timlet and in my free time. I'm working as a launcher on zero AD. Zero AD is a free and open source

00:22.880 --> 00:28.440
cross-platform game. If UI games like Hedou from Paris Empire Earth or StarCraft, then you

00:28.440 --> 00:35.920
might find zero AD interesting as well. We use a cross-platform game, so it works only in

00:35.920 --> 00:42.600
the Mac OS and Windows. It works on different architectures. We use our own custom

00:42.600 --> 00:48.080
engine. It's called Pyogenesis. It's written in C++, so we have some technologies

00:48.080 --> 00:57.280
under the hood. About rendering, we have abstraction rendering interface. It helps us to

00:57.280 --> 01:02.720
have multiple backends. Open jail, Vulcan, and dummy. The latch-1, 4, test, and CPU performance

01:02.720 --> 01:09.920
checks. Also, we use MotonViki to be able to run Vulcan on Mac OS. It converts Vulcan API

01:09.920 --> 01:18.200
calls to a metal API calls. So it's how it looks like. We were trying to design our rendering

01:18.200 --> 01:24.840
interface as close as possible to Vulcan API, but we have still limitations because we support

01:24.840 --> 01:31.920
Open jail. For example, we still have a plot text-trained plot buffer, so no ringbuffers or

01:31.920 --> 01:37.440
a plot buffer or a new form buffer on top. Also, we have been using your sources directly

01:37.440 --> 01:43.320
in the device command context, so you might see the texture and set uniform. We are binding

01:43.320 --> 01:52.320
slot. So we have several limitations. So let's talk about Vulcan. We have added the Vulcan

01:52.320 --> 02:00.280
in 2020-23, and the first version of the game with Vulcan was released in 2025. It was

02:00.280 --> 02:07.160
released in 2007. By the way, we are currently preparing the next version. It released

02:07.160 --> 02:17.280
in 2008. We are hoping to release it soon. So that's how a timeline looks like. You might

02:17.280 --> 02:25.000
see that in the submit timeline, we have re-ordered our uploads, because uploads are not

02:25.000 --> 02:30.320
allowed with render passers approach, because we don't support the name rendering for all

02:30.320 --> 02:36.320
platforms, but we need to support Vulcan for all platforms. So we split them into two

02:36.320 --> 02:43.360
decay command buffers, prepange and mine. Why we have that, because we still have some legacy

02:43.360 --> 02:50.520
components. For example, user interface, where a component might load a plot and use texture

02:50.520 --> 02:57.320
right before the draw goal. So it might just get a texture path right before a draw goal

02:57.320 --> 03:02.480
inside render pass. So we have to upload them separately before the main decay command

03:02.480 --> 03:08.760
buffer. And because of that, we don't allow to upload the same texture or the same buffer,

03:08.760 --> 03:14.880
let's put times during the single render pass. So let's talk about obstacles we had.

03:14.880 --> 03:23.320
First one, Vulcan support detection. We've tried to do that with SDL. Nothing really hard

03:23.320 --> 03:30.960
here, just trying to load library and get instant support. But it was crashing for some

03:30.960 --> 03:38.520
drivers, because if you are trying to load Vulcan and then Vulcan jail, some drivers crashing.

03:38.520 --> 03:43.600
So we just disable that. Unfortunately, we don't have any suitable solution for us.

03:43.600 --> 03:50.200
So now users can switch in options or becans. Device detection. We have multiple physical

03:50.200 --> 03:58.080
devices. And we have to choose one. Because we have OpenGL, we had a problem that OpenGL

03:58.080 --> 04:03.440
sometimes uses integrated GPU instead of discrete one. So we thought that it would be

04:03.520 --> 04:12.440
great to sort all devices. So first we sort by type. GPU, discrete, GPU integrated,

04:12.440 --> 04:18.560
CPU and virtual. Then by device.com memory. And then by initial order. So the first

04:18.560 --> 04:25.320
error was using device.com memory, because some drivers report like Wapers have more memory

04:25.320 --> 04:36.280
than native cuts. The second problem was this one. So that's the initial order reported

04:36.280 --> 04:42.160
by driver. We were selecting the second one, they discrete. But it wasn't working, because

04:42.160 --> 04:49.160
it failed during installation. So we have to remove using type sorting as well. So we

04:49.240 --> 04:57.400
always using the 0, the first one from the list to report it by driver. So the problem

04:57.400 --> 05:09.880
that WCAF is equal to devices might be listed even if they're not going to work. So another problem,

05:09.880 --> 05:16.040
texture comparison, we see some platforms. Don't support all BC formats. And because of

05:16.120 --> 05:21.880
that texture comparison, BC might be false. So because of that, we were switching devices.

05:21.880 --> 05:28.280
That's not correct, because actually those platforms were supporting all needed for us BC formats.

05:28.280 --> 05:37.560
So instead of checking this property, we just checking all BC formats individually. Out of memory.

05:38.440 --> 05:43.240
In our game, we have multiple options. True control quality. We have shadow quality,

05:43.240 --> 05:48.040
it's a texture quality. We can enable water refraction, water reflection and so on. And it

05:48.040 --> 05:55.720
consumes memory. And we need to handle that situation. So when that might occur, usually when

05:55.720 --> 06:01.560
we allocate resources. So okay, memory creates something. It's easy to fall back on that stack,

06:01.560 --> 06:08.840
because we can return the error in the color. But there are more rare situation when in

06:08.840 --> 06:15.640
can fail in a choir and Q present and Q segment and wait for fences. In those situations,

06:15.640 --> 06:20.840
we unfortunately can handle. So currently in those situations we just crush.

06:21.880 --> 06:27.960
Because it's another way, it's a low level. But information about quality is a high level.

06:27.960 --> 06:34.280
So low level can't access the high level. So currently we just crush. No proper solution for that.

06:39.000 --> 06:45.800
We use in a low command mirror locator. So it's another helping for us. It's really

06:45.800 --> 06:53.160
simplified. Some could relate to memory locations. But at the same time, if we freeze memory,

06:53.160 --> 06:59.800
it doesn't go to GPU immediately, because all command mirror locator uses intermediate buffers.

06:59.800 --> 07:07.160
Bigger buffers to allocate smaller buffers from them. We're trying to use VKX memory budget,

07:07.240 --> 07:16.360
but it doesn't really help in those situations. The only solution we have is to use some ratio,

07:16.360 --> 07:23.960
like 80% of total available memory and another, like, should be on operation system side.

07:26.360 --> 07:32.680
GPU scanning artifacts. Skinning is a process of applying skeleton animation to a model,

07:32.680 --> 07:39.160
to a mesh. And how the frames should look like. It's a regular frame from our game.

07:39.720 --> 07:45.800
And that's how it looks like with the bug. And another one. So it might look like a driver bug,

07:45.800 --> 07:51.240
or like another synchronization problem. But actually, it was pretty simple. It was just

07:51.240 --> 07:58.600
incorrectly selected data. We didn't invalidate the flag when user was switching from

07:58.680 --> 08:05.400
CPU scanning to GPU scanning. But it was looking like a serious bug, which weren't able to

08:05.400 --> 08:11.800
produce. We are collecting GPU statistics to be able to optimize our game to know, like,

08:11.800 --> 08:19.720
corner cases. And so, for example, for OpenGL, we have the following reported names for the same

08:20.600 --> 08:30.120
game version, the same GPU, the same platform. We've welcome it's much better.

08:31.960 --> 08:41.320
The same game version, the same GPU, but all supported platforms. It's better. It's simpler to

08:41.320 --> 08:46.440
parts. It still has some problems, like it includes additional information and we need to

08:46.440 --> 08:56.440
remove trained markets. So forth, but it's much simpler. There are some helpers that might help

08:56.440 --> 09:01.960
you to distinguish different GPUs. For example, device ID, but it's not enough because it might

09:01.960 --> 09:08.280
be equal for different GPUs. Device UID is possible if it presents, because in some cases, it might

09:08.280 --> 09:13.640
be just zeroed. So it's not really helpful for that case. So the final solution, we just

09:13.720 --> 09:21.400
parse device name, but it's much simpler than for OpenGL. You know, debugging, that's the most

09:21.400 --> 09:29.640
interesting part for us. We have a lot of players, but not many of them have programming skills,

09:29.640 --> 09:35.560
or might build the game, debug the game. So we have to introduce some helpers, configuration options

09:35.560 --> 09:42.680
to be able to retrieve some useful information for us to debug the game. For example, we have

09:42.760 --> 09:48.440
Android. Helpers. So if we can teach someone how to make a capture, or maybe not still

09:48.440 --> 09:58.280
Android. We can enable debug labels. We can enable debug scope labels. So each resource in our

09:58.280 --> 10:05.640
engine is marked with a constant name. So we can distinguish them. Also, we can enable messages.

10:05.640 --> 10:12.120
If driver has something to report for us and we have debug context, it enables different features,

10:12.120 --> 10:18.200
including validation layers, if they are present in the platform, in the user platform.

10:19.320 --> 10:24.680
Also, currently, by default, we are working in Windows, so we are using discriminating indexing.

10:25.640 --> 10:31.720
But there is a quarter of a case. When you enable validation without GPU assistance,

10:32.840 --> 10:38.760
some validation layer might complain, because they don't really know when a resource will be accessed.

10:38.760 --> 10:44.440
So, for example, in this single descriptor set, you might have a frame buffer target and some

10:44.440 --> 10:52.760
like texture sample. And it will be evaluation for it. So in real case, it's not a problem.

10:53.320 --> 11:02.680
So we need to be able to disable that. Also, we would like to have in the future an option to

11:02.760 --> 11:07.640
choose GPU, not only backend, but also GPU for all cannot list. But can't even have.

11:07.640 --> 11:14.680
So we don't have that. So we use configuration option. And the last one helps us debug

11:14.680 --> 11:23.080
different situation problems and driver issues. We able to insert debug buyer. It's a barrier

11:23.080 --> 11:28.680
from all stages to all stages, from all excess mask to all excess mask. So it's really hard execution

11:28.680 --> 11:36.680
and memory barrier. Also, we have wait, wait for different stages to work present before and

11:36.680 --> 11:47.720
after. Again, to artifacts. In the beginning of 2025, I found visual artifacts in demand new

11:47.720 --> 11:53.960
in on Raspberry Pi 4, with metadata 24. So on the left, with bug on the right, without bug.

11:54.840 --> 12:01.960
After investigation, after I had to enable mentioned debug buyers and for stages and masks,

12:01.960 --> 12:08.360
didn't help. Try to locate device, wait, idle, didn't help. Don't help. Don't the thing is

12:09.880 --> 12:17.320
that was working is to split VKQ submit on two. We have the similar four relation between them.

12:17.960 --> 12:26.280
So the code was looking like this. And actually, it was a driver bug and thanks to Sam,

12:26.280 --> 12:33.720
he's colleagues. But my device has a talk today about Raspberry Pi. It was, in my opinion,

12:35.880 --> 12:45.160
the first fix I have seen for drivers and for vendor or driver author. So very much thanks to them.

12:48.040 --> 12:53.000
The main conclusion is that the one usual application uses the Vulcan API.

12:54.280 --> 13:01.240
The more likely a driver error will occur. So if you are doing like a simple quad rendering,

13:01.240 --> 13:08.920
then the chance that you will have an error, an artifact or something like that, is way, way

13:08.920 --> 13:15.720
low. But if you are doing something specific, for example, like we do, like I mentioned on the timeline,

13:16.040 --> 13:21.640
we split our device common context on two. So we have prepared to the K common buffer and

13:21.640 --> 13:28.200
main common buffer. And we have a synchronization between them. That already isn't so usual behavior

13:28.200 --> 13:35.240
for some platform, so at least. Because usual recommendation is to avoid multiple case

13:35.240 --> 13:43.240
of meets or multiple common buffer. So use only one, but not for all. And another thing that

13:43.240 --> 13:50.840
really helped in that situation that I was able to reproduce that back myself on my Raspberry Pi 4.

13:50.840 --> 13:58.920
And after a few evenings of debugging, I finally got that I would be a driver back and I made an

13:58.920 --> 14:07.800
issue for the method. And it was fixed really fast. I am really glad about that. And the last thing,

14:08.120 --> 14:14.440
GPU performance measurements, because we have players with different hardware from low to high,

14:15.000 --> 14:22.760
we need to be able to measure how expensive our frame is. What do we need to optimize?

14:24.760 --> 14:30.280
Usually we prefer using tools from vendors. So if we are debugging locally, we are trying to use

14:31.240 --> 14:40.120
tools that are provided by vendors if available. Not all patrons have enough tools.

14:41.320 --> 14:50.840
Else, we fall back to timestamp queries. It has limitations in terms of how it measures data,

14:50.840 --> 14:58.760
because when you insert a timestamp query, you pass when you want to capture. So for example,

14:58.760 --> 15:07.080
if you have to overlapping jobs, then you can really distinguish only single one.

15:09.240 --> 15:15.880
And actually using timestamps might be affected. For example, it will be affected by other processes,

15:15.880 --> 15:24.440
because they are using your GPU as well. They can be affected by temperature. It's not so

15:24.520 --> 15:32.200
relevant to discrete GPUs. It's mostly for mobile GPUs or like energy conserving.

15:33.880 --> 15:46.760
And sometimes you might have measurements that you get slower results than the previous one.

15:46.840 --> 15:54.360
So you are sure that you're using a better code. That's happening because a GPU driver

15:55.560 --> 16:05.320
sees the code or sees the real data. And it thinks that it might make sense to decrease the

16:05.320 --> 16:16.280
GPU frequency. We are still on 60 FPS, but we are using low less energy. So better code doesn't mean

16:16.360 --> 16:24.760
that it will be better performance. But in terms of energy consumption, it will be better anyway.

16:25.640 --> 16:29.240
So that was the last one. Thank you very much.

16:46.840 --> 16:58.840
I need tips about measurements for GPU performance, right?

17:04.120 --> 17:13.400
It's really not a simple question, because usually you need to take a look at each platform independently.

17:13.960 --> 17:21.720
For example, some wonders provide special functions, which can fix your GPU frequency.

17:23.000 --> 17:31.880
In that case, or for example, not GPU frequency, but from energy bands to performance mode,

17:31.880 --> 17:39.080
where it will be might be fixed. Not will, but might. Also some wonders too,

17:39.720 --> 17:46.840
help you to get more metrics. For example, those, which are not available in Vulcan and P.

17:47.720 --> 17:53.000
In the frame statistics somewhere else, or we are times some queries or something else.

17:53.960 --> 17:58.920
So each platform independently and using window tools.

17:59.880 --> 18:08.600
Looking back over the last three years now, was it best interest, which is the one kind

18:09.880 --> 18:14.840
and on putting on the effort in it, but if you think if you have to do it again, it's better to do it again.

18:16.680 --> 18:26.280
What's it worth to switch the welcome? I had the talking 2020-24, and yes, I said that for some platforms,

18:26.280 --> 18:33.640
we get up to 300% improved of performance. The best improved was for macOS,

18:34.840 --> 18:40.920
because we were using OpenGL there, and their OpenGL implementation is far from ideal.

18:42.680 --> 18:54.200
We get 10% to 300% improvement, and we get more stable performance, less fluctuations during multiple frames.

18:54.920 --> 19:02.680
So yes, it's worth it. Another reason why we switched to Vulcan, because I was really interested in Vulcan,

19:02.680 --> 19:09.320
so I have internal motivation to add it. If you don't have it or you don't have time,

19:09.320 --> 19:18.600
then it might be worth it to take a look at some other libraries that might be implemented in your

19:18.920 --> 19:25.960
application. We were limited, because we still support OpenGL 2.1, it has limitations,

19:25.960 --> 19:32.920
so more than a library is like Libangle or BGFX, not really useful for us yet.

19:34.760 --> 19:40.120
So the short answer, yes, it was worth it. Yes?

19:48.600 --> 20:11.000
You mean like choosing different GPUs, how does it affect performance?

20:19.560 --> 20:33.880
Do we notice any difference for similar GPUs, or for the same?

20:34.280 --> 20:51.960
I think no, because we don't have much GPUs available for us yet, so generally we have like developer machines,

20:52.680 --> 20:58.520
they are pretty limited, so we don't have many of them, maybe 10 to 20, not more,

20:59.160 --> 21:05.560
so we don't have much variety. From users we usually get pretty rough statistics,

21:06.280 --> 21:12.120
I mean from their reports, because they're reporting for us voluntarily, so we are not

21:12.120 --> 21:21.160
collection any data without their consent, so usually the most useful that we usually have,

21:21.160 --> 21:27.720
it's like FPS, or FnChime or something like that.

21:41.000 --> 21:48.280
Yes, in terms of supporting platforms, because for example, we still support OpenGL 2.1,

21:48.360 --> 21:55.880
and for, yeah, the question. Do you have, do you notice any difference between

21:55.880 --> 22:02.920
commercial implementation for engines and the open source one? So, a variety of support, like

22:03.560 --> 22:11.000
commercial usually trying to avoid platforms where we don't have many people,

22:11.880 --> 22:16.120
where we don't have money or something like that. In open source, it's vice versa,

22:17.080 --> 22:23.560
where we have many people, we have many support, so in that case we need to support

22:23.560 --> 22:37.720
much wider amount of hardware and much older hardware. It's, like, it's really interesting, it's

22:37.720 --> 22:48.760
also, like, two different areas to investigate, so it's not worse, not better, but different,

22:49.320 --> 22:52.920
so for me, it's both areas of those are interesting.

22:57.400 --> 22:59.960
Yep. So, what's the problem on the Raspberry Pi?

23:00.920 --> 23:07.320
I can show you a Ticket and Messer. What's the problem with Raspberry Pi?

23:08.120 --> 23:17.160
If not mistaken, it was when two different common buffers there, internal state,

23:17.720 --> 23:27.160
wasn't synchronized about some painted painting barriers, but I might show you more detailed.

23:27.560 --> 23:31.560
So, that's all?

23:31.560 --> 23:38.600
You mentioned you used the Bibles texture, so when you help to debug something, do you switch to

23:40.760 --> 23:46.440
known Bibles? As far as I know, with the renderer, it's a little bit difficult to see,

23:46.440 --> 23:49.400
to debug if you enable Bibles texture.

23:49.400 --> 24:01.400
Yes, it's possible we're trying to do it trying, are we trying to disable Bibles?

24:01.400 --> 24:08.360
Disgusting, if we have a bug. Yes, it's the first step, so if we have a bug, we're trying to

24:08.360 --> 24:13.800
disable, just keep it in the same first. If it's still producing, then we are working on that.

24:13.800 --> 24:21.160
If not, then we're trying to instigate the discreeting, but if I'm not mistaken,

24:21.160 --> 24:28.840
we had only one bug related to discreeting, so most of our bugs are producing on both cases.

24:30.360 --> 24:35.880
Because the client code, which is calling our backends, is absolutely the same.

24:36.600 --> 24:38.520
Just building your sources is different.

