WEBVTT

00:00.000 --> 00:12.080
A little about us before we get started. I'm a Nanya. I'm a software engineer at Kufara.

00:12.080 --> 00:16.200
I was one of the founding engineers on tempo, which was our just a bit of tracing back

00:16.200 --> 00:22.640
in. And now I work on the machine learning team. And based in Seattle. And yeah, I'll

00:22.640 --> 00:28.360
pass it off to Mark Twentries. Yeah. My name is Mark Twentries. I live in Berlin.

00:29.280 --> 00:37.000
I'm working in Mingerafana in the Belarus Quatt. And this is our first picture together.

00:37.000 --> 00:41.720
And we didn't know. And it's me there. And Nanya is there.

00:41.720 --> 00:46.120
Yeah. And in fact, it's our only picture together in the same frame.

00:46.120 --> 00:52.680
We're in. Cool. So yeah, I'll set this stage for our talk. We'll split it roughly in two

00:52.680 --> 00:57.880
halves. In the first half, we'll talk about the challenges we're deploying GPUs to production.

00:58.360 --> 01:03.240
And the current solutions out there to monitor GPU performance. And in the second half,

01:03.240 --> 01:08.200
we'll talk about our proposed solution to monitor GPU performance using EPPFN bail-on.

01:08.200 --> 01:14.720
Cool. And with that, let's meet Randy. So Randy's room in the sense that he has a YouTube

01:14.720 --> 01:21.240
channel where he does comedy. But for the purposes of this talk, we'll say Randy is an entrepreneur

01:21.240 --> 01:27.120
who's raised money to start an AI research lab. And with this money, Randy's bought a bunch

01:27.200 --> 01:36.640
of GPUs and deployed a fleet. 5 GPUs is not a fleet, but a few more. And these GPUs have been

01:36.640 --> 01:40.800
passed off for different teams for experimentation. So typically the way this works is we have

01:40.800 --> 01:45.840
a scheduler and different teams sort of submit tasks to the scheduler. Some of these could be

01:45.840 --> 01:50.320
really long running tasks like model training and they can span like hours and days. While

01:50.320 --> 01:54.960
others could be short-lived tasks like model live inference. And this scheduler decides

01:55.040 --> 02:02.240
with GPUs are free and assigns these tasks to these GPUs. But couple days after deploying

02:02.240 --> 02:06.720
these teams start complaining about either crashing trainings, like repeatedly crashing

02:06.720 --> 02:13.120
trainings or very slow live inference. And turns out this is more common than you'd think.

02:13.120 --> 02:18.560
So GPUs are incredibly complex pieces of hardware compared to CPUs. They have a lot more

02:18.560 --> 02:25.200
failure modes. This is from the Lama 3 paper released by Meta in July last year where they talk

02:25.200 --> 02:31.920
about how they train the model on like 16,000 GPUs over a course of 50 something days. And they had

02:31.920 --> 02:39.120
500 odd errors like 500 odd instances where the training just stopped. And it's sort of categorized

02:39.120 --> 02:44.480
that and you can see that GPUs right there are the top. There's a lot of, there's I think 58%

02:44.480 --> 02:50.800
of the stalls were because of GPU issues. And turns out this number is actually pretty standard like

02:50.800 --> 02:57.200
1 to 3% GPUs is failed like the overheat and fall off the bus. But yeah there's lots of other issues

02:57.200 --> 03:03.600
with this. Spoiler alert, EBPF doesn't help with tracking hardware related errors. So

03:06.480 --> 03:13.360
skip couple sets. But yeah so coming to software side right like so this is typically how we

03:14.000 --> 03:21.760
write a program that's geared towards GPUs, a CUDA software as you'd say. So we first load data into

03:21.760 --> 03:27.520
the CPU memory and then we allocate memory on the GPU side for the device. And then we transfer

03:27.520 --> 03:32.960
bits of that data that we want for our computation over to the GPU. Then we launch kernels which

03:32.960 --> 03:37.680
are basically like functions or basic computations that you launch on the GPU. They could be like

03:37.920 --> 03:44.160
matrix multiplications or softmax, lot of math that you launch on the GPUs. And the results

03:44.160 --> 03:51.600
of those computation are then copied back over to the CPU. So the tail DR is that CPUs the

03:51.600 --> 03:56.480
orchestrator for all of the GPU tasks. So kernels are launched from the CPU so it's important to know

03:56.480 --> 04:01.600
how many of those were launched where the dependencies between kernels was the output of one

04:01.920 --> 04:07.600
kernel being consumed by the other and did that cost load downs. And then memory is also

04:07.600 --> 04:11.200
allocated and deallocated from the CPU side. So it's kind of important to know how much was

04:11.200 --> 04:18.000
allocated was it freed up before the next operation was done. And our data transfer happening

04:18.000 --> 04:24.560
async to the computation right typically there should be done async while other computation

04:24.560 --> 04:31.600
tasks are underway. So like all of us would want a monitor GPU performance randomly goes to the

04:31.600 --> 04:36.160
internet and searches for how do I how do I do this right and what are the solutions out there.

04:37.040 --> 04:43.200
And the the first thing that you kind of read is to set up hardware metrics and there's some

04:43.200 --> 04:49.840
pretty solid exporters out there. The first screenshot is from Nvidia's DCGM exporter which is

04:49.840 --> 04:56.800
transferred data center GPU manager and it plugs into hardware APIs of the GPU and reports

04:56.800 --> 05:01.280
some pretty interesting metrics like the temperature temperature tracking is pretty important.

05:01.280 --> 05:09.600
Flex GPUs can overheat and die and also like frame by free usage GPU utilization all of

05:09.600 --> 05:14.160
this stuff. There's also a slurm job exporter which is linked from this slide. So slurm is another

05:14.160 --> 05:18.640
container orchestration system which is more popular in the HPC world. So most of the top

05:18.720 --> 05:25.840
super computers would use slurm and then this screenshot is from Azure HPC node health which is

05:25.840 --> 05:32.000
an open source repository of node health checks that run that are targeted towards GPU. So similar

05:32.000 --> 05:36.800
to the health checks that we have in Kubernetes but specifically for GPUs they'll check things like

05:36.800 --> 05:43.440
anxiety errors and if there's any errors related to the GPU they won't bring it online so that the

05:43.520 --> 05:48.480
cleaning doesn't stall. With the problem with hardware metrics is that while they're helpful

05:48.480 --> 05:53.040
to know with GPU or with job fail and the failure states it doesn't lead you to the root cause of

05:53.040 --> 06:01.280
the problem. So again next advice is to use profiling systems and again there's some pretty solid

06:01.280 --> 06:05.920
profiling systems out there. This is actually a good point in time to get a show of hands like how

06:05.920 --> 06:13.040
many of you all are using GPUs in production. So there's a couple so you might be familiar with

06:13.200 --> 06:20.240
this if yeah maybe you're using it to profile your CUDA programs. So the first one is Nvidia

06:20.240 --> 06:26.240
and site systems and it helps break down all of your Python functions into the operations that are

06:26.240 --> 06:32.240
launched on the GPU and the second one is the PyTorch profiler and this also kind of helps break down

06:32.240 --> 06:38.640
how much time was spent in different parts of your code. But these are pretty complex to use like

06:38.800 --> 06:45.040
this is an example of the command that found in one of the user forums and it's like you can kind

06:45.040 --> 06:51.360
of start seeing the issues with this. First off every team in Randy's company now needs to learn

06:51.360 --> 06:58.000
how to use these that are learn which libraries to profile whether or not to sample the CPU

06:58.000 --> 07:04.880
call stack and by the way if you enable CPU call stack sampling you have a performance overhead

07:04.880 --> 07:10.640
because the profiler is running in the same process as your main Python code and people have seen

07:10.640 --> 07:15.040
up to 2x the performance overhead like it takes twice as long to run your code if you have CPU

07:16.480 --> 07:20.560
called stack sampling enabled and then finally the output is on your local machine that you have to

07:20.560 --> 07:24.720
load into a GUI it's not like a platform capability that can be provided to multiple teams

07:25.600 --> 07:29.600
and if you're in more complex if you have a multi GPU setup so each of those would generate their own

07:30.240 --> 07:35.920
output files. So again there's performance overhead and in terms of like PyTorch Profiler you have

07:35.920 --> 07:40.720
to manually instrument the code to enable profiling there's also lack of CPU context before and

07:40.720 --> 07:48.400
after GPU events so that's also something that we want to highlight. And finally like what

07:48.400 --> 07:53.600
if Randy runs out of money and is like I cannot afford Nvidia GPUs anymore we're going to switch

07:53.600 --> 07:59.360
to AMD's and you know now with the deep C can Nvidia stuff going on I don't know what the

07:59.360 --> 08:04.480
price of GPUs is going to be but suppose we swap out some of these for AMD's right like now

08:04.480 --> 08:10.720
your whole profiling architecture each of these teams learned how to use the Nvidia Profiler

08:10.720 --> 08:17.600
now you have to learn all of that again for the AMD tool set and all of that. So these were the

08:17.680 --> 08:26.880
problems that we decided to tackle at Profano and yeah and now we're going to talk about how

08:26.880 --> 08:32.320
we're going to use bailer which is Defano's EVPF based open source auto instrumentation tool

08:32.320 --> 08:39.040
which is also going to be part of open telemetry and how we're going to use this to monitor the

08:39.040 --> 08:47.280
performance of GPUs and I'll pass it off to Mark and I'll try to. Thank you. Yeah so what

08:47.360 --> 08:53.600
are the advantages of using EVPF in this context? First of all as the instrumentation so that means

08:53.600 --> 09:01.280
that we can use deploy bailer in our classes that we use GPU and we're going to have

09:02.400 --> 09:11.600
by default automatic instrumentation in for every CUDA call that happens in the in the system.

09:11.760 --> 09:17.760
It is framework agnostic so it means that whether you use PyTorch or any other machine learning

09:17.760 --> 09:25.040
libraries does a mother since you're if you're using CUDA it's all good and it has a lower

09:25.040 --> 09:31.120
overhead you probably might know but EVPF is quite fast we haven't measured the performance yet

09:31.120 --> 09:38.960
but we assume that is lower overhead like the other probes that we have in bailer how these how

09:39.040 --> 09:43.840
these works please take it with a pinch of salt is an experimental feature

09:44.720 --> 09:49.120
things my broken but the first of all we have to identify the important CUDA calls that

09:49.680 --> 09:58.080
we have available and we got a small subset of them and just to test and then we start in

09:58.080 --> 10:03.680
right in some probes they're quite straightforward and so complicated and then we start we can

10:04.240 --> 10:10.400
get in data and see which metrics and which labels we can start creating from them we have to do

10:10.400 --> 10:18.400
some process of expecting the CUDA libraries and a model discovery that's going to be necessary

10:18.400 --> 10:26.880
to get the symbols and the names of the kernels and it's important to mention that we have access

10:27.840 --> 10:34.240
to the CPU context before and after the GPU call so we are not able to instrument the GPU

10:34.240 --> 10:39.440
but we're going to be able to instrument every time that the CPU is doing a call to a GPU

10:41.520 --> 10:48.000
this is how it works roughly imagine that we have a prompt to our favorite AI assistant

10:48.000 --> 10:56.480
and it goes to an LLM the LLM and needs to do some sort of operation and for that it triggers

10:56.560 --> 11:05.120
this operation CUDA launch kernel and this typically goes to a GPU and it does some calculation

11:05.120 --> 11:10.720
and it returns back to the CPU. Baylight's capable to add a proof to inspect what's going on there

11:10.720 --> 11:18.320
in the CUDA launch kernel operation and we create this metric GPU kernel launch

11:18.400 --> 11:25.760
costoto is ingested by promithios or a pentalometry to later be visualizing in graphana

11:27.600 --> 11:34.880
this is the first function that we try to instrument is the CUDA launch kernel we can see that we

11:34.880 --> 11:41.440
have the funk offset here that is going to be used later to to get the name of the kernel and we

11:41.520 --> 11:49.920
have this grid and block coordinates that are useful in order to figure out the cardinality

11:49.920 --> 11:56.960
of the operation we're going to talk about this in a minute here we can see in promithios the visualization

11:56.960 --> 12:06.240
of this of this function sorry of this metric and we can like track which which of them are

12:06.240 --> 12:17.200
more called another and since Baylight's Kubernetes for citizen we are able to automatically

12:17.200 --> 12:26.320
Kubernetes method decoration to all the metrics so we are able to figure out for this metric

12:26.320 --> 12:32.800
for this operation in which port and which name spaces running so we can track which teams

12:32.800 --> 12:39.120
are running more operations than other here we have a graphana dashboard

12:42.160 --> 12:50.240
visualizing the dimensions of the CUDA launch kernel function and in the image above

12:50.240 --> 12:57.360
it's the panel above we can see like the average and of the cardinality or the

12:57.360 --> 13:05.440
grid cardinality and the block cardinality so we can identify which kernel functions are using more

13:07.440 --> 13:17.920
more using more resources in our GPU in terms of blocks and grid in the panel below we can see

13:17.920 --> 13:26.000
the rate of this cardinality how it grows over time this is right away like running a prompt

13:26.080 --> 13:33.760
in a loop but if we run if we run a second prompt we can see in a spike in the cardinality

13:33.760 --> 13:41.600
over time which makes sense because we just need to do more operations we also are tracking the

13:41.600 --> 13:54.160
CUDA malloc but this is not so useful because this this call is only when you load the model

13:54.240 --> 14:02.960
the model on the bootstrap but we do have the CUDA main copy and it allows us to identify

14:02.960 --> 14:08.240
host to device or device to host which in this case is CPU to GPU and GPU to CPU

14:09.280 --> 14:16.160
and we can look at things like for example this panel where on the left side we can see that

14:16.720 --> 14:24.400
again like over time there's was the quays that we have before the amount of kilobytes

14:24.400 --> 14:29.520
that they're transferred to from the CPU to the GPU and the other way around and of course

14:30.960 --> 14:37.040
is usually we usually send more data to the GPU than the other way around because

14:37.600 --> 14:43.520
the GPU took that typically only returns one parameter or one result and the CPU has to send

14:44.480 --> 14:49.440
the whole text of your query for example and on the right side we can see how the rate

14:50.160 --> 14:58.080
of the CUDA kernel launches is correlated to the memory allocations

15:00.080 --> 15:05.040
we can also do profiling this is a piece of code that we have in the in our first probe that we

15:05.040 --> 15:12.800
show to capture the stack trace thanks to this bbf helper and yeah we can see some interesting

15:12.800 --> 15:20.880
stuff like piterch kernels and vlm kernels because we have support for that but the rest of the

15:20.880 --> 15:30.400
stack is gone that's because we use frame this is using frame pointers and they are optimized

15:30.400 --> 15:39.920
so they are gone this is how we want how we envision to have our CPU profiling for this operation

15:40.560 --> 15:46.320
and this is like a screenshot of gdb our idea is to use the auto

15:47.440 --> 15:57.760
open telemetry profiler the native stack on wind it's a bit tricky to make it work with in our

15:57.760 --> 16:05.280
platform but we are working progress and that's it's going to give us the full picture

16:05.360 --> 16:16.000
some of the limitations with a bf approach there is not a information available on kernel execution

16:16.000 --> 16:24.320
time that means that we cannot for example do the latency of a kernel function execution

16:25.120 --> 16:31.440
and we don't have access to the GPU hardware itself so we cannot measure things like the temperature

16:31.520 --> 16:38.720
just to recap the idea here is to close the gap between traditional GPU monitoring

16:39.520 --> 16:46.160
and modern monitoring solutions like khafana and baila in the future we would like to

16:46.160 --> 16:51.520
support more architectures like as an audience say like I am indeed for example

16:52.240 --> 16:57.600
but at the same time we don't want to work on instrument every lm and every framework out there

16:58.560 --> 17:03.120
we also like to capture the context before and after a GPU call what does it mean

17:04.240 --> 17:09.920
like currently we are just generating a metric but we can also generate the whole trace

17:10.720 --> 17:18.160
when the request is made the time that it take to the request like the prompt and how these impacts

17:18.160 --> 17:21.840
in the number of kernel executions that it has

17:21.920 --> 17:30.320
year prompt so we could also do a cost association to like depending on on which operation are you

17:30.320 --> 17:39.920
running we could track the amount of GPU intensity that is that operation and finally we would like

17:39.920 --> 17:46.880
to instrument more cool operations and I think that's it thank you

17:52.800 --> 17:59.440
there are any questions that we can do do we have time for questions

18:14.960 --> 18:16.480
yeah that's something we'd like to try

18:16.560 --> 18:22.880
okay it's it's easy so that is the problem the way the kernels are launched there's no

18:22.880 --> 18:28.560
way to track there are two times but maybe we can try a couple other approaches see if the

18:28.560 --> 18:34.080
books any other questions

18:39.520 --> 18:45.600
so okay we have worked running oh sorry so the question was how long have we been running it in

18:45.600 --> 18:52.240
production so right now we're using it to monitor just like a single GPU that we have running

18:52.240 --> 18:57.680
some internal queries and stuff we don't have a large scale GPU deployment at Grafana if there's

18:57.680 --> 19:04.640
anyone in the crowd that does please reach out we can collaborate and work on this and yeah so

19:04.640 --> 19:10.320
right now it's been couple months since we've been running it internally on like one GPU but that's

19:10.320 --> 19:15.760
not the ideal use case thank you

19:40.320 --> 19:42.320
you

