WEBVTT

00:00.000 --> 00:06.000
OK, sir.

00:06.000 --> 00:08.000
What's your name?

00:08.000 --> 00:09.000
I think Laurence.

00:09.000 --> 00:10.000
Laurence.

00:10.000 --> 00:13.000
And so let me introduce Laurence and Frank,

00:13.000 --> 00:15.000
who will be talking about GPU stack.

00:15.000 --> 00:17.000
And we are just about to start.

00:17.000 --> 00:18.000
So let's start.

00:18.000 --> 00:20.000
How are everyone?

00:20.000 --> 00:25.000
My talk is GPU stack building simple and scalable

00:25.000 --> 00:28.000
management experience for diverse AI models.

00:28.000 --> 00:33.000
So my name is Laurence Lee.

00:33.000 --> 00:36.000
This is my colleague, Frank Mai.

00:36.000 --> 00:40.000
So we are building some open source projects

00:40.000 --> 00:42.000
for using AI models.

00:42.000 --> 00:46.000
So this talk is mainly about GPU stack,

00:46.000 --> 00:51.000
which is a tool to manage your GPU clusters

00:51.000 --> 00:53.000
for running AI models.

00:53.000 --> 00:55.000
And another one is,

00:55.000 --> 00:57.000
GPUF parser.

00:57.000 --> 01:00.000
GPUF parser plays an important role

01:00.000 --> 01:01.000
in GPU stack.

01:01.000 --> 01:05.000
It is a tool for checking GPUF files

01:05.000 --> 01:09.000
so that we can read those metadata

01:09.000 --> 01:12.000
and do estimations like how much memory

01:12.000 --> 01:14.000
is should be used for this model.

01:14.000 --> 01:20.000
And GPUF parser itself is actually a standalone project.

01:20.000 --> 01:24.000
It is also being used by other open source projects

01:24.000 --> 01:25.000
in local AI.

01:25.000 --> 01:29.000
And you can also use it as standalone CI

01:29.000 --> 01:34.000
for evaluating the GPUF model files.

01:34.000 --> 01:36.000
So in this talk,

01:36.000 --> 01:40.000
we will basically split it into two parts.

01:40.000 --> 01:45.000
The first part we will introduce GPU stack,

01:45.000 --> 01:48.000
what problems we are trying to adjust

01:48.000 --> 01:52.000
and experience where we have

01:53.000 --> 01:57.000
when building these projects.

01:57.000 --> 02:00.000
And the second part, Frank will discuss

02:00.000 --> 02:03.000
what is under the hood in the GPUF parser.

02:03.000 --> 02:07.000
How do we compute those memory estimations stuff?

02:10.000 --> 02:13.000
So let me do a quick recap.

02:13.000 --> 02:18.000
How have we run LMs locally in 2024?

02:19.000 --> 02:23.000
So we all know that LMS-CPP itself

02:23.000 --> 02:27.000
is an outstanding piece of software.

02:27.000 --> 02:31.000
It helps us to be able to run LMs

02:31.000 --> 02:36.000
in many hardware, especially for those GPUs.

02:36.000 --> 02:41.000
And all LMs and the LMs still deal are also

02:41.000 --> 02:42.000
famous tools.

02:42.000 --> 02:46.000
It helps many developers to be able

02:46.000 --> 02:51.000
to serve LMs or large models on their laptop.

02:51.000 --> 02:55.000
And actually, we also have other inference engines.

02:55.000 --> 02:58.000
We call that like the VLM,

02:58.000 --> 03:01.000
scheduling, and many others like TensorFlow,

03:01.000 --> 03:03.000
or etc.

03:03.000 --> 03:06.000
The list can be very wrong.

03:06.000 --> 03:09.000
So how do we scale?

03:09.000 --> 03:13.000
We know that we can use tools like LMs still deal

03:13.000 --> 03:15.000
or let top easily.

03:15.000 --> 03:18.000
But in terms of the scalability,

03:18.000 --> 03:21.000
we can face some challenges.

03:21.000 --> 03:24.000
Because most inference engines

03:24.000 --> 03:28.000
itself does not adjust the scalability issues.

03:28.000 --> 03:32.000
They mainly focus on the inference performance

03:32.000 --> 03:37.000
or how to improve the throughput of your service.

03:37.000 --> 03:40.000
So for the scalability,

03:40.000 --> 03:44.000
we will probably need to rely on something else.

03:44.000 --> 03:47.000
For example, you can use Ray.

03:47.000 --> 03:53.000
It's a distributed system for running machine learning workloads.

03:53.000 --> 03:58.000
Or you can be deep speed, which is a library,

03:58.000 --> 04:02.000
which is useful for making distributed training

04:02.000 --> 04:04.000
and inference effective.

04:04.000 --> 04:07.000
And of course, it can be Kubernetes.

04:07.000 --> 04:13.000
Kubernetes is an orchestration platform for running containers.

04:13.000 --> 04:20.000
You can also combine multiple technologies together

04:20.000 --> 04:23.000
to satisfy your scalability demands.

04:23.000 --> 04:26.000
For example, you can run Kubernetes,

04:26.000 --> 04:29.000
which is to run Ray on Kubernetes.

04:29.000 --> 04:32.000
So that kind of stuff,

04:32.000 --> 04:38.000
we can see many examples in production workloads.

04:38.000 --> 04:43.000
So the problem of these ways is that,

04:43.000 --> 04:48.000
for example, Kubernetes itself is a general purpose container

04:48.000 --> 04:50.000
orchestration platform.

04:50.000 --> 04:57.000
It does not have the building concepts for GPUs and AI models.

04:57.000 --> 05:01.000
So if you use this one to run it well,

05:01.000 --> 05:07.000
they generally need to have a certain level of expertise

05:07.000 --> 05:12.000
to run those platforms effectively.

05:12.000 --> 05:17.000
So what we are trying to adjust is that we want to make it easy

05:17.000 --> 05:21.000
for the general public, not just professions to run

05:21.000 --> 05:27.000
and to easily scale their platforms.

05:27.000 --> 05:32.000
So how's the view of the general public?

05:32.000 --> 05:36.000
So unlike the audience here,

05:36.000 --> 05:39.000
many of the committee users,

05:39.000 --> 05:43.000
we encounter actually lack of many of the knowledge

05:43.000 --> 05:44.000
that are under the hood.

05:44.000 --> 05:49.000
They may not know what the model architecture.

05:49.000 --> 05:52.000
They may not know the detail of how the matrix

05:52.000 --> 05:55.000
multiplication works for inference.

05:55.000 --> 05:58.000
They may not know what those different kinds of

05:58.000 --> 06:00.000
connections means.

06:00.000 --> 06:05.000
So what they know are that they have the devices

06:05.000 --> 06:07.000
or hardware at hand.

06:07.000 --> 06:12.000
It can be as powerful as H100 clusters.

06:12.000 --> 06:17.000
It can also be some old 30, 3090 GPUs

06:17.000 --> 06:20.000
or even some data computers.

06:20.000 --> 06:25.000
And they also know that there are many good models out there.

06:25.000 --> 06:29.000
For example, deep-sick R1, they can just download it

06:29.000 --> 06:33.000
free on hacking interface or somewhere else.

06:33.000 --> 06:36.000
And when referring to AI models,

06:36.000 --> 06:40.000
actually large language models are not the only ones.

06:40.000 --> 06:44.000
We also have like a stable diffusion,

06:44.000 --> 06:46.000
which is for the image generation.

06:46.000 --> 06:50.000
We also have the embeddings and rerankers models,

06:50.000 --> 06:54.000
which are useful for building or rack systems.

06:54.000 --> 06:57.000
We also have like audio models.

06:57.000 --> 07:00.000
So in terms for the end users,

07:00.000 --> 07:03.000
they may need to, for example, do the chat

07:03.000 --> 07:06.000
to the large range models.

07:06.000 --> 07:10.000
In terms of the AI application developers,

07:10.000 --> 07:14.000
they actually want to have some useful APIs

07:14.000 --> 07:18.000
to interact with those all those kinds of models.

07:18.000 --> 07:23.000
Actually, I should put Lama in the first place of this slide,

07:23.000 --> 07:27.000
but it just don't have official logo.

07:27.000 --> 07:31.000
So the user demand is simple.

07:31.000 --> 07:34.000
So they want to get model and get it wrong

07:34.000 --> 07:37.000
and do their job style.

07:37.000 --> 07:42.000
But the complexity is there.

07:42.000 --> 07:51.000
So we have the topology complexity for running AI models.

07:51.000 --> 07:55.000
For example, if you have some powerful GPUs,

07:55.000 --> 07:59.000
you can do fully of loading to those GPUs.

07:59.000 --> 08:01.000
But for some users,

08:01.000 --> 08:03.000
they may need to do partial of loading.

08:03.000 --> 08:06.000
They even don't have the GPU.

08:06.000 --> 08:09.000
For example, we can do a CPU of loading

08:09.000 --> 08:12.000
to run your deep-sick R1.

08:12.000 --> 08:15.000
And for some users,

08:15.000 --> 08:19.000
they also might need some distributed topology.

08:19.000 --> 08:23.000
For example, their models are too large

08:23.000 --> 08:27.000
to fit in a single GPU or multiple GPUs

08:27.000 --> 08:28.000
on a single node.

08:28.000 --> 08:33.000
So distributed computing may involve in this process.

08:33.000 --> 08:37.000
And the diversity is here.

08:37.000 --> 08:41.000
They different users may have different highways.

08:41.000 --> 08:45.000
For example, different vendors of the GPUs

08:45.000 --> 08:47.000
or different accelerators.

08:47.000 --> 08:51.000
And the inference engine can be also different.

08:51.000 --> 08:55.000
For example, we can use lambda cdp to do our

08:55.000 --> 08:58.000
ming inference for the whisper.

08:58.000 --> 09:01.000
We may need to rely on something else.

09:01.000 --> 09:04.000
And the models can be different.

09:04.000 --> 09:06.000
The conversations can be different.

09:06.000 --> 09:10.000
And when you want to run something well in production,

09:10.000 --> 09:14.000
you also need to tune your parameters well.

09:14.000 --> 09:17.000
For example, to change the bad change strategy.

09:17.000 --> 09:20.000
For example, to optimize your cache,

09:20.000 --> 09:25.000
including the profile cache, the kv cache strategy.

09:25.000 --> 09:28.000
And you may also use different execution mode.

09:28.000 --> 09:32.000
For example, eager mode or the graph mode for running the inference.

09:32.000 --> 09:39.000
And when you have to serve a high throughput of the LM service,

09:39.000 --> 09:43.000
you can do optimization on higher level.

09:43.000 --> 09:48.000
When you have a bunch of requests.

09:48.000 --> 09:51.000
So to adjust these changes,

09:51.000 --> 09:55.000
we want to build up two that is smart.

09:55.000 --> 09:58.000
It can understand the environment well.

09:58.000 --> 10:00.000
It can understand the models.

10:00.000 --> 10:04.000
Well, it can understand the inference engine as well.

10:04.000 --> 10:09.000
So this is the screenshot of the GPUs.

10:10.000 --> 10:13.000
To know the environment well,

10:13.000 --> 10:18.000
we should know the detail or the information of each

10:18.000 --> 10:21.000
node added to the cluster.

10:21.000 --> 10:25.000
For example, the operating system, the architecture,

10:25.000 --> 10:31.000
those matters when running your large language models.

10:31.000 --> 10:37.000
Because that different inference engine may rely on different platforms.

10:37.000 --> 10:42.000
For example, MLX is for the Apple device.

10:42.000 --> 10:45.000
And when we run the VRM backend,

10:45.000 --> 10:49.000
we usually cannot do that on Windows server.

10:49.000 --> 10:53.000
So we need to know those information well.

10:53.000 --> 10:58.000
And for the GPUs, like there are many different kinds of GPUs,

10:58.000 --> 11:01.000
or when the software can group everything together.

11:01.000 --> 11:06.000
And they can do live balancing your loads

11:06.000 --> 11:12.000
or build inference for you.

11:12.000 --> 11:15.000
And to know the models well,

11:15.000 --> 11:21.000
for example, GPUs stack needs to be aware of the models.

11:21.000 --> 11:25.000
So there are many different kinds of models,

11:25.000 --> 11:30.000
including audio, LM image or any other.

11:30.000 --> 11:37.000
So we can run a properly scheduled to support GPU or devices.

11:37.000 --> 11:39.000
For example, in this screenshot,

11:39.000 --> 11:44.000
when the user tried to run the large language model,

11:44.000 --> 11:47.000
it cannot be fit entirely to the GPU.

11:47.000 --> 11:50.000
So partial of loading is used here.

11:50.000 --> 11:52.000
And for this situation,

11:52.000 --> 11:57.000
when we detect that to run this massive model,

11:57.000 --> 12:03.000
we need to rely on multiple GPUs on different nodes.

12:03.000 --> 12:05.000
We will do that for you.

12:05.000 --> 12:11.000
So this is distributed inference or class workers.

12:11.000 --> 12:15.000
And the two needs to know the inference engines well.

12:15.000 --> 12:21.000
So when you search or when the user search model on hacking phase,

12:21.000 --> 12:25.000
they are many different kinds of format,

12:25.000 --> 12:27.000
quantization, etc.

12:27.000 --> 12:30.000
So when you choose one of the models,

12:30.000 --> 12:35.000
we should know which inference engine fits your environment,

12:35.000 --> 12:38.000
which inference engine fits this model,

12:38.000 --> 12:40.000
so we can choose it automatically for you.

12:40.000 --> 12:45.000
And you can do any optimization you want.

12:45.000 --> 12:49.000
You can easily configure those back-hand parameters

12:49.000 --> 12:53.000
according to your need.

12:53.000 --> 12:57.000
And we also have the cardboard support for multiple inference engines.

12:57.000 --> 12:59.000
What does that mean?

12:59.000 --> 13:05.000
As we know that the innovation in AI models happens fast.

13:05.000 --> 13:09.000
So when a new model comes up,

13:09.000 --> 13:13.000
the inference engine needs to support it quickly.

13:13.000 --> 13:17.000
So we can see in the most major inference engine,

13:17.000 --> 13:23.000
their release cadence is for they need to release very fast.

13:23.000 --> 13:27.000
So bugs can show up easily as well.

13:27.000 --> 13:31.000
To balance the stability of running the models

13:31.000 --> 13:33.000
and trying the new models,

13:33.000 --> 13:37.000
we need to decouple the inference engine version

13:37.000 --> 13:39.000
and the two itself.

13:39.000 --> 13:41.000
So when using this GPU stack platform,

13:41.000 --> 13:45.000
you can easily choose different inference engine

13:45.000 --> 13:48.000
and they can serve different models

13:48.000 --> 13:50.000
and it's totally decoupled.

13:50.000 --> 13:54.000
And you can easily easily do its experiments.

13:56.000 --> 14:00.000
So next, we will talk about how to do

14:00.000 --> 14:02.000
your F-passer work under the hood.

14:02.000 --> 14:03.000
Frank.

14:03.000 --> 14:06.000
Thank you for joining us.

14:06.000 --> 14:11.000
And I will take you deep dive into the Iranian memory usage

14:11.000 --> 14:13.000
of the GPU F-passer,

14:13.000 --> 14:15.000
a GPU of the models,

14:15.000 --> 14:19.000
which are the kick-canger for the efficient running AI models.

14:19.000 --> 14:23.000
As the model is at the model grows inside,

14:23.000 --> 14:31.000
the memory management director effect the inference speed,

14:31.000 --> 14:34.000
cause and hardware utilization.

14:34.000 --> 14:40.000
Our team tries to simplify this process

14:40.000 --> 14:43.000
and enable the wrapper to optimize the resource

14:43.000 --> 14:46.000
without needing to be expressed.

14:48.000 --> 14:50.000
Okay, let's see this.

14:50.000 --> 14:55.000
When we load a model using LamaCp or something like LamaCp,

14:55.000 --> 14:59.000
based on the LamaCp, just like LamaBoss,

14:59.000 --> 15:03.000
we can get some loss of the memory usage.

15:03.000 --> 15:07.000
As you see, we can divide the memory usage

15:07.000 --> 15:10.000
into four parts, the footprint,

15:10.000 --> 15:13.000
the base overhead of the loading model

15:13.000 --> 15:15.000
and running inference server,

15:15.000 --> 15:18.000
which the models,

15:18.000 --> 15:22.000
the memory occupied by the models parameters,

15:22.000 --> 15:25.000
the kick-catch,

15:25.000 --> 15:27.000
the memory stores, the KV pair,

15:27.000 --> 15:30.000
in the tension mechanism,

15:30.000 --> 15:32.000
and the computation,

15:32.000 --> 15:36.000
the memory used to influence.

15:37.000 --> 15:43.000
For example, this is the KV 2.5 half million model.

15:43.000 --> 15:48.000
You can see that the wage is almost one kick-buy,

15:48.000 --> 15:53.000
and the KV-catch is almost 100 megabyte,

15:53.000 --> 15:59.000
and the computation is almost a 300 megabyte.

15:59.000 --> 16:02.000
This number indicator,

16:02.000 --> 16:09.000
regulating any part, can lead the impartity resource allocation.

16:09.000 --> 16:14.000
So let's break down the let's just see

16:14.000 --> 16:20.000
which is the kick-factory example.

16:20.000 --> 16:23.000
Sorry.

16:23.000 --> 16:26.000
First, the hardware backend,

16:26.000 --> 16:28.000
chose off the hardware backend,

16:28.000 --> 16:30.000
such as CUDA, Recomb, Meta,

16:30.000 --> 16:34.000
or can influence the footprint.

16:34.000 --> 16:40.000
A typical example is that when we call the CUDA set device,

16:40.000 --> 16:49.000
it occupies 160 megabyte in the Nvidia P40,

16:49.000 --> 16:54.000
well, it occupies 215 megabyte

16:54.000 --> 16:59.000
on the rear-deaf 4090.

16:59.000 --> 17:03.000
And then the models, in this part,

17:03.000 --> 17:07.000
the number of the models parameter tastes great impact

17:07.000 --> 17:08.000
on the wage,

17:08.000 --> 17:10.000
and choosing the,

17:10.000 --> 17:13.000
choosing the, the Rycon configuration

17:13.000 --> 17:17.000
for Mac can reduce that part.

17:17.000 --> 17:19.000
And the embedding,

17:19.000 --> 17:22.000
and the dimension of the embedding layer,

17:22.000 --> 17:25.000
and the actual layer of the model,

17:25.000 --> 17:31.000
take a very degree of the KVCatch and the communication.

17:31.000 --> 17:35.000
And then the hyperparameters,

17:35.000 --> 17:37.000
as you see, the hyperparameters,

17:37.000 --> 17:40.000
the contact size, fresh attention,

17:40.000 --> 17:42.000
and the cage type.

17:42.000 --> 17:48.000
It can, they pay the crucial roles in the KVCatch

17:48.000 --> 17:51.000
and the computation, integrating.

17:52.000 --> 17:54.000
Let's break down the,

17:54.000 --> 17:58.000
this is case for the memory use for the KVCatch.

17:58.000 --> 17:59.000
Let's break it down.

17:59.000 --> 18:03.000
The attention equivalence is determined by the embedding lens

18:03.000 --> 18:05.000
and the attention had come.

18:05.000 --> 18:09.000
And then the embedding KVCatch,

18:09.000 --> 18:13.000
is calculated by the multiplicity,

18:13.000 --> 18:19.000
attention equivalence by attention had come for the KV.

18:19.000 --> 18:22.000
And the KVCatch and the KVCatch dimension, as you see,

18:22.000 --> 18:24.000
the KVCatch dimension is,

18:24.000 --> 18:28.000
is also calculated by multiplying the embedding KV

18:28.000 --> 18:31.000
to QA by the contact size.

18:31.000 --> 18:35.000
So finally, we can result in the KVCatch player

18:35.000 --> 18:38.000
with the, with the, with the catch type

18:38.000 --> 18:40.000
and the KVCatch dimension.

18:40.000 --> 18:43.000
The total KVCatch can add up

18:43.000 --> 18:47.000
each layer of the KVCatch.

18:48.000 --> 18:52.000
This number, this shows how even small parameter

18:52.000 --> 18:59.000
can have significant impact on the memory calculation.

19:00.000 --> 19:03.000
So KVCatch,

19:03.000 --> 19:06.000
integrating memory usage is more complex

19:06.000 --> 19:10.000
when we consider the, just you see,

19:10.000 --> 19:15.000
the fresh attention memory map and the IPC remove

19:15.000 --> 19:18.000
all our floating and the tensors three.

19:18.000 --> 19:21.000
So, we,

19:21.000 --> 19:24.000
so to adjust this problem with,

19:24.000 --> 19:27.000
we product two,

19:27.000 --> 19:28.000
call them,

19:28.000 --> 19:30.000
do you have a pass?

19:30.000 --> 19:32.000
And it can, it can remove

19:32.000 --> 19:35.000
to passing the metadata of the GVF fire

19:35.000 --> 19:38.000
and then to assess the memory requirement

19:38.000 --> 19:42.000
without needing to download the model.

19:43.000 --> 19:48.000
This is the, let's do some, let's do some K study.

19:48.000 --> 19:52.000
This is the, we will try to calculate the difference

19:52.000 --> 19:56.000
between the actual running and the estimation

19:56.000 --> 19:58.000
results as you see.

19:58.000 --> 20:01.000
In the RAM part, we can,

20:01.000 --> 20:05.000
we can maintain the difference

20:05.000 --> 20:08.000
with in the 200 MB.

20:08.000 --> 20:14.000
And the, and the rerbs in the 1 GPU,

20:14.000 --> 20:17.000
we can, we can maintain the,

20:17.000 --> 20:23.000
maintain the difference within 100 and 20 MB.

20:23.000 --> 20:27.000
When we choose the, when we change the context size

20:27.000 --> 20:32.000
from 4K to 60, 64K,

20:32.000 --> 20:35.000
and then we can keep the,

20:35.000 --> 20:38.000
the RAM in 1 GPU under the,

20:38.000 --> 20:40.000
1,

20:40.000 --> 20:44.000
under the 120 MB.

20:44.000 --> 20:47.000
When we, when we use the tensors three,

20:47.000 --> 20:51.000
average the 2 GPU with the different model

20:51.000 --> 20:53.000
and we can compare 2 GPU,

20:53.000 --> 20:55.000
and you can see that,

20:55.000 --> 20:59.000
we can keep the memory use under,

20:59.000 --> 21:02.000
under the 15 MB.

21:02.000 --> 21:04.000
That's all.

21:04.000 --> 21:06.000
Thank you.

21:06.000 --> 21:08.000
Thank you very much.

