WEBVTT

00:00.720 --> 00:09.860
there serve hostels loud slang with model materials it's all years

00:10.660 --> 00:12.880
hello everyone

00:42.880 --> 00:58.420
Hello everyone, so my name is Mateusz, and today we will be talking about

00:58.420 --> 01:07.320
Padler, which is a tool for self-coasted large language models at the scale.

01:07.320 --> 01:12.080
And to start with, as an introduction, I would like to mention that Padler is a tool

01:12.080 --> 01:19.280
for LLM-OPS practitioners, and this is important because a lot of people do not even want

01:19.280 --> 01:24.720
to acknowledge that LLM-OPS exists in itself, but I really think it deserves its own niche,

01:24.720 --> 01:30.360
it has its own place, because the tool link that we need to deal with when it comes

01:30.360 --> 01:35.860
to AI and the inference and production usage of those is much different than what we

01:35.860 --> 01:41.080
see usually see in DevOps, and what we see when deploying for example classical websites.

01:41.080 --> 01:45.740
And we are still figuring out as an industry, what works, what doesn't, it is still

01:45.740 --> 01:51.020
an infancy stage, but at the same time, businesses who want to use models in production

01:51.020 --> 01:55.420
self-coasted models, yearn for some kind of stability.

01:55.420 --> 02:00.260
And this is kind of what Padler comes in, Padler is primarily a load balancer, custom

02:00.260 --> 02:03.260
tailored for LLM-OPS.

02:03.260 --> 02:07.900
We are working on more tools related to the infrastructure, but at the time it is primarily

02:07.900 --> 02:16.420
a load balancer, and it aims to provide scalability and some resilience in production.

02:16.420 --> 02:23.100
And to explain what Padler is and how to get there, we will start from the beginning.

02:23.100 --> 02:28.180
So if you are wondering how to end up with a project like Padler, you can follow these

02:28.180 --> 02:30.380
few simple steps.

02:30.380 --> 02:36.420
So let's say you want to host an open source large-fog which model in production.

02:36.420 --> 02:41.580
So you will probably want to start with a single host and you need to pick a runner.

02:41.580 --> 02:47.060
So you can pick between like VLLM, LM, LM, LM, or LM, which is also based on LM-CPP,

02:47.060 --> 02:49.140
maybe something else.

02:49.140 --> 02:54.860
I chose LM-OPS-CPP because of how fast it implements new models, of how fast it moves

02:54.860 --> 02:58.420
forward, and how many internals it exposes.

02:58.420 --> 03:03.460
It is really easy to build custom tools on top of LM-OPS-CPP.

03:03.460 --> 03:08.540
And let's say we don't know that LM-OPS-CPP, we don't know that weights from hugging

03:08.540 --> 03:14.140
space or whatever else, we installed all that on a server, and we can quickly notice

03:14.140 --> 03:16.780
that something does not add up quite.

03:16.780 --> 03:21.860
Because even if we take some smaller model, like 10, 12 billion parameter model, like

03:21.860 --> 03:27.980
Mr. Nemo for example, it requires about 10 gigabytes of beer.

03:27.980 --> 03:32.780
So you have 10 gigabytes of beer, I'm the cheapest solution, is like something with T100

03:32.780 --> 03:39.900
GPU, which currently costs about $300 a month for NAWS, and for that we can have an

03:39.900 --> 03:43.420
inference speed of about 200 tokens per second.

03:43.420 --> 03:49.780
The way concurrency works, it more or less divides tokens between users of the service.

03:49.780 --> 03:54.100
So if you have two users, we will have less than 100 tokens per users, in general, it

03:54.140 --> 03:55.900
is a generating speed.

03:55.900 --> 04:02.620
If we have about 80 users, we will have like 20 to 30 tokens per user, more or less.

04:02.620 --> 04:08.500
And to me, like 20 to 30 tokens per second is something on the verge of usability.

04:08.500 --> 04:12.620
And if you can lower than that, I think that makes your service unusable.

04:12.620 --> 04:18.380
So does this mean we can deliver this kind of poor quality experience for 300 dollars

04:18.380 --> 04:24.420
a buck, is this it, and the answer is resounding yes, we are coming back kind of to the

04:24.420 --> 04:28.420
90s with this, and what do I mean by that?

04:28.420 --> 04:33.820
We are so used to the fact that we can buy a server for five bucks a month, and we can

04:33.820 --> 04:38.220
install some kind of web framework, and we can handle like hundreds or thousands user

04:38.220 --> 04:43.380
concurrently on that service, and we are back to the situation where we need to expensive

04:43.380 --> 04:46.980
hardware, and we can handle maybe thousands of users at once.

04:46.980 --> 04:53.420
So in this sense, we need scalability from the start, and this is not only a technical

04:53.420 --> 04:59.140
issue, this is also a mentality issue, because when we start a new service, everybody tells

04:59.140 --> 05:03.660
us, you do not need scalability from the start, just push it on production, see if it

05:03.660 --> 05:11.100
gains some traction, maybe lighter your needs scalability, but we need that from the beginning.

05:11.100 --> 05:16.300
So if we need any kind of scalability, it looks like we need some of the balanceic also,

05:16.300 --> 05:20.500
and there are options, there are a lot of load balanceic algorithms, but let us go through

05:20.500 --> 05:23.500
the most popular ones.

05:23.500 --> 05:25.740
So the simplest one would be round robin.

05:25.740 --> 05:30.700
So the way this works, let's say we have a few service before I'm at the CP and some

05:30.700 --> 05:35.980
modern running, and the way round robin works, it will distribute requests in circles.

05:35.980 --> 05:39.860
So request number one goes to server number one, server number two, server number three,

05:39.860 --> 05:42.860
and then back to server number one, just in circles.

05:42.860 --> 05:47.300
And this approach works well, if the response times are more or less the same all the

05:47.300 --> 05:48.300
time.

05:48.300 --> 05:55.660
For example, in rest API, so, but the issue with LLMs is the response times are really

05:55.660 --> 06:00.540
variable, because there is a seed of randomness, so even the same like the same prompt

06:00.540 --> 06:07.500
can produce various results, and if we are exposing those prompts to users, one user can

06:07.500 --> 06:12.220
for example ask a yes or no question that model can answer, and the other user can ask

06:12.220 --> 06:17.020
model to generate something like half of the book, really, which we'll take a few minutes.

06:17.020 --> 06:22.020
And if you are unlucky and you will be, those requests will pile up on one server, while

06:22.020 --> 06:26.940
the other servers are not busy at all, so that is not good.

06:26.940 --> 06:31.340
So maybe we can monitor resources used, and this is really tricky also.

06:31.340 --> 06:34.860
It is possible, but it is really hard to get it right.

06:34.860 --> 06:40.740
So the GPU usage, for example, if you use it, will stay more or less the same, no matter

06:40.740 --> 06:46.340
how many users use the service, and you use, for example, continuous batching, as an algorithm

06:46.340 --> 06:53.900
for parallelization, and so you would, so to match those differences, you would need to have

06:53.900 --> 07:01.340
some kind of high resolution monitoring to check very often the GPU usage, and again, we

07:01.340 --> 07:05.820
are used to the fact, for example, with websites, that the more users we have, the user

07:05.820 --> 07:10.820
gradually grows, and is rigid of the text moving to scale up or down, but it is much harder

07:10.820 --> 07:14.020
here, so maybe something else.

07:14.020 --> 07:17.620
The option number three would be to use least connections.

07:17.620 --> 07:26.620
So we can, for example, maybe serve inferences over HTTP, and from the load balancer perspective,

07:26.620 --> 07:31.980
maybe we can issue new requests to the servers who are handling the list amount of requests

07:31.980 --> 07:38.460
at the moment, which in itself would work well, but the issue is, the resources are so limited

07:38.460 --> 07:42.940
that we need something application specific, we need a lot of blancer that knows what it is

07:42.940 --> 07:48.100
balanced things exactly, so it can react accordingly to this situation, because again, it

07:48.100 --> 07:52.580
is really easy to clog up servers with new requests.

07:52.580 --> 07:58.540
So this is finally where parallel comes in, so it uses different balance categories depending

07:58.540 --> 08:03.860
on the end points, so when there is an end point that uses inferences device, it uses

08:03.860 --> 08:10.700
least connection, plus it uses LMSPP internals to know how many resources the server can

08:10.700 --> 08:18.500
handle, so it doesn't bug it down further, it introduces some kind of resiliency to the infrastructure,

08:18.500 --> 08:24.500
which are very explain later, and also it is capable of sending stats to the metrics if the

08:24.500 --> 08:30.340
or infrastructure needs to scale up or down, which is, for example, usable in stage

08:30.340 --> 08:38.820
environments, because it can also scale from zero host, and the way parallel works in general,

08:38.820 --> 08:47.180
you do not specify the workers or LMSPP workers from the load balancer level, instead you

08:47.180 --> 08:54.020
stop the load balancer, and the agents that are installed alongside LMSPP, register

08:54.060 --> 09:00.020
themselves in parallel, so it is kind of reverse to what most load balancers do, and this

09:00.020 --> 09:05.700
allows parallel to scale from zero host, because which means you can start some chip compute

09:05.700 --> 09:13.700
instance, and it can start more expensive GPU or GPU or whatever interface device instances

09:13.700 --> 09:23.860
when they are actually needed, it can also pose as LMSPP server itself, because it

09:23.940 --> 09:30.980
for example, it can also aggregate some metrics, for example, if you have two LMSPP instances,

09:30.980 --> 09:36.500
and one of them is configured to handle eight requests at most, and the other one for example seven,

09:36.500 --> 09:44.180
it can pose itself as a single LMSPP instance with 15 slots to handle, it can also buffer

09:44.180 --> 09:49.940
requests, delay them or redirect them, to LMSPP instances that have free slots to handle them,

09:50.020 --> 09:56.100
so for example, if your infrastructure is scaling up or down, it can buffer the request,

09:56.100 --> 10:01.540
it can wait until a new host appears in the infrastructure, and only then it will direct

10:01.540 --> 10:08.980
it to that LMSPP instance, so it helps to avoid situations, for example, it would be enough

10:08.980 --> 10:13.780
for user to wait just for a few seconds, and the other request would be handled, and otherwise it

10:14.100 --> 10:21.540
would be dropped, in its self-pudder is lightweight, it is based on a pink or a framework from

10:21.540 --> 10:28.900
cloudser, so if that kind of network framework is good enough for cloudser stock, it is also

10:28.900 --> 10:34.580
most likely good for your stack also, and you can make mine, and the heaviest pink it adds on top

10:34.580 --> 10:40.660
of LMSPP, I just have checks, which I think has standard pink in application lower balancers,

10:40.660 --> 10:47.060
so it does a health check once 10 or 30 seconds, mostly for the integrity sector, see how many

10:47.060 --> 10:56.820
slots are actually left and LMSPP, and that's it, so scalability is also simple, because

10:57.620 --> 11:04.420
it can pose since it can pose as LMSPP itself, it can aggregate some statistics, you can experiment,

11:04.420 --> 11:09.620
you can, for example, maybe balance paddlers with paddlers, you can use maybe some other

11:09.700 --> 11:13.860
products in front of it, like actually proxy, so you can experiment, this is also useful for

11:13.860 --> 11:20.180
the stage, you can buy elements, because you can set it up, so only a compute instance is running,

11:20.180 --> 11:25.620
and it will only start more expensive instances when they are actually needed, I have also seen

11:25.620 --> 11:31.620
people who are using paddler in their home labs, for example, so you have Raspberry Pi 0 running all

11:31.620 --> 11:37.860
the time, if it needs to do some inference, it starts your GPU device, before you can learn or

11:37.940 --> 11:44.740
stuff like that, so it has lot of uses, high availability is also kind of easy, because

11:45.620 --> 11:53.540
you can keep paddler instance in your, for example, backup region, because sometimes there are issues,

11:54.740 --> 12:02.580
maybe I am unlucky, but in the last 16 years, the data center in which I had servers burned

12:02.580 --> 12:09.060
down, so you do not really need high availability until you do, so the idea is, for example, you

12:09.060 --> 12:14.900
can keep your infrastructure in the primary region, and you can have a backup region in the

12:14.900 --> 12:21.220
on a stand-by, and if your primary region goes down, you redirect requests to a backup region,

12:21.220 --> 12:26.740
and in that backup region, you don't have to keep your entire GPUs, like GPU, infrastructure is

12:26.740 --> 12:34.340
stuck up all the time, and how does paddler compares to other projects, so paddler was not

12:34.340 --> 12:39.380
made to compete with any other project, paddler was made to help projects cooperate,

12:39.380 --> 12:45.460
as still we can compare it to some other projects, and people often ask me about LAMAS CPP RPC,

12:47.140 --> 12:53.540
so LAMAS CPP RPC is built in LAMAS, it does similar thing to paddler, but in the opposite way,

12:53.780 --> 13:02.980
you define the hosts in LAMAS CPP in the network, and this is a static setup, so you just attach

13:02.980 --> 13:08.420
inference devices to LAMAS CPP, it will distribute requests, but paddler is more focused on the

13:08.420 --> 13:15.140
infrastructure, it does those metrics, it can scale from zero hosts, and it makes sense to use

13:15.140 --> 13:20.500
both really at the same time, because you can configure an infrastructure, for example, in a way,

13:20.500 --> 13:28.260
when you can bring up small clusters of LAMAS CPP RPC and let paddler manage them, and balance

13:28.260 --> 13:36.100
requests between them, so you can kind of look at paddler as reverse proxy, by LAMAS CPP RPC's kind of

13:36.100 --> 13:43.780
a forward proxy, and for the future plans, we are considering bundling LAMAS CPP with paddler,

13:43.780 --> 13:51.540
because that might help, because currently you need to have compatible version of LAMAS CPP with

13:51.540 --> 13:57.700
a compatible version of paddler, and if we bundle both, it will avoid such issues, we are also

13:57.700 --> 14:04.180
considering adding support for other runners, for example, LAMAS, and we are planning to expose

14:05.220 --> 14:12.100
semantic versioning AI, API, but not only for inference, but also to manage servers, because we want to

14:12.100 --> 14:18.660
go further into this infrastructure and operations path, and I would like to use this opportunity

14:18.660 --> 14:24.500
to give a shout-out to contributors, especially Luis Miguel, who recently contributed a lot of

14:24.500 --> 14:31.300
awesome stuff, like console dashboard and initial sketch of Supervisor, but really to everyone,

14:32.820 --> 14:39.220
and if you would like to join our community, we have a GitHub Lab for Discord server, you can

14:39.300 --> 14:44.100
reach out to me personally, and thank you for your attention, if you have any questions, I will be

14:44.100 --> 14:49.780
happy to answer them.

14:50.820 --> 14:55.780
We have a few minutes for questions, so if somebody has a question, please, and shout it.

15:01.140 --> 15:04.260
Okay, no questions so far, thank you very much Matos.

