WEBVTT

00:00.000 --> 00:04.000
Thank you so much for joining in.

00:04.000 --> 00:08.000
I'm going to be talking about Huggenface ecosystem for locally IML.

00:08.000 --> 00:14.000
When I started to write these slides, it was more supposed to be like a talk more about

00:14.000 --> 00:17.000
Huggenface, the library that we have and so on.

00:17.000 --> 00:21.000
I'm going to be talking about Huggenface ecosystem for locally IML.

00:21.000 --> 00:27.000
When I started to write these slides, it was more supposed to be like a talk more about

00:27.000 --> 00:30.000
Huggenface, the library that we have and so on.

00:30.000 --> 00:34.000
But as I started sort of finishing this slide, it became more as a snapshot and

00:34.000 --> 00:39.000
time of where we are as a community of locally IML.

00:39.000 --> 00:46.000
Engineers, frameworks, you know, models and so on as of today.

00:46.000 --> 00:51.000
So I hope like you, you take it more as a snapshot and where we are and through

00:51.000 --> 00:55.000
time we sort of look back and see like where we were back in 2020, which here is

00:55.000 --> 01:00.000
it five and so on.

01:00.000 --> 01:01.000
Cool, who am I?

01:01.000 --> 01:02.000
I'm VB.

01:02.000 --> 01:06.000
I lead on device and open source collaborations at Huggenface.

01:06.000 --> 01:08.000
I am a local IML.

01:08.000 --> 01:14.000
Joyer must like like most like most of you.

01:14.000 --> 01:18.000
So a lot of my day-to-day work involves working with our partners like

01:18.000 --> 01:22.000
Meta, Google and so on to bring the numbers closer to you to bring the

01:22.000 --> 01:27.000
Gemmas closer to you and so on and so forth.

01:27.000 --> 01:32.000
I build a lot of glue work which means I make tools work better together through

01:32.000 --> 01:39.000
our ecosystem of libraries or through our contributions and so on.

01:39.000 --> 01:41.000
So just a quick sort of poll.

01:41.000 --> 01:45.000
How many of you have run an LM or a model on device?

01:45.000 --> 01:46.000
All right.

01:46.000 --> 01:49.000
What in the right room?

01:49.000 --> 01:50.000
Perfect.

01:50.000 --> 01:53.000
Just the showcase like what is possible like right now.

01:53.000 --> 02:01.000
I'm just going to do a quick demo of running an LM on web GPU.

02:01.000 --> 02:05.000
If only I ever figured out how exactly this works.

02:05.000 --> 02:06.000
Perfect.

02:06.000 --> 02:14.000
So this is a very quick demo of running a 1.7-billion-parameter model on the browser using

02:14.000 --> 02:19.000
Web GPU and this is powered by MLC which is powered by TDM.

02:19.000 --> 02:25.000
And the idea behind this is is a very simple idea wherein you take some huge amount

02:25.000 --> 02:28.000
of unstructured text or in this case very small.

02:28.000 --> 02:30.000
Define a schema.

02:30.000 --> 02:31.000
So in this case.

02:31.000 --> 02:32.000
Can you make a bigger?

02:32.000 --> 02:33.000
Again.

02:33.000 --> 02:36.000
Is there a visible?

02:36.000 --> 02:41.000
Or maybe the contrast like the lights that are above?

02:42.000 --> 02:44.000
In the interest of time maybe.

02:44.000 --> 02:45.000
Yeah.

02:45.000 --> 02:46.000
Okay.

02:46.000 --> 02:47.000
All right.

02:47.000 --> 02:53.000
So there's an unstructured text over there which is essentially an issue where a person

02:53.000 --> 02:56.000
shows like this is the problem that they're facing.

02:56.000 --> 02:59.000
And then you define some sort of a schema.

02:59.000 --> 03:01.000
So for example in this issue we want to get like description.

03:01.000 --> 03:04.000
We want to assign some labels and so on.

03:04.000 --> 03:07.000
In a non-ML word.

03:07.000 --> 03:11.000
You would do this with a series of afel statements, some regular expressions.

03:11.000 --> 03:13.000
You know some control flows and so on.

03:13.000 --> 03:20.000
But in this case we're just going to throw this to NLLM which is small LLM to in this case.

03:20.000 --> 03:29.000
And we ask it to take this text and give us like a nice and pretty JSON out of it.

03:29.000 --> 03:30.000
Right.

03:30.000 --> 03:34.000
And so what you see over here is that it created a title missing password whatever

03:34.000 --> 03:39.000
description, labels, priority, it also like estimated like how much time it might take.

03:39.000 --> 03:42.000
And it creates like a nice issue preview over here.

03:42.000 --> 03:43.000
Right.

03:43.000 --> 03:48.000
This all is just like two lines of LLM calls for the end user.

03:48.000 --> 03:54.000
However, in the back end it's like thousands and thousands of lines of code written in

03:54.000 --> 03:57.000
TVM which makes it possible for you to be able to do this.

03:57.000 --> 03:58.000
Right.

03:58.000 --> 03:59.000
And this is just one runner.

03:59.000 --> 04:00.000
This is just TVM.

04:00.000 --> 04:05.000
Similarly we have you know other runners like there's there's of course like coulda.

04:05.000 --> 04:10.000
There's there's of course you know metal for like apple and so on.

04:10.000 --> 04:15.000
Let me is it this one.

04:15.000 --> 04:17.000
Okay.

04:17.000 --> 04:25.000
So so now that we've we've looked into like a very small like use case of what you can use.

04:25.000 --> 04:29.000
You know some of these sort of models for how do you run stuff like this.

04:29.000 --> 04:30.000
Right.

04:30.000 --> 04:33.000
There's like an entire ecosystem of first of all.

04:33.000 --> 04:36.000
Sorry for the visually explosive slide here.

04:36.000 --> 04:44.000
But there's an entire ecosystem of tools and apps through which like you can do a lot of this stuff.

04:44.000 --> 04:46.000
So first of all.

04:46.000 --> 04:52.000
The like my personal favorite is llama.cp which which sort of like creates an ecosystem of tools.

04:52.000 --> 04:54.000
There's you know like all llama.

04:54.000 --> 04:55.000
There is llama file.

04:55.000 --> 04:57.000
There's Ellen studio pocket pal.

04:57.000 --> 04:59.000
Lama cpp python come to you and so on.

04:59.000 --> 05:05.000
These are these you know like this is then ecosystem like pretty much built on top of llama.cp.

05:05.000 --> 05:09.000
Which is built on ggml as a Roma was talking about before.

05:09.000 --> 05:11.000
Then there is mx.

05:11.000 --> 05:14.000
This is seemingly a new kid on the block.

05:14.000 --> 05:21.000
It's been I guess a year since mx has been around and you can you know it's specifically for.

05:21.000 --> 05:28.000
Apple ecosystem and specifically for mcd's you know Apple chips.

05:28.000 --> 05:32.000
And there's you know support for vision language models.

05:32.000 --> 05:38.000
Ellen lens and so on and there are some like third party apps like Ellen studio local chat that you can use for.

05:38.000 --> 05:42.000
Running stuff on your phone or on your mac and so on.

05:42.000 --> 05:44.000
Then there is on x runtime.

05:44.000 --> 05:49.000
This is like this has been around for a long while you know it's it's actually before my time as well.

05:49.000 --> 05:52.000
They've been here like ancient.

05:52.000 --> 05:55.000
And there's an ecosystem of libraries that use it.

05:55.000 --> 05:57.000
So there is on x runtime web.

05:57.000 --> 06:00.000
Through which you can do some some web GPU stuff.

06:00.000 --> 06:05.000
There's normal on x through which you can run stuff on on on metal as well as good and so on.

06:05.000 --> 06:11.000
And there's an entire ecosystem of libraries over there as well.

06:11.000 --> 06:13.000
Then there's pvm.

06:13.000 --> 06:16.000
My example before was powered by that as well.

06:16.000 --> 06:19.000
They have a web GPU implementation called web lm.

06:19.000 --> 06:22.000
Then they have MLCL lm.

06:22.000 --> 06:29.000
Which is both for for deployments as well as for like local users as well.

06:29.000 --> 06:35.000
Then there's of course whisper.cp which is also built on top of ggml.

06:35.000 --> 06:37.000
And there's like apps called.

06:37.000 --> 06:40.000
I can't have like mac was personal whisper and so on.

06:40.000 --> 06:42.000
Then of course more on like deployment side.

06:42.000 --> 06:46.000
There is like real lm, tgi, results of gml.

06:46.000 --> 06:50.000
And so on which is not on the slides but I will add.

06:50.000 --> 06:56.000
So that's like not an exhaustive but sort of like okay.

06:56.000 --> 07:01.000
It's representation of what kind of stuff we have going on right now.

07:01.000 --> 07:06.000
And now that we've seen like what the what the ecosystem looks like.

07:06.000 --> 07:11.000
What's what's like the typical workflow of running these models you know on device.

07:11.000 --> 07:23.000
And when I say on device it could be anything ranging from like you know a raspberry pie to you know your Mac to even your Apple watch whatever it may be.

07:23.000 --> 07:32.000
The typical workflow okay this is where I need to get my.

07:32.000 --> 07:36.000
Well okay it's fine we we don't need.

07:36.000 --> 07:42.000
Okay so typically you would go on like hugging face dot go like a chip dot go it's like a hub of models in case.

07:42.000 --> 07:45.000
You don't know where anyone can upload.

07:45.000 --> 07:53.000
You know models in this snapshot you can see some of the recent like really really good models like deep sea car one.

07:53.000 --> 07:57.000
There's missed little small 24B bunch of 3D models and so on.

07:57.000 --> 08:05.000
And typically you go you go on on the hub like you look for a use case you can you know the use case could be.

08:05.000 --> 08:09.000
Text generation could whatever it may be.

08:09.000 --> 08:13.000
Then you you know you select an LM of your choice.

08:13.000 --> 08:18.000
Let's say we choose like deep sea car 1.5D it's a distilled model which deep sea.

08:18.000 --> 08:22.000
Recently released and like a 1.5D version of it.

08:22.000 --> 08:25.000
And then you you know you start chatting with it.

08:25.000 --> 08:29.000
A typical example of that would be that you just use like normal CLI.

08:29.000 --> 08:34.000
And then the descriptor of where the model is.

08:34.000 --> 08:36.000
It's precision.

08:36.000 --> 08:42.000
Similarly you can do it if if you use all of our then you can use like all of our that as well.

08:42.000 --> 08:46.000
And typically this is how being the face would look like.

08:46.000 --> 08:51.000
Not sure if if you're familiar with the with the deep sea models.

08:51.000 --> 08:58.000
They are known to sort of think quote and quote or like reason and this is like I find this snippet very funny because I just say hey.

08:58.000 --> 09:06.000
And then it just like goes on like a thinking screen is like what's the user friendly to me was it approachable and so on it's like a.

09:06.000 --> 09:11.000
I find it like a really nice exercise to.

09:11.000 --> 09:13.000
Or like just just.

09:13.000 --> 09:15.000
Okay.

09:15.000 --> 09:27.000
Then let's look at some some sort of facts as of like how much like the usage of these models or like how much do we see like.

09:27.000 --> 09:37.000
Lama rod CPV coins which are like GGUF or MLX points and and so on be used right now in this snapshot.

09:37.000 --> 09:48.000
So we see typically on the hub we see multi billion plus plus polls every month which means that roughly like like more than like 2 billion.

09:48.000 --> 09:49.000
billion.

09:49.000 --> 09:54.000
Billion polls or downloads of these models are happening every month.

09:54.000 --> 09:57.000
I hope that this continues to increase.

09:57.000 --> 10:04.000
We there are roughly 200,000 LLMs on the hub right now which is which has spread across.

10:04.000 --> 10:09.000
You know, Lama rod CPV coins MLX on X.

10:09.000 --> 10:12.000
And so on.

10:13.000 --> 10:21.000
Then there are for like GGUF as you mentioned as as I mentioned which is like the file format for Lama rod CPV.

10:21.000 --> 10:27.000
There are roughly like 80,000 of those on the hub which people which which which people are pulling like.

10:27.000 --> 10:35.000
As I mentioned up up there and we roughly see like 6 to 6.5 petabytes per day of egress on the hub.

10:35.000 --> 10:45.000
Which is like people pulling these models using them for either CI for for their use cases or in production of course.

10:45.000 --> 10:46.000
Okay.

10:46.000 --> 11:01.000
Now comes now comes the more sort of like media part is like how do you prepare these models right like these models in in most cases are are just you know a bunch of like pet or scripts in like in like.

11:01.000 --> 11:07.000
In terms of like modeling file and you have like either like safe transfer is equivalent in terms of like model weights.

11:07.000 --> 11:13.000
And so like how do you go from that to actually running them you know on device.

11:13.000 --> 11:18.000
And so actually I'm going to move this here.

11:18.000 --> 11:28.000
So you would you would typically like start with picking your desired LLM like this could be a mystery deep seek pick your favorite one.

11:28.000 --> 11:39.000
And this could I this could be fine key on so you could have like a fine key model for your specific use case or this could be free trained as you know the provider gave it.

11:39.000 --> 11:46.000
You would first check if it's supported in transformers which is which is a library.

11:46.000 --> 11:53.000
By hugging face which is typically used as the backbone for having all the modeling files and then you would.

11:53.000 --> 12:00.000
You can you can use the model we are pytorch or jacks within that.

12:00.000 --> 12:08.000
Then you check if it's supported in any of the frameworks that you want to use it which is you know LMLX on X and so on.

12:08.000 --> 12:18.000
First of all if the model is not supported in transformers you either open a pull request or you look through like existing pull request to see if it's being added.

12:18.000 --> 12:25.000
If it's there in like LMLX on X whichever one that you that you want to use it with.

12:25.000 --> 12:32.000
If it's not there you again like you search you open a pull request or not or like follow along.

12:32.000 --> 12:38.000
As you mean that you've made it so far you convert the model to whichever specific format that they have.

12:38.000 --> 12:47.000
So in case of MLX so in case of LMLX it's the GGUF format in case of MLX it's the safe answers format in case of on X of course it is the on X format.

12:47.000 --> 13:05.000
And once the model is converted the task is not just done you wipe check if the outputs are good if they match the original original model and so on.

13:05.000 --> 13:16.000
Spoiler alert it almost never does in the first case probably because of bad quantization or because the architecture is not supported properly or whatever we will go through.

13:16.000 --> 13:19.000
We will go through some of those in the next slide.

13:19.000 --> 13:33.000
And last but not the least like you double check the chat template chat template is how like the model knows like how to process like in like in input conversation how to output what kind of tools to use and so on.

13:33.000 --> 13:44.000
This is also another one of those like risk vectors where you have an issue almost like 99% of the times and then if you made it so far maybe profit.

13:45.000 --> 14:13.000
And this this is like a fairly like complex process especially when you have a new architecture or you have a non standard architecture maybe it's it's a mamma style model maybe it's like some linear attention that you go there that you that you might need to re implement from scratch and so on and this is typically like whenever I'm like explaining someone like how you know how to run like a new architecture you know like this is how like it typically would sound like.

14:13.000 --> 14:31.000
And this of course like you know it does sound simple if it's just five bullet points but it's it's like after having helped quite a lot of you know our collaborators and partners bring their.

14:32.000 --> 14:39.000
Their permissive models to two different frameworks there's a lot of things that can go wrong.

14:39.000 --> 14:47.000
First of all typically like tokenizers can go wrong that's like you know which which could be that.

14:48.000 --> 15:01.000
Just that the tokenizer does not convert well or you know there is there is an issue with with the way it would tokenize certain unicorn characters and so on.

15:01.000 --> 15:13.000
When Lamar 3.1 was released there was an issue with with rope with because you know when you scale from like smaller context to longer context it could be that it's not supported properly for that particular architecture.

15:13.000 --> 15:20.000
Chat templates this is in in my opinion the root of all evil on this planet.

15:20.000 --> 15:32.000
I'm quite excited that now like Lamar or CPP has support for ninja yes and we we can be can finally say the say goodbye to a lot of these issues.

15:32.000 --> 15:41.000
And then there's of course like now like we see a lot of like we see a rise of multimodal models you know people who use the.

15:41.000 --> 15:42.000
The you know.

15:43.000 --> 15:51.000
Chaggipity or whatever advanced voice mode you see like there's there's new modalities that are like intertwined with the with the same element backbone so.

15:51.000 --> 16:06.000
You know there's there's that which is like non standard in Lamar or CPP right now and there's there's ongoing work on refactoring it and making it easy to bring vision language models and so on.

16:07.000 --> 16:19.000
And of course like if there is like new architecture you know like for example deep seek R1 was a MOE a mixture of expert which was a non supported architecture so you know adding support for that and so on.

16:19.000 --> 16:33.000
Here's just some some examples which I quite enjoy this is a PR from my colleague Pedro when Lamar three was released to Lamar or CPP there was a change in the tokenizer this was this was supposed to be a seemingly two line change.

16:33.000 --> 16:47.000
But actually if you like later on just maybe click on the ref and see how the how the PR sort of progress it became like a like a sort of seemingly complex PR.

16:47.000 --> 16:51.000
Just to get all the ruffages and like cases map.

16:51.000 --> 17:05.000
This is our fairly recent one to add support for deep seek me three and R1 within Lamar or CPP then this is done by son who's also going to be presenting later today.

17:05.000 --> 17:09.000
Who's a core contributor to Lamar or CPP as well.

17:09.000 --> 17:16.000
Where he adds support for the distilled models because distilled models had a different tokenizer.

17:16.000 --> 17:25.000
Nice and so how does this is just once a sidebar like how does like hadn't faced contribute to this ecosystem.

17:25.000 --> 17:34.000
Of course we have the hub where it's it's it's like we host millions of models we do not pass the cost or whatever of keeping these models up.

17:34.000 --> 17:42.000
We continue to we would continue to keep these models up and make them continue to have them be available for everyone for free.

17:42.000 --> 17:46.000
We have our core set of libraries there's transformers tokenizers.

17:46.000 --> 17:52.000
Safe sensors how do you face up and and a lot of other libraries these are the.

17:52.000 --> 18:04.000
The libraries that that like all ecosystem players beat MLX beat on X beat MLC you know.

18:04.000 --> 18:08.000
Use in in one shape of the phone.

18:08.000 --> 18:16.000
We have specialized libraries for different backends so there's there's optimum libraries for exporting to one X to neuron.

18:16.000 --> 18:20.000
Also to like TRT LLM and so on.

18:20.000 --> 18:24.000
We have some some really specialized.

18:24.000 --> 18:30.000
On device libraries so there's there's candle which is a framework which a good friend and colleague of mine.

18:30.000 --> 18:33.000
A lot of road.

18:33.000 --> 18:38.000
Which is cross back in and and scale quite quite nicely.

18:38.000 --> 18:41.000
There's an ecosystem of libraries on tim book as well.

18:41.000 --> 18:43.000
There's mistil dot address and so on.

18:43.000 --> 18:48.000
And we have transformers.js which which utilizes on X web runtime.

18:49.000 --> 18:54.000
For running elements and and like models on the web.

18:54.000 --> 19:01.000
We have a slew of converter spaces which you can use to create LMLX equivalent models.

19:01.000 --> 19:07.000
And of course we're core contributors to LMLX, TGI and so on.

19:07.000 --> 19:11.000
And as as we go through this year.

19:11.000 --> 19:17.000
We want to sort of increase these quite a bit more and we want to make sure that we help.

19:17.000 --> 19:22.000
To contribute to the ecosystem as much as possible.

19:22.000 --> 19:27.000
And this is just a snapshot of like stuff that you can do on the hub.

19:27.000 --> 19:34.000
Like there's there's like open conversion spaces that you can use to create your own LMLX.

19:34.000 --> 19:40.000
And so on even if you even even if you want to create like simple I matrix coins.

19:40.000 --> 19:46.000
You can you can do that just by going on this going on this space.

19:46.000 --> 19:50.000
You can just pass any transformers model and it just converts it and gives it to you.

19:50.000 --> 19:54.000
And then you can use it as you want.

19:54.000 --> 20:00.000
This is just a snapshot of our optimum optimized inference libraries.

20:00.000 --> 20:14.000
And I have like two quick sort of graphs from the hub about like how we see like LMLX CPT and like how we see the craze of like on device.

20:14.000 --> 20:17.000
And like these like ecosystem players increasing.

20:17.000 --> 20:21.000
This is like the number of coins created like in a self-serve way.

20:21.000 --> 20:24.000
We are this space that I mentioned before.

20:24.000 --> 20:27.000
As you can see like in April is when we started building it.

20:27.000 --> 20:31.000
In April it was like roughly about like 600 coins created per month.

20:31.000 --> 20:35.000
And now you can see that it's it's well over like 2,500.

20:35.000 --> 20:37.000
This is like people just like creating it on its own.

20:37.000 --> 20:45.000
I find it quite nice to see see this graph every now and then to see like how much interest there is by people and so on.

20:45.000 --> 20:51.000
Another one this is very interesting for those who are interested in like development on like Apple Silicon.

20:51.000 --> 20:56.000
This is MLX you can see that for the past one year it's just been like grinding around there.

20:56.000 --> 20:59.000
These are like unique users of like MLX coins.

20:59.000 --> 21:03.000
You can see that they're like grinding around like all the way to like October.

21:03.000 --> 21:09.000
And then I don't know what happened in October but it pretty much like quadrupled around that time.

21:09.000 --> 21:15.000
And in January I also don't know what happened at quadrupled or roughly 3x there again.

21:15.000 --> 21:21.000
So this is something which I'm which I'm monitoring quite closely to see like how we can head there and so on.

21:21.000 --> 21:24.000
This is my last night.

21:24.000 --> 21:28.000
Going into this year like what am I most excited about.

21:28.000 --> 21:34.000
There's of course there is major support or gender chat time rating support in number of CTP.

21:34.000 --> 21:39.000
And there's a multi-moderate revamp coming up in number of CTP as well which is which is very exciting.

21:39.000 --> 21:51.000
I'm quite excited about us shipping transformers back in and both TGI and VRLM which means that you don't really need to have different set of like architectures for it.

21:51.000 --> 21:57.000
There's a there's a little push for DDOS which is like single file diffusion models.

21:57.000 --> 22:01.000
That's something that I'm quite excited to sort of scale in this year.

22:01.000 --> 22:07.000
More accurate sort of quantization schemes better calibration for the existing coins.

22:08.000 --> 22:16.000
And yeah leaner and easier packaging like what Lama file has been doing also with the external diffusion file and whisper file and so on.

22:16.000 --> 22:19.000
That's it and thank you so much.