WEBVTT

00:00.000 --> 00:09.000
Let's talk about learning about Lumigators.

00:09.000 --> 00:10.000
OK, thank you very much.

00:10.000 --> 00:11.000
And hello, everyone.

00:11.000 --> 00:13.000
I'm David A and Art.

00:13.000 --> 00:16.000
As you can tell, for my T-shirt, I work on cutting-edge technology

00:16.000 --> 00:18.000
at Mozilla AI.

00:18.000 --> 00:20.000
I didn't change the T-shirt, sorry.

00:20.000 --> 00:22.000
Also, they kind of ruined my joke before,

00:22.000 --> 00:24.000
because they were telling me, nice T-shirt.

00:24.000 --> 00:28.000
So I still hope that some of you didn't hear that part before.

00:28.000 --> 00:33.000
So, in other things, I want to give a shout out in advance.

00:33.000 --> 00:35.000
I kind of lied to you, so you could join this talk.

00:35.000 --> 00:38.000
It should have been evaluating a Lambsmith simpler.

00:38.000 --> 00:41.000
But I hope it's still fine.

00:41.000 --> 00:44.000
I think about waiting is hard, per se, as a problem.

00:44.000 --> 00:47.000
What we are striving to do is to make it simpler and simpler.

00:47.000 --> 00:50.000
At some point, it will become actually simple.

00:50.000 --> 00:53.000
I hope we are already a good way through that.

00:53.000 --> 00:56.000
So, this is our starting point.

00:56.000 --> 00:59.000
About one year ago, we started interviewing people from companies,

00:59.000 --> 01:03.000
research teams, and trying to get an idea about what their experience

01:03.000 --> 01:05.000
with our language model was.

01:05.000 --> 01:10.000
And we kind of found that it was aligned to our own experience.

01:10.000 --> 01:12.000
Just a few months before doing these interviews,

01:12.000 --> 01:16.000
we were participating to these new reaps, fine tuning,

01:16.000 --> 01:19.000
one GPU, 24 hours, one LLM contest.

01:19.000 --> 01:23.000
And the idea was for people to fine tune the model using only one GPU

01:23.000 --> 01:25.000
for 24 hours at most.

01:25.000 --> 01:28.000
And they had to evaluate how their models went.

01:28.000 --> 01:32.000
And we used what was available in terms of academic benchmarks

01:32.000 --> 01:34.000
to evaluate these things.

01:34.000 --> 01:37.000
And first of all, it was a daunting job.

01:37.000 --> 01:40.000
We had to run on many different models and architectures.

01:40.000 --> 01:44.000
We had to run evaluation that came out of the box.

01:44.000 --> 01:46.000
And we had to rely on kind of standard data sets,

01:46.000 --> 01:51.000
which didn't really reflect what a very specific use case could have been for anyone.

01:51.000 --> 01:55.000
And so we decided to try and build something that would have helped

01:55.000 --> 02:01.000
user work on their own specific use cases in the simplest possible way.

02:01.000 --> 02:08.000
So our kind of user target is somebody who codes and works with AI and LLMs.

02:08.000 --> 02:11.000
But it's not necessarily publishing on your reaps.

02:11.000 --> 02:15.000
Like there's plenty of people who just want to use models without having

02:15.000 --> 02:18.000
to develop models from scratch or fine tuning them even.

02:18.000 --> 02:22.000
There are plenty of models coming out which are already good enough for us to be using

02:22.000 --> 02:23.000
in some application.

02:23.000 --> 02:27.000
We just need to understand like which is the best one for our specific use case.

02:27.000 --> 02:32.000
And we also want to be aware about the fact that just trying all these models

02:32.000 --> 02:33.000
take a lot of time.

02:33.000 --> 02:36.000
And so we would like to make this step easier for everyone.

02:36.000 --> 02:39.000
So we're a lot of people.

02:39.000 --> 02:42.000
Right now, the actual contribution is just a dozen.

02:42.000 --> 02:45.000
It's great that we have some which are not in our team already.

02:45.000 --> 02:49.000
And we would like to make AI in general easier, more transparent.

02:49.000 --> 02:52.000
Understand what is broken and possibly fix it.

02:52.000 --> 02:57.000
We are striving to also contribute back to the project that we are using.

02:57.000 --> 03:00.000
Also trying to avoid reinventing the wheel.

03:00.000 --> 03:04.000
So integrating existing tools where they are available.

03:04.000 --> 03:06.000
And finally, only building what is missing.

03:06.000 --> 03:08.000
So sometimes we do mistakes.

03:08.000 --> 03:10.000
We try to build something on our own.

03:10.000 --> 03:14.000
Then find there's very good project we can rely on and then we move to that.

03:14.000 --> 03:20.000
And just try to use what's available in the ecosystem and give back to the ecosystem.

03:20.000 --> 03:23.000
So building the aggregator.

03:23.000 --> 03:29.000
This is the definition of what we get or how we want the aggregator to be.

03:29.000 --> 03:35.000
So a platform that guides users through the process of selecting the right language model for their specific needs.

03:35.000 --> 03:40.000
It's very ambitious and you're still on the path of building the final version.

03:40.000 --> 03:44.000
But we're already some key features which I think are worth sharing.

03:44.000 --> 03:48.000
So first one is kind of infrastructure agnostic.

03:48.000 --> 03:54.000
That is, it builds on a stack that can be ran on a single computer on some laptops.

03:54.000 --> 03:57.000
On computers we don't without GPUs.

03:57.000 --> 04:01.000
It can be built in a local cluster or directly in the cloud.

04:01.000 --> 04:03.000
It can be hybrid.

04:03.000 --> 04:05.000
It can be fully distributed.

04:05.000 --> 04:07.000
You can rename it.

04:07.000 --> 04:12.000
It relies again on not just on existing tools but also standards for interoperability.

04:12.000 --> 04:22.000
Just to give an example, the OpenAI API which is a defect of standard for communication between the large language models and tools going to use those models.

04:22.000 --> 04:31.000
It's something we rely on so we can easily switch from somebody using OpenAI to somebody using an alarm file or any other kind of local model.

04:31.000 --> 04:33.000
We have built an API.

04:33.000 --> 04:36.000
This is how our system basically relies on.

04:36.000 --> 04:44.000
Over this API we have built an SDK and we have also built a UI so that users who want a more friendly interface can try and use it.

04:44.000 --> 04:50.000
We decided it to be extensible not just by us but by the community and for the community.

04:50.000 --> 04:53.000
So you might not have heard about the aggregator yet.

04:53.000 --> 04:55.000
So it's great that you're here today.

04:55.000 --> 05:01.000
But the project has been like publicly developed in the open since about six months ago.

05:01.000 --> 05:06.000
We just released an announcement of our let's say first MVP.

05:06.000 --> 05:11.000
The first version we thought it was worth really advertising somehow.

05:11.000 --> 05:18.000
And we would like the community to tell us how to grow what is most important and useful for them.

05:18.000 --> 05:22.000
And possibly to contribute in any possible way of capability.

05:22.000 --> 05:30.000
Which was a very specific use case to get started with because of course evaluating any model for any use case would have been to complicated to.

05:30.000 --> 05:36.000
Immediately provide something for people to test and we focused on summarization as a specific use case.

05:36.000 --> 05:39.000
This is how the UI looks like.

05:39.000 --> 05:44.000
It kind of looks like many experiment tracking tools but it's actually way easier than that.

05:44.000 --> 05:50.000
The main idea is you have two functionalities you can upload the data set and you can run experiments on these data sets.

05:50.000 --> 05:53.000
The fact that we kind of.

05:54.000 --> 06:05.000
Over fit on summarization for now is allows us to already suggest you some models that we have tried for summarization and we consider good ones open source models.

06:05.000 --> 06:14.000
We also allow people to use existing APIs for close models if they want to see how an open model compares to an existing API.

06:14.000 --> 06:19.000
In case they are thinking about the transition from one technology to another.

06:19.000 --> 06:32.000
And we already provide a lot of parameters which are pre chosen for how the model generally works well in the summarization use case that can of course be tuned if people want to delve deeper into this.

06:32.000 --> 06:43.000
And for this we also allow people to directly access the API and show it in a few slides where you have a much more fine brain possibility of choosing parameters.

06:43.000 --> 06:47.000
Finally we show different models within the same experiment.

06:47.000 --> 07:01.000
We allow you to either death deeper in a single model to do some vibe checks evaluations so see how the model translated compared to the ground fruit that was available and see a few performance matrix that are calculated again.

07:01.000 --> 07:09.000
Given this is the summarization example performance matrix are chosen specifically for that so even if you don't know exactly which ones are used.

07:09.000 --> 07:19.000
You will find the ones that are pre-selected for your specific use case. Of course you don't know what the number means exactly and this is something we are working on right now.

07:19.000 --> 07:27.000
We have also provided some documentation to tell you how to tell a part one matrix from another one and what it means for one matrix to be higher or lower.

07:27.000 --> 07:42.000
You can also compare many models at the same time and see how they perform compared to each other and also we are starting to have some extra matrix that tell you something more to allow you choose between one model or another.

07:42.000 --> 07:58.000
Maybe the model doesn't perform as well as GPT but I don't know it takes only 240 megabytes and so it's worth trying for your specific use case or it had a very much smaller runtime rather than another one.

07:58.000 --> 08:10.000
Some of them have a very low one because they are just APIs that we are hitting. You might have or not have a GPU so it's worth knowing whether you can actually run it or not.

08:10.000 --> 08:26.000
This is a very high level description of the architecture of our tool. We rely on a ray cluster for computation. One of the advantages of kind of devoting a full ray cluster that can actually run on your own laptop.

08:26.000 --> 08:35.000
I don't think about a cluster as many computers doing the work for you. Is that automatically as kidials your jobs so you can really just tell it.

08:35.000 --> 08:43.000
I want to run evaluation on these models and it will take care of in queueing them one after the other depending on the amount of resources you have available.

08:43.000 --> 08:52.000
Another nice thing I saw people talking about models which are LLMs or small LLMs that are not necessarily Python models.

08:52.000 --> 08:59.000
So ray cluster is a diagnostic of which kind of code you are running. What we are calling is basically a shell script.

08:59.000 --> 09:06.000
So when we call a Python script we call Python script name minus minus config and configuration.

09:06.000 --> 09:12.000
So if you have something that is not a classical Python model you can still run it in this infrastructure.

09:12.000 --> 09:19.000
Of course we have not written any plugin for that but we can work together to make it true.

09:20.000 --> 09:29.000
Then there is the rest API is based on fast API and surface is the different functionalities that the software is able to perform right now.

09:29.000 --> 09:35.000
We have an SQL database which right now tracks the main information about what you shared on the system.

09:35.000 --> 09:41.000
For instance the data sets that you have made available and the experiments that you run.

09:41.000 --> 09:56.000
We are moving a lot of the information we were writing in the SQL database onto MLflow because we decided that to allow people who want to have a more granular way of tracking their own experiments is worth using the third party to instead of writing everything on our own.

09:56.000 --> 10:04.000
So coming back to one of the first slides we don't want to reinvent the wheel and this is how we are acting iteratively on our tool.

10:04.000 --> 10:16.000
And finally we have a nice three object storage. You can think about a WSS3 but what we are actually deploying for custom personal use cases is min iobase the s3 object store.

10:16.000 --> 10:26.000
So it can run on your computer again the language it talks is the s3 API so you can easily exchange it with another technology if you want.

10:26.000 --> 10:33.000
This is the web UI you already saw it before this is just another page is the data sets instead of the experiments.

10:33.000 --> 10:45.000
Once you start your system and in our case it's one comment make local app after you have put directory at the repository you connect to local hosting your browser and you access this API.

10:46.000 --> 11:07.000
Another thing you can do you connect to loading the rest API and port 8000 you can see our fast API based rest API and you can already see that with the API makes available sorry I went out of the string the API makes available is way more parameters that are usually available in the UI.

11:07.000 --> 11:16.000
We are now navigating the trade-off between making many functionalities available and making the system actually simple for the user to use.

11:16.000 --> 11:31.000
So we are hiding a lot of these and this is where you can come in telling us I think it would be better if this was exposing the UI instead of it's good to have it there and just going to use it as an SDK or directly call the API.

11:31.000 --> 11:45.000
And then have the ray back hand which is what I talked to you about you can check the logs of the jobs that are running understand better what's happening but at least the central part of this is already surface in the UI so you don't have to use it to understand what's happening.

11:45.000 --> 11:56.000
And finally as I told you before the object store in this case you can connect again locally and see what datasets you have saved and what results you have generated by running our tool.

11:56.000 --> 12:15.000
We have the SDK, I couldn't really show you the SDK itself so I'm showing you how we are using the SDK we have made a test Jupiter notebook available that shows you how to use the SDK to programmatically call all the functionalities that we make available in our code.

12:15.000 --> 12:25.000
And here is also some documentation about how the different matrix work and how you can interpret the results from this specific evaluation.

12:25.000 --> 12:39.000
So the main the main decisions that we talk about this tool where you should be able to plug in and out different kind of components to move stuff around so.

12:39.000 --> 13:00.000
First of all you can have different classes of supported models we are supporting high-infase transformer models API's again open API is the first one but there's a mystery API for instance local models BLLM, Lama files, Olamma everything which supports the open AI compatible APIs.

13:00.000 --> 13:08.000
Now I know the question of the month is well before it was can it run doom now it's can it run deep seek so.

13:08.000 --> 13:22.000
It couldn't run like what's what's a master cannot run deep seek I changed one line in the configuration and I could run deep seek using go Lama and the local served version of Olamma my laptop so I can say can run deep seek I think.

13:22.000 --> 13:35.000
I'm going to put send in PR so you can do it to different type of jobs again you can customize the jobs that you run right now we have annotation inference evaluation.

13:35.000 --> 13:46.000
But you can also have composite jobs you can choose to have a workflow that runs inference evaluation all together so you can just submit everything at once and have the results at the very end.

13:46.000 --> 14:02.000
You can have different levels of access again API SDK UI and finally different kind of deployments is a totally before sun components can be not local you can move them around as long as you have an IP address to point.

14:02.000 --> 14:18.000
So what next these are very high level our next steps we want to make our evaluation results easier to interpret we want to add further matrix so new ways to evaluate what we already have that is summarization.

14:18.000 --> 14:31.000
And we want to expand our use cases and translation is one of the following one partially because it's kind of close to summarization in terms of analyzing language results with respect to ground truth but also because we found it's.

14:31.000 --> 14:36.000
relevant use case and with potential also collaborations with the.

14:36.000 --> 14:41.000
Firefox itself who's one of translations and kind of work as a.

14:41.000 --> 14:56.000
Ulysses packed you know like we build the evaluation and we are also able to develop our own tools against others so whenever we do it to which is not as good as we expect people can use this other tool to evaluate whether it works well or not.

14:57.000 --> 15:05.000
Thank you very much so if you want to know more there's a list of links which are available on this lights which are available on fuzzle website.

15:05.000 --> 15:15.000
And if you want to contribute I like to side these power low of participation that I learned about in 2006 when I was doing my PhD.

15:15.000 --> 15:25.000
There's really no barrier to the entrance in the last 24 hours we got two pieces of feedback one of them was oh I didn't know.

15:25.000 --> 15:36.000
Where to get the data set I could test this stuff on and we realized we didn't provide any data set to play with we only suppose people they already had their own or already knew what they could use for that.

15:36.000 --> 15:44.000
But it happens the other feedback I got was a PR of a one word type in the documentation.

15:44.000 --> 15:49.000
I'm the 41 I'm Italian sorry I'm not sorry for being Italian I'm sorry.

15:49.000 --> 15:52.000
I'm very happy for being Italian.

15:52.000 --> 16:01.000
So it can really be nothing and you just choose your level everything is going to be super appreciated especially if it's something that helps this grow more.

16:01.000 --> 16:03.000
So thank you very much everyone.

16:03.000 --> 16:16.000
Do I have questions.

16:16.000 --> 16:20.000
I'm not seeing any hands raised do we have questions in the chat.

16:20.000 --> 16:21.000
We don't have questions.

16:21.000 --> 16:23.000
I already know all your questions.

16:23.000 --> 16:24.000
Oh no there's a question.

16:24.000 --> 16:25.000
Thank you.

16:26.000 --> 16:28.000
Let me get you the high.

16:28.000 --> 16:29.000
Yes.

16:29.000 --> 16:32.000
Thank you for always.

16:32.000 --> 16:34.000
Well that's not mine.

16:40.000 --> 16:41.000
Hi thank you.

16:41.000 --> 16:44.000
My question for the UI.

16:44.000 --> 16:53.000
Why did you decide to display it in a list with all the different responses?

16:53.000 --> 16:55.000
Like this one or the previous one.

16:55.000 --> 16:56.000
This one.

16:56.000 --> 17:01.000
So this one is not useful to compare one model to another.

17:01.000 --> 17:03.000
But one thing we found.

17:03.000 --> 17:05.000
I mean we all realized.

17:05.000 --> 17:12.000
Is that when we actually run evaluation matrix which is what you would expect in a very scientific approach.

17:12.000 --> 17:17.000
You just have the table you see what's the biggest results and so on.

17:17.000 --> 17:19.000
Beauty is in the eye of the beholder.

17:20.000 --> 17:22.000
So look at this ground truth.

17:22.000 --> 17:26.000
This is the official ground truth provided in the dialogue some data set.

17:26.000 --> 17:28.000
This is like an academic data set.

17:28.000 --> 17:29.000
This is provided.

17:29.000 --> 17:31.000
It's it's a great summary.

17:31.000 --> 17:33.000
It's a one sentence summary.

17:33.000 --> 17:35.000
If I run an LLM on this.

17:35.000 --> 17:38.000
I can tell it's just keep it one sentence.

17:38.000 --> 17:41.000
But it might start giving different wording.

17:41.000 --> 17:43.000
A bigger sentence and everything.

17:43.000 --> 17:46.000
If I'm fine with that sure thing over there.

17:46.000 --> 17:49.000
The result I might get from an LLM might not be great.

17:49.000 --> 17:53.000
And so having also a chance to just see the individual examples.

17:53.000 --> 17:58.000
And for instance sorting by every single evaluation matrix.

17:58.000 --> 18:02.000
And comparing the ground truth what you actually get from the model.

18:02.000 --> 18:10.000
In our opinion was a good way to always keep the user in control of what kind of mapping there is

18:10.000 --> 18:15.000
between what they perceive as good and what they get from the evaluation matrix.

18:16.000 --> 18:18.000
Do I have more questions?

18:22.000 --> 18:23.000
No.

18:23.000 --> 18:25.000
Thank you. You're free to go.

18:25.000 --> 18:26.000
Thanks a lot.

