WEBVTT

00:00.000 --> 00:10.000
Talk to you about Firefox AI runtime by Tarek.

00:10.000 --> 00:12.000
Hey.

00:12.000 --> 00:18.000
Sorry, it's a picture for a friend.

00:18.000 --> 00:19.000
All right.

00:19.000 --> 00:20.000
Hello, everyone.

00:20.000 --> 00:24.000
I hope you're enjoying your amazing weather outside.

00:24.000 --> 00:25.000
This is not Brussels.

00:25.000 --> 00:29.000
I don't know what's wrong, but it's pretty amazing.

00:29.000 --> 00:31.000
I hope you enjoy your weekend.

00:31.000 --> 00:33.000
So I'm Tarek's Yaday.

00:33.000 --> 00:35.000
I work at Mozilla.

00:35.000 --> 00:38.000
I used to be a pre-involving Python community.

00:38.000 --> 00:43.000
I created the French Python user group, FP long time ago.

00:43.000 --> 00:47.000
And I'm part, I also wrote some book about Python.

00:47.000 --> 00:51.000
And I'm part of the Firefox AI ML team at Mozilla.

00:51.000 --> 00:55.000
And I have 15 minutes sharp to finish my talk.

00:55.000 --> 00:58.000
That takes 20 minutes or this guy will punch me.

00:58.000 --> 00:59.000
All right.

00:59.000 --> 01:02.000
So Firefox AI runtime goal.

01:02.000 --> 01:07.000
We want to provide an inference API inside the browser that can run offline.

01:07.000 --> 01:09.000
That we can use for all this stuff.

01:09.000 --> 01:11.000
We want to do inside the browser.

01:11.000 --> 01:16.000
But that we also want to surface to the web extension developer.

01:16.000 --> 01:21.000
Because we believe that it shouldn't be just our team building cool stuff on the top of

01:21.000 --> 01:22.000
inference.

01:22.000 --> 01:25.000
It should be something that anyone out there could do.

01:25.000 --> 01:31.000
So that's roughly the goal we have with this new set of API.

01:31.000 --> 01:33.000
So inference.

01:33.000 --> 01:38.000
That happens in Firefox some years ago already.

01:38.000 --> 01:41.000
Do you guys use Firefox?

01:41.000 --> 01:42.000
Everyone.

01:42.000 --> 01:43.000
Yeah.

01:43.000 --> 01:45.000
So yeah.

01:45.000 --> 01:50.000
And I'm pretty sure a lot of you are not speaking like a native English speaker.

01:50.000 --> 01:53.000
So you always get pages in English.

01:53.000 --> 01:55.000
You might want to translate.

01:55.000 --> 01:59.000
And that's where the awesome Firefox translation feature comes in.

01:59.000 --> 02:01.000
It's running fully offline.

02:01.000 --> 02:05.000
And when we started to add last year some new AI feature.

02:05.000 --> 02:07.000
There were a lot of drama in the community.

02:07.000 --> 02:08.000
Wow.

02:08.000 --> 02:10.000
Firefox is doing AI in the browser.

02:10.000 --> 02:11.000
But guess what?

02:11.000 --> 02:16.000
We had that back in 2019 with that project.

02:16.000 --> 02:21.000
It's based on Bergamot, which is also based on Marion MNT.

02:21.000 --> 02:22.000
So NMT.

02:22.000 --> 02:26.000
It's neural machine translation engine.

02:26.000 --> 02:31.000
So it's roughly an inference runtime that's specializing to doing translation.

02:31.000 --> 02:32.000
It's a very neat project.

02:32.000 --> 02:34.000
I don't have QR code.

02:34.000 --> 02:36.000
But this one is easy to write.

02:36.000 --> 02:39.000
Browser.mt and you get all the details.

02:39.000 --> 02:41.000
And it's using RNN models.

02:41.000 --> 02:43.000
It's a trend on language pair.

02:43.000 --> 02:47.000
And it's re-fast and it works really nice.

02:47.000 --> 02:50.000
This is the architecture as of today.

02:50.000 --> 02:55.000
So when Firefox translation when you want to translate the page.

02:55.000 --> 02:58.000
It's going to fork a dedicated inference process.

02:58.000 --> 03:01.000
That doesn't use for process server yet.

03:01.000 --> 03:07.000
But I guess that's one of the candidates for you for you work.

03:07.000 --> 03:11.000
And once the process is forked, it's running Bergamot.

03:11.000 --> 03:15.000
The lib that is compiled as wasom.

03:15.000 --> 03:19.000
And it calls our service that is online called remote settings.

03:19.000 --> 03:21.000
Where it's going to grab the wasom file.

03:21.000 --> 03:23.000
The model that's it's going to use.

03:23.000 --> 03:26.000
In my case, for example, English to French.

03:26.000 --> 03:29.000
Those are 10 to 20 megabyte each.

03:29.000 --> 03:34.000
It puts them inside your browser in XDB and then runs the inference.

03:34.000 --> 03:38.000
And there is also this cool project called gemology.

03:38.000 --> 03:44.000
That is going to take part of the work that's happening to the inference.

03:44.000 --> 03:48.000
To run it's natively on the best back end on your computer.

03:48.000 --> 03:52.000
So if you have a money and you have the latest Apple M3.

03:52.000 --> 03:58.000
It's going to pick the neon i8 mm back end to do with the metrics multiplication.

03:58.000 --> 04:00.000
And it's going to go very fast.

04:00.000 --> 04:06.000
And if you have less money and you have like a smaller CPU,

04:06.000 --> 04:08.000
it's going to look at it and see what it can use out of it.

04:08.000 --> 04:12.000
So this is what we have with far-fruxed translation.

04:12.000 --> 04:16.000
But we don't want to do more stuff with inference in the browser.

04:16.000 --> 04:20.000
So when we started to look at it last year,

04:20.000 --> 04:24.000
we couldn't use Bergamot, which is specialized in translation.

04:24.000 --> 04:28.000
So we started to list all the features.

04:28.000 --> 04:32.000
We thought about experimenting into the browser.

04:32.000 --> 04:36.000
So describing image for this, we need an image to text model.

04:36.000 --> 04:38.000
Recognizing words.

04:38.000 --> 04:42.000
For example, what if when I visit a page,

04:42.000 --> 04:46.000
I can detect that the word is blue cell is in the page.

04:46.000 --> 04:48.000
And that's it's a city.

04:48.000 --> 04:52.000
So that you would use an nomenity recognition model.

04:52.000 --> 04:56.000
What if I could classify all my tabs with the title of each tab.

04:56.000 --> 05:00.000
Some for traveling, some for the best French fries in bullysad,

05:00.000 --> 05:02.000
etc. etc.

05:02.000 --> 05:06.000
That would use a text classification or sentiment ananesis model.

05:06.000 --> 05:08.000
So that's another one.

05:08.000 --> 05:10.000
Semantic search.

05:10.000 --> 05:12.000
Grab your text from the page, create vectors, index them,

05:12.000 --> 05:14.000
and then you can do some cool stuff with this.

05:14.000 --> 05:16.000
etc. etc.

05:16.000 --> 05:18.000
Text to speech, speech to text.

05:18.000 --> 05:22.000
There are a bunch of models we would love to run offline into the browser.

05:22.000 --> 05:26.000
So we looked at all the projects out there,

05:26.000 --> 05:30.000
and we decided to use Transformer JFs.

05:30.000 --> 05:34.000
So Transformer JFs is a project run by having phase.

05:34.000 --> 05:38.000
That is a JavaScript port of hugging phase,

05:38.000 --> 05:42.000
Transformer project that is in Python,

05:42.000 --> 05:48.000
which is one of the most used stuff out there for people

05:48.000 --> 05:52.000
that build machine learning and want to do some stuff,

05:52.000 --> 05:54.000
some training.

05:54.000 --> 06:00.000
And the Transformer JFs project is built on the top of the on-extrand time.

06:00.000 --> 06:08.000
For Microsoft, which is a inference runtime that can be compiled in wasm,

06:08.000 --> 06:12.000
and that also can run against your web GPU.

06:12.000 --> 06:14.000
And by using this stack,

06:14.000 --> 06:20.000
we're enabling, we're able to use over 1,000 models

06:20.000 --> 06:22.000
that are from hugging phase,

06:22.000 --> 06:24.000
and all the task I've described in my previous slide.

06:24.000 --> 06:30.000
So that gives us a super high level API we can use to do inference in JavaScript.

06:30.000 --> 06:34.000
Here is an example of using Transformer JFs.

06:34.000 --> 06:40.000
So if I want to describe an image with cats,

06:40.000 --> 06:44.000
I can pass the URL to the pipeline.

06:44.000 --> 06:46.000
I just say, yeah, I want to use image text model.

06:46.000 --> 06:48.000
I provide the name of the model,

06:48.000 --> 06:52.000
and I get back the description of the image.

06:52.000 --> 06:54.000
And that is super high level.

06:54.000 --> 06:56.000
This is doing a lot of stuff under the hood.

06:56.000 --> 07:04.000
Transformer JFs implements all the classes that do the processing and pre-processing,

07:04.000 --> 07:08.000
because when you get an image, you can't pass the URL like that.

07:08.000 --> 07:10.000
You need to grab the image,

07:10.000 --> 07:14.000
and then you need to convert the image into blocks of areas.

07:14.000 --> 07:18.000
You want to pass the model and etc.

07:18.000 --> 07:22.000
For the model, you need to grab the model from hugging phase,

07:22.000 --> 07:26.000
download it, put it in your cache on disk,

07:26.000 --> 07:29.000
and then run inference engine on the top of that.

07:29.000 --> 07:33.000
So all this is abstracted away by the Transformer JFs project,

07:33.000 --> 07:36.000
and we think it's a very good way for people

07:36.000 --> 07:40.000
that are not machine learning specialists,

07:40.000 --> 07:44.000
to be able to experiment and play with inference.

07:44.000 --> 07:50.000
So we shipped Transformer JFs into Firefox 133.

07:50.000 --> 07:52.000
We added the Onix runtime web,

07:52.000 --> 07:56.000
which is like the wasm runtime from Onix.

07:56.000 --> 07:59.000
As a back end, a long-site barricamote.

07:59.000 --> 08:02.000
So now when you're running an inference process,

08:02.000 --> 08:05.000
if you use the feature that's specific back end,

08:05.000 --> 08:09.000
you pull that from our server, compile it,

08:09.000 --> 08:11.000
put it in the inference process,

08:11.000 --> 08:14.000
that is completely isolated from the web page,

08:14.000 --> 08:17.000
which makes it more robust and secure.

08:17.000 --> 08:22.000
And then we store all the models into index DB

08:22.000 --> 08:24.000
in a way that's cross origin.

08:24.000 --> 08:29.000
So if I download the model A and it's used in several places,

08:29.000 --> 08:31.000
it's not going to download it twice,

08:31.000 --> 08:34.000
it's a single cache of models.

08:34.000 --> 08:37.000
And then it uses the same inference process

08:37.000 --> 08:39.000
than Firefox implementation.

08:39.000 --> 08:43.000
So that's probably what we've shipped into Firefox.

08:43.000 --> 08:46.000
And the first feature that uses it is in PDFJS,

08:46.000 --> 08:49.000
so who uses PDFJS?

08:49.000 --> 08:50.000
All right.

08:50.000 --> 08:54.000
Did you know that you can add images in PDFJS?

08:54.000 --> 08:56.000
Who does that?

08:56.000 --> 08:58.000
All right.

08:58.000 --> 09:03.000
Okay, it might sound like a weird use case,

09:03.000 --> 09:07.000
but there's one use case is when you want to add signatures in documents,

09:07.000 --> 09:10.000
you can put your signature like that.

09:10.000 --> 09:13.000
But for us, it's also a way to start experimenting

09:13.000 --> 09:15.000
with inference in Firefox,

09:15.000 --> 09:17.000
because you can put an image there,

09:17.000 --> 09:19.000
and then we can process stuff on the image.

09:19.000 --> 09:21.000
So the demo here is about that.

09:21.000 --> 09:24.000
Also let me click on there.

09:24.000 --> 09:26.000
So here it's a PDF,

09:26.000 --> 09:30.000
so this is me with my cat.

09:30.000 --> 09:35.000
And I'm taking a longer time to resize it

09:35.000 --> 09:39.000
because the model works in the background.

09:39.000 --> 09:44.000
And yeah, I have a caption.

09:44.000 --> 09:47.000
So this caption was generated by a small,

09:47.000 --> 09:50.000
image-to-text model.

09:50.000 --> 09:52.000
That is quite small.

09:52.000 --> 09:55.000
It's around 200 megabytes,

09:55.000 --> 09:57.000
if I remember the size on disk.

09:58.000 --> 10:02.000
And it's 180 million parameters.

10:02.000 --> 10:06.000
It's a bit encoder for the image,

10:06.000 --> 10:10.000
and a GP2 decoder for the text.

10:10.000 --> 10:13.000
Yeah, so that's the demo for our text.

10:13.000 --> 10:15.000
So that's, I don't know,

10:15.000 --> 10:22.000
like maybe 15 lines of code in the context in the PDFJS side.

10:22.000 --> 10:24.000
So that's cool,

10:24.000 --> 10:26.000
but we're not a lot of people that want to learn.

10:26.000 --> 10:28.000
And we need to help from the community.

10:28.000 --> 10:32.000
So, and yeah, so we want to enable

10:32.000 --> 10:34.000
the music community to build some cool features.

10:34.000 --> 10:36.000
We have not thought about.

10:36.000 --> 10:39.000
Don't zoom on, don't look at people's faces.

10:39.000 --> 10:41.000
This was AI generated,

10:41.000 --> 10:44.000
and it's very creepy.

10:44.000 --> 10:50.000
Very, that's scary.

10:50.000 --> 10:52.000
So web extension AI API.

10:52.000 --> 10:56.000
It's a wrapper on the top of the AI API we use internally,

10:56.000 --> 10:58.000
that we want to provide to the community.

10:58.000 --> 11:00.000
It's available in a nightly.

11:00.000 --> 11:02.000
It's prepped off in Firefox 134.

11:02.000 --> 11:06.000
I would use the nightly version because I've already fixed some bug there.

11:06.000 --> 11:09.000
So if you use the 134, it's not as good.

11:09.000 --> 11:11.000
It traps the runtime,

11:11.000 --> 11:15.000
and it offers a very high level API to influence the browser,

11:15.000 --> 11:16.000
with low friction.

11:16.000 --> 11:18.000
Of course, you could do that with vanilla,

11:18.000 --> 11:21.000
transform a jazz, but you get a bunch of benefits.

11:21.000 --> 11:23.000
It runs in a separated process.

11:23.000 --> 11:27.000
You have a cached, like a list of models

11:27.000 --> 11:29.000
that are cached into your browser,

11:29.000 --> 11:32.000
not in the web cache, but it indexed TV.

11:32.000 --> 11:35.000
So we don't have the same restrictions.

11:35.000 --> 11:39.000
And we're going to iterate on making it faster

11:39.000 --> 11:43.000
by doing some native things in there.

11:43.000 --> 11:46.000
So it's just going to be better.

11:46.000 --> 11:49.000
And it gives us a way to iterate with the community,

11:49.000 --> 11:52.000
because we don't know yet what could be a good API

11:52.000 --> 11:54.000
to do AI in the browser.

11:54.000 --> 11:57.000
So our hope is that through the experiments people can do,

11:57.000 --> 12:01.000
they will be able to help us shape the best API

12:01.000 --> 12:03.000
for the browser for inference.

12:03.000 --> 12:05.000
So that's how it looks.

12:05.000 --> 12:07.000
It's very similar to what you've seen before.

12:07.000 --> 12:11.000
We have under the trial namespace in web extension,

12:11.000 --> 12:13.000
ML namespace, where you can create an engine,

12:13.000 --> 12:16.000
here I'm creating an engine, for a summarization.

12:16.000 --> 12:20.000
I'm trying to get faster when I talk because of this guy.

12:20.000 --> 12:21.000
It's looking at me.

12:21.000 --> 12:24.000
And it's going to pick a model by default for you,

12:24.000 --> 12:26.000
and you can run it, and it's going to do inference.

12:26.000 --> 12:28.000
So I have a demo here.

12:28.000 --> 12:31.000
Sam stuff that in PDF jazz, but in a web extension,

12:31.000 --> 12:33.000
all the code is in web extension.

12:33.000 --> 12:38.000
So here, if I load my temporary extension,

12:38.000 --> 12:42.000
and I grant permission for people to download model from the web,

12:42.000 --> 12:52.000
that comes from another presentation when I talk to a lot about that.

12:52.000 --> 12:58.000
And here, if I write click, it's going to download the model,

12:58.000 --> 12:59.000
the first time you use it.

12:59.000 --> 13:03.000
So that takes a little bit of time because it's big files.

13:03.000 --> 13:07.000
But once they're there, the next run is going to be super fast.

13:07.000 --> 13:08.000
What happened?

13:08.000 --> 13:10.000
Oh, oh, crap.

13:11.000 --> 13:16.000
Sorry.

13:16.000 --> 13:21.000
Oops.

13:21.000 --> 13:24.000
Yeah.

13:24.000 --> 13:27.000
Almost there.

13:27.000 --> 13:30.000
Yeah, running inference, and then it

13:30.000 --> 13:32.000
describes what's in the image.

13:32.000 --> 13:35.000
So that's the three lines you've seen roughly.

13:35.000 --> 13:38.000
And if I run it again here, it doesn't

13:38.000 --> 13:42.000
go down to the model because it's already cached.

13:42.000 --> 13:43.000
That's it.

13:43.000 --> 13:47.000
Thank you.

13:47.000 --> 13:48.000
Thank you, Eric.

13:48.000 --> 13:52.000
So you have to memorize the first link.

13:52.000 --> 13:54.000
It's an invitation.

13:54.000 --> 13:58.000
So just looking on the internet, it's the discord, the musala AI discord.

13:58.000 --> 13:59.000
You can go there.

13:59.000 --> 14:00.000
We can interact.

14:00.000 --> 14:02.000
You can try to build cool web extension.

14:02.000 --> 14:03.000
And we can talk about it.

14:03.000 --> 14:06.000
And the second link is the documentation.

14:07.000 --> 14:10.000
And for the record, every slide is supposed to be available on

14:10.000 --> 14:13.000
pre-tot, which is the first website.

14:13.000 --> 14:16.000
So if you're a speaker here, please upload your presentation to

14:16.000 --> 14:19.000
pre-tot, so people can find them later.

14:19.000 --> 14:21.000
Thank you for joining.

14:21.000 --> 14:24.000
Do you have any questions?

14:24.000 --> 14:25.000
Yes.

14:25.000 --> 14:46.000
I don't want to ask you, is this mozilla browser specific API or

14:46.000 --> 14:48.000
there are discussion to make it?

14:48.000 --> 14:51.000
Can you repeat, sorry, I don't hear well.

14:51.000 --> 14:55.000
Is this mozilla browser specific API?

14:55.000 --> 15:01.000
Or is there discussion to make it a standard for the whole

15:01.000 --> 15:03.000
JS web APIs?

15:03.000 --> 15:04.000
Yeah.

15:04.000 --> 15:06.000
So that's the next question.

15:06.000 --> 15:09.000
So right now we're a musala specific.

15:09.000 --> 15:12.000
It's what we call a trial API.

15:12.000 --> 15:15.000
It's under a trial namespace when you build web extension.

15:15.000 --> 15:19.000
It has to provide a specific permission in your web extension.

15:19.000 --> 15:21.000
And that's mozilla specific.

15:21.000 --> 15:26.000
Chrome, on the other hand, Google, they're working on something at the browser level.

15:26.000 --> 15:29.000
But we think it's a permit rate.

15:29.000 --> 15:34.000
We want to iterate with the community to try to understand what it means

15:34.000 --> 15:36.000
to do inference in the browser.

15:36.000 --> 15:38.000
At this point, we don't really know.

15:38.000 --> 15:45.000
So our strategy there is to try to iterate with this web extension API.

15:45.000 --> 15:47.000
We want to do some cycles.

15:47.000 --> 15:49.000
And at some point, if we feel like there is something

15:49.000 --> 15:54.000
surfacing that we could propose as an official API.

15:54.000 --> 15:58.000
For the browser, we will do it at the W3C level.

15:58.000 --> 16:02.000
But for now, we think it's a bit of a permit rate.

16:02.000 --> 16:04.000
Do we have other questions?

16:04.000 --> 16:05.000
Okay.

16:05.000 --> 16:07.000
Why we're having another question?

16:07.000 --> 16:08.000
People want to sit.

16:08.000 --> 16:12.000
So please pack up.

16:12.000 --> 16:15.000
What's your state use on other inference?

16:15.000 --> 16:19.000
Like who explores and for example, speech to text?

16:19.000 --> 16:20.000
Sorry.

16:20.000 --> 16:24.000
What's the state use for other inference that you want to run?

16:24.000 --> 16:27.000
For other, like other stuff than PDF chess?

16:27.000 --> 16:28.000
No.

16:28.000 --> 16:33.000
On one side, you expose that you want to be able to run stuff like speech to text to speech.

16:33.000 --> 16:34.000
Right.

16:34.000 --> 16:36.000
What's the state use right now?

16:36.000 --> 16:37.000
Okay.

16:37.000 --> 16:40.000
It's still experimenting.

16:40.000 --> 16:42.000
We don't have yet.

16:42.000 --> 16:48.000
So for text to speech, some people have some interest to run models like

16:48.000 --> 16:50.000
Kokoro.

16:50.000 --> 16:52.000
And we're going to enable it.

16:52.000 --> 16:55.000
So if you build a web extension and you want to do text to speech,

16:55.000 --> 16:57.000
it's going to be super simple.

16:57.000 --> 16:59.000
For the other side,

17:00.000 --> 17:05.000
speech to text.

17:05.000 --> 17:12.000
This is something we're look starting to look at to revive what you started some years ago

17:12.000 --> 17:17.000
and trying to build something that works well in the browser.

17:17.000 --> 17:20.000
But yeah, it's coming this year.

17:20.000 --> 17:21.000
Thank you.

17:21.000 --> 17:23.000
More questions?

17:23.000 --> 17:25.000
Thank you very much.

17:25.000 --> 17:26.000
Thank you everyone.

17:26.000 --> 17:28.000
Enjoy your photos.

17:28.000 --> 17:31.000
Thank you.

