WEBVTT

00:00.000 --> 00:09.000
Hello, my name is Kasar. I'm a little bit stressed out about the time limitation. I've

00:09.000 --> 00:17.000
been warned. I'll do my best and I'm going to tell you where to find more. So I'm going

00:17.000 --> 00:23.000
to tell you about the project that I've been working on for several last months. It was

00:23.000 --> 00:29.000
project with a community on open datasets for LLM training. This project I've done

00:29.000 --> 00:33.000
it with a colleague at Mozilla, but also in partnership with Aluter AI. If you don't know

00:33.000 --> 00:38.000
them, they are fantastic in the Penden AI research lab, open source research lab, and Sebastian

00:38.000 --> 00:42.000
my service was part of this, he's sitting there standing up now so that you know that

00:42.000 --> 00:46.000
you can bombard him with questions as well. It's much more technical than me if you

00:46.000 --> 00:54.000
have any afternoon talk. So this project is part of a bigger series of efforts that Mozilla

00:54.000 --> 00:59.000
as doing to create space for conversations with the leaders of open source AI space.

00:59.000 --> 01:04.000
Because if you went to the keynote of Mitchell this morning, you definitely are aware of

01:04.000 --> 01:07.000
that there's a lot of disagreement, different opinions. How does that actually

01:07.000 --> 01:11.000
would open source and openness mean in the context of AI? The definitions are

01:11.000 --> 01:15.000
flying around and so on and not everybody agrees, but we all part of a one

01:15.000 --> 01:20.000
bigger tent. Mozilla wants to create space to have discussions to ask the right

01:20.000 --> 01:27.000
questions and to find answers one by one to the Tony questions and possibly create

01:27.000 --> 01:31.000
some common artifacts like definitions and things so that we can start looking

01:31.000 --> 01:36.000
in a common direction because the kind of you know that the fight that we fight

01:36.000 --> 01:43.000
in the enemy is really dominant and huge. So I wanted to tell you a little bit about the project.

01:43.000 --> 01:47.000
We brought people together to talk about what are the challenges of putting together

01:47.000 --> 01:52.000
open data sets for AI training, like what is the way forward, what are the successes

01:52.000 --> 01:57.000
out there and what are the recommendations, where can the investment come. It's a story of

01:57.000 --> 02:06.000
community, it's a frustration in the community struggle, PDFs and a lot of successes and

02:06.000 --> 02:14.000
hope as well. So what's happening right now? Right now, no one almost no one releases

02:14.000 --> 02:22.000
data sets that been used for to train AI anymore. And that's open AI doesn't do it,

02:22.000 --> 02:26.000
but that doesn't do it, Google doesn't do it. We talk about open source models and so on

02:26.000 --> 02:30.000
even deep seek, as you probably know, we talk about that is being an open source model,

02:30.000 --> 02:35.000
but the data set that they used to train it is not revealed is natural use, it's not

02:35.000 --> 02:41.000
even open in any way as you see later according to one of the definitions we have here.

02:41.000 --> 02:49.000
And while the data, of course, is the fundamental element of AI. So what how did that happen?

02:49.000 --> 02:53.000
In the beginning of January, January, January, January, July explosion, so on there were

02:53.000 --> 02:58.000
still some data sets being released by Lama One or RT5 by Google. However, as you probably

02:58.000 --> 03:04.000
are aware, the copyright, outrage, the copyright loss was that followed with a lot of

03:05.000 --> 03:10.000
not be happy with data being scraped by the big companies and exploited to train

03:10.000 --> 03:17.000
a hugely successful generative AI. This copyright lawsuit created a lot of legal risk and

03:17.000 --> 03:25.000
a lot of fear in that industry from releasing data sets. So most of this big company

03:25.000 --> 03:30.000
stopped releasing the data set because the lawyers basically told them, that's just to

03:30.000 --> 03:36.000
be not doing that. That, of course, in addition to the competition pressure in the industry.

03:36.000 --> 03:41.000
And that pertains both to the big companies as well as small research lab as well.

03:41.000 --> 03:46.000
There's a lot of legal risk in releasing the data. However, of course, we know

03:46.000 --> 03:51.000
we are here in the open source community. There's a lot of advantages to open data sets,

03:51.000 --> 03:57.000
to open data sets, bring about competition, so that smaller players can build on the data

03:57.000 --> 04:02.000
that is out there. It's about accountability and transparency. And of course, also

04:02.000 --> 04:10.000
research. So we can, and we have to do better. So what we did at Mozilla, we organized

04:10.000 --> 04:15.000
this data set convening in June 2024 together with a literary AI when we invited

04:15.000 --> 04:20.000
people who are actually building the open data sets who struggle with that and who succeed

04:20.000 --> 04:26.000
in that. And around 30 leaders of that field from a range of organizations here,

04:26.000 --> 04:30.000
you can say, how are you hugging face, a grattle of a synthetic data.

04:30.000 --> 04:35.000
Play us that was featured here before mentioned as part of the builders,

04:35.000 --> 04:38.000
product spawning, really great organization as well. So a lot of people,

04:38.000 --> 04:41.000
we even like reached, I reached out on LinkedIn to people and everybody was very

04:41.000 --> 04:45.000
eager like, yeah, let's meet and let's discuss it because it's a real mess and we

04:45.000 --> 04:51.000
trying our best but it's hard. Although, as you can see, we're very happy at the end.

04:51.000 --> 04:56.000
And as a background for that, we interviewed a Lutter AI,

04:56.000 --> 05:02.000
Stella Biderman, who was leading the building of the open data sets at a Lutter

05:02.000 --> 05:07.000
and play us as well, the French organization to understand what the challenges

05:07.000 --> 05:13.000
are and to have a common background for these discussions. Based on that, we created

05:13.000 --> 05:17.000
the research paper that you can access later on and I'm going to,

05:18.000 --> 05:22.000
it's just made out of the community insights that we derived from that workshop,

05:22.000 --> 05:25.000
but later on also like asynchronous collaboration.

05:25.000 --> 05:29.000
And I'm going to give you a little overview of what's in that paper,

05:29.000 --> 05:32.000
of course, it's just not that much time and I'm scared already.

05:32.000 --> 05:42.000
So it's just a short overview. So first of all, that used to be a graph.

05:43.000 --> 05:51.000
So all right. All right, okay. Well, it appeared.

05:51.000 --> 05:57.000
So we try to order kind of the space for us. What do we mean when we talk about the open data sets?

05:57.000 --> 06:02.000
And there are these three tiers of open openness and data sets for AI,

06:02.000 --> 06:06.000
starting with the sufficient documentation as the replicable data sets.

06:06.000 --> 06:10.000
That aligns with the open source institutes definition.

06:10.000 --> 06:16.000
All right. So you need to document data sources and the processing steps of somebody

06:16.000 --> 06:22.000
could replicate as substantially similar equivalent data set based on that.

06:22.000 --> 06:25.000
And these are the data sets such as CR4 and C4.

06:25.000 --> 06:29.000
The second tier is the open access, so the data availability.

06:29.000 --> 06:32.000
So the data set is out there for everyone to download,

06:32.000 --> 06:35.000
but that doesn't make any claim about the licensing of it.

06:36.000 --> 06:41.000
And the third one, what we call here, the fully open is there are all three elements.

06:41.000 --> 06:47.000
And there's the legal side of the usual open data definition that we know from open knowledge

06:47.000 --> 06:51.000
foundation, where you can reuse, share, and modify data.

06:51.000 --> 06:55.000
And that pertains both to the kind of licensing of data sets itself,

06:55.000 --> 07:02.000
as well as all the components that go into image text and so on.

07:02.000 --> 07:07.000
And then of course, we didn't only talk about open because open data sets alone are enough.

07:07.000 --> 07:14.000
There is also the aspect of like what makes the data set fair, just equitable and ethical and also

07:14.000 --> 07:20.000
compliant. And this is something that is important to remember that this has three different notions

07:20.000 --> 07:23.000
that sometimes are even attention with each other.

07:23.000 --> 07:28.000
They intersect, but sometimes you have to make decisions as you build the data set.

07:28.000 --> 07:41.000
For example, offering and going opt-out might go against you wanting to have a stable version of the data sets over time.

07:41.000 --> 07:47.000
So a router and play has told us about a lot about like what are the challenges right now.

07:47.000 --> 07:54.000
If you try to put together an open data set that just made out of openly licensed content and public domain content,

07:54.000 --> 07:59.000
and these are a lot. So a lot of stuff that's making the data have exploded every day.

07:59.000 --> 08:04.000
First of them is that loss across very across jurisdiction and time.

08:04.000 --> 08:15.000
So if you try to assemble an open data set that will have global implications, you need to look at multiple jurisdictions and geographies,

08:15.000 --> 08:22.000
but that also change over time. And that requires usually legal expertise from different lawyers from across the globe,

08:22.000 --> 08:30.000
which is of course very expensive and very time intense. There is also a very big challenge around the data data being incomplete.

08:30.000 --> 08:37.000
So what constitutes a work in a copyright law doesn't always translate so neatly into the different components.

08:37.000 --> 08:45.000
For example, if you have a few crawling automatically website or common crawl across it and see that there is a CC license on a website,

08:46.000 --> 08:53.000
but in an automated way, it's not really possible to say if that pertains to the whole or the assets of the website or is it just an image or a text.

08:53.000 --> 08:58.000
So you can make money mistakes there and make yourself legally vulnerable.

08:58.000 --> 09:05.000
The same problem with the public domain where the status isn't always so clear and you need to do a lot of manual digging,

09:05.000 --> 09:11.000
which I know that Sebastian is really spending his days and months with manual digging.

09:11.000 --> 09:19.000
The same goes, the other problem is just because the document is actually in the public domain doesn't mean that you can get a copy of it.

09:19.000 --> 09:27.000
A lot of stuff cultural institutions and so on aren't digitized or even if they digitized, for example, the project of Google Books,

09:27.000 --> 09:35.000
you can't always get a full public access to it because it requires different arrangements with Google and so on.

09:35.000 --> 09:43.000
The problem with PDFs that I also learned from Sebastian about is that extracting data from PDFs is extremely difficult.

09:43.000 --> 09:54.000
There are no real tools that help you do it in a scalable way and it requires a lot of manual labor, which of course requires a lot of resources and people and so on.

09:54.000 --> 10:00.000
Many, as you know, many open source projects that made out of Google and tears, they don't have a legal entity.

10:00.000 --> 10:05.000
Being exposed to legal risk requires a lawyer, a lawyer, a lawyer, a legal entity and so on.

10:05.000 --> 10:15.000
That's another problem that arises in this AI context and finally there's something called the consent crisis right now where people don't want to,

10:15.000 --> 10:26.000
even originally, open data, they don't want to, the AI scrappers to scrape the data because they are, of course, annoyed with, with data being used into the big data sets.

10:26.000 --> 10:38.000
And that is directed at the big company, but at the same time, for example, robot, robot that it takes still and so on, that blocks out also researchers and independent developers and so on.

10:38.000 --> 10:42.000
But of course, we don't want the big tech to exploit the open data that is out there.

10:42.000 --> 10:52.000
There's also that kind of problem that we might maybe use with a solve with an AI common that is set up in a right way.

10:52.000 --> 10:55.000
So we spent eight, really intense hours together.

10:55.000 --> 10:58.000
I have to say the food wasn't really good.

10:58.000 --> 11:11.000
Unfortunately, but people were amazing and we had a lot of discussions around the pipeline of producing such open data sets that a lot of exchange.

11:11.000 --> 11:21.000
And we came up with the back space practices that are really grounded and what people are actually doing, the different organizations building the different open data sets.

11:21.000 --> 11:28.000
And I'm not going to go through all of them because that maybe would be too much by inviting you to check out the paper in detail.

11:28.000 --> 11:40.000
But one thing that comes ahead of everything is really encoding the preferences in metadata, the problem of not being sure if something is licensed or what is the preference of the data owner.

11:40.000 --> 11:47.000
Going forward is that it doesn't allow building data sets in a very scalable way.

11:47.000 --> 11:56.000
And other than that, there's a lot around working with communities, around a lot of documentation, making the open data sets reproducible and so on.

11:56.000 --> 11:59.000
There is also a lot of emerging examples.

11:59.000 --> 12:07.000
Alluterious building players building common corpus that are already training, training models on it.

12:07.000 --> 12:13.000
They also have a so-called toxicity classifier.

12:13.000 --> 12:18.000
So an open source, the pipeline of how to identify toxic content.

12:18.000 --> 12:24.000
The same hugging phase that I do, I do web classifiers, also open source pipeline.

12:24.000 --> 12:31.000
They just like open source in the methodology of like how to go through the data sets and remove the harmful and toxic content.

12:31.000 --> 12:36.000
Of course, documenting the definition how they defined at harmful content.

12:36.000 --> 12:43.000
And so on, there's also experiments around data trust of how to organize the governance, around data for communities.

12:43.000 --> 12:47.000
So they have say about what's happening with the data sets and so on.

12:47.000 --> 13:01.000
This is also an amazing organization working on data governance, letting people opt out of the data set and then creating API for developers to run through the data set and remove that data from it.

13:01.000 --> 13:06.000
So a lot of great examples, we have even more of them all with links in the paper.

13:06.000 --> 13:14.000
And finally, we also identified what needs to change in terms of policy and what needs to change in terms of tech investments.

13:14.000 --> 13:19.000
But we can move forward as a field.

13:19.000 --> 13:23.000
And I identified here three main points, so to say.

13:23.000 --> 13:37.000
So increasing open data availability, again, here, making it easier to identify the status in a reliable way of public domain data having maybe registries, working with cultural institutions,

13:37.000 --> 13:44.000
in partnership with cultural institutions to digitize data and establish the metadata for it.

13:44.000 --> 13:51.000
A.I. data commands is something that's been discussed all the time by, like, who can finally do it.

13:51.000 --> 13:57.000
Anyway, that is reliable and such concrete things, as I mentioned, the tools to extract,

13:57.000 --> 14:01.000
to open the license content from difficult formats like PDF.

14:01.000 --> 14:04.000
And of course, clarifying the legal status of the data.

14:04.000 --> 14:12.000
So one proposition there was an policy makers who could create a safe harbor for, especially for smaller organizations such as a Luther AI,

14:12.000 --> 14:15.000
that they can make some mistakes around license things.

14:15.000 --> 14:18.000
But they don't need to feel that immediately they will be slapped with loss.

14:18.000 --> 14:22.000
And of course, that they don't have the resources to fight.

14:22.000 --> 14:31.000
So having a safe harbor for that would help a lot, as well as invest in this global metadata standards to manage licensing and consent and scale.

14:31.000 --> 14:40.000
And here the consent is meant as a new one's consent that goes beyond the robot takes the crawl, not crawl, but, like, for which purpose can this data be actually used.

14:40.000 --> 14:45.000
And finally, as always, the money, ensuring the sustainability of the ecosystem.

14:45.000 --> 14:52.000
So if we want the open datasets for AI to be out there, to be really meaningful, so that we can build open source AI on them,

14:52.000 --> 14:54.000
they need to be treated as public goods.

14:54.000 --> 15:02.000
And that means they also need to be financed as public goods with a long-term perspective of sustainable funding.

15:02.000 --> 15:06.000
And alternatively, also thinking about sustainable business models here.

15:06.000 --> 15:07.000
So what would that look like?

15:07.000 --> 15:12.000
That's always a problem, I guess, in the open source community because things are open to use.

15:12.000 --> 15:18.000
You can't really sell them, but there's a week media experimenting, a week media enterprise from what I know.

15:18.000 --> 15:33.000
And then there is also spawning thinking about the premium model and thinking about how to give back also to the communities that are giving out data and how to develop something that is more sustainable, so that we cannot move forward.

15:33.000 --> 15:36.000
So that's very short.

15:36.000 --> 15:39.000
And thank you, this is the QR code.

15:39.000 --> 15:41.000
That would be too much to type.

15:41.000 --> 15:48.000
If you want to check out the research paper, also if you want to contact me, I'm actually living on Zilla,

15:48.000 --> 15:55.000
but I'm connected with that project and I can answer your questions on direct you to people at the company.

15:55.000 --> 15:56.000
Thank you very much.

15:56.000 --> 16:04.000
I make it.

16:04.000 --> 16:18.000
Do I have questions?

16:18.000 --> 16:25.000
How do you plan to handle the attribution requirement for open license content?

16:25.000 --> 16:35.000
Yeah, I have the microphone.

16:35.000 --> 16:36.000
Yeah, it's a true.

16:36.000 --> 16:40.000
I've been thinking about it, maybe Sebastian also can give his opinion on that.

16:40.000 --> 16:54.000
But I heard about people actually giving attribution on mass listing or one as part of the documentation.

16:54.000 --> 17:02.000
Of course, or linking to this kind of repositories, but I don't know if there is, I think it's like one of,

17:02.000 --> 17:21.000
still maybe some of the unresolved issues, but Sebastian, maybe if you have an opinion of how to handle that at the letter.

17:21.000 --> 17:24.000
It's one of the hardest problems, so let's talk about that.

17:24.000 --> 17:33.000
I think you know, best practices right now when the status has released effectively mean the part of the parquet file, when you have a data set or something.

17:33.000 --> 17:35.000
It's just another column, right?

17:35.000 --> 17:36.000
But is that enough?

17:36.000 --> 17:39.000
Is that sufficient or something that the community really needs to talk about?

17:39.000 --> 17:46.000
But we really try hard to get attribution for every single item that is licensed under Creative Commons licenses.

17:46.000 --> 17:48.000
Thank you.

17:48.000 --> 18:04.000
Do I have more questions?

18:04.000 --> 18:06.000
That was great, thank you very much.

18:06.000 --> 18:10.000
So quick question.

18:10.000 --> 18:19.000
Are you going to turn, I've not read the report, but I've got to turn it into something machine readable, so that another machine couldn't fully understand.

18:19.000 --> 18:25.000
And that means that new providers, not the ones that just talk to you, can also implement it as well.

18:25.000 --> 18:27.000
Machine readable, can you?

18:27.000 --> 18:34.000
So everything that's in there described what the intentions are and stuff like that, it's machine readable.

18:34.000 --> 18:36.000
Yeah, the metadata being machine readable.

18:36.000 --> 18:39.000
Yeah, so the machine couldn't also read it.

18:39.000 --> 18:47.000
Yeah, yeah, no, that's, I think that's the whole crooks on it that it's not, it's requires manual labor, so the trick would be to make this.

18:47.000 --> 18:56.000
So you can pull the data that you need and be sure about the license, sure about the preference signals, so that it must be machine readable.

18:56.000 --> 19:02.000
Otherwise it's just like picking or going one by one and checking and requiring a lot of resources time.

19:02.000 --> 19:10.000
And that we'll never reach the same level as OpenAI and so on without that.

19:10.000 --> 19:13.000
To waste more questions.

19:13.000 --> 19:15.000
And I don't, thank you very much.

19:15.000 --> 19:16.000
Thank you.