WEBVTT

00:00.000 --> 00:13.000
Hello everyone, I would like to introduce Nihau Pavo and Igor to talk about Bielek AI, which is

00:13.000 --> 00:18.520
like very close to my heart, as you can see, I have the baths here that I got from the guys

00:18.520 --> 00:23.280
that they are Polish, and they are talking about the Polish language model, how they trained

00:23.280 --> 00:28.000
that and made it possible to run on Raspberry Pi.

00:28.000 --> 00:30.000
Isn't it?

00:30.000 --> 00:31.000
Cheers.

00:31.000 --> 00:33.000
Hi everyone.

00:33.000 --> 00:40.000
How did the small open data initiative became the national phenomenon in Poland?

00:40.000 --> 00:47.000
Well, basically by creating the domestic language model that's already being used by

00:47.000 --> 00:52.000
small and large institutions, enterprises and startups, all across Poland.

00:52.000 --> 00:56.000
So, my name is Michael with me, there's Spaddle and Igor.

00:56.000 --> 01:03.000
We are a small part of the team of the entire speaker-ish community who created Bielek AI

01:03.000 --> 01:05.000
largely in which model.

01:05.000 --> 01:10.000
And today, guys who are from the technical team, I'm more from the business part.

01:10.000 --> 01:17.000
Of this initiative, we'll tell you a bit about what's going on with Bielek.

01:17.000 --> 01:21.000
Thank you, Bielek.

01:21.000 --> 01:24.000
So, let's begin.

01:24.000 --> 01:31.000
We will give you a little insight about what's the definition of the open source.

01:31.000 --> 01:38.000
Of the open source, what we are creating, also what is open science.

01:38.000 --> 01:44.000
And we will talk about data sets and training and models, but training and models will be something

01:44.000 --> 01:47.000
like what we will tell us more.

01:47.000 --> 01:56.000
So, as you can see, there are only three of us, but we are a part of the bigger initiative, a bigger community.

01:56.000 --> 02:03.000
A few weeks ago, we have reached 1,800 users, 1,800 members of our community.

02:03.000 --> 02:12.000
At this particular moment, because I remember when I joined this project, I was around 50 or 60 person

02:13.000 --> 02:16.000
who was collaborating in this project.

02:16.000 --> 02:22.000
But to reach this number, this open source community number,

02:22.000 --> 02:27.000
it has some action to be taken to start it.

02:27.000 --> 02:31.000
So, there is some kind of pretty interesting situation.

02:31.000 --> 02:40.000
When after data science, big data science, conference in Poland, in the 2020-22 year,

02:40.000 --> 02:46.000
our founder Sebastian Konradski and our master of data sets of data, a drink wasj,

02:46.000 --> 02:55.000
we're talking about the lack of the linguistic corpus, which consisted only of Polish text data sets.

02:55.000 --> 03:00.000
Of course, we are living in the big AI LLM's moment.

03:00.000 --> 03:08.000
So, we have to find good sources of such text data sets.

03:08.000 --> 03:18.000
And they managed to do it to provide such high quality, large language data sets.

03:18.000 --> 03:24.000
Of course, there were some signs of our critical voices that we want managed to do it.

03:24.000 --> 03:29.000
People were telling us that maybe 100, 200 gigabits of the text data sets.

03:29.000 --> 03:42.000
And it will be everything. But for this moment, we are about, we have reached the level of 2.8 terabytes of text data sets.

03:42.000 --> 03:55.000
It is open source, it is open source, it is source of the data sets in the Polish language.

03:55.000 --> 03:59.000
Text, we are still providing more data to it.

03:59.000 --> 04:05.000
We are collecting data via some web scraping, maybe some that also closed sources,

04:05.000 --> 04:09.000
but we are collaborating with other institutions.

04:09.000 --> 04:14.000
But it is for everyone. Everyone can join us to help us together,

04:14.000 --> 04:19.000
and also everyone can download it by using our paper package.

04:19.000 --> 04:24.000
But it is only for free lines of code. Everything is also classified.

04:24.000 --> 04:32.000
You will know if it is low data, if it is category low, high medium of the quality.

04:32.000 --> 04:38.000
If it is medical or not from the cooking, everything is placed here.

04:38.000 --> 04:43.000
We have our dashboard. We are providing information about our data.

04:43.000 --> 04:51.000
We are providing metadata to inform you what you are dealing with.

04:51.000 --> 04:55.000
And everything is done by society.

04:55.000 --> 05:01.000
And to reach such number, it wasn't easy task.

05:01.000 --> 05:09.000
But I remember also one time when we were about to reach a mild stone of 1 terabyte of data.

05:09.000 --> 05:12.000
It was about December of 2020.

05:12.000 --> 05:15.000
And the case was before it.

05:15.000 --> 05:18.000
Some folks reached us.

05:18.000 --> 05:23.000
They were from the Academy of Computer Center to run it.

05:23.000 --> 05:29.000
They were providing from 1975 a computing power,

05:29.000 --> 05:35.000
and also IT infrastructure for the science for open projects.

05:35.000 --> 05:39.000
And they asked us, do you want to collaborate with us?

05:39.000 --> 05:43.000
You've got a data, you've got a specialist, you're building a big community.

05:43.000 --> 05:47.000
We have a supercomputing power and let's collaborate.

05:47.000 --> 05:52.000
It was a Christmas gift, which we were dreaming of,

05:52.000 --> 05:56.000
because we didn't have a computing data computing power.

05:56.000 --> 06:00.000
We were gathering data and it was our creme de la creme.

06:00.000 --> 06:05.000
But with the help of Cepronet, we can now create large language models,

06:05.000 --> 06:08.000
which is our sharing on the top.

06:08.000 --> 06:12.000
So we managed to work together with the Cepronet.

06:12.000 --> 06:20.000
And now we became a little Santa's helpers who are giving large language models for this society,

06:20.000 --> 06:25.000
as open source, and it's always for free.

06:25.000 --> 06:28.000
And we give a voice to Pa.

06:29.000 --> 06:37.000
Well, they were, it was in January last year that they kind of started the computer.

06:37.000 --> 06:44.000
When they were warming up, where they were warming up to the GPUs,

06:44.000 --> 06:49.000
they have an opportunity to train the first version of everything,

06:49.000 --> 06:53.000
everything is kind of fuzzy logic, mostly different models.

06:53.000 --> 06:58.000
We do a special training and environment preparation.

06:58.000 --> 07:01.000
So all pipeline is kind of reputable.

07:01.000 --> 07:02.000
We can rerun them.

07:02.000 --> 07:05.000
The data itself is available to all of you.

07:05.000 --> 07:08.000
You can people install the package and download.

07:08.000 --> 07:13.000
And it's ready for you after a couple of minutes, probably hours.

07:13.000 --> 07:18.000
And you can, of course, run the pipeline and try to train it on your own.

07:18.000 --> 07:20.000
Of course, it will take some days.

07:20.000 --> 07:27.000
And you probably don't have equipment as we had before the superior computers and their reach us.

07:27.000 --> 07:30.000
Eventually, we run a lot of experiments,

07:30.000 --> 07:36.000
run, try to understand that there are issues with new GPUs,

07:36.000 --> 07:42.000
because the HPC was using, it was one of the first in the world,

07:42.000 --> 07:44.000
BH200 GPUs.

07:44.000 --> 07:48.000
So many of the libraries completely didn't compile at all.

07:48.000 --> 07:56.000
So we had to come up with the code and a lot of things went through.

07:56.000 --> 08:01.000
With the model number two, we started to do the synthetic data,

08:01.000 --> 08:07.000
get some experience with proper ideas around the synthetic data,

08:07.000 --> 08:14.000
and how to approach the parameter, how to get the heuristic around that topic,

08:14.000 --> 08:19.000
how to cover most of the things that we understand as a viable for the model.

08:19.000 --> 08:25.000
Because the model itself is, we try to cover different use cases and ask business,

08:25.000 --> 08:32.000
education people, all the kind of artist movement to give us instructions,

08:32.000 --> 08:37.000
give us a clue what you want to have within the model.

08:37.000 --> 08:40.000
And of course, there are users of that.

08:40.000 --> 08:48.000
Eventually, we also started to find what are the second stage,

08:48.000 --> 08:53.000
fine tuning, DPO, and a third stage of the training itself.

08:53.000 --> 08:56.000
So we are doing pre-training, fine tuning, and DPO,

08:56.000 --> 09:02.000
and eventually instruction model as a point and alignment as well.

09:02.000 --> 09:06.000
For us right now, also evaluation is a key feature.

09:06.000 --> 09:10.000
So to ramp up, we had the first model that is 1B, train on single,

09:10.000 --> 09:15.000
then version number one, with a proper paper,

09:15.000 --> 09:20.000
you can go to archive accents and see some information around it.

09:20.000 --> 09:22.000
It was based, of course, on the mistral 1.5,

09:22.000 --> 09:26.000
but we continued pre-training it for like a number of days,

09:26.000 --> 09:30.000
and then a model number 1.2.

09:31.000 --> 09:37.000
It was, in fact, a train on kind of more days on larger amount of data

09:37.000 --> 09:44.000
around 200 billion documents that we kind of filter out.

09:44.000 --> 09:51.000
And documents and tokens, right.

09:51.000 --> 09:54.000
So as I said, told you about evaluation,

09:54.000 --> 09:58.000
we prepare our own empty bench, and it's not translated.

09:58.000 --> 10:02.000
This courage you to do translation by Google, and you have to localize it.

10:02.000 --> 10:05.000
Because you know, that question are asking,

10:05.000 --> 10:08.000
where are you going for holidays?

10:08.000 --> 10:09.000
How why?

10:09.000 --> 10:11.000
Nobody in Poland would go to Hawaii.

10:11.000 --> 10:12.000
Sorry.

10:12.000 --> 10:14.000
It was too costly, and it's time to, of course,

10:14.000 --> 10:16.000
maybe some people do.

10:16.000 --> 10:18.000
But they go for, to the seaside.

10:18.000 --> 10:20.000
It's to some Kalushki or whatever.

10:20.000 --> 10:23.000
And so we prepared our own kind of,

10:23.000 --> 10:28.000
those specific parts in conjunction with what we do

10:28.000 --> 10:30.000
about Polish culture.

10:30.000 --> 10:34.000
And of course, for that, we open a Polish LLM leaderboard,

10:34.000 --> 10:37.000
where we evaluate all possible models.

10:37.000 --> 10:40.000
Within one day, when the model is published, we rerun it,

10:40.000 --> 10:45.000
and eventually we are having very good results with

10:45.000 --> 10:47.000
model number V2.

10:47.000 --> 10:51.000
We do also have scripts and preparation for all the formats

10:51.000 --> 10:52.000
that you want to have.

10:52.000 --> 10:53.000
We are on Olama.

10:53.000 --> 10:56.000
We are on, on hugging face.

10:56.000 --> 10:57.000
So you can approach it.

10:57.000 --> 10:59.000
We try to be open.

10:59.000 --> 11:02.000
So whatever code we are talking about,

11:02.000 --> 11:04.000
they are on the heat GitHub itself.

11:04.000 --> 11:05.000
And you can help it.

11:05.000 --> 11:10.000
Those are the results of those models that are quantized.

11:10.000 --> 11:12.000
They're pretty good.

11:12.000 --> 11:16.000
I must say that the model number eight,

11:17.000 --> 11:23.000
quantization which was eight is even thirder.

11:23.000 --> 11:26.000
So it's better than the model without quantization,

11:26.000 --> 11:29.000
which surprises us a little bit.

11:29.000 --> 11:32.000
Right now we are in progress still,

11:32.000 --> 11:36.000
but we are training in a model even smaller model.

11:36.000 --> 11:38.000
There's still some form,

11:38.000 --> 11:41.000
there's still the knowledge from bigger models,

11:41.000 --> 11:44.000
and also have our own tokenizer

11:44.000 --> 11:48.000
that covers some specifics of a Polish language.

11:48.000 --> 11:53.000
And we already tried out on different devices.

11:53.000 --> 11:55.000
So this model runs on Raspberry Pi,

11:55.000 --> 11:58.000
runs on Android devices,

11:58.000 --> 12:00.000
some other set out boxes.

12:00.000 --> 12:05.000
There are numbers of getting started in tutorials.

12:05.000 --> 12:08.000
So everyone can help this guy,

12:08.000 --> 12:11.000
administrator of some network of the school,

12:11.000 --> 12:13.000
and can go there to see how to install,

12:13.000 --> 12:14.000
how to use it,

12:14.000 --> 12:18.000
that people are doing the workshops on schools and universities

12:18.000 --> 12:20.000
to set up the rack,

12:20.000 --> 12:21.000
to set up, of course,

12:21.000 --> 12:25.000
we also show some use cases for local communities,

12:25.000 --> 12:27.000
or companies,

12:27.000 --> 12:30.000
but we try to give the instruments,

12:30.000 --> 12:33.000
and then need to do their own work

12:33.000 --> 12:35.000
to get to know what is inside,

12:35.000 --> 12:36.000
what is AI,

12:36.000 --> 12:38.000
how to use the models,

12:38.000 --> 12:39.000
how to run the models,

12:39.000 --> 12:42.000
and how to evaluate them.

12:42.000 --> 12:48.000
We really want to say thank you to all the guys

12:48.000 --> 12:49.000
that's here,

12:49.000 --> 12:51.000
it's only parts of the core team,

12:51.000 --> 12:54.000
and also to special thanks for AGC,

12:54.000 --> 12:55.000
Krakow,

12:55.000 --> 12:57.000
super computer centers,

12:57.000 --> 12:59.000
center in Poland.

12:59.000 --> 13:01.000
So thank you for listening,

13:01.000 --> 13:03.000
and if you have questions,

13:03.000 --> 13:04.000
then go ahead.

13:08.000 --> 13:11.000
Thank you.