WEBVTT

00:00.000 --> 00:11.000
So the first pitch of the search room, we will talk about language support in search engine

00:11.000 --> 00:15.000
and we will take the example of the military because I am marketing.

00:15.000 --> 00:18.000
So first of all, what is the military?

00:18.000 --> 00:21.000
The military is a search engine, an open source search engine.

00:21.000 --> 00:26.000
Our goal is to provide a fast search engine and easy to use search engine.

00:27.000 --> 00:30.000
So that's our tool motto.

00:30.000 --> 00:35.000
The clear code here leads to the official website.

00:35.000 --> 00:41.000
A bit more about me, so about me, I have five years for my research,

00:41.000 --> 00:44.000
as an engineer in the search engine site,

00:44.000 --> 00:49.000
so working on the old we store the data and old we search in this data.

00:49.000 --> 00:53.000
And I work more specifically on the language support

00:53.000 --> 00:57.000
with the goal of supporting as many languages as possible.

00:57.000 --> 01:02.000
So what is the subject, so the agenda?

01:02.000 --> 01:07.000
So we talk about the challenges of language support.

01:07.000 --> 01:13.000
So what kind of problem we could encounter during our journey of supporting many languages?

01:13.000 --> 01:20.000
Then our choices, old we did with address in many search,

01:20.000 --> 01:22.000
those challenges.

01:22.000 --> 01:26.000
And yes, the sort one, sorry, it's the conclusion,

01:26.000 --> 01:30.000
but it should have changed that.

01:30.000 --> 01:41.000
So, but first of all, I would like to make a smaller remainder of why all language support work in search engine.

01:41.000 --> 01:44.000
And most precisely why it's important.

01:44.000 --> 01:47.000
And here it's a smaller example of a tiny search engine.

01:47.000 --> 01:53.000
Basically a search engine, what is it, is something where you put some documents

01:53.000 --> 01:57.000
and it will extract words using a tokenizer,

01:57.000 --> 02:00.000
which is the main part of our subject.

02:00.000 --> 02:05.000
And then on the search side, you will query using other words.

02:05.000 --> 02:11.000
You will extract word and try to retrieve this word in the documents

02:11.000 --> 02:13.000
containing in the database.

02:13.000 --> 02:19.000
And basically, what is the most important thing when we are talking about language support

02:19.000 --> 02:22.000
is to extract, to be able to extract this word,

02:22.000 --> 02:25.000
not manage them in order, for example,

02:25.000 --> 02:29.000
to when you search, you know, with a capital H,

02:29.000 --> 02:32.000
you are able to retrieve, you know, with the capital H,

02:32.000 --> 02:36.000
but you know, without the capital H as well.

02:36.000 --> 02:42.000
So this kind of process, you have the two main part of the tokenization

02:42.000 --> 02:44.000
is one, the segmentation.

02:44.000 --> 02:50.000
So where you have to take a text and split this text into several words

02:50.000 --> 02:51.000
and extract word.

02:51.000 --> 02:54.000
And then the second part is the normalization.

02:54.000 --> 02:56.000
The normalization is where you unify,

02:56.000 --> 03:02.000
and of the same kind of word together in order to be able to retrieve them

03:02.000 --> 03:05.000
with a specific or another writing.

03:05.000 --> 03:09.000
So for example, capital H or lowercase.

03:10.000 --> 03:14.000
No, that we know all the basic search engine work.

03:14.000 --> 03:19.000
We will speak about the challenges of language support.

03:19.000 --> 03:23.000
The first challenge we have is the diversity.

03:23.000 --> 03:25.000
There is a lot of languages.

03:25.000 --> 03:29.000
Even if we have like 50% of the web that is written in English,

03:29.000 --> 03:34.000
we have the other half of the web that is written in other languages.

03:34.000 --> 03:42.000
And most of the other half is written in 15 other languages.

03:42.000 --> 03:47.000
But if we count all the possible language we have in the world,

03:47.000 --> 03:51.000
we can count around 6000 language in the world.

03:51.000 --> 03:54.000
Obviously, they are not old on the internet,

03:54.000 --> 03:57.000
but maybe in the future we will have a lot of language

03:57.000 --> 04:01.000
and the ratio between English and other language

04:01.000 --> 04:05.000
we decrease with the years.

04:05.000 --> 04:08.000
And so when you start to have a community,

04:08.000 --> 04:11.000
where you have some users that are not English speaker

04:11.000 --> 04:15.000
and wants to use your search engine in English,

04:15.000 --> 04:18.000
you receive some feedbacks.

04:18.000 --> 04:22.000
And the feedbacks can be really diverse

04:22.000 --> 04:24.000
and because each, obviously,

04:24.000 --> 04:27.000
because that's different languages.

04:27.000 --> 04:30.000
Different languages means different specific features.

04:30.000 --> 04:34.000
Some things small things to support.

04:34.000 --> 04:37.000
So for example, you have some language

04:37.000 --> 04:39.000
that we have the segmentation is wrong

04:39.000 --> 04:42.000
because just splitting on a new space works well in English

04:42.000 --> 04:44.000
but maybe not in other languages.

04:44.000 --> 04:48.000
And sometimes it's under normalization side

04:48.000 --> 04:52.000
where it's more difficult than just lower casing character

04:52.000 --> 04:55.000
to be able to unify the same kind of font.

04:55.000 --> 04:59.000
And we not dig deeper into each case

04:59.000 --> 05:02.000
but I suggest you to read them after.

05:02.000 --> 05:05.000
It's really funny what you can encounter

05:05.000 --> 05:07.000
in terms of diversity of problem.

05:07.000 --> 05:09.000
So here, what is the issue is that

05:09.000 --> 05:12.000
one, the problem are really diverse

05:12.000 --> 05:16.000
but you must have the knowledge of every language,

05:16.000 --> 05:19.000
every specificity to understand what you have to do

05:19.000 --> 05:23.000
to support this language, which is basically not possible.

05:23.000 --> 05:28.000
In my search, we are 25 working in my search.

05:28.000 --> 05:33.000
Well, 25 people are not speaking 6,000 language,

05:33.000 --> 05:36.000
obviously.

05:36.000 --> 05:39.000
So let's dig deeper in a specific case.

05:39.000 --> 05:43.000
So for example, the Japanese segmentation,

05:43.000 --> 05:46.000
as you may know, Japanese is not space-separated.

05:46.000 --> 05:50.000
So full sentence is written in full character

05:50.000 --> 05:52.000
without any space inside.

05:53.000 --> 05:58.000
And the funny example, which comes from presentation

05:58.000 --> 06:02.000
from Minohou Zuka, which is the Japanese developer,

06:02.000 --> 06:04.000
you may have a full presentation.

06:04.000 --> 06:07.000
So if you want to know more about O to support Japanese,

06:07.000 --> 06:09.000
the normalization, the segmentation, et cetera,

06:09.000 --> 06:13.000
it's a really good presentation that is the character.

06:13.000 --> 06:18.000
But one of the slide is speaking about the segmentation

06:18.000 --> 06:22.000
and O difficult, it can be to segment words in Japanese.

06:22.000 --> 06:25.000
And for example, for the same sentence there,

06:25.000 --> 06:29.000
we have two kinds of possible segmentation.

06:29.000 --> 06:33.000
That means different things depending on O,

06:33.000 --> 06:35.000
you segment the sentence.

06:35.000 --> 06:39.000
And so that's O difficult, it can be to segment

06:39.000 --> 06:43.000
the text of a specific language or another.

06:43.000 --> 06:46.000
And as you may know, we have the same kind of problem

06:46.000 --> 06:49.000
on the normalization side.

06:49.000 --> 06:54.000
The second issue we could have is the language detection.

06:54.000 --> 06:57.000
The language detection, unfortunately,

06:57.000 --> 07:01.000
is kind of an approximate detection.

07:01.000 --> 07:04.000
And even if it's kind of works well

07:04.000 --> 07:07.000
when you have a set of documents like thousands of documents,

07:07.000 --> 07:09.000
it's easy to detect the language on over

07:09.000 --> 07:11.000
these thousands of documents.

07:11.000 --> 07:13.000
On the other side, on the search side,

07:13.000 --> 07:16.000
because you have maybe one or two words,

07:16.000 --> 07:19.000
it can be really difficult and really

07:19.000 --> 07:22.000
approximate to detect the good languages.

07:22.000 --> 07:26.000
And for example, if you are on maybe a Korean dataset,

07:26.000 --> 07:28.000
it's really easy to detect that it's Korean,

07:28.000 --> 07:31.000
because the Korean script is only used for Korean.

07:31.000 --> 07:35.000
But if you are on a Latin dataset,

07:35.000 --> 07:38.000
you don't, it's really hard to make the difference

07:38.000 --> 07:41.000
between Italian, French, English, et cetera,

07:41.000 --> 07:47.000
and you have to use approximate algorithm.

07:47.000 --> 07:49.000
And as you can see in the example,

07:49.000 --> 07:52.000
we have to wait the 7th, yes,

07:52.000 --> 07:55.000
the 7th character to understand that it's not Chinese,

07:55.000 --> 07:57.000
but Japanese.

07:57.000 --> 08:00.000
So for a queries, it can be really difficult

08:00.000 --> 08:01.000
to detect the good language.

08:01.000 --> 08:03.000
And when we detect the language,

08:03.000 --> 08:05.000
maybe we talk in a, we talk in a,

08:05.000 --> 08:07.000
the query badly,

08:07.000 --> 08:09.000
and we don't treat the document.

08:09.000 --> 08:13.000
So it's really impact the relevancy.

08:13.000 --> 08:16.000
And the last thing I want to talk about

08:16.000 --> 08:19.000
is about the question itself.

08:19.000 --> 08:21.000
Do you support this language?

08:21.000 --> 08:23.000
Is it easy to answer this question?

08:23.000 --> 08:25.000
Not really.

08:25.000 --> 08:28.000
It's not really a yes, no question.

08:28.000 --> 08:30.000
Because yes, you could say yes,

08:30.000 --> 08:33.000
I do you support English, for example.

08:33.000 --> 08:36.000
But yes, obviously I support English,

08:36.000 --> 08:38.000
I speak on my space and I,

08:38.000 --> 08:40.000
and I, and I lower case every word.

08:40.000 --> 08:42.000
So what could go wrong?

08:42.000 --> 08:44.000
And just in this example,

08:44.000 --> 08:46.000
the user come back and say,

08:46.000 --> 08:48.000
okay, but why, when I search,

08:48.000 --> 08:52.000
can't, I cannot find any document containing cannot.

08:52.000 --> 08:55.000
It's relevant in kind of,

08:55.000 --> 08:57.000
and well,

08:57.000 --> 09:00.000
it's a bit more complicated that yes,

09:00.000 --> 09:02.000
or no, I support this language.

09:02.000 --> 09:04.000
So about the military approach,

09:04.000 --> 09:06.000
what I visit, I just want to,

09:06.000 --> 09:08.000
yeah, okay, perfect.

09:08.000 --> 09:11.000
The first thing we made is

09:11.000 --> 09:13.000
relaying on an open source.

09:13.000 --> 09:15.000
So the first thing we we made is

09:15.000 --> 09:16.000
extracting the tokenizer,

09:16.000 --> 09:18.000
the milli search tokenizer,

09:18.000 --> 09:20.000
in a dedicated repo.

09:20.000 --> 09:22.000
And,

09:22.000 --> 09:25.000
and we worked on

09:25.000 --> 09:27.000
the easiness of contribution

09:27.000 --> 09:29.000
on this repository.

09:29.000 --> 09:30.000
So we extract the code,

09:30.000 --> 09:33.000
this way, the code base is way smaller

09:33.000 --> 09:35.000
than the full, the world search engine.

09:35.000 --> 09:37.000
And the second thing is,

09:37.000 --> 09:40.000
we focus on what kind of action

09:40.000 --> 09:42.000
the contributor has to do

09:42.000 --> 09:45.000
to add a segmenter,

09:45.000 --> 09:47.000
add a normalizer,

09:47.000 --> 09:48.000
and in an instance,

09:48.000 --> 09:50.000
the language report.

09:50.000 --> 09:52.000
And there is two main thing,

09:52.000 --> 09:53.000
for us.

09:53.000 --> 09:55.000
Here is the segmenter.

09:55.000 --> 09:57.000
So all do we segment the text.

09:57.000 --> 10:01.000
So text into a set of words.

10:01.000 --> 10:03.000
What the contributor has to do,

10:03.000 --> 10:04.000
as to know,

10:04.000 --> 10:05.000
is only oh,

10:05.000 --> 10:07.000
should there a segment,

10:07.000 --> 10:08.000
Japanese text,

10:08.000 --> 10:10.000
or an English text,

10:10.000 --> 10:11.000
into a list of forms.

10:11.000 --> 10:13.000
That's all you have to know

10:13.000 --> 10:15.000
when you contribute to charabya

10:15.000 --> 10:17.000
to the tokenizer of milli search.

10:17.000 --> 10:19.000
And that's,

10:19.000 --> 10:21.000
and you don't need to know

10:21.000 --> 10:22.000
and oh,

10:22.000 --> 10:25.000
it will be integrated in milli search,

10:25.000 --> 10:28.000
or in the tokenizer pipeline.

10:28.000 --> 10:30.000
So just one function to implement,

10:31.000 --> 10:33.000
text into set of words.

10:33.000 --> 10:34.000
And then,

10:34.000 --> 10:35.000
for the normalizer,

10:35.000 --> 10:37.000
it's basically the same.

10:37.000 --> 10:39.000
We made the same approach,

10:39.000 --> 10:41.000
but by character,

10:41.000 --> 10:43.000
because most of the normalization process

10:43.000 --> 10:45.000
rely on characters,

10:45.000 --> 10:49.000
more complex algorithms.

10:49.000 --> 10:50.000
Some time, yes,

10:50.000 --> 10:51.000
so we have some,

10:51.000 --> 10:54.000
some alternative way of,

10:54.000 --> 10:57.000
of coding an normalizer

10:57.000 --> 10:59.000
to be able to be a bit clever.

11:00.000 --> 11:02.000
But most of the time,

11:02.000 --> 11:03.000
the user say,

11:03.000 --> 11:04.000
okay,

11:04.000 --> 11:06.000
this character should be normalized

11:06.000 --> 11:08.000
like this in main language,

11:08.000 --> 11:10.000
so that's all.

11:10.000 --> 11:12.000
And so we provide the same kind of function.

11:12.000 --> 11:15.000
We give the input character,

11:15.000 --> 11:16.000
and we say,

11:16.000 --> 11:17.000
okay,

11:17.000 --> 11:18.000
oh,

11:18.000 --> 11:20.000
should this character be normalized,

11:20.000 --> 11:22.000
and that's all.

11:22.000 --> 11:23.000
Then,

11:23.000 --> 11:27.000
for people that are not developers,

11:27.000 --> 11:29.000
because one of the issues,

11:29.000 --> 11:30.000
as I said,

11:30.000 --> 11:31.000
in the diversity,

11:31.000 --> 11:32.000
is that we,

11:32.000 --> 11:34.000
you can't know everything on every language,

11:34.000 --> 11:35.000
and sometimes,

11:35.000 --> 11:38.000
people that as the knowledge of the language

11:38.000 --> 11:40.000
are not developers,

11:40.000 --> 11:43.000
maybe that's linked with or just native speaker,

11:43.000 --> 11:45.000
that know a bit,

11:45.000 --> 11:46.000
but oh,

11:46.000 --> 11:48.000
the language works.

11:48.000 --> 11:49.000
And so,

11:49.000 --> 11:50.000
the idea was to create,

11:50.000 --> 11:53.000
open discussion on GitHub,

11:53.000 --> 11:55.000
and in this discussion,

11:56.000 --> 11:57.000
what we had,

11:57.000 --> 11:59.000
we had some,

11:59.000 --> 12:01.000
some users,

12:01.000 --> 12:03.000
or some native speaker,

12:03.000 --> 12:05.000
that were speaking about their language,

12:05.000 --> 12:06.000
oh,

12:06.000 --> 12:07.000
the language works,

12:07.000 --> 12:08.000
etc.

12:08.000 --> 12:09.000
And we had some contributor,

12:09.000 --> 12:10.000
coming in this discussion,

12:10.000 --> 12:12.000
didn't know anything about the language,

12:12.000 --> 12:15.000
but that were available to develop

12:15.000 --> 12:16.000
or code,

12:16.000 --> 12:18.000
or implement a new segmentor,

12:18.000 --> 12:19.000
or a new normalizer.

12:19.000 --> 12:21.000
And in this discussion,

12:21.000 --> 12:23.000
we were kind of,

12:23.000 --> 12:24.000
had a code,

12:24.000 --> 12:26.000
a codeation space,

12:26.000 --> 12:28.000
where we discussed,

12:28.000 --> 12:31.000
on the few improvements we should do

12:31.000 --> 12:33.000
to enhance the language report.

12:33.000 --> 12:37.000
Funny thing in this kind of discussion,

12:37.000 --> 12:38.000
is that sometimes,

12:38.000 --> 12:40.000
speak native speaker,

12:40.000 --> 12:42.000
are speaking together,

12:42.000 --> 12:43.000
in their native language.

12:43.000 --> 12:44.000
For example,

12:44.000 --> 12:46.000
in the Japanese language report,

12:46.000 --> 12:48.000
I have a world discussion in Japanese.

12:48.000 --> 12:50.000
I don't understand anything

12:50.000 --> 12:53.000
at the end of this discussion,

12:53.000 --> 12:54.000
at the end of this discussion,

12:54.000 --> 12:55.000
sorry,

12:55.000 --> 12:56.000
but at the end,

12:56.000 --> 12:59.000
we managed to enhance the Japanese,

12:59.000 --> 13:00.000
Japanese tokenizer.

13:00.000 --> 13:02.000
So, it's a win for me.

13:02.000 --> 13:05.000
Yes,

13:05.000 --> 13:06.000
and another,

13:06.000 --> 13:07.000
the fact about the Shahrabia,

13:07.000 --> 13:10.000
is that some commits are in Korean,

13:10.000 --> 13:11.000
so.

13:15.000 --> 13:16.000
And then,

13:16.000 --> 13:17.000
our last subject,

13:17.000 --> 13:19.000
so our last subject is the language support.

13:20.000 --> 13:22.000
Sorry, the language detection.

13:22.000 --> 13:23.000
For us,

13:23.000 --> 13:24.000
unfortunately,

13:24.000 --> 13:26.000
the language detection is still an open issue.

13:26.000 --> 13:28.000
The fact that it's a next,

13:28.000 --> 13:31.000
approximate way of,

13:31.000 --> 13:32.000
of guessing,

13:32.000 --> 13:33.000
it's a guess,

13:33.000 --> 13:34.000
on the language,

13:34.000 --> 13:35.000
and because,

13:35.000 --> 13:37.000
if we are wrong,

13:37.000 --> 13:39.000
we chose the wrong tokenizer,

13:39.000 --> 13:41.000
and we don't re-trive anything

13:41.000 --> 13:43.000
with our,

13:43.000 --> 13:44.000
our charging chain.

13:44.000 --> 13:46.000
We tried several approach.

13:46.000 --> 13:48.000
So, the first approach we tried,

13:48.000 --> 13:49.000
is like,

13:49.000 --> 13:50.000
okay,

13:50.000 --> 13:51.000
because we know the documents,

13:51.000 --> 13:52.000
what we could do is,

13:52.000 --> 13:53.000
kind of,

13:53.000 --> 13:54.000
guessing,

13:54.000 --> 13:55.000
a list,

13:55.000 --> 13:57.000
a subset of language that can be detected,

13:57.000 --> 14:00.000
detected for this database,

14:00.000 --> 14:01.000
because we know that,

14:01.000 --> 14:03.000
the documents are only in Japanese,

14:03.000 --> 14:04.000
and in Chinese,

14:04.000 --> 14:05.000
for example,

14:05.000 --> 14:06.000
and that's all.

14:06.000 --> 14:07.000
And then,

14:07.000 --> 14:08.000
at query time,

14:08.000 --> 14:10.000
we just have to detect between these two language.

14:10.000 --> 14:11.000
But, even with,

14:11.000 --> 14:13.000
with this strategy,

14:13.000 --> 14:14.000
unmonoling,

14:14.000 --> 14:15.000
well,

14:15.000 --> 14:16.000
database,

14:16.000 --> 14:17.000
only English,

14:17.000 --> 14:18.000
or only Chinese,

14:18.000 --> 14:19.000
it works well.

14:19.000 --> 14:20.000
But,

14:20.000 --> 14:21.000
if it was,

14:21.000 --> 14:23.000
a configuration,

14:23.000 --> 14:24.000
a user configuration,

14:24.000 --> 14:25.000
it would work well,

14:25.000 --> 14:26.000
as well.

14:26.000 --> 14:28.000
And on multilingual database,

14:28.000 --> 14:29.000
so, for example,

14:29.000 --> 14:30.000
if you mix Japanese and Chinese,

14:30.000 --> 14:32.000
we add the same issue.

14:32.000 --> 14:33.000
So, some time,

14:33.000 --> 14:34.000
the,

14:34.000 --> 14:35.000
the,

14:35.000 --> 14:36.000
at query time,

14:36.000 --> 14:38.000
we mix detects Japanese into Chinese,

14:38.000 --> 14:39.000
and,

14:39.000 --> 14:40.000
and inverse,

14:40.000 --> 14:41.000
in the inverse.

14:41.000 --> 14:42.000
So, at the end,

14:42.000 --> 14:43.000
we didn't treat the document.

14:43.000 --> 14:45.000
So, the second approach,

14:45.000 --> 14:46.000
we,

14:46.000 --> 14:47.000
we add is kind of,

14:47.000 --> 14:49.000
get rid of the language

14:49.000 --> 14:50.000
detection,

14:50.000 --> 14:53.000
and ask the user for which language is using,

14:53.000 --> 14:55.000
and we only rely on the script,

14:55.000 --> 14:58.000
because it's way easier to detect the script,

14:58.000 --> 14:59.000
than the,

14:59.000 --> 15:00.000
the,

15:00.000 --> 15:01.000
the language itself.

15:01.000 --> 15:02.000
Yes,

15:02.000 --> 15:06.000
and,

15:06.000 --> 15:08.000
about the last question,

15:08.000 --> 15:09.000
which was,

15:09.000 --> 15:11.000
do you support English,

15:11.000 --> 15:14.000
do you support core and do you support everything like that?

15:14.000 --> 15:15.000
As I said,

15:15.000 --> 15:16.000
it's,

15:16.000 --> 15:17.000
it's not,

15:17.000 --> 15:19.000
an easy question to answer by yes or by,

15:19.000 --> 15:20.000
or by no,

15:20.000 --> 15:21.000
but what you can do,

15:21.000 --> 15:23.000
is explaining,

15:23.000 --> 15:24.000
oh well,

15:24.000 --> 15:25.000
you are,

15:25.000 --> 15:26.000
you support this,

15:26.000 --> 15:27.000
specifically in which,

15:27.000 --> 15:28.000
and that's the,

15:28.000 --> 15:29.000
that's the goal,

15:29.000 --> 15:30.000
we,

15:30.000 --> 15:31.000
that's the thing we tried.

15:31.000 --> 15:32.000
So,

15:32.000 --> 15:33.000
it's not the best way to,

15:33.000 --> 15:34.000
to provide the information,

15:34.000 --> 15:35.000
but,

15:35.000 --> 15:36.000
it's the first,

15:36.000 --> 15:37.000
it's the first step.

15:37.000 --> 15:38.000
What we,

15:38.000 --> 15:39.000
we did is,

15:39.000 --> 15:40.000
instead of,

15:40.000 --> 15:41.000
you're saying yes or no,

15:41.000 --> 15:42.000
we say,

15:42.000 --> 15:43.000
okay,

15:44.000 --> 15:45.000
we have a specialized,

15:45.000 --> 15:46.000
a psych mentor,

15:46.000 --> 15:47.000
that does,

15:47.000 --> 15:48.000
that, that, that,

15:48.000 --> 15:50.000
and we have some specialized,

15:50.000 --> 15:51.000
normalizer,

15:51.000 --> 15:52.000
that,

15:52.000 --> 15:53.000
normalize,

15:53.000 --> 15:54.000
your character,

15:54.000 --> 15:55.000
or your text,

15:55.000 --> 15:56.000
like that, like that.

15:56.000 --> 15:57.000
Are you,

15:57.000 --> 15:58.000
are you,

15:58.000 --> 16:00.000
are you,

16:00.000 --> 16:01.000
are you,

16:01.000 --> 16:02.000
are you,

16:02.000 --> 16:04.000
are you,

16:04.000 --> 16:05.000
are you,

16:05.000 --> 16:06.000
we,

16:06.000 --> 16:07.000
are you,

16:07.000 --> 16:08.000
like that.

16:08.000 --> 16:09.000
Are you,

16:09.000 --> 16:10.000
are you,

16:10.000 --> 16:12.000
are you,

16:12.000 --> 16:18.640
has well, and sometimes it's in the language distribution, and so some external

16:18.640 --> 16:24.680
contributor can come improve, contribute to the tokenizer, and improve the tokenizer.

16:24.680 --> 16:26.680
What time is it?

16:26.680 --> 16:29.680
Yes, perfect.

16:29.680 --> 16:32.560
So the conclusion.

16:32.560 --> 16:34.960
So the first thing is that we continue everything.

16:34.960 --> 16:41.520
There is too many languages to know, and we have to find strategy to

16:41.520 --> 16:48.640
gather as much information as possible to support as many languages as possible.

16:48.640 --> 16:54.440
And what we did is relaying an open source and open discussion to a gather information

16:54.440 --> 16:59.560
and to implement and to co-implement the language support.

16:59.560 --> 17:05.560
The second thing about, to talk about is the language protection is not accurate, and

17:05.560 --> 17:10.320
unfortunately we don't have a good way of detecting the language.

17:10.320 --> 17:14.560
So what we did is relaying more on the street instead of the language.

17:14.560 --> 17:22.560
So I have more generalistic tokenizer thing, and if a user wants something really specific

17:22.560 --> 17:30.000
to its language, he has to say, okay, I'm in German, and there is only German, so I'm on

17:30.000 --> 17:34.200
the specific future of the German.

17:34.200 --> 17:41.200
Yes, language support is not a yes, no question is the gradient scale, so you don't

17:41.200 --> 17:45.680
say yes or no, but you can explain what you support, and what you don't support for this

17:45.680 --> 17:48.680
specific language.

17:48.680 --> 17:58.880
And that's it, if we have some question, if you guys have questions, if you don't have

17:58.880 --> 18:03.880
a mic to pass around, you speak loud, and he's going to repeat the question for the

18:03.960 --> 18:04.880
correct.

18:04.880 --> 18:12.200
But this talk is very appropriate for you to respond to the data from, but I have a question

18:12.200 --> 18:20.760
about the format of the number of things, are you also thinking about it in a short

18:20.760 --> 18:21.760
time?

18:21.760 --> 18:27.560
So I think about internationalizing the data in the way interpretable dates, numbers in

18:27.560 --> 18:33.840
different languages and different countries, so what do you mean?

18:33.840 --> 18:43.080
This is, do you interpret the format that is like the dates, the number, etc, that is not

18:43.080 --> 18:50.960
only converting characters, but maybe reformathing or understand the meaning of what's

18:50.960 --> 18:51.960
being.

18:51.960 --> 18:52.960
That's it.

18:52.960 --> 19:03.440
Yeah, yeah, okay, so unified the date, for example, yeah, so yes, it's a good question, so

19:03.480 --> 19:10.080
the quick question is, is it the job of the tokenizer to reformat or to change the meaning

19:10.080 --> 19:12.480
of something?

19:12.480 --> 19:13.960
It's a good question.

19:13.960 --> 19:22.440
I think personally, I would more add this as the consumer feature, so for example, if you

19:22.440 --> 19:28.680
have a search engine using the tokenizer, I would say that it's more the search engine

19:28.720 --> 19:33.760
job to convert a date from a loser, but that's the question that that can be escaped.

19:33.760 --> 19:36.760
I have no read on source, sorry.

19:36.760 --> 19:41.360
Yeah, we have a question from the room, is it going to read it out loud and try to answer?

19:41.360 --> 19:42.360
Yep.

19:42.360 --> 19:48.000
Question for the term, I've noticed that most monoling well data set are aren't really

19:48.000 --> 19:53.800
monoling well as English, I'm German, not limited to two, words are everywhere, as

19:53.800 --> 19:58.680
many search liquidity, do cleaning up the data set that the long wedge detector, like

19:58.680 --> 20:04.680
quite long are lingua, generate, they are used, you're received from, sorry, I'm sure

20:04.680 --> 20:09.720
for getting something to do, pretty loud, so people can get it down the video.

20:09.720 --> 20:19.400
Yeah, so as many search as clean the data set based on the long wedge detection, I'm

20:19.400 --> 20:30.760
not sure to understand the question, but what I say is, in Moscow, we suggest to split

20:30.760 --> 20:35.920
the data set into the long wedge into different language, for example, German and English,

20:35.920 --> 20:43.400
we split them, however, if we are talking about specific German English data set, the mix

20:43.440 --> 20:51.840
of these two, considering that English is German, it's not really an issue because

20:51.840 --> 21:00.640
for the German language, the most important thing is to split some of the German

21:00.640 --> 21:06.840
word, because German is an agutinative long wedge, where you can put several words together

21:06.920 --> 21:15.560
as a single word, but German people expect that you should be able to search each

21:15.560 --> 21:22.840
inner word of agutinative word, however, in English we don't have that, so it's not incompatible

21:24.280 --> 21:29.240
to where we have to connect, so it's kind of case by case, I don't know if I answer

21:29.320 --> 21:34.120
well the discussion with it. So the other thing is, we have a matrix here, and there's

21:34.120 --> 21:38.040
a second question, I'm not going to ask you to read it, but if you guys want to go into

21:38.040 --> 21:42.600
the matrix room and answer the questions and engage the people that are asking the question

21:42.600 --> 21:46.200
that would be great, okay? Okay. Now we're going to continue with question from the room.

21:59.240 --> 22:25.600
Yes, so the question is, if we can't detect two or two languages, why don't we want to

22:25.680 --> 22:33.280
run a search for each languages, so for example, if you don't know if it's Chinese or

22:33.280 --> 22:40.480
English, you run a query for the Chinese, a query for English, it's completely possible,

22:40.480 --> 22:47.360
we can do that in a really search, in fact, using federated search, but the fact is that you

22:47.360 --> 22:53.040
will run a search for every language you have in your dataset, so this means that if you have

22:53.040 --> 23:01.280
like 20 languages for each end user search, you will run a 20 different search, so that's a

23:01.280 --> 23:02.640
cost to pay.

23:23.760 --> 23:30.960
Hopefully better today, because the tokenization that you have to provide as a text to go in,

23:30.960 --> 23:36.000
is something that a user can probably even want to do, and possibly something that is self-insourced,

23:36.000 --> 23:40.000
feasible, as it could probably be for a company, we need to hire someone on the technical

23:40.880 --> 23:47.840
search, so I have a try to learn it, or I have to immediately try to pay all of the things that

23:47.840 --> 23:49.360
this is a lot of people.

23:49.360 --> 23:59.600
So I try to, yes, sorry, so the question was, did I try to learn O tokenizer of the work in the

23:59.600 --> 24:00.600
world?

24:00.600 --> 24:04.000
No, did you try to machine learn the tokenizer?

24:04.000 --> 24:09.520
So you don't get to take the document to parse it in, and you emit instead of the original

24:09.520 --> 24:13.600
central instrument separator, or potentially multiple separate, which you might need for the

24:13.600 --> 24:15.840
contracts like a term.

24:15.840 --> 24:23.200
So, yes, yes, yes, so did I try to machine learn the document, so basically we take all the

24:23.200 --> 24:29.920
document of the dataset, and we try to guess the O should we split the query letter, I'm

24:29.920 --> 24:30.920
right, on this side.

24:30.920 --> 24:54.080
So, to be honest, I didn't try any machine learn on dynamic tokenization like that, we are

24:54.080 --> 25:05.440
kind of in a static way of tokenizing, so we know by in advance what O should we tokenize

25:05.440 --> 25:10.280
this sentence or this sentence kind of, and we apply it for the document and for the search

25:10.280 --> 25:18.720
as well, maybe it's more a question for a formulator, but the question is that in

25:18.720 --> 25:25.040
milliseconds, for example, we are, we don't have all the dataset directly, but we build

25:25.040 --> 25:30.280
the dataset with the user, adding more and more document, so all those it impacts this

25:30.280 --> 25:36.080
dynamic, the dynamic tokenization, I don't know, and if we have to reindex the world

25:36.080 --> 25:40.720
dataset, the world, the all the document that are already in the database, each time you

25:40.720 --> 25:48.680
add a new or a few documents, it can be a problematic for us, but to me, I didn't dig into

25:48.680 --> 25:54.680
that, I was more in the traditional way of tokenizing the tokenizing sentences.

25:54.680 --> 26:02.480
Thank you, thank you, thank you for the listening world, I'm just intrigued by all the way

26:02.480 --> 26:07.800
I set things up as you do the segmentation and then normalization, which is the other way

26:07.800 --> 26:12.000
rounds to have the students done it for 35 years, I just want to bring this up, was there

26:12.000 --> 26:18.240
a reason to choose that, that order of everything, so is it, you know, you're worth it

26:19.120 --> 26:26.640
Yes, so the question is, in this tokenizing in Shabia, we are first running the segmentation,

26:26.640 --> 26:33.120
so we segment, texting to word and then for its words, we run the normalization, and

26:33.120 --> 26:41.360
way we do that and not normalization first and then segmentation, it's because of the

26:41.360 --> 26:48.960
highlighting, in the research, is a, is a prefix searching gene, so put on the celly in the query,

26:48.960 --> 26:56.160
you have only a part of the web, okay, and normalization is not only, is not only a character

26:56.160 --> 27:02.320
by a two character, but maybe character to string, sometimes you convert one character to

27:02.320 --> 27:09.920
full string, etc, and it's way easier for us to retrieve all to highlight the character,

27:10.080 --> 27:17.680
I should highlight in the document, which is the non-normalization version, okay, when you apply

27:17.680 --> 27:24.320
first the segmentation and then the tokenization, because you're just after to have some offset,

27:24.320 --> 27:31.440
some relative offset based on the start of the word, and if I had to do the same, but by

27:31.440 --> 27:37.200
applying the normalization first and then the segmentation, I would have to, to think, to think

27:37.200 --> 27:44.400
in an absolute offset, which is way harder when it comes to highlighting a part of the word,

27:45.440 --> 27:52.160
so yeah, I don't use information when you are normalizing first, and then tokenizing

27:52.240 --> 28:02.000
doesn't it break the tokenizer, which I normalize in the first? So, those normalizing first

28:02.000 --> 28:08.480
breaks the potting celly, the segmentar just after, so it depends on a segmentar, I know that the

28:08.480 --> 28:16.720
Japanese segmentar would prefer to have a first phase of normalization to be sure that all the

28:16.800 --> 28:23.760
characters in the same format, and it's way easier, because in Japanese you have several

28:23.760 --> 28:29.200
different formats for the same character, you have the alpha character and the plain character,

28:29.200 --> 28:37.360
and for example, the Japanese segmentar works better with the only full character, and if it

28:37.360 --> 28:44.880
encounters alpha character, the segmentation is less accurate, but it's not really common.

28:56.880 --> 29:04.960
That's a good question, so that's not possible, yeah, sorry, sorry, so I gave the example of the

29:05.040 --> 29:15.120
English, the English can count versus canot or canot, and oh, we are doing under that character

29:15.120 --> 29:23.360
by character in terms of normalization, you come to it, but to support that, you have to mix

29:23.360 --> 29:30.800
between the segmentation, so first you segment can and not from canot, and then you normalize

29:31.120 --> 29:42.800
sorry, and don't have a tick, a tick in to not in order to unify, and so for this specific case,

29:42.800 --> 29:47.680
it's not a character in normalize, but to can normalize, which is a bit more complex to implement,

29:48.400 --> 29:56.320
in the xarabia code base, what we did is we have some interface, kind of, where the character

29:56.480 --> 30:07.760
normalize, it's a token normalize, with less method in it, so it's a subset of token normalize,

30:08.480 --> 30:12.960
but you can implement a token normalize, if you want, it's just a bit more complex.