WEBVTT

00:00.000 --> 00:12.000
We'll speak with and prompt high fidelity text to speech with minimal supervision.

00:12.000 --> 00:18.000
And when we read the paper, we kind of found out that the eye texture that they presented

00:18.000 --> 00:19.000
is simple.

00:19.000 --> 00:25.000
It's basically just like true transform models stitched together and to end.

00:25.000 --> 00:30.000
The samples that they released also sounded really, really good.

00:30.000 --> 00:34.000
Unfortunately, they only released the paper and the samples.

00:34.000 --> 00:39.000
And we thought, like, we give it a shot to basically replicate the paper.

00:39.000 --> 00:47.000
We had no previous experience with TTS, speech, or even with training big transform models.

00:47.000 --> 00:50.000
And SS had this is basically our journey.

00:50.000 --> 00:52.000
Let's see if that works.

00:53.000 --> 00:56.000
Hello, Fauston. This is the first demo of Wester speech.

00:56.000 --> 01:03.000
A fully open source text to speech model trained by Calabra and Lyon on the jewel's supercomputer.

01:03.000 --> 01:09.000
This is basically just like one example that Wester speech is able to to produce.

01:09.000 --> 01:19.000
And we think it's basically on par with the text to speech that you get from Amazon Microsoft or Google.

01:20.000 --> 01:27.000
Previously, I basically said that we build Wester speech on top of the S.

01:27.000 --> 01:30.000
That's only half of the story.

01:30.000 --> 01:35.000
We actually build Wester speech on top of some amazing open source projects.

01:35.000 --> 01:37.000
Mainly Wester from OpenAI.

01:37.000 --> 01:40.000
That's also like where the name Wester speech comes from.

01:40.000 --> 01:47.000
And codec from meta, which is the the neural codec and vocals from Gemilo AI.

01:47.000 --> 01:51.000
We also in the process of like implementing Wester speech.

01:51.000 --> 01:56.000
We read a bunch of papers from like different AI labs.

01:56.000 --> 01:59.000
Namely, of course, Spirit TTS from Google.

01:59.000 --> 02:07.000
Music can from meta and tensor program five from Microsoft and OpenAI.

02:07.000 --> 02:13.000
So this basically shows you that the rough overview about the architecture of Wester speech.

02:13.000 --> 02:16.000
So on the left, of course, it detects a speech model.

02:16.000 --> 02:27.000
You have the input text, which is then in fact into like the first transform model, which kind of creates a phonetic representation of you in

02:27.000 --> 02:28.000
Protect.

02:28.000 --> 02:35.000
And since it's kind of difficult to figure out like emotions or property from the text itself,

02:35.000 --> 02:42.000
we also embed this into like the phonetic representation as part of the first transform model.

02:42.000 --> 02:49.000
Then the phonetic representation is fetch into the next one, which then actually creates the actual speech.

02:49.000 --> 02:52.000
It's still like compressed speech, but it's speech.

02:52.000 --> 03:00.000
And we also, since the text doesn't give you anything about like the speakers, emotions, whatever.

03:00.000 --> 03:03.000
We also add speaker embeddings on top of it.

03:03.000 --> 03:09.000
And then we apply the vocals vocoder to actually get the audio out of it.

03:10.000 --> 03:14.000
Now, I mentioned that we kind of get like a phonetic representation.

03:14.000 --> 03:18.000
And phonetic representation actually comes from from Wester itself.

03:18.000 --> 03:22.000
And Wester is a very easy, just encoder decoder model.

03:22.000 --> 03:29.000
And if you just look at the rights, the Wester decoder just takes the Wester features and creates the text tokens,

03:29.000 --> 03:34.000
with like varying speed because you don't really have an idea from the text itself,

03:34.000 --> 03:38.000
but from the audio sample, how fast the speakers.

03:38.000 --> 03:43.000
And on the left, of course, you have the input speech, which is like an audio file,

03:43.000 --> 03:49.000
goes into the encoder, and this actually creates the phonetic representation.

03:49.000 --> 03:59.000
And what we did is we trained a quantizer in between the encoder and decoder to not only kind of reduce the number of features,

03:59.000 --> 04:06.000
but also to kind of force the encoder to focus on the phonetic representation.

04:07.000 --> 04:15.000
Now, we basically have our architecture in place, and this was basically our plan, how we can basically train Wester speech.

04:15.000 --> 04:20.000
Starting from the right here, we have the speech waveform, which is just audio files,

04:20.000 --> 04:24.000
and you can find a bunch of audio files like on the internet.

04:24.000 --> 04:30.000
The problem is, if you want to train a model, like a text speech model, you also need the transcription.

04:31.000 --> 04:38.000
Luckily, we have like the open-eye Wester model that can take audio and create the transcription files.

04:38.000 --> 04:43.000
And then we just do the reverse, we have the transcriptions, and in the training process,

04:43.000 --> 04:46.000
we just used the audio file in the transcription to train it.

04:46.000 --> 04:53.000
But when we actually started the training process, we kind of figured out that it failed us on all fronts,

04:53.000 --> 04:58.000
and yeah, like let's just go through the different parts.

04:59.000 --> 05:02.000
Starting with the speech waveform.

05:02.000 --> 05:07.000
I mentioned that we just used audio data.

05:07.000 --> 05:11.000
Luckily for us, there's like this really great status at out there.

05:11.000 --> 05:13.000
It's called liberalites.

05:13.000 --> 05:14.000
It was released by meta.

05:14.000 --> 05:19.000
It has 60,000 hours of English speech in the public domain.

05:19.000 --> 05:25.000
But then, like the first problem that we encountered was, it comes as a single 3.6 terabyte archive,

05:25.000 --> 05:28.000
which is kind of dumb.

05:28.000 --> 05:33.000
Just if you want to download this, and if you can, like, set a rate, you're 200 end, it's like connection.

05:33.000 --> 05:34.000
It takes you like 5 hours.

05:34.000 --> 05:37.000
But if you're like us, you want to try out different things.

05:37.000 --> 05:41.000
You don't really download it like once you don't like multiple times.

05:41.000 --> 05:43.000
So it's a problem.

05:43.000 --> 05:49.000
If you want to unpack the status at, it takes like over eight terabytes on your SSD.

05:49.000 --> 05:52.000
Just yeah, so you have the data available.

05:53.000 --> 05:56.000
The other problem is read amplification.

05:56.000 --> 06:01.000
So the 3.6 terabytes are actually 220,000 files.

06:01.000 --> 06:06.000
And the liberalite data set, as I mentioned, it's fairly audio books.

06:06.000 --> 06:09.000
So they're kind of like split up into two chapters.

06:09.000 --> 06:14.000
And one chapter has around like 16 terabytes on average.

06:14.000 --> 06:19.000
And in machine learning, what you want to do is you kind of want to randomly access your data,

06:19.000 --> 06:21.000
you want to shuffle it.

06:21.000 --> 06:27.000
You might want to just extract like 30 seconds out of like a single chapter.

06:27.000 --> 06:32.000
So on average, you kind of have to read like eight megabytes, and the rest is just waste.

06:32.000 --> 06:39.000
And if you do the mass, if you want to try in this, you kind of have like an SSD that can read like six gigabytes per second.

06:39.000 --> 06:43.000
You have like HEPUs and like use a batch of like 32.

06:43.000 --> 06:45.000
You arrive at like three iterations.

06:45.000 --> 06:50.000
And in practice, what we have seen is more point five iterations.

06:50.000 --> 06:55.000
Because like you get like file system overhead and just like the system overhead in general.

06:55.000 --> 07:03.000
Which is really, really bad because like you would wait like months to just finish like a single iteration.

07:03.000 --> 07:11.000
Luckily, if there's this really cool project, web datasets from TMB, that's the GitHub user handle.

07:11.000 --> 07:20.000
And what the web datasets actually allows us to split up our 3.6 terabyte file into like charts.

07:20.000 --> 07:24.000
It's just like like splitting up in multiple tar files.

07:24.000 --> 07:31.000
And we split up the whole dataset into 623 5 gigabird chart files.

07:31.000 --> 07:37.000
And the web datasets had then allowed us to read eight random charts to country and also shuffle.

07:37.000 --> 07:45.000
And if you do the mass again, like you basically arrive at 330 iterations per second on the same system.

07:45.000 --> 07:52.000
And that is something that you don't really see because like most people just maybe deal with like a gigabyte of data.

07:52.000 --> 07:53.000
And that's it.

07:53.000 --> 07:58.000
And it just fits in RAM and we don't really have a problem like you don't have copies whatsoever.

07:58.000 --> 08:06.000
But if you deal with like a lot of data, large data, the disk actually becomes a bottleneck.

08:06.000 --> 08:13.000
So now we have a way to read our audio files, our audio books.

08:13.000 --> 08:18.000
Next thing is that we need the actual transcriptions, right?

08:18.000 --> 08:21.000
We need it to try and the whole whisper model with the speech model.

08:21.000 --> 08:26.000
And luckily, as I said, we can just use OpenAI whisper.

08:26.000 --> 08:29.000
It's like a state's great of the art model.

08:29.000 --> 08:35.000
But if we actually use it just like the plain whisper model that OpenAI released,

08:35.000 --> 08:38.000
it's like 50 times faster than real time.

08:38.000 --> 08:45.000
Which if you want to process like the 60,000 hours, it takes you 1,200 GPU hours, 50 days.

08:45.000 --> 08:49.000
I'm not waiting like over a month to kind of get the transcriptions.

08:49.000 --> 08:57.000
Especially if you mess something up along the way, you probably do this like multiple times.

08:58.000 --> 09:00.000
So make it faster.

09:00.000 --> 09:02.000
Yeah, you can use batches.

09:02.000 --> 09:06.000
You can also use faster whisper implementation, which is using the same model,

09:06.000 --> 09:08.000
but it's just like a faster implementation.

09:08.000 --> 09:16.000
You can also switch to a smaller model, which then basically puts it down to like 78 GPU hours,

09:16.000 --> 09:18.000
which is still like three days.

09:18.000 --> 09:26.000
But remember, we are using web data sets, which allows us to parallelize this over across like multiple GPUs.

09:26.000 --> 09:34.000
So if you do this across like 100, for example, you can kind of process the data set in over and in another one hour.

09:34.000 --> 09:40.000
And that's not only works for the transcription, but also for voice activity detection, speaker embeddings, and so on.

09:40.000 --> 09:43.000
Now we have the transcription.

09:43.000 --> 09:47.000
But yeah, whisper is like kind of a state of the art model, right?

09:47.000 --> 09:49.000
But we encountered several problems.

09:49.000 --> 09:53.000
One of it is like you also get timestamps from whisper.

09:53.000 --> 09:57.000
But what we have seen is that they're off by several seconds.

09:57.000 --> 10:02.000
Luckily, it's kind of consistent, so we just applied a constant offset.

10:02.000 --> 10:08.000
What was even more puzzling was that in the transcription itself, there were some parts missing.

10:08.000 --> 10:11.000
And SSH had been using the liberal light data sets.

10:11.000 --> 10:16.000
And this is like an example, like if I hear, like if I'm listening to the one of the shepherds,

10:16.000 --> 10:19.000
it says like, shape the five of the things in our garden by author random,

10:19.000 --> 10:22.000
this leap of works recording is in the public domain, and so on.

10:22.000 --> 10:25.000
And what whisper heard was basically this.

10:25.000 --> 10:28.000
So it ignored the first part completely.

10:28.000 --> 10:34.000
And we kind of like wish this art because like you want to train something like on on top of it.

10:34.000 --> 10:37.000
And we have a really good idea like how this happened.

10:37.000 --> 10:42.000
So basically, open AI, it's very likely that they use the liberal light data set.

10:42.000 --> 10:45.000
It's about like 10% of their whole data set.

10:45.000 --> 10:48.000
So it's not likely that they ignore this.

10:48.000 --> 10:51.000
And but they also needed the transcriptions, right?

10:51.000 --> 10:55.000
But what they did was basically they used forced alignment.

10:55.000 --> 11:00.000
And the actual e-books from project Gutenberg.

11:00.000 --> 11:03.000
So they aligned the text with the audio.

11:03.000 --> 11:06.000
But in the e-book, there's no like chapter five.

11:06.000 --> 11:08.000
You don't really see this.

11:08.000 --> 11:10.000
Even like the same four footnotes as well.

11:10.000 --> 11:15.000
So what we do to basically fix this if you just ignore the 30 seconds.

11:16.000 --> 11:19.000
Now like to this training itself.

11:19.000 --> 11:24.000
Just to sit for me, what's the most important thing when you need to solve a difficult problem.

11:24.000 --> 11:27.000
And it's basically iteration speed.

11:27.000 --> 11:30.000
I'm just skipping this because like I'm running out of time.

11:30.000 --> 11:33.000
But yeah, what does it mean in our case?

11:33.000 --> 11:40.000
So if we want to train like a large model on like 80,000 hours on a thousand speakers on like 96 GPUs,

11:40.000 --> 11:43.000
we take like six hours to train.

11:43.000 --> 11:48.000
Which allows me to basically train two experiments to two experiments per day.

11:48.000 --> 11:51.000
One in the morning, one like after lunch.

11:51.000 --> 11:59.000
Which is not really great because like if you like us, you want to try out like a lot of different ideas and like 99% of the ideas I garbage.

11:59.000 --> 12:03.000
So what you do is just you train it on a very small data set.

12:03.000 --> 12:06.000
Single speaker, you can train it on the 40 90 and 50 minutes.

12:06.000 --> 12:10.000
And that allows you to train basically 48 experiments per day.

12:10.000 --> 12:15.000
And then like you just like train a little model like add somewhere depth and profit right.

12:15.000 --> 12:17.000
Now the last part.

12:17.000 --> 12:20.000
I mentioned like oh like you need GPUs.

12:20.000 --> 12:22.000
96 GPUs 100 GPUs.

12:22.000 --> 12:26.000
So doing our journey, we basically export like different companies.

12:26.000 --> 12:28.000
We started this like data crunch on the lips.

12:28.000 --> 12:31.000
Which kind of give you like visual pricing.

12:31.000 --> 12:35.000
But the problem is that usually you can only get like one or two GPUs.

12:35.000 --> 12:37.000
So they're all booked.

12:37.000 --> 12:40.000
Another interesting platform is vast AI.

12:40.000 --> 12:45.000
Where basically users provide their their GPUs that they have under the desk.

12:45.000 --> 12:46.000
It's a lot cheaper.

12:46.000 --> 12:48.000
But the problem is like bandwidth.

12:48.000 --> 12:50.000
And they also don't support like network drives.

12:50.000 --> 12:53.000
And again like we have to download the data type like multiple times.

12:53.000 --> 12:56.000
Like you have some machine crushes and all of this.

12:56.000 --> 13:01.000
And something that we figured out later was if you do like open source work.

13:01.000 --> 13:03.000
You can talk to the line community.

13:03.000 --> 13:06.000
And they have access to like the dual super cluster.

13:06.000 --> 13:07.000
With a lot of GPUs.

13:07.000 --> 13:09.000
And then yeah, you basically profit.

13:09.000 --> 13:10.000
There you go.

13:10.000 --> 13:12.000
Give them a round of applause everybody.

13:12.000 --> 13:13.000
Good morning.

