WEBVTT

00:00.000 --> 00:07.000
Good morning everybody.

00:07.000 --> 00:20.000
I will show you how I program a visit to data compression encoder in Ada from scratch during

00:20.000 --> 00:24.000
two rainy weekends.

00:24.000 --> 00:33.000
So, my motivation was fun, first of all, a challenge, a bit of a warm-up for advent of code,

00:33.000 --> 00:35.000
the latest.

00:35.000 --> 00:46.000
On the more serious notes, I wanted to fill a gap in the zip-aid library, which is a visit to encoder

00:46.000 --> 00:47.000
was missing.

00:47.000 --> 00:52.000
There was a decoder right to the yellow cell, but there was no encoder.

00:52.000 --> 01:00.000
It was absolutely this gap needed to be filled.

01:00.000 --> 01:10.000
And my expectations were quite low because this visit to a skin from experience compresses

01:10.000 --> 01:17.000
not a lot of kind of files better than for instance at the DMA.

01:18.000 --> 01:22.000
The algorithm is mostly mechanical.

01:22.000 --> 01:29.000
So, on most steps, you have only one way to encode the data.

01:29.000 --> 01:32.000
But not all steps.

01:32.000 --> 01:39.000
And the visit to is a week in version of first visit, so it's very older.

01:40.000 --> 01:45.000
At the time, there were patterns about arithmetic encoding.

01:45.000 --> 01:50.000
And the older had to change to human trees.

01:50.000 --> 01:58.000
But the results of my programming exercise were absolutely surprising.

01:58.000 --> 02:02.000
So, I got two very good surprises.

02:02.000 --> 02:07.000
Of course, I keep them from the end.

02:07.000 --> 02:14.000
So, I will explain you the visit to compressing skin in 15 minutes.

02:14.000 --> 02:17.000
Some people are skeptical.

02:17.000 --> 02:23.000
Now, actually, the visit to is completely unusual as a compressing skin.

02:23.000 --> 02:25.000
It's super simple.

02:25.000 --> 02:28.000
I promise you, it's simple.

02:28.000 --> 02:32.000
First of all, it's not a streaming and adaptive skin.

02:32.000 --> 02:37.000
It doesn't use sophisticated technique.

02:37.000 --> 02:45.000
You have twice run length encoding, which is the most trivial way of compressing data.

02:45.000 --> 02:51.000
Just repeated sequence of exactly the same symbols.

02:51.000 --> 02:54.000
And a few of the techniques.

02:54.000 --> 03:05.000
So, once you have treated your block on step 2, you output everything in one go.

03:05.000 --> 03:09.000
So, it's basically a offline algorithm.

03:09.000 --> 03:16.000
You see, in the green box, the ADA code, of course, is missing in the red ellipse.

03:16.000 --> 03:18.000
A few hundred offline.

03:18.000 --> 03:23.000
But basically, you have all the steps with these codes.

03:23.000 --> 03:27.000
It's super simple.

03:27.000 --> 03:34.000
For instance, the one of the run lengths encoding, you don't even have a very smart.

03:34.000 --> 03:40.000
You don't even have a special symbol to say, ah, now they are run.

03:40.000 --> 03:48.000
Automatically, after four symbols, you have a extra bite, saying ah, you have a run,

03:48.000 --> 03:51.000
and with the length, the extra length.

03:51.000 --> 03:59.000
So, it's very smart, and you have a game from the sixth symbol.

03:59.000 --> 04:08.000
Now, the core of the visibility to algorithm is a mysterious transform called Burroughs Wheeler,

04:08.000 --> 04:14.000
which is a bit, yeah, the history is also funny.

04:14.000 --> 04:20.000
I don't go into details, but basically, you take the blue,

04:20.000 --> 04:27.000
message on the top left, you shift it as you see in the left box.

04:27.000 --> 04:35.000
You sort all the lines, the red box contains the results of the sorting,

04:35.000 --> 04:41.000
and you output the red column, the rightmost column.

04:41.000 --> 04:43.000
That's the output of this transform.

04:43.000 --> 04:47.000
That's as simple as it is.

04:48.000 --> 04:52.000
The fascinating thing is that you can reverse this transform.

04:52.000 --> 04:57.000
So, this letter soup on the right can be reversed.

04:57.000 --> 05:04.000
If you know the index of the original sentence.

05:04.000 --> 05:08.000
So, basically, you don't compress anything, and you just,

05:08.000 --> 05:14.000
you also have an extra information, which is an integer.

05:14.000 --> 05:18.000
And you wonder, what does it have to do?

05:18.000 --> 05:23.000
With data compression, the data on the right, the red thing.

05:23.000 --> 05:31.000
It doesn't look, in this example, more compressed or compressible or whatever,

05:31.000 --> 05:34.000
than the blue message.

05:34.000 --> 05:39.000
But let's have a look on a larger example.

05:39.000 --> 05:47.000
So, here I take the source code of the encoder itself, as a test data,

05:47.000 --> 05:54.000
and you see there are lots of repetition, which this transform.

05:54.000 --> 05:59.000
And that's, of course, the reason you can do compression with that.

05:59.000 --> 06:06.000
You have lots of repetition, so you can do again run length encoding.

06:06.000 --> 06:19.000
And before the run length encoding, you do move to front transform,

06:19.000 --> 06:22.000
which is again super simple.

06:22.000 --> 06:27.000
Imagine you have a card deck, which is sorted at the beginning.

06:27.000 --> 06:33.000
And if you take the card number six, you have the index number six,

06:33.000 --> 06:35.000
because the original sorting.

06:35.000 --> 06:38.000
And after it becomes a bit more complicated,

06:38.000 --> 06:42.000
because the card are progressively shuffled,

06:42.000 --> 06:48.000
but the interesting feature, the cards that appear more often,

06:48.000 --> 06:54.000
have a lower index, and that's very prone to compression.

06:54.000 --> 06:58.000
For instance, card number six appears,

06:58.000 --> 07:06.000
after the third time the index is lower.

07:06.000 --> 07:11.000
And the final step, here it's a bit smarter,

07:11.000 --> 07:13.000
and it's more difficult.

07:13.000 --> 07:18.000
It's the entropy coding, so it's compression with Huffman trees.

07:18.000 --> 07:25.000
So, the author of this scheme had the bright idea to not have a single entropy

07:25.000 --> 07:31.000
code, you can choose between two to six Huffman trees,

07:31.000 --> 07:35.000
which are completely, freely defined, of course,

07:35.000 --> 07:38.000
try to do it efficiently.

07:38.000 --> 07:43.000
And also for each group of 50 symbol,

07:43.000 --> 07:48.000
you can allocate one of these two to six trees.

07:49.000 --> 07:54.000
So here, you have lots of room for optimization.

07:54.000 --> 08:03.000
It's where the lemons are squeezed properly.

08:03.000 --> 08:07.000
So the results of my exercise,

08:07.000 --> 08:12.000
after the last rainy Sunday,

08:12.000 --> 08:19.000
to say it's polarity, I would say,

08:19.000 --> 08:23.000
what the, okay, but there's a camera, so, like,

08:23.000 --> 08:30.000
Holy cow, so on the third bar, you see,

08:30.000 --> 08:34.000
the compression of the original visit to scheme,

08:34.000 --> 08:40.000
the one that you find everywhere, especially on Linux.

08:40.000 --> 08:46.000
So here, to the zip utility, it's using this algorithm,

08:46.000 --> 08:49.000
if you take the right switch.

08:49.000 --> 08:51.000
In the middle, you have seven zip.

08:51.000 --> 08:55.000
You know, everybody use seven zip,

08:55.000 --> 08:57.000
so if you tell it,

08:57.000 --> 09:03.000
ah, take a zip tube, encoding, here you go,

09:03.000 --> 09:07.000
and on the left, here is the compression

09:07.000 --> 09:12.000
with the new encoder and I just vote.

09:12.000 --> 09:18.000
So it was totally surprised that it's compressed better

09:18.000 --> 09:25.000
than these tools that are developed for decades.

09:25.000 --> 09:27.000
Yeah.

09:27.000 --> 09:30.000
To be mentioned, again,

09:30.000 --> 09:35.000
this is too, it's very good for human written text,

09:35.000 --> 09:39.000
like ebooks and also source code,

09:39.000 --> 09:42.000
of course not automatically generated source code,

09:42.000 --> 09:46.000
but human written source code.

09:46.000 --> 09:48.000
And less good for other things,

09:48.000 --> 09:50.000
but for these things,

09:50.000 --> 09:53.000
it's very good in general,

09:53.000 --> 09:59.000
and surprisingly, my implementation is better.

09:59.000 --> 10:04.000
And the second surprise,

10:04.000 --> 10:10.000
what happens if you have a mix of different data types?

10:10.000 --> 10:15.000
So you have many benchmark data,

10:15.000 --> 10:18.000
test data, called corpus,

10:18.000 --> 10:21.000
like the classic cuntorberry,

10:21.000 --> 10:27.000
where you have images on other kind of things,

10:27.000 --> 10:31.000
or also source code on ebooks.

10:31.000 --> 10:36.000
And the second surprise was that even in this context,

10:36.000 --> 10:43.000
by using the zip aider pre-selection option,

10:43.000 --> 10:48.000
which picks the right algorithm depending on the data type,

10:48.000 --> 10:54.000
the overall archive size is smaller

10:54.000 --> 10:58.000
than not only seven zip with the maximum compression

10:58.000 --> 11:00.000
for zip archive,

11:00.000 --> 11:07.000
but also the second bar on the bottom box,

11:07.000 --> 11:11.000
with the seven Z archive,

11:11.000 --> 11:15.000
which combines all the files together.

11:15.000 --> 11:22.000
So it was also a super rejoicing surprise there.

11:22.000 --> 11:29.000
And so, what are the things we are in the Edda Devroom?

11:29.000 --> 11:32.000
What are the benefits of Edda,

11:32.000 --> 11:38.000
except that you can make a winning encoder

11:38.000 --> 11:45.000
in two days, in three days, so quickly.

11:45.000 --> 11:49.000
And I would say, regarding data compression,

11:49.000 --> 11:53.000
it's very difficult to debug.

11:53.000 --> 11:58.000
If you have a bug, you almost have to start from scratch,

11:58.000 --> 12:03.000
because the output is random,

12:03.000 --> 12:08.000
so a wrong random doesn't look like better

12:08.000 --> 12:14.000
than a worse than the good random data.

12:14.000 --> 12:22.000
And really Edda does its best to do the things right the first time.

12:22.000 --> 12:29.000
It finds tons of dumb mistakes, programming mistakes.

12:29.000 --> 12:33.000
And if you use the customizable,

12:33.000 --> 12:37.000
you can create as many custom types as you want,

12:37.000 --> 12:43.000
so I've listed range types in the box here.

12:43.000 --> 12:49.000
And also, if you use these types,

12:49.000 --> 12:56.000
you cannot mix two easily and index with an element

12:56.000 --> 13:01.000
and this takes you sometimes to.

13:01.000 --> 13:07.000
Plus, you can define certain types with range types,

13:07.000 --> 13:15.000
which are very tailored, made range.

13:15.000 --> 13:21.000
And in compression, you always have very strictly

13:21.000 --> 13:27.000
delimited ranges, so it's perfect for that.

13:27.000 --> 13:34.000
And the two of the types listed there are data dependent,

13:34.000 --> 13:39.000
so it depends on the data you are compressing.

13:39.000 --> 13:46.000
So it's kind of, you actually have dynamic types in Edda,

13:46.000 --> 13:54.000
so they are, they depend on some data features.

13:55.000 --> 14:01.000
So that's it. Questions?

14:12.000 --> 14:18.000
So I was, so yeah, runtime and memory usage.

14:18.000 --> 14:27.000
I was not too, I didn't try to save too much in memory.

14:27.000 --> 14:31.000
So you have a few megabytes of memory usage.

14:31.000 --> 14:38.000
You could squeeze it to perhaps one megabytes by reusing things.

14:38.000 --> 14:44.000
But nowadays a few megabytes, it's not so, yeah.

14:48.000 --> 14:55.000
So how does it scale?

14:55.000 --> 14:58.000
So how does it scale?

14:58.000 --> 15:09.000
This format is by design limited to 900 kilobytes.

15:09.000 --> 15:15.000
Because there is a reason because you remember

15:15.000 --> 15:20.000
the strings that were sorted, the sorting of these strings

15:20.000 --> 15:22.000
explodes with the size.

15:22.000 --> 15:29.000
So you have to slice your data into maximum 900 kilobytes.

15:29.000 --> 15:32.000
So it's very easy with memory.

15:32.000 --> 15:36.000
You can allocate as you want.

15:36.000 --> 15:41.000
You don't need to spare too much.

15:41.000 --> 15:45.000
But runtime, there's some improvements to do.

15:45.000 --> 15:50.000
I'm using the standard Adda sorting,

15:50.000 --> 15:54.000
and I should use more specialized sorting for that.

15:54.000 --> 15:56.000
But it's for later.

15:56.000 --> 16:03.000
It runs conveniently fast on most data I would say.

16:03.000 --> 16:06.000
Similar to 70.

16:06.000 --> 16:10.000
When you say 900 kilobytes, it's the window.

16:11.000 --> 16:17.000
No, it's not that you take the first 900 kilobytes.

16:17.000 --> 16:18.000
You process it.

16:18.000 --> 16:21.000
You take the next and so on.

16:21.000 --> 16:26.000
It's very special.

16:26.000 --> 16:31.000
It's not a streaming compression scheme.

16:31.000 --> 16:33.000
It's really you take a block.

16:33.000 --> 16:36.000
You process it offline.

16:36.000 --> 16:39.000
You output it and so on and so on.

16:40.000 --> 16:43.000
Yeah, question?

16:43.000 --> 16:46.000
The white?

16:46.000 --> 16:48.000
The white?

16:48.000 --> 16:51.000
The first one.

16:51.000 --> 16:55.000
The first one.

16:55.000 --> 16:58.000
The white.

16:58.000 --> 16:59.000
The white.

16:59.000 --> 17:00.000
The white.

17:00.000 --> 17:02.000
The size.

17:02.000 --> 17:10.000
The just on the.

17:10.000 --> 17:13.000
So this one, it just bites.

17:13.000 --> 17:15.000
Compress bites.

17:15.000 --> 17:20.000
Of course, it doesn't go from zero, but you cannot compress two zero as well.

17:20.000 --> 17:29.000
So up, one minute left.

17:29.000 --> 17:34.000
So I have to wait when we're in it.

17:34.000 --> 17:39.000
So I can start with the next one.

