WEBVTT

00:00.000 --> 00:12.000
I hope you just moved on.

00:12.000 --> 00:18.000
So many people contributed to this project, the one on bottom line are people from software

00:18.000 --> 00:23.000
heritage, my students, the three main people you see are the people who are invested

00:23.000 --> 00:27.000
more time probably in the project.

00:27.000 --> 00:34.000
What we want to do is graph analytics at scale, huge graph, huge means hundreds of billions

00:34.000 --> 00:39.000
of arcs, okay, hundreds of billions of edges, not millions or billions.

00:39.000 --> 00:44.000
Graphs are everywhere, okay, I assume you know what a graph is or we won't go very far

00:44.000 --> 00:51.000
with this web snapshots, social networks, biological graph, you name it, oh sorry,

00:51.000 --> 00:54.000
wrong key, I'll start in very one here.

00:55.000 --> 01:03.000
Okay, there is some delay between what I do and what my Mac does.

01:03.000 --> 01:06.000
Okay, sorry.

01:06.000 --> 01:10.000
Okay, the standard representation like adhesion,

01:10.000 --> 01:14.000
solicit will not work if you have hundreds of billion of edges.

01:14.000 --> 01:19.000
And you can do distributed approaches which usually spend most of the time

01:19.000 --> 01:25.000
around if you have a followed Mac sure is cost idea, cost over a single thread.

01:25.000 --> 01:32.000
You take any graph task, you run it on a single thread and the cost is the number of

01:32.000 --> 01:37.000
MapReduce servers you need to replicate the speed of a laptop.

01:37.000 --> 01:40.000
And the cost of some algorithm is infinite.

01:40.000 --> 01:46.000
No matter how many server you add to your MapReduce cluster, it's slower than a laptop.

01:46.000 --> 01:49.000
So that doesn't work really.

01:49.000 --> 01:53.000
What we do we've been for some times is compression, okay.

01:53.000 --> 01:59.000
So web graph is a framework to do compressed representation of graphs.

01:59.000 --> 02:03.000
It's a very long lead project, it's more than 20 years now.

02:03.000 --> 02:09.000
And there are hundreds of publications, image on conferences and journals that you've

02:09.000 --> 02:12.000
had used the web graph of the data with this review.

02:12.000 --> 02:13.000
It's all open.

02:13.000 --> 02:18.000
I mean open source, open data, we gather web crawls, we gather data from, you know,

02:18.000 --> 02:23.000
the structure of Twitter or DBLP or Wikipedia and we make it public.

02:23.000 --> 02:26.000
In 2011, that was a long time ago.

02:26.000 --> 02:28.000
And I think of it.

02:28.000 --> 02:33.000
There was news that Facebook had four degrees of separation with, wow,

02:33.000 --> 02:38.000
this is there is a long delay which actually was featured on the newer times

02:38.000 --> 02:42.000
and then there was three dots for 74, not four.

02:42.000 --> 02:46.000
It's a long story because the degrees of separation in distances in graph and runoff by one.

02:46.000 --> 02:48.000
And then your time got the off by one.

02:48.000 --> 02:51.000
Okay, I don't want to talk about it, it's very sad.

02:51.000 --> 02:58.000
But at that time we measured that at Facebook, I mean, people at Facebook did the measurement with our software.

02:58.000 --> 03:04.000
And at that time Facebook was 700 million nodes and it was 70 billion links.

03:04.000 --> 03:09.000
It was represented in just 200 gigabytes.

03:09.000 --> 03:21.000
Like common crawl, if you know the nonprofit that distributes web graph and does all their internal analytics using web graph.

03:21.000 --> 03:28.000
So the motivation for this was the software heritage, history graph.

03:28.000 --> 03:30.000
So what is this?

03:30.000 --> 03:35.000
It's the largest public archive of its style version control history.

03:35.000 --> 03:39.000
This is a UNESCO sponsored project to create a library of software.

03:39.000 --> 03:41.000
All the software ever created.

03:41.000 --> 03:45.000
And the data model is a miracle, direct a cyclic graph.

03:45.000 --> 03:52.000
So basically you have like a gigantic git repository with all the development of all public code.

03:52.000 --> 03:57.000
Even very old code, I mean they gather code for a number of sources.

03:57.000 --> 04:02.000
And this is one of the largest graphs, a human activity available.

04:02.000 --> 04:13.000
Like presently we are at 44 billion nodes and seven more than 700 billion arcs, which are basically committing dependencies,

04:13.000 --> 04:20.000
which presently is represented by web graph into 150 gigabytes instead of more than six terabytes,

04:20.000 --> 04:24.000
which makes it possible to do things in memory.

04:24.000 --> 04:36.000
Now previously, software heritage is using a pipeline based on our Java historical 20 of Java implementation of a web graph.

04:36.000 --> 04:45.000
And being able to store the graph explicitly, I mean you can do visits, you can run any kind of normal implementation algorithm we have in mind,

04:45.000 --> 04:50.000
because you don't have to write something specific for a distributed implementation.

04:50.000 --> 04:55.000
But clearly, at this scale, start Java was getting in the way.

04:55.000 --> 04:59.000
I was looking for an excuse to learn Rust.

04:59.000 --> 05:02.000
Tomazo was in Rust.

05:02.000 --> 05:09.000
Stefan wanted to do this, so essentially that was the time to move to Rust.

05:09.000 --> 05:12.000
Of course, it's hyperformas and safe.

05:12.000 --> 05:15.000
There are zero cross abstractions, you know this.

05:16.000 --> 05:25.000
Now this might look to you a minor issue, but if you ever try to manage something with hundreds of billions of items in Java,

05:25.000 --> 05:29.000
having two billion elements array, it's a major problem, okay?

05:29.000 --> 05:33.000
Not even two billions, two billions minus one in theory.

05:33.000 --> 05:38.000
In reality, two billions minus eight don't get me into that, it's very sad.

05:38.000 --> 05:44.000
So I shouldn't complain because I used Java happily for at least a decade.

05:44.000 --> 05:52.000
It was just around 2010 that it became almost impossible to do this, probably, sensibly in Java.

05:52.000 --> 06:00.000
Another thing that might be obvious, lazy, truly lazy iterators.

06:00.000 --> 06:03.000
The whole framework is based on lazy iterators.

06:03.000 --> 06:08.000
If you ever wrote an iterator in Java, you know what I'm talking about, nightmare.

06:08.000 --> 06:13.000
You ask, fortunately, we have actual lazy iterator.

06:13.000 --> 06:18.000
And moving to Rust required porting, or I think it completely several keys idea,

06:18.000 --> 06:21.000
that resulted in a number of crates.

06:21.000 --> 06:25.000
So what I'm trying to do today is to show you what we did,

06:25.000 --> 06:28.000
and maybe it can be useful for you too.

06:29.000 --> 06:34.000
And a couple of ideas that we met during the writing.

06:34.000 --> 06:40.000
So, absurdity is an epsilon copy, serialization, serialization framework,

06:40.000 --> 06:45.000
and I will tell you what it means because we invented the word for this thing.

06:45.000 --> 06:51.000
The MNDBG is a system for fast memory footprint analysis.

06:51.000 --> 06:54.000
This is a crate with succinct data structure.

06:54.000 --> 06:57.000
Who knows what is succinct data structure? Raise your hands.

06:57.000 --> 06:59.000
Nobody.

06:59.000 --> 07:03.000
Seriously. Okay, there is a two-line introduction.

07:03.000 --> 07:06.000
These streams, which streams for instantaneous,

07:06.000 --> 07:11.000
instantaneous code, gamma code, compression, variable byte.

07:11.000 --> 07:15.000
Okay, now someone knows, I know.

07:15.000 --> 07:18.000
Well, and of course, a web graph.

07:18.000 --> 07:23.000
Let's start with the first idea, which took a lot of time to work out.

07:23.000 --> 07:28.000
So, our purpose, keep in mind, because it's called an epsilon copy, blah, blah, blah.

07:28.000 --> 07:34.000
The purpose is to memory map very large data structures directly from disk.

07:34.000 --> 07:38.000
You have data structure that are 300 gigabytes wide.

07:38.000 --> 07:42.000
Loading them into memory takes half an hour, literally, half an hour.

07:42.000 --> 07:44.000
What can we do to have fact census?

07:44.000 --> 07:48.000
I mean, the thing they do at Facebook Google, the data center name,

07:48.000 --> 07:52.000
is to have them in a specific format on this, memory map then

07:52.000 --> 07:56.000
and access them directly from the memory mapping.

07:56.000 --> 07:59.000
Okay, so we wanted to do something like that in Rust.

07:59.000 --> 08:06.000
And essentially, to do that, what you do is something called zero copy.

08:06.000 --> 08:10.000
Zero copy means that instead of this serializing something,

08:10.000 --> 08:16.000
you load the bytes in memory, and then those bytes are already the object you want.

08:16.000 --> 08:20.000
More or less, so there is no DC realization happening.

08:20.000 --> 08:24.000
And in that case, you can even directly memory map the file.

08:24.000 --> 08:28.000
So instead of DC realizing an object, you memory map the file,

08:28.000 --> 08:30.000
and it's already the object you want.

08:30.000 --> 08:35.000
There are a few crazy to do this, but they didn't do what we needed.

08:35.000 --> 08:38.000
One is abomination from Framac Sherry.

08:38.000 --> 08:43.000
Abomination rewrites all the references into memory to make it work,

08:43.000 --> 08:46.000
because you don't know where the object is.

08:46.000 --> 08:51.000
Out of, we cannot do that because we want to do memory mapping of a mutable

08:51.000 --> 08:53.000
data structure, we cannot write.

08:53.000 --> 08:56.000
Zero rec is a popular create to do that.

08:56.000 --> 09:00.000
It has a special type that you use instead of rec,

09:00.000 --> 09:03.000
that essentially does relative indexing.

09:03.000 --> 09:05.000
Huge performance impact.

09:05.000 --> 09:09.000
We are shading every cycle, not possible.

09:09.000 --> 09:13.000
Another one that is very popular is Archive.

09:13.000 --> 09:17.000
Archive is some impact, not very clear on performance,

09:17.000 --> 09:23.000
but the main problem with Archive, what you DC realize, zero DC realize,

09:23.000 --> 09:26.000
is not what you see realize, it's a different structure.

09:26.000 --> 09:31.000
So you have to re-implement every single method on the DC realized structure,

09:31.000 --> 09:35.000
and we have a lot of algorithmic things going on.

09:35.000 --> 09:37.000
There was not working for us.

09:37.000 --> 09:40.000
So we developed this idea.

09:40.000 --> 09:43.000
Let's say that I show you now.

09:43.000 --> 09:48.000
The concept of our ideas that you need some collaboration

09:48.000 --> 09:49.000
from the data structure.

09:49.000 --> 09:53.000
You cannot just let a derived absurdity and it will work for you.

09:53.000 --> 09:58.000
In exchange, we have something that we didn't have from this crate.

09:58.000 --> 10:02.000
So there are two types of data, zero copy and the copy.

10:02.000 --> 10:04.000
There are two traits.

10:04.000 --> 10:08.000
And what we do is, if you have any kind of sequence,

10:08.000 --> 10:11.000
vector, boxes, license of zero copy types,

10:11.000 --> 10:14.000
when you DC realize, instead of the vector,

10:14.000 --> 10:16.000
you get a reference to the memory.

10:16.000 --> 10:19.000
So this is the zero copy part.

10:19.000 --> 10:22.000
But the point is that the rest of the structure,

10:22.000 --> 10:26.000
which is very small with respect to the large arrays,

10:26.000 --> 10:28.000
is allocated normally.

10:28.000 --> 10:32.000
So it's, we copy a little bit, an epsilon part of the structure.

10:32.000 --> 10:35.000
It's not entirely zero copy.

10:35.000 --> 10:38.000
Zero copy means we don't allocate anything.

10:38.000 --> 10:41.000
The memory is your structure.

10:41.000 --> 10:45.000
We take the memory, we use like 99% of it,

10:45.000 --> 10:48.000
and then we allocate a little bit.

10:48.000 --> 10:52.000
And to do that, we use the enjoying to create implementation

10:52.000 --> 10:55.000
based on a societate type that I show you later.

10:55.000 --> 10:58.000
Maybe how many people are asked, like,

10:59.000 --> 11:02.000
when you're programming it, listen to us here.

11:02.000 --> 11:05.000
Okay, so there is some people who probably met the problem.

11:05.000 --> 11:10.000
But again, the idea is that you get something that I write for us.

11:10.000 --> 11:12.000
Let me just keep a second.

11:12.000 --> 11:14.000
This, and I'll show you, first of all,

11:14.000 --> 11:17.000
how you do these joint implementations.

11:17.000 --> 11:22.000
So the idea is that you want to implement a trait differently,

11:22.000 --> 11:26.000
depending on another trait, which you cannot do, of course,

11:26.000 --> 11:27.000
because the rules.

11:27.000 --> 11:30.000
So the way you do it is, like this,

11:30.000 --> 11:33.000
you create a selector, basically.

11:33.000 --> 11:36.000
So a trait that is an associated type,

11:36.000 --> 11:39.000
that is, in this case, just two possible values.

11:39.000 --> 11:40.000
Phase one.

11:40.000 --> 11:44.000
Phase two, you put in my direction your trait

11:44.000 --> 11:47.000
with a specific instance of the first trait.

11:47.000 --> 11:50.000
So zero copy depends on being a copy type,

11:50.000 --> 11:52.000
with copy equals zero,

11:52.000 --> 11:54.000
but anything with copy type,

11:54.000 --> 11:56.000
copy equals zero, will be zero copy.

11:56.000 --> 11:59.000
So now zero copy and the other trait,

11:59.000 --> 12:04.000
with internal value zero, are the same thing.

12:04.000 --> 12:09.000
Now, what we would like to write is this.

12:09.000 --> 12:14.000
Oh, this is, okay, my keyboard, okay.

12:14.000 --> 12:17.000
What we would like to write is this.

12:17.000 --> 12:21.000
I can write to different implementation of my trait,

12:21.000 --> 12:24.000
like this you like for t, depending on whether t,

12:24.000 --> 12:27.000
and this will not compile, because it won't.

12:27.000 --> 12:29.000
It used to, like if you were to go,

12:29.000 --> 12:30.000
but now it doesn't anymore,

12:30.000 --> 12:32.000
but it will be the small helper trait,

12:32.000 --> 12:33.000
you can make this work.

12:33.000 --> 12:35.000
So you can add a digital implementation of a trait,

12:35.000 --> 12:37.000
depending on another trait.

12:37.000 --> 12:39.000
And this is how we manage the whole,

12:39.000 --> 12:43.000
something is deep copy, something is zero copy thing.

12:43.000 --> 12:45.000
Okay, if you are interested,

12:45.000 --> 12:49.000
there is a PR going on to make this work in the compiler,

12:49.000 --> 12:53.000
and the PR contains the RSIato trait we're using here.

12:54.000 --> 12:56.000
So this is a very interesting technique,

12:56.000 --> 12:59.000
that without which we could never be able to write,

12:59.000 --> 13:01.000
exactly, example.

13:01.000 --> 13:06.000
Okay, so what happened here is that you declare the structure

13:06.000 --> 13:09.000
with an ID, an integer, and some data.

13:09.000 --> 13:11.000
Okay, and you derive, absurdly.

13:11.000 --> 13:14.000
Unfortunately, there is a procedural macro.

13:14.000 --> 13:17.000
Now, you create a structure with a vector inside,

13:17.000 --> 13:20.000
and you serialize it somewhere.

13:20.000 --> 13:22.000
Okay, you store it, easy.

13:22.000 --> 13:26.000
Now, what you do is you load the serialized form

13:26.000 --> 13:28.000
in a buffery memory.

13:28.000 --> 13:30.000
So it's now with a piece of memory.

13:30.000 --> 13:33.000
Okay, and at that point, you can add

13:33.000 --> 13:35.000
a single serialized structure,

13:35.000 --> 13:38.000
and what you get is a structure that instead of a vector

13:38.000 --> 13:40.000
contains a reference to a slice,

13:40.000 --> 13:42.000
there is in your serialized data.

13:42.000 --> 13:46.000
Okay, you can also do a full serialization

13:46.000 --> 13:48.000
like a normal serialization framework,

13:48.000 --> 13:50.000
and it would be very, very fast

13:50.000 --> 13:53.000
because all zero copy are just one red exact.

13:53.000 --> 13:55.000
There's no looping,

13:55.000 --> 13:58.000
or you can even directly memory map it.

13:58.000 --> 14:01.000
And, okay, let me show what's happening here.

14:01.000 --> 14:04.000
You have your structure, a phase one.

14:04.000 --> 14:08.000
The structure is an ID, okay.

14:08.000 --> 14:12.000
I don't know why that is cut like a two-second delay

14:12.000 --> 14:15.000
between, okay, whatever.

14:15.000 --> 14:18.000
Consection time, you have your structure,

14:18.000 --> 14:21.000
there is an AD, a vector, a known pointer,

14:21.000 --> 14:23.000
length cap, okay.

14:23.000 --> 14:26.000
I think this is pretty obvious.

14:26.000 --> 14:29.000
Then we serialize, and you just have ID length

14:29.000 --> 14:32.000
and the data, plus a lot of headers, checks,

14:32.000 --> 14:34.000
some verification, okay.

14:34.000 --> 14:36.000
But this is the data we write.

14:36.000 --> 14:39.000
When you do the epsilon DC realization,

14:39.000 --> 14:42.000
you get a structure that contains

14:42.000 --> 14:44.000
just a reference to the data.

14:44.000 --> 14:47.000
So you see, IDLN are being effectively copied,

14:48.000 --> 14:50.000
because we allocate space for them.

14:50.000 --> 14:52.000
But the huge amount of data in the stars

14:52.000 --> 14:55.000
is not copied, it's just reference.

14:55.000 --> 14:56.000
Okay.

14:56.000 --> 14:59.000
This is basically what, like Facebook or Google does,

14:59.000 --> 15:04.000
in C++, in a more in a less safe manner, let's say.

15:04.000 --> 15:06.000
But for us, it worked very well,

15:06.000 --> 15:09.000
because now the serialized data is on disk,

15:09.000 --> 15:13.000
memory map pin takes nothing, takes just a memory map,

15:13.000 --> 15:15.000
and will be that very small structure,

15:15.000 --> 15:18.000
but the epsilon complex structure.

15:18.000 --> 15:20.000
Okay.

15:20.000 --> 15:23.000
Of course, you have to write your methods,

15:23.000 --> 15:26.000
so they work both with a vector and a slice,

15:26.000 --> 15:28.000
but that's what everybody does,

15:28.000 --> 15:31.000
so that should be too difficult.

15:31.000 --> 15:33.000
And once you have things supporting,

15:33.000 --> 15:36.000
absolutely, you can put them as a type parameter,

15:36.000 --> 15:40.000
and they will work flawlessly, recursively, up to the leaves.

15:40.000 --> 15:43.000
The only problem is that if you have a lot of fields,

15:43.000 --> 15:46.000
all of different types, and they all should support

15:46.000 --> 15:49.000
absurdity, you will have a lot of parameter types,

15:49.000 --> 15:51.000
type parameters.

15:51.000 --> 15:52.000
Sorry.

15:52.000 --> 15:53.000
Okay.

15:53.000 --> 15:54.000
So that's it.

15:54.000 --> 15:56.000
If you need to memory map, larger structure,

15:56.000 --> 15:58.000
this serialized, very quickly things,

15:58.000 --> 15:59.000
have a look at absurdity.

15:59.000 --> 16:01.000
I think we did a nice job.

16:01.000 --> 16:03.000
And notice that since the structure,

16:03.000 --> 16:06.000
you decelerize, is exactly the structure,

16:06.000 --> 16:09.000
you serialize, you can basically

16:09.000 --> 16:11.000
exchange the two things without problem,

16:11.000 --> 16:13.000
it's not a different structure,

16:13.000 --> 16:16.000
and the impact on performance is zero,

16:16.000 --> 16:18.000
because it's exactly the same structure

16:18.000 --> 16:20.000
simply instead of a vector,

16:20.000 --> 16:22.000
you have a reference to a slice.

16:22.000 --> 16:24.000
Contrarily two, the other one.

16:24.000 --> 16:25.000
Okay.

16:25.000 --> 16:26.000
Very quickly,

16:26.000 --> 16:30.000
Mandy Biji is a high performance memory

16:30.000 --> 16:31.000
compensate detector,

16:31.000 --> 16:32.000
and the question is,

16:32.000 --> 16:35.000
how can be slow a memory detector?

16:35.000 --> 16:37.000
Very easy.

16:37.000 --> 16:40.000
These are some of the most common

16:40.000 --> 16:42.000
aspects for memory detection.

16:42.000 --> 16:45.000
I mean, I want to know how much memory

16:45.000 --> 16:48.000
and object use is including the memory

16:48.000 --> 16:51.000
it references, not just a size of.

16:51.000 --> 16:56.000
This is a very large 2GB Ashmap allocated,

16:56.000 --> 16:59.000
and the first three frameworks

16:59.000 --> 17:02.000
are what you can find now on crates.

17:02.000 --> 17:04.000
The point is that the measure will be

17:04.000 --> 17:05.000
off for very reason,

17:05.000 --> 17:07.000
but the point is that they take

17:07.000 --> 17:11.000
almost two tenths of a second to establish the size,

17:11.000 --> 17:15.000
because they have to loop through two billion elements.

17:15.000 --> 17:18.000
Man's side leverages on this joint trade thing,

17:18.000 --> 17:20.000
and we'll give you the answer in,

17:20.000 --> 17:22.000
I mean, 200 nanoseconds, nothing.

17:22.000 --> 17:24.000
And if you start managing structure

17:24.000 --> 17:26.000
that are 100 gigabytes waiting

17:26.000 --> 17:28.000
to 30 seconds to get an answer,

17:28.000 --> 17:29.000
it's not very nice.

17:29.000 --> 17:31.000
Also, we can print memory layouts,

17:31.000 --> 17:32.000
including padding,

17:32.000 --> 17:34.000
so you can have a complete breakout

17:34.000 --> 17:36.000
of a data structure,

17:36.000 --> 17:39.000
knowing which part occupies more occupies less.

17:39.000 --> 17:42.000
We can even use padding in enums using

17:42.000 --> 17:45.000
the unstable feature offset of enums.

17:45.000 --> 17:48.000
So if you're interested in looking inside

17:48.000 --> 17:51.000
your data structure that is used.

17:51.000 --> 17:53.000
Sox, I will skip a little bit on this

17:53.000 --> 17:55.000
because nobody knows anything about this.

17:55.000 --> 17:56.000
Soxin data structure,

17:56.000 --> 17:59.000
but let me just expose to the idea.

17:59.000 --> 18:01.000
Soxin to the data structure,

18:01.000 --> 18:03.000
they use the space of the information

18:03.000 --> 18:05.000
theoretical lower bound,

18:05.000 --> 18:08.000
but with operation that asymptotically

18:08.000 --> 18:10.000
are the same of standard structures.

18:10.000 --> 18:13.000
Let me just make an example.

18:13.000 --> 18:16.000
There are four to the power of en binary trees.

18:16.000 --> 18:18.000
There's a number.

18:18.000 --> 18:21.000
So it should be possible to represent

18:21.000 --> 18:25.000
binary tree using log four to the power of en two en bits.

18:25.000 --> 18:28.000
But we use two en log en bits

18:28.000 --> 18:30.000
because we have n nodes,

18:30.000 --> 18:32.000
and it's known as two pointers.

18:32.000 --> 18:35.000
The two pointers are at least log n bits.

18:35.000 --> 18:38.000
So Jacob's own representation

18:38.000 --> 18:40.000
uses two en bits.

18:40.000 --> 18:43.000
And give you parent and child in constant time.

18:43.000 --> 18:46.000
So you shave off a log n factor.

18:46.000 --> 18:49.000
This is very important because, oh,

18:49.000 --> 18:52.000
but, okay.

18:52.000 --> 18:54.000
Now that you're sorry,

18:54.000 --> 18:59.000
complete, I was seeing things you were not seeing.

18:59.000 --> 19:02.000
So there is something really bad going on.

19:02.000 --> 19:06.000
I'm so sorry.

19:06.000 --> 19:08.000
Okay, now it's on the screen,

19:08.000 --> 19:10.000
but not on my screen.

19:10.000 --> 19:14.000
So, never happened before with keynote in my life.

19:14.000 --> 19:17.000
I'll try to make it look represents everything

19:17.000 --> 19:19.000
internally using alias funnel.

19:19.000 --> 19:20.000
They index the graph.

19:20.000 --> 19:23.000
It's pervasive throughout the infrastructure.

19:23.000 --> 19:25.000
We also have something else like sort of

19:25.000 --> 19:28.000
recompression blah blah, but I'm going to skip

19:28.000 --> 19:31.000
because I want to have some times for web graph.

19:31.000 --> 19:35.000
Okay, let me just skip this because I don't think it's.

19:35.000 --> 19:38.000
Super interesting in this context.

19:38.000 --> 19:41.000
Let me just make one complaint.

19:41.000 --> 19:42.000
Okay.

19:42.000 --> 19:46.000
We have a comprehensive set of traits for indexed dictionaries.

19:46.000 --> 19:49.000
And indexed dictionary is a set of integers.

19:49.000 --> 19:52.000
You can know the iF integers in increasing order.

19:52.000 --> 19:55.000
And you can do predecessor and successor.

19:55.000 --> 19:57.000
So list up or bound.

19:57.000 --> 20:00.000
Creates lower bound computations.

20:00.000 --> 20:03.000
Our main implementation is yes funnel.

20:03.000 --> 20:09.000
The main problem we have is that there is no index get in rust.

20:09.000 --> 20:12.000
And there is a pr going on, but it's stalled.

20:12.000 --> 20:14.000
What I mean is that if you know it works,

20:14.000 --> 20:17.000
if you want to overload the square bracket operator,

20:17.000 --> 20:19.000
you have to implement index.

20:19.000 --> 20:20.000
Okay.

20:20.000 --> 20:25.000
Index must return reference.

20:25.000 --> 20:27.000
That's not good.

20:27.000 --> 20:30.000
Imagine you have to implement functionally a vector that

20:30.000 --> 20:33.000
returns is squared for index side.

20:33.000 --> 20:34.000
You can.

20:34.000 --> 20:39.000
So in general, rust and intentional representation do not work

20:39.000 --> 20:41.000
well together at the moment.

20:41.000 --> 20:44.000
If your representation of the data is extension of,

20:44.000 --> 20:46.000
so you actually have the data.

20:46.000 --> 20:51.000
Like in a binary tree of integers, the integers are in the tree.

20:51.000 --> 20:54.000
If you need to expose one of the integers,

20:54.000 --> 20:56.000
you can offer a reference.

20:56.000 --> 21:00.000
But anything that is implicit, succinct, compressed anything

21:00.000 --> 21:04.000
that is intentional in which the description of the data is

21:04.000 --> 21:06.000
in the data structure.

21:06.000 --> 21:08.000
But the data is not in the data structure.

21:08.000 --> 21:12.000
It's impossible to use any kind of overloading a rust.

21:12.000 --> 21:15.000
We really hope someone at some point makes,

21:15.000 --> 21:17.000
at least in the index get work,

21:17.000 --> 21:21.000
that would be the ideal trait that overloads square brackets

21:21.000 --> 21:25.000
but returns a value, not a reference.

21:25.000 --> 21:26.000
Okay.

21:26.000 --> 21:30.000
If you can push this through any body to the component team,

21:30.000 --> 21:32.000
that would be very nice to us.

21:32.000 --> 21:35.000
Or the other PR or the other PR.

21:35.000 --> 21:37.000
Okay.

21:37.000 --> 21:40.000
Another component that might be useful will be strings.

21:40.000 --> 21:44.000
So, bit strings let you read the instead of the stream of bytes,

21:44.000 --> 21:46.000
stream of bits.

21:46.000 --> 21:48.000
Why you want to do that?

21:48.000 --> 21:52.000
Well, because to do compression, a lot of compression happens

21:52.000 --> 21:56.000
in bit strings like even jpeg and peg, whatever.

21:56.000 --> 22:00.000
Essentially, we have codes that let you write small integers with few bits

22:00.000 --> 22:03.000
and large integers with many bits.

22:03.000 --> 22:06.000
These are called instantaneous codes.

22:06.000 --> 22:11.000
And there's the side bit stream, implements many of them.

22:11.000 --> 22:15.000
The main thing that is different from other crates, we read by word.

22:15.000 --> 22:18.000
This is like a major change in efficiency.

22:18.000 --> 22:21.000
So, we don't read byte by byte and then we the code.

22:21.000 --> 22:25.000
We read like, you can say, you can set your own read word

22:25.000 --> 22:28.000
from 8 to 64 bits.

22:28.000 --> 22:31.000
And that speeds up enormously.

22:31.000 --> 22:35.000
The code in it's very easy to enrast in Java impossible.

22:36.000 --> 22:41.000
We have a lot of instantaneous code and it's very easy to add yours.

22:41.000 --> 22:45.000
Like a gamma code, if you're familiar with that, we can read it in less than two

22:45.000 --> 22:50.000
or no seconds with cheese, believe me, pretty good.

22:50.000 --> 22:55.000
You have two basic traits, bit right, which you can imagine,

22:55.000 --> 22:59.000
can read bits and write bits, not surprisingly.

22:59.000 --> 23:02.000
And then we have extension traits for the codes.

23:02.000 --> 23:08.000
You can write your own extension traits for the code you want using the infrastructure.

23:08.000 --> 23:13.000
And we have implementation of a reader and a writer that are buffered

23:13.000 --> 23:17.000
with selectable writer read word.

23:17.000 --> 23:21.000
Okay, this is a little bit too much.

23:21.000 --> 23:27.000
But one observation presently, the best read word is U32

23:28.000 --> 23:32.000
and the writer is U64 for implementation problem.

23:32.000 --> 23:36.000
When you want to read bits and you want to have the code in tables,

23:36.000 --> 23:39.000
you need to be able to pick bits in the stream.

23:39.000 --> 23:45.000
To do that, your internal bit buffer must be at least twice the word you're reading from the file.

23:45.000 --> 23:49.000
Usually, people read 8 bit at a time, so that is not an issue.

23:49.000 --> 23:52.000
But we are reading a word at a time.

23:52.000 --> 23:55.000
If we were to read 64 bit at a time,

23:55.000 --> 24:02.000
the internal buffer would be 120 bit, and 120 bits, 120 bits,

24:02.000 --> 24:04.000
integer are very slow.

24:04.000 --> 24:09.000
Now, I guess in two three years we'll be able to have read word of U64,

24:09.000 --> 24:16.000
but for the time being on Intellit list 128, you are very slow.

24:16.000 --> 24:23.000
Okay, so if you're interested in bit stream, have a look at the outside bit stream.

24:23.000 --> 24:26.000
Now, more interestingly, finally web graph.

24:26.000 --> 24:29.000
What does that can it do for you?

24:29.000 --> 24:34.000
Okay, graphs are usually represented by bit streams.

24:34.000 --> 24:37.000
Obviously, otherwise, you wouldn't have all those crates,

24:37.000 --> 24:40.000
and we use the SI bit stream for the code,

24:40.000 --> 24:43.000
and then we use sucks for pointer into the bit stream.

24:43.000 --> 24:48.000
Because using the list funnel, we can store the pointers in the bit stream in very little space.

24:48.000 --> 24:51.000
That's a classical usage of the list funnel.

24:51.000 --> 24:57.000
Like on an old software heritage graph with 34 billion nodes,

24:57.000 --> 25:00.000
and more or less half a trillion arcs,

25:00.000 --> 25:02.000
a BFS is completed in three hours.

25:02.000 --> 25:06.000
On standard hardware, I mean, half a terabyte of RAM, of course,

25:06.000 --> 25:08.000
but fairly standard hardware.

25:08.000 --> 25:11.000
And it's three time faster than it was in Java.

25:11.000 --> 25:15.000
We expected 50% more, maybe 100% more,

25:15.000 --> 25:18.000
three time faster was really nice.

25:18.000 --> 25:20.000
Consider that everything else is faster.

25:20.000 --> 25:23.000
I mean, in Java, for instance, when you have an array,

25:23.000 --> 25:26.000
there is a hundred like 50, 40 billion elements,

25:26.000 --> 25:29.000
you have to simulate it through many small arrays.

25:29.000 --> 25:31.000
There's loads of down things a lot.

25:31.000 --> 25:35.000
In the rest, all these things are fairly easy.

25:35.000 --> 25:41.000
What I'm also found obvious that the ergonomics of the API

25:41.000 --> 25:44.000
are unbelievably better with respect to Java,

25:44.000 --> 25:47.000
and just to let you have a look,

25:47.000 --> 25:51.000
Xographs and N nodes numbered from zero to N, period.

25:51.000 --> 25:55.000
There is no specific data structure for node of arts.

25:55.000 --> 25:57.000
Everything is done with integer.

25:57.000 --> 25:59.000
If you have your own index space,

25:59.000 --> 26:02.000
you have to bijectively multiply into a zero N.

26:02.000 --> 26:03.000
It's your job.

26:03.000 --> 26:08.000
We work with indices in a compact space space zero N.

26:08.000 --> 26:12.000
And when you want to access the graph structure,

26:12.000 --> 26:13.000
you get successors.

26:13.000 --> 26:17.000
So you take a node and you know the successors of the node.

26:17.000 --> 26:20.000
The other node that are connected by an arc.

26:20.000 --> 26:23.000
So there is no data structure,

26:23.000 --> 26:28.000
materializing a node on an edge arc, however you call it.

26:28.000 --> 26:30.000
Everything is done with view size.

26:30.000 --> 26:36.000
And this is what you do if you want to work with a half a trillion edges graph.

26:36.000 --> 26:41.000
Otherwise, anything you try to represent set of edges, set of nodes,

26:41.000 --> 26:46.000
stacks, becomes immense because these structures,

26:46.000 --> 26:49.000
representing nodes of edges are usually,

26:49.000 --> 26:54.000
not using much more space than just a use size.

26:54.000 --> 26:57.000
I should look there.

26:57.000 --> 27:00.000
So how does compression happens?

27:00.000 --> 27:02.000
These are complicated topics, but just as an idea,

27:02.000 --> 27:03.000
gap compression.

27:03.000 --> 27:05.000
So if you have list of successors,

27:05.000 --> 27:08.000
you take the gaps, differences.

27:08.000 --> 27:11.000
These are small numbers, because they are just gaps.

27:11.000 --> 27:13.000
And you use instantaneous code.

27:13.000 --> 27:16.000
If you're familiar with inverted indices,

27:16.000 --> 27:20.000
I mean, standard techniques in 30 years, this how you do it.

27:20.000 --> 27:24.000
But with graphs, we can do a little bit more because these large graphs

27:24.000 --> 27:27.000
are often very redundant.

27:27.000 --> 27:29.000
Like the successful list of a node,

27:29.000 --> 27:32.000
is very similar to the successful list of another node.

27:32.000 --> 27:37.000
Imagine a web snapshot, two pages from the same level of a website.

27:37.000 --> 27:40.000
They have very similar links.

27:40.000 --> 27:44.000
So what we do, we compress by reference.

27:44.000 --> 27:49.000
So instead of storing entirely my success list, my out links.

27:49.000 --> 27:52.000
First, I check if I can copy most of them from another node,

27:52.000 --> 27:56.000
then I already stored, and then I store the difference.

27:56.000 --> 27:59.000
This is a major game changing in web,

27:59.000 --> 28:02.000
in web, in particular in web snapshots,

28:02.000 --> 28:06.000
where you can get easily to one, one and a half bit per link.

28:06.000 --> 28:08.000
Instead of 64.

28:08.000 --> 28:12.000
Because web snapshots are impressively redundant.

28:12.000 --> 28:14.000
You can also intervalize.

28:14.000 --> 28:16.000
So if you have consecutive successors,

28:16.000 --> 28:20.000
maybe you can store them as intervals.

28:20.000 --> 28:24.000
We have labels on edges.

28:24.000 --> 28:27.000
And the way we do it is very elegant.

28:27.000 --> 28:30.000
I can say that because it's tomato's idea, not mine.

28:30.000 --> 28:32.000
So I can say a lot of nice things about that.

28:32.000 --> 28:38.000
So a label is like a graph that instead of returning successors

28:38.000 --> 28:39.000
returns labels.

28:39.000 --> 28:42.000
And then we have zip operators, like a generator,

28:42.000 --> 28:44.000
you can zip your graph with your label,

28:44.000 --> 28:46.000
and now you have a label graph.

28:46.000 --> 28:49.000
And this is very nice because if you have several set of labels,

28:49.000 --> 28:51.000
you can zip them and create your own custom labels,

28:51.000 --> 28:56.000
zip it with a graph, and get your custom labeling without any programming.

28:56.000 --> 28:59.000
You just have to write zip this and this,

28:59.000 --> 29:01.000
and you will get a label graph.

29:01.000 --> 29:04.000
The whole thing is based on landers.

29:04.000 --> 29:09.000
And this is the 10 minutes of the talk that will be pure rust problems.

29:09.000 --> 29:12.000
So it took us a while to get this working.

29:12.000 --> 29:16.000
So you might, do you know what the lander is in rust,

29:16.000 --> 29:18.000
or landing a generator?

29:18.000 --> 29:20.000
Anybody, never heard of it?

29:20.000 --> 29:21.000
Okay.

29:21.000 --> 29:23.000
Okay, iterators give you things.

29:23.000 --> 29:25.000
You know what a generator is, right?

29:25.000 --> 29:28.000
Okay, iterator give you things, right?

29:28.000 --> 29:32.000
And you can take two things from the iterator and do things

29:32.000 --> 29:35.000
with the two things you got from the iterator.

29:35.000 --> 29:37.000
With a lander, you can't.

29:37.000 --> 29:41.000
In a lander, you take a thing, you do something with the thing.

29:41.000 --> 29:45.000
You throw you the way, and then you can take another thing.

29:45.000 --> 29:50.000
Okay, landers are iterators that have relevant internal state.

29:50.000 --> 29:54.000
It cannot provide you with two items at the same time.

29:55.000 --> 29:58.000
For instance, because they go over a file,

29:58.000 --> 30:02.000
and they move the pointer as they return item.

30:02.000 --> 30:04.000
Okay.

30:04.000 --> 30:06.000
Now imagine a situation.

30:06.000 --> 30:09.000
We are returning iterator on the graph,

30:09.000 --> 30:12.000
and we are scanning a bit stream.

30:12.000 --> 30:14.000
We cannot use an iterator.

30:14.000 --> 30:17.000
I mean, we return a lander that we call an iterator,

30:17.000 --> 30:19.000
but it should be a lander.

30:20.000 --> 30:23.000
And these are computations and complicated.

30:23.000 --> 30:25.000
Okay, let me just skip this.

30:25.000 --> 30:28.000
Let me go just go to the trades,

30:28.000 --> 30:30.000
because we don't have much time.

30:30.000 --> 30:32.000
I thought they would have been faster.

30:32.000 --> 30:36.000
So the basic trade you access is a sequential labeling.

30:36.000 --> 30:39.000
So you have labels, and you have a lander,

30:39.000 --> 30:43.000
and node labels, lander is the thing I'm going to talk later.

30:43.000 --> 30:47.000
It's something that gives you nodes and iterator of successors.

30:47.000 --> 30:50.000
So you iterate on a graph, and you get node zero,

30:50.000 --> 30:54.000
as these successors, node two as these successors, and so on.

30:54.000 --> 30:56.000
And these are labeling.

30:56.000 --> 30:59.000
A graph is just a labeling with you size labels.

30:59.000 --> 31:00.000
Phase one.

31:00.000 --> 31:02.000
Phase two, random axis.

31:02.000 --> 31:03.000
You will random axis.

31:03.000 --> 31:06.000
So what you go do is something like this.

31:06.000 --> 31:12.000
And you get a labels method that tells you which are the labels,

31:12.000 --> 31:14.000
and an out-degree method.

31:14.000 --> 31:19.000
And then a random axis graph is a random axis labeling with you size labels.

31:19.000 --> 31:24.000
If you have an actual labeling and you zip it, then you get a label to graph.

31:24.000 --> 31:26.000
So everything is a labeling.

31:26.000 --> 31:29.000
If you're labeling returns your size, you get a graph.

31:29.000 --> 31:30.000
You can zip labeling together.

31:30.000 --> 31:35.000
If you zip together a graph and a labeling, you have a label to graph.

31:35.000 --> 31:38.000
But everything is composable like legal.

31:38.000 --> 31:43.000
Now the main thing to make this work properly is in famous lander.

31:43.000 --> 31:47.000
Okay, let me just explain you the idea.

31:47.000 --> 31:52.000
When GAT is, you know what is a generic associated type.

31:52.000 --> 31:55.000
Raise your hand if you know what a generic associated.

31:55.000 --> 31:56.000
Okay, someone knows.

31:56.000 --> 31:57.000
Okay.

31:57.000 --> 32:03.000
If you look around when people were introduced in generic associated types,

32:03.000 --> 32:06.000
these will be examples everywhere.

32:06.000 --> 32:08.000
So this is a lander.

32:08.000 --> 32:12.000
So you see the item as a lifetime.

32:12.000 --> 32:16.000
Sorry that this should be an A, that is, I mean, taking the slide.

32:16.000 --> 32:20.000
Now what happens is that when you get an element from the lander,

32:20.000 --> 32:23.000
it borrows exclusively from self.

32:23.000 --> 32:27.000
So until you have, until the item you get get out of scope,

32:27.000 --> 32:30.000
you cannot call next, which is exactly what you want.

32:30.000 --> 32:33.000
You want an iterator, the give you things.

32:33.000 --> 32:37.000
But until you get rid of the thing, you cannot get another thing.

32:37.000 --> 32:38.000
Okay.

32:38.000 --> 32:41.000
That seems very nice and it's totally not working.

32:41.000 --> 32:42.000
Why?

32:42.000 --> 32:49.000
Because as soon as you start to use it, you need to pass a lander to a function.

32:49.000 --> 32:56.000
And if you want to condition the type of the item, you have to write a condition,

32:56.000 --> 32:59.000
a trade bound on item with a lifetime.

32:59.000 --> 33:04.000
The only way to do it is with a four with a hiring trade bound.

33:04.000 --> 33:08.000
And at that point that four can be compensated with static,

33:08.000 --> 33:11.000
which means that the lander must be static.

33:11.000 --> 33:16.000
So the only way to do it to use this is with a static lander.

33:16.000 --> 33:18.000
And of course, our landers are not static.

33:18.000 --> 33:20.000
We need landers, they give you the graph,

33:20.000 --> 33:25.000
and they read the graph from another source to each of the ever reference.

33:25.000 --> 33:30.000
So Sabine and Jusos came with this idea, which actually works.

33:30.000 --> 33:38.000
And the idea is, if I can make it work, because sorry guys,

33:38.000 --> 33:40.000
that's okay.

33:40.000 --> 33:44.000
I'll wait three seconds, and maybe this will happen.

33:44.000 --> 33:46.000
Okay.

33:46.000 --> 33:52.000
Okay, this is only for very interesting people.

33:52.000 --> 33:56.000
So here what happens is that we have a trade lander.

33:56.000 --> 34:02.000
There is a type parameter with a default value.

34:02.000 --> 34:04.000
You will never write this.

34:04.000 --> 34:06.000
There is a reference to self.

34:06.000 --> 34:10.000
And they contains the thing you will let the item, with a dependency on it.

34:10.000 --> 34:14.000
Now what we write here now to create the letter trade,

34:14.000 --> 34:19.000
we write this four-on-lending next, blah, blah, like before.

34:19.000 --> 34:23.000
What happens now, the interesting part is that the compiler,

34:23.000 --> 34:27.000
sees a higher rank trade bound for A,

34:27.000 --> 34:33.000
but because of the reference, there is an implicit where self.

34:33.000 --> 34:34.000
Okay.

34:34.000 --> 34:39.000
So implicitly, we're saying that this is outleased by self.

34:39.000 --> 34:42.000
And I will be very nice to have this syntax.

34:42.000 --> 34:47.000
One day, maybe we will have this syntax, and all this will be easier.

34:47.000 --> 34:51.000
But this is a way you create a higher rank trade bound

34:51.000 --> 34:54.000
with a limitation on the lifetime.

34:54.000 --> 34:57.000
Once you do this, everything works.

34:57.000 --> 35:02.000
And you can pass the lander to every function you want,

35:02.000 --> 35:05.000
just writing, yeah, perfect.

35:05.000 --> 35:08.000
Just writing that three minutes now.

35:08.000 --> 35:10.000
I'm over anyway.

35:10.000 --> 35:13.000
Anyway, this is, okay, this is an interesting technique.

35:13.000 --> 35:16.000
If you need to do landers, have a look at Sabina Juson's blog.

35:16.000 --> 35:20.000
Juson's blog or at the lander create, which we are maintaining now,

35:20.000 --> 35:24.000
because the original writer is like no longer interested.

35:24.000 --> 35:27.000
We have slightly more complicated problem,

35:27.000 --> 35:32.000
because we need the iterator,

35:32.000 --> 35:35.000
literally third by the lander to be referencing the data.

35:35.000 --> 35:37.000
So we need something horrible like this,

35:37.000 --> 35:39.000
that I'm not going to explain.

35:39.000 --> 35:42.000
But thanks to Quindot and the last language form,

35:42.000 --> 35:44.000
we were able to write this, which is horrible.

35:44.000 --> 35:49.000
I never remember why it works, but it actually works and does the job.

35:49.000 --> 35:52.000
Okay, performance, amazing.

35:52.000 --> 35:56.000
These are various kinds of graphs of their sizes.

35:56.000 --> 36:00.000
And you can see the speed up of us with respect to the job implementation

36:00.000 --> 36:02.000
in the order of two, three times.

36:02.000 --> 36:05.000
They are fairly small except the last one,

36:05.000 --> 36:06.000
because they are graphs.

36:06.000 --> 36:10.000
We stopped to do crawls in the 2015,

36:10.000 --> 36:12.000
because common crawls was taking the lead,

36:12.000 --> 36:15.000
and it was not really necessary to continue this activity.

36:15.000 --> 36:17.000
But anyway, conclusion.

36:18.000 --> 36:22.000
Fantastic journey, since April 2023.

36:22.000 --> 36:26.000
I was very surprising that we could do all these

36:26.000 --> 36:30.000
on a enormous cloud base developed over 20 years.

36:30.000 --> 36:32.000
A graph traversal arriving.

36:32.000 --> 36:35.000
I would like to thank Valenti Lawrence,

36:35.000 --> 36:38.000
a software heritage that suggested a lot of improvement.

36:38.000 --> 36:42.000
Could and generally was our guinea pig for everything.

36:42.000 --> 36:44.000
The last language form,

36:44.000 --> 36:46.000
without which none of these would have happened.

36:46.000 --> 36:48.000
I mean, we've got so many suggestions

36:48.000 --> 36:50.000
who they never thought about,

36:50.000 --> 36:52.000
that I cannot even count them.

36:52.000 --> 36:56.000
And my students, which also wrote a lot of code.

36:56.000 --> 36:58.000
And thank you, if you have any questions.

36:58.000 --> 36:59.000
The talk is all right.

37:00.000 --> 37:01.000
Thank you.

37:07.000 --> 37:09.000
Thirty seconds, four questions.

37:15.000 --> 37:16.000
Hi.

37:16.000 --> 37:18.000
So you have this huge data set,

37:18.000 --> 37:20.000
221 gigabytes.

37:20.000 --> 37:22.000
Yeah, I know.

37:24.000 --> 37:27.000
Was your thing bit compatible

37:27.000 --> 37:30.000
with the Java version or was it a different

37:30.000 --> 37:33.000
dismember representation?

37:38.000 --> 37:40.000
I'm also probably a little bit dangerous.

37:42.000 --> 37:44.000
Oh, yeah.

37:44.000 --> 37:48.000
The rest version is totally compatible with Java.

37:48.000 --> 37:51.000
It generates bit by bit.

