WEBVTT

00:00.000 --> 00:05.000
That is...

00:05.000 --> 00:10.000
Yeah, it should be...

00:10.000 --> 00:15.000
Did you have a book in my book?

00:15.000 --> 00:17.000
Yeah, okay.

00:17.000 --> 00:19.000
I have a book in my book.

00:19.000 --> 00:21.000
I'm going to give you a book.

00:21.000 --> 00:24.000
It's fine.

00:24.000 --> 00:26.000
It's fine.

00:27.000 --> 00:31.000
Yeah, we can do downstairs.

00:35.000 --> 00:37.000
Awesome.

00:43.000 --> 00:44.000
Okay.

00:44.000 --> 00:46.000
I'm going to just...

00:46.000 --> 00:49.000
It's on my website, so I'll just grab it.

00:49.000 --> 00:50.000
I'll just grab it.

00:50.000 --> 00:51.000
It doesn't change much.

00:51.000 --> 00:53.000
Let's try it on this one.

00:54.000 --> 00:58.000
And the worst case is I can also just set things up.

00:58.000 --> 00:59.000
I don't know.

00:59.000 --> 01:00.000
Yeah.

01:00.000 --> 01:01.000
We'll get it.

01:01.000 --> 01:03.000
In my initial bit.

01:03.000 --> 01:04.000
It's the easiest.

01:04.000 --> 01:06.000
The easiest is the right book in just.

01:06.000 --> 01:08.000
I'll put it in the first book.

01:08.000 --> 01:09.000
No, I am the easiest.

01:09.000 --> 01:10.000
Yeah, he's up.

01:10.000 --> 01:12.000
We're in just need to figure out.

01:12.000 --> 01:13.000
Okay.

01:13.000 --> 01:16.000
I couldn't figure out.

01:16.000 --> 01:18.000
It turns out...

01:18.000 --> 01:20.000
It's very easy.

01:21.000 --> 01:23.000
It looks complicated.

01:23.000 --> 01:24.000
It looks complicated.

01:24.000 --> 01:25.000
Yes.

01:25.000 --> 01:26.000
Yes.

01:26.000 --> 01:27.000
What is this?

01:27.000 --> 01:28.000
Yeah.

01:32.000 --> 01:35.000
I was about to get to the best part of the presentation.

01:35.000 --> 01:36.000
And I run out.

01:36.000 --> 01:37.000
Yeah.

01:37.000 --> 01:38.000
Because I was going to...

01:38.000 --> 01:39.000
You know, create a namespace.

01:39.000 --> 01:41.000
And then it was going to un-mount the root.

01:41.000 --> 01:42.000
And everybody was going to like...

01:42.000 --> 01:43.000
What?

01:45.000 --> 01:46.000
Why is this a show?

01:46.000 --> 01:47.000
Wait, you can un-mount it.

01:47.000 --> 01:48.000
Root?

01:48.000 --> 01:49.000
Yes.

01:49.000 --> 01:50.000
It's going to work.

01:50.000 --> 01:51.000
That's the best thing about it.

01:51.000 --> 01:52.000
How?

01:52.000 --> 01:57.000
Because since you already have your binary directory mounted,

01:57.000 --> 02:00.000
then you can still issue commands.

02:00.000 --> 02:02.000
It's a thing about it.

02:02.000 --> 02:04.000
It's a different...

02:04.000 --> 02:05.000
Yeah.

02:05.000 --> 02:06.000
How does it help you out?

02:06.000 --> 02:07.000
I'll do that.

02:07.000 --> 02:08.000
I'll do that.

02:08.000 --> 02:09.000
I'll do that.

02:09.000 --> 02:10.000
I'll do that.

02:10.000 --> 02:11.000
I'll do that.

02:11.000 --> 02:12.000
I'll do that.

02:12.000 --> 02:13.000
I'll do that.

02:17.000 --> 02:18.000
Can I leave this with you?

02:18.000 --> 02:19.000
And you can just...

02:19.000 --> 02:20.000
Yeah.

02:20.000 --> 02:21.000
Absolutely.

02:21.000 --> 02:22.000
Just got to make sure...

02:24.000 --> 02:25.000
What did you do?

02:33.000 --> 02:34.000
It is.

02:34.000 --> 02:35.000
All right.

02:35.000 --> 02:36.000
Figure it out.

02:36.000 --> 02:37.000
Yeah.

02:37.000 --> 02:38.000
There we go.

02:38.000 --> 02:39.000
OK.

02:39.000 --> 02:40.000
I'll do that.

02:40.000 --> 02:41.000
Here you go.

02:41.000 --> 02:42.000
OK.

02:42.000 --> 02:43.000
All right.

02:43.000 --> 02:45.000
We'll have this on your belt.

02:45.000 --> 02:46.000
It's possible.

02:46.000 --> 02:47.000
It's here.

02:47.000 --> 02:48.000
Yeah.

02:48.000 --> 02:49.000
It's a...

02:49.000 --> 02:50.000
Post-M2026.

02:50.000 --> 02:51.000
Keep yourself.

02:51.000 --> 02:52.000
OK.

02:52.000 --> 02:53.000
OK.

02:53.000 --> 02:54.000
I'll do that.

02:54.000 --> 02:55.000
OK.

02:55.000 --> 02:56.000
OK.

02:56.000 --> 02:57.000
OK.

02:57.000 --> 02:58.000
OK.

02:58.000 --> 02:59.000
OK.

02:59.000 --> 03:00.000
OK.

03:00.000 --> 03:01.000
OK.

03:01.000 --> 03:02.000
OK.

03:02.000 --> 03:03.000
OK.

03:03.000 --> 03:04.000
OK.

03:04.000 --> 03:05.000
OK.

03:05.000 --> 03:06.000
OK.

03:06.000 --> 03:07.000
OK.

03:07.000 --> 03:08.000
OK.

03:08.000 --> 03:09.000
OK.

03:09.000 --> 03:10.000
OK.

03:10.000 --> 03:11.000
OK.

03:11.000 --> 03:12.000
OK.

03:12.000 --> 03:13.000
OK.

03:13.000 --> 03:14.000
OK.

03:14.000 --> 03:15.000
OK.

03:15.000 --> 03:16.000
OK.

03:17.000 --> 03:18.000
Yep.

03:18.000 --> 03:19.000
OK.

03:19.000 --> 03:20.000
Yeah.

03:20.000 --> 03:21.000
OK.

03:21.000 --> 03:22.000
OK.

03:22.000 --> 03:23.000
OK.

03:23.000 --> 03:24.000
OK.

03:24.000 --> 03:25.000
OK.

03:25.000 --> 03:26.000
OK.

03:26.000 --> 03:27.000
OK.

03:27.000 --> 03:28.000
OK.

03:28.000 --> 03:29.000
OK.

03:29.000 --> 03:30.000
OK.

03:30.000 --> 03:31.000
OK.

03:31.000 --> 03:32.000
OK.

03:32.000 --> 03:33.000
OK.

03:33.000 --> 03:34.000
OK.

03:34.000 --> 03:35.000
OK.

03:35.000 --> 03:36.000
OK.

03:36.000 --> 03:37.000
OK.

03:37.000 --> 03:38.000
OK.

03:38.000 --> 03:39.000
OK.

03:39.000 --> 03:40.000
OK.

03:40.000 --> 03:41.000
OK.

03:41.000 --> 03:43.000
Oh.

03:43.000 --> 03:44.000
OK.

03:44.000 --> 03:45.000
OK.

03:45.000 --> 03:52.000
I need to show you how I can get to my information.

03:52.000 --> 03:54.000
Okay.

03:56.000 --> 03:57.000
Oh.

04:04.000 --> 04:08.000
Yeah, you've missed him slash in the end off.

04:08.000 --> 04:09.000
Yeah.

04:10.000 --> 04:12.000
Fast in 2020, 60GZ.

04:12.000 --> 04:13.000
That's it.

04:13.000 --> 04:14.000
Just untara.

04:16.000 --> 04:18.000
The second file in the listing.

04:18.000 --> 04:20.000
Fast in 2020, 60GZ.

04:20.000 --> 04:22.000
That's the, yeah.

04:30.000 --> 04:31.000
Okay.

04:32.000 --> 04:33.000
You can handle it?

04:33.000 --> 04:34.000
I guess.

04:34.000 --> 04:35.000
Let's see.

04:35.000 --> 04:37.000
I'm going to just...

04:37.000 --> 04:38.000
Oh.

04:39.000 --> 04:52.000
Well, no, there is a...

04:52.000 --> 04:53.000
Yeah.

04:59.000 --> 05:00.000
I'm just going to do it.

05:00.000 --> 05:02.000
It's just a bunch of text files.

05:02.000 --> 05:03.000
It'll be fine.

05:09.000 --> 05:10.000
Right.

05:10.000 --> 05:11.000
Wait.

05:15.000 --> 05:18.000
That guy's very inferior.

05:18.000 --> 05:46.720
There's no matter what was the punch you had.

05:47.720 --> 05:50.720
You can try it with Suga.

05:50.720 --> 05:51.720
Okay.

05:53.720 --> 05:54.720
Ah, God damn it.

05:54.720 --> 05:57.720
Should have used acne with the Bart flag.

06:01.720 --> 06:02.720
Okay.

06:08.720 --> 06:10.720
A lib font.

06:16.720 --> 06:17.720
I'm sorry.

06:17.720 --> 06:18.720
I'm just going to change the font.

06:18.720 --> 06:20.720
I don't remember how to know which font I was supposed to use.

06:20.720 --> 06:22.720
I'm just going to change the font.

06:28.720 --> 06:30.720
I'm going to change the font.

06:36.720 --> 06:38.720
Oh.

06:38.720 --> 06:39.720
Ah.

06:41.720 --> 06:43.720
Which one is?

06:45.720 --> 06:47.720
Ah, Latin 32 right.

06:51.720 --> 06:53.720
Yeah, there we go.

07:03.720 --> 07:05.720
Sorry, this is...

07:12.720 --> 07:13.720
There we go.

07:16.720 --> 07:17.720
Okay.

07:20.720 --> 07:23.720
So, now we're ready to start.

07:23.720 --> 07:25.720
I apologize for the amount of time it took.

07:25.720 --> 07:33.720
But yeah, so I'm going to be talking about the new file system for P9 front, GES.

07:33.720 --> 07:39.720
It... these slides are basically a variation of...

07:39.720 --> 07:43.720
I talk a give a couple of years ago when I initially introduced it.

07:43.720 --> 07:49.720
So it used to say the great experimental file shredder as the acronym.

07:50.720 --> 07:52.720
It's been upgraded to the good enough file system.

07:52.720 --> 07:53.720
Is it great?

07:53.720 --> 07:55.720
It's good enough.

07:55.720 --> 07:57.720
And that's kind of the ethos I went with it.

07:57.720 --> 08:00.720
There's a lot of ways that you could optimize this file system.

08:00.720 --> 08:04.720
But my goal wasn't to be the most optimal, the most complicated.

08:04.720 --> 08:05.720
Squeezing out everything.

08:05.720 --> 08:08.720
My goal was to be simple and good enough.

08:10.720 --> 08:11.720
Okay.

08:11.720 --> 08:12.720
So...

08:14.720 --> 08:15.720
What?

08:15.720 --> 08:16.720
Damn it.

08:17.720 --> 08:18.720
Okay.

08:19.720 --> 08:25.720
So, I'm going to give you a quick summary of the current state of file systems on 9 front initially.

08:25.720 --> 08:30.720
The world today, there are a few file systems that we have.

08:30.720 --> 08:33.720
We've got CWFS, the cashworm file system.

08:33.720 --> 08:39.720
This is the file system you heard about just now,

08:39.720 --> 08:46.720
where the can actually wrote as the initial standalone file server

08:46.720 --> 08:51.720
that was that eventually got ported from its own kernel to user space.

08:51.720 --> 08:54.720
It has a...

08:54.720 --> 08:56.720
What do we do here?

08:59.720 --> 09:01.720
Okay, we just have to do the...

09:01.720 --> 09:02.720
Cool.

09:02.720 --> 09:03.720
I'll do it live.

09:03.720 --> 09:05.720
So here's the cashworm file system,

09:05.720 --> 09:12.720
which is the old Ken Thompson file system that was initially there.

09:12.720 --> 09:16.720
There's HJFS, which is one that was written for 9 front,

09:16.720 --> 09:21.720
as a kind of fairly boring file system.

09:21.720 --> 09:23.720
But it's...

09:23.720 --> 09:25.720
It's really simple.

09:25.720 --> 09:28.720
There is...

09:28.720 --> 09:30.720
See, there is...

09:30.720 --> 09:33.720
I'm wondering if we have chalk because I can just do the drawing if we...

09:33.720 --> 09:34.720
Perfect.

09:34.720 --> 09:35.720
Okay.

09:35.720 --> 09:36.720
Cool.

09:36.720 --> 09:37.720
So I'm going to...

09:37.720 --> 09:39.720
So we're winning it because...

09:39.720 --> 09:41.720
Well, we figure this out.

09:41.720 --> 09:42.720
So there's CWFS.

09:42.720 --> 09:45.720
There's HJFS.

09:45.720 --> 09:47.720
And then there's...

09:47.720 --> 09:50.720
Fossil, which is the classic Bell Labs file system

09:50.720 --> 09:53.720
that you get on the Labs distribution.

09:53.720 --> 09:54.720
The Labs...

09:54.720 --> 09:58.720
Nine front got rid of it because it kept crashing and losing data.

09:58.720 --> 10:00.720
And then there's CW...

10:00.720 --> 10:01.720
Or then then...

10:01.720 --> 10:05.720
Recently, we've added GFS, which ships by default.

10:05.720 --> 10:06.720
And...

10:06.720 --> 10:10.720
GFS still keeps on crashing and losing data.

10:11.720 --> 10:13.720
At least it seems to be getting better.

10:13.720 --> 10:14.720
And...

10:14.720 --> 10:16.720
And actually, I've been using it for...

10:16.720 --> 10:17.720
I'm joking mostly.

10:17.720 --> 10:19.720
I've been using it for a year.

10:19.720 --> 10:21.720
I haven't lost data with it yet.

10:21.720 --> 10:25.720
Other people have, and we've found and fixed those bugs,

10:25.720 --> 10:28.720
added a bunch of testing, which I'll talk about in a bit.

10:28.720 --> 10:29.720
And...

10:29.720 --> 10:32.720
Basically, made that work.

10:32.720 --> 10:36.720
So what I'm going to be doing here is talking about the general...

10:36.720 --> 10:39.720
structure of the file system, how it works internally,

10:39.720 --> 10:46.720
and what the kind of path forward is.

10:46.720 --> 10:49.720
So...

10:49.720 --> 10:56.720
As a quick review, all file systems on plan nine are network-based file systems

10:56.720 --> 10:58.720
that talk over nine p.

10:58.720 --> 11:02.720
Nine p is a protocol with, I think it's 13 operations.

11:02.720 --> 11:06.720
And most of them don't matter that much.

11:06.720 --> 11:11.720
I mean, they matter, but most of them aren't that relevant for the interesting parts.

11:11.720 --> 11:12.720
You've got...

11:12.720 --> 11:15.720
The important ones are...

11:15.720 --> 11:18.720
Read, which has the...

11:18.720 --> 11:20.720
T message...

11:20.720 --> 11:22.720
and they...

11:26.720 --> 11:28.720
and then the response.

11:33.720 --> 11:37.720
Right, and it's response...

11:37.720 --> 11:39.720
Walk...

11:39.720 --> 11:41.720
and it's response.

11:41.720 --> 11:43.720
And I think these are really the important...

11:43.720 --> 11:45.720
Oh, and open.

11:45.720 --> 11:49.720
Which...

11:49.720 --> 11:53.720
Basically, open is permission checks.

11:53.720 --> 11:56.720
You can think of Walk as basically taking a...

11:56.720 --> 12:01.720
what the protocol calls a FID, a file ID, and moving it around.

12:01.720 --> 12:04.720
So you start off with a FID when you attach to the...

12:04.720 --> 12:09.720
to the FES.

12:09.720 --> 12:11.720
So attach is basically where you start the whole...

12:11.720 --> 12:13.720
handshake process to authenticate and get...

12:13.720 --> 12:16.720
connection to a namespace.

12:16.720 --> 12:18.720
So attach...

12:18.720 --> 12:22.720
and then Walk will basically start...

12:22.720 --> 12:27.720
and attach gives back a FID for the root of your file system.

12:28.720 --> 12:29.720
Read...

12:29.720 --> 12:30.720
I'm sorry.

12:30.720 --> 12:31.720
Walk...

12:31.720 --> 12:33.720
It moves the FID around the file system.

12:33.720 --> 12:35.720
So you start with the root for...

12:35.720 --> 12:38.720
the FID for the root, clone it, and Walk to...

12:38.720 --> 12:40.720
home-or-e.

12:40.720 --> 12:43.720
And that gives you a new FID that you can use for open...

12:43.720 --> 12:45.720
read, right, et cetera.

12:45.720 --> 12:46.720
Okay.

12:46.720 --> 12:48.720
So that's the protocol.

12:48.720 --> 12:50.720
How do we implement a file system to do this?

12:50.720 --> 12:54.720
So the idea behind GFS is...

12:55.720 --> 12:57.720
Just...

12:57.720 --> 12:58.720
Email.

12:58.720 --> 12:59.720
Welcome.

13:01.720 --> 13:03.720
We don't have a eraser.

13:03.720 --> 13:04.720
Okay.

13:12.720 --> 13:14.720
Oh, wait, there isn't a eraser.

13:14.720 --> 13:15.720
Never do.

13:21.720 --> 13:23.720
I'm learning how to do this on the fly.

13:23.720 --> 13:27.720
Yeah.

13:27.720 --> 13:28.720
Okay.

13:28.720 --> 13:30.720
So...

13:30.720 --> 13:32.720
So how do we do this?

13:32.720 --> 13:36.720
I'm going to go through the layers one by one and...

13:36.720 --> 13:37.720
kind of talk through it.

13:37.720 --> 13:41.720
So GFS is basically built in three different layers.

13:41.720 --> 13:43.720
You've got the...

13:43.720 --> 13:48.720
the FFS layer, which sits on top of the key value layer,

13:48.720 --> 13:52.720
which sits on top of the...

13:52.720 --> 13:54.720
the block layer.

13:54.720 --> 13:57.720
And all of these just kind of stack nicely.

13:57.720 --> 14:01.720
So the file system layer is what takes things like Walk

14:01.720 --> 14:05.720
and turns them into basically...

14:05.720 --> 14:08.720
operations on a key value store.

14:08.720 --> 14:11.720
The key value store takes these operations in term...

14:11.720 --> 14:14.720
into IO on the block and cash layer.

14:14.720 --> 14:20.720
So what we end up doing here is...

14:20.720 --> 14:21.720
Okay.

14:21.720 --> 14:23.720
Let's take an example of...

14:23.720 --> 14:24.720
Thank you.

14:24.720 --> 14:28.720
So let's take an example of file system hierarchy.

14:28.720 --> 14:30.720
We have...

14:30.720 --> 14:33.720
Let's just say...

14:33.720 --> 14:34.720
All right.

14:34.720 --> 14:35.720
Let's give it a...

14:35.720 --> 14:37.720
About a plane 9 name.

14:42.720 --> 14:45.720
And the expile contains...

14:50.720 --> 14:51.720
Right?

14:51.720 --> 14:52.720
Okay.

14:52.720 --> 14:53.720
So we've got...

14:53.720 --> 14:56.720
User or EX, which says...

14:56.720 --> 14:57.720
Hello GFS.

14:57.720 --> 14:59.720
How do we map this into a hierarchy?

14:59.720 --> 15:00.720
So...

15:00.720 --> 15:05.720
What we would do is we would put in a key value pair that says...

15:05.720 --> 15:09.720
Let's give it ID...

15:09.720 --> 15:11.720
Negative 1...

15:11.720 --> 15:12.720
Empty name.

15:12.720 --> 15:15.720
And that's the root of your file system hierarchy.

15:15.720 --> 15:18.720
So at the root of the file system hierarchy...

15:19.720 --> 15:21.720
What do you need to do to look up files?

15:21.720 --> 15:24.720
You need to know the ID of the root.

15:24.720 --> 15:26.720
And the...

15:26.720 --> 15:32.720
So what I do is I have a map that...

15:32.720 --> 15:37.720
Let's say...

15:37.720 --> 15:38.720
One...

15:38.720 --> 15:39.720
Or...

15:39.720 --> 15:40.720
Yeah.

15:40.720 --> 15:43.720
Okay.

15:43.720 --> 15:44.720
Sorry.

15:44.720 --> 15:46.720
Here I would have the permissions.

15:46.720 --> 15:48.720
So ID 0...

15:48.720 --> 15:49.720
And...

15:49.720 --> 15:50.720
I don't know.

15:50.720 --> 15:52.720
Owned by or...

15:52.720 --> 15:55.720
File size is whatever...

15:55.720 --> 15:59.720
Basically everything you get out of staff goes into the value...

15:59.720 --> 16:01.720
Packed into a buffer.

16:01.720 --> 16:05.720
Now this file ID here is the idea of the...

16:05.720 --> 16:06.720
Of the directory.

16:06.720 --> 16:08.720
And let's just say I also have...

16:08.720 --> 16:10.720
User...

16:10.720 --> 16:12.720
I don't know...

16:12.720 --> 16:13.720
Rob.

16:14.720 --> 16:15.720
Because...

16:15.720 --> 16:16.720
You know...

16:16.720 --> 16:18.720
I always share my computer with Rob.

16:18.720 --> 16:19.720
Um...

16:19.720 --> 16:20.720
Okay.

16:20.720 --> 16:21.720
So I've got...

16:21.720 --> 16:24.720
So what I would have in the key value story is a file...

16:24.720 --> 16:25.720
Starting with...

16:25.720 --> 16:26.720
There's...

16:26.720 --> 16:27.720
Two files starting with the...

16:27.720 --> 16:28.720
Sorry.

16:28.720 --> 16:30.720
One file starting with zero.

16:30.720 --> 16:32.720
And name is...

16:32.720 --> 16:34.720
User...

16:34.720 --> 16:38.720
Which maps to file ID 1.

16:38.720 --> 16:44.720
File ID 1 would have...

16:44.720 --> 16:46.720
Or...

16:54.720 --> 16:55.720
Orian Rob.

16:55.720 --> 17:01.720
And both of these start with file ID 1 because their parent directory is user with file ID 1.

17:01.720 --> 17:04.720
So this is how you end up getting directories.

17:04.720 --> 17:05.720
It's all a flat listing.

17:05.720 --> 17:13.720
But the prefix of the directory in the flat listing gives you the directory hierarchy.

17:13.720 --> 17:19.720
So that means that every file that starts with ID 1 goes into the directory user.

17:19.720 --> 17:25.720
Everything with ID 0 goes into the directory flash.

17:25.720 --> 17:27.720
So this...

17:27.720 --> 17:30.720
This works inside of key values.

17:30.720 --> 17:33.720
This is how you map a tree to a key value start.

17:33.720 --> 17:35.720
And then how do you map the data for the file?

17:35.720 --> 17:37.720
Let's just say I've got...

17:37.720 --> 17:39.720
Sorry.

17:39.720 --> 17:40.720
In...

17:40.720 --> 17:42.720
I'll call this one two.

17:42.720 --> 17:46.720
So in two I have a file called X, which has ID 3.

17:46.720 --> 17:48.720
Right?

17:48.720 --> 17:50.720
So...

17:50.720 --> 17:54.720
What I end up doing is ID 3...

17:54.720 --> 17:56.720
Offset 0.

17:56.720 --> 18:00.720
I get a block ID here,

18:00.720 --> 18:03.720
which points to the data on to the block on disk.

18:03.720 --> 18:05.720
That contains the contents of the file.

18:05.720 --> 18:09.720
So even the file blocks going to the key value store.

18:09.720 --> 18:13.720
Which means that you have to make a pretty fast key value store.

18:13.720 --> 18:14.720
Okay.

18:14.720 --> 18:16.720
So how do we do that?

18:16.720 --> 18:18.720
So that's how we map the hierarchy.

18:18.720 --> 18:20.720
Let's talk about key value stores.

18:20.720 --> 18:23.720
So the...

18:23.720 --> 18:24.720
Got damn it.

18:24.720 --> 18:27.720
It would be really nice if I could just flip the...

18:27.720 --> 18:29.720
Flip to a new page.

18:29.720 --> 18:30.720
Okay.

18:30.720 --> 18:32.720
You might be able to press the person.

18:32.720 --> 18:33.720
Ooh.

18:33.720 --> 18:34.720
Cool.

18:34.720 --> 18:37.720
I will let you get to...

18:37.720 --> 18:39.720
Oh, we only have half the screen.

18:39.720 --> 18:40.720
Ah.

18:40.720 --> 18:41.720
Wonderful.

18:41.720 --> 18:46.720
Let's see.

18:46.720 --> 18:49.720
I'm just going to keep going until you've gotten...

18:49.720 --> 18:53.720
Flip through the slides and let me know when you get to the right place.

18:53.720 --> 18:55.720
Slide plus.

18:55.720 --> 18:56.720
Same as it.

18:56.720 --> 18:57.720
Okay.

18:58.720 --> 18:59.720
Okay.

18:59.720 --> 19:04.720
So the key value store is based on what's called a B-epsylantry.

19:04.720 --> 19:09.720
And a B-epsylantry is basically...

19:09.720 --> 19:14.720
How many people here are familiar with B-trees?

19:14.720 --> 19:15.720
Almost everyone.

19:15.720 --> 19:16.720
Cool.

19:16.720 --> 19:18.720
So I'm not going to spend much time on it.

19:18.720 --> 19:22.720
I'm going to say that B-trees are like binary trees but fat.

19:22.720 --> 19:26.720
So you have a B-tree which has a bunch of...

19:26.720 --> 19:29.720
A note with a bunch of children.

19:29.720 --> 19:31.720
And these children fan out.

19:31.720 --> 19:32.720
All right.

19:32.720 --> 19:33.720
And you get...

19:33.720 --> 19:34.720
Yeah.

19:34.720 --> 19:36.720
Other blocks that also have a bunch of children.

19:36.720 --> 19:37.720
Right.

19:37.720 --> 19:40.720
So that's a B-tree.

19:40.720 --> 19:42.720
How is it B-epsylantry different?

19:42.720 --> 19:44.720
It's actually really close.

19:44.720 --> 19:47.720
The difference is that all the internal nodes,

19:47.720 --> 19:49.720
the ones that have children,

19:49.720 --> 19:51.720
get split in half.

19:51.720 --> 19:54.720
So the first half of it is exactly like a B-tree.

19:54.720 --> 20:01.720
Where you fan out to all the other children child nodes.

20:01.720 --> 20:03.720
And they'll save.

20:03.720 --> 20:05.720
These are also internal nodes.

20:11.720 --> 20:12.720
Okay.

20:12.720 --> 20:13.720
So you get the fan out.

20:13.720 --> 20:15.720
And then you end up with a children here.

20:15.720 --> 20:18.720
So what happens with...

20:19.720 --> 20:21.720
What's the file?

20:21.720 --> 20:22.720
What's the file?

20:22.720 --> 20:23.720
What's the file?

20:23.720 --> 20:26.720
What do you mean what's the file?

20:26.720 --> 20:28.720
Just go to...

20:28.720 --> 20:30.720
Hit slide plus until you see...

20:30.720 --> 20:31.720
Ask Eric.

20:31.720 --> 20:33.720
Oh, that's going to be a problem.

20:33.720 --> 20:35.720
See each problem plus X file.

20:35.720 --> 20:36.720
The slide plus.

20:36.720 --> 20:38.720
And then it'll work.

20:38.720 --> 20:39.720
See it...

20:39.720 --> 20:41.720
Make it executable.

20:41.720 --> 20:42.720
Okay.

20:42.720 --> 20:47.720
So what we do here...

20:48.720 --> 20:49.720
So what...

20:49.720 --> 20:50.720
Okay.

20:50.720 --> 20:52.720
So what do we do with the other half of the node?

20:52.720 --> 20:56.720
What we do with the other half of the node is we put messages into it.

20:56.720 --> 21:01.720
And a message in this case is basically a deferred update to the leaf.

21:01.720 --> 21:04.720
So what we can do is, for example,

21:04.720 --> 21:07.720
let's just say I want to insert...

21:07.720 --> 21:20.720
I'll call it the operation is called insert.

21:20.720 --> 21:25.720
And I'll say file X to create X, right?

21:25.720 --> 21:32.720
And what I'll do is I'll put a little message here that says insert file X with these attributes.

21:32.720 --> 21:33.720
This is...

21:33.720 --> 21:34.720
Get...

21:34.720 --> 21:36.720
Insert it into the root node of the tree.

21:36.720 --> 21:37.720
And...

21:37.720 --> 21:39.720
Stay isn't it?

21:39.720 --> 21:42.720
When you read from the tree, you walk down to the leaf and say,

21:42.720 --> 21:45.720
Do I have an insertion for file X?

21:45.720 --> 21:49.720
Change the name of intro 1 to intro.

21:49.720 --> 21:51.720
Does not have it.

21:51.720 --> 21:54.720
Then next go to...

21:54.720 --> 21:56.720
Go up the node.

21:56.720 --> 21:58.720
Does not have anything about file X.

21:58.720 --> 21:59.720
And here we have a message.

21:59.720 --> 22:04.720
So we basically reconstruct the key value pair and we get the file.

22:04.720 --> 22:06.720
Now let's just say I want to touch the file.

22:06.720 --> 22:10.720
I don't actually need to read anything about it because all I'm doing is updating the

22:10.720 --> 22:12.720
Modification Time blindly.

22:12.720 --> 22:16.720
So what I do is I create a...

22:16.720 --> 22:18.720
Oh...

22:18.720 --> 22:24.720
WSTAT, which is...

22:24.720 --> 22:26.720
WSTAT is right STAT.

22:26.720 --> 22:31.720
And it will just create a new message here without actually even reading a damn thing.

22:31.720 --> 22:33.720
It will...

22:33.720 --> 22:38.720
And we can basically toss messages into the tree like that.

22:38.720 --> 22:41.720
So this is...

22:41.720 --> 22:43.720
And then what happens when this message buffer fills up?

22:43.720 --> 22:46.720
We basically flush down to another node.

22:46.720 --> 22:47.720
So this...

22:47.720 --> 22:53.720
So we get the messages kind of flowing down the tree slowly towards the leaves.

22:53.720 --> 22:58.720
Why do we want this data structure?

22:58.720 --> 23:04.720
Well, the rights become a lot cheaper because you're not actually doing...

23:04.720 --> 23:05.720
Okay.

23:05.720 --> 23:08.720
The rights become a lot cheaper for multiple reasons.

23:08.720 --> 23:13.720
The first one is that you're not actually necessarily reading any data to do the right.

23:13.720 --> 23:16.720
The second one, which is a little bit more subtle.

23:16.720 --> 23:20.720
For a file system that is crash safe, you really want to do copy on right.

23:21.720 --> 23:27.720
Now, in a classic B tree, if you do a copy on right, you're doing a copy of the root node,

23:27.720 --> 23:33.720
and a copy of the child, and a copy of the leaf.

23:33.720 --> 23:37.720
And now you've just done three rights to update one value.

23:37.720 --> 23:40.720
Or four rights because you have to actually update the value too.

23:40.720 --> 23:45.720
With a Bepsal entry, what you do is you take a copy of the root node,

23:45.720 --> 23:48.720
and you're done.

23:48.720 --> 23:51.720
You've updated the data structure.

23:51.720 --> 23:55.720
So now you end up with a lot less right amplification.

23:55.720 --> 24:01.720
ZFS uses a classic B tree, which put means that they have a lot of...

24:01.720 --> 24:03.720
The B tree itself is simpler.

24:03.720 --> 24:05.720
You don't have to reconstruct values on the way up.

24:05.720 --> 24:09.720
But to make it perform well, you have to do a lot of work on how do you batch transactions

24:09.720 --> 24:14.720
and avoid a bunch of copy and copy and write amplification,

24:14.720 --> 24:18.720
that I can just kind of go, whatever.

24:18.720 --> 24:21.720
So this is, and you might be asking yourself,

24:21.720 --> 24:24.720
well, if this is so great, why doesn't ZFS use it?

24:24.720 --> 24:26.720
And the answer is very simple.

24:26.720 --> 24:29.720
ZFS was released in 2003.

24:29.720 --> 24:37.720
Bepsal entries were described, at least, became well-knownish in 2015.

24:37.720 --> 24:42.720
So 12 years of time is the difference.

24:42.720 --> 24:49.720
Okay, so this is the key value store, and what we do for...

24:49.720 --> 24:52.720
So this gives us a lot of nice stuff.

24:52.720 --> 24:55.720
For example, if we want to take a snapshot,

24:55.720 --> 24:58.720
well, we need to keep a pointer to the old value,

24:58.720 --> 25:00.720
and we keep a pointer to the new value.

25:00.720 --> 25:02.720
And that's your snapshot.

25:02.720 --> 25:05.720
You just don't delete the old stuff.

25:05.720 --> 25:09.720
The way we do this is that GFS is actually a tree of trees.

25:09.720 --> 25:17.720
So we have a, yes, I can switch to opt-out from now.

25:17.720 --> 25:20.720
If it's working, okay.

25:20.720 --> 25:21.720
Cool.

25:21.720 --> 25:25.720
So this should actually let me move a little bit faster.

25:25.720 --> 25:28.720
Good to just move this down.

25:28.720 --> 25:31.720
Okay, good enough.

25:31.720 --> 25:34.720
Good enough presentation.

25:34.720 --> 25:36.720
Cool.

25:37.720 --> 25:40.720
So this is basically what we were talking about just now,

25:40.720 --> 25:45.720
with the file system layer, key value layer.

25:45.720 --> 25:50.720
So right optimized variant of a B3 is what the Bepsal entry is.

25:50.720 --> 25:52.720
I just kind of described it.

25:52.720 --> 25:54.720
It's complex and painful to implement,

25:54.720 --> 25:59.720
but everything else once you have it gets simpler.

25:59.720 --> 26:06.720
So the fan out, messages, updating inserts to the message buffer.

26:06.720 --> 26:10.720
And then when the message buffer fills, we flush to the children.

26:10.720 --> 26:15.720
We apply the, so basically the messages turn into a leaf node,

26:15.720 --> 26:19.720
or into key value pairs when it reaches the leaf.

26:19.720 --> 26:24.720
And I just explained why it works really well with count.

26:24.720 --> 26:28.720
So here's an example of what copy on right looks like with this.

26:28.720 --> 26:35.720
The old tree and your new tree.

26:35.720 --> 26:39.720
Look up is, again, a pin in the ass because once you walk down to the leaf,

26:39.720 --> 26:44.720
you have to kind of walk back up and get to a full key value pair

26:44.720 --> 26:47.720
that you can actually return to the querying layer.

26:47.720 --> 26:51.720
So your base, so every read ends up doing a little bit of reconstruction.

26:51.720 --> 26:57.720
This isn't too expensive, but it is a little bit of overhead over a regular B3.

26:57.720 --> 27:01.720
And then listing a directory scanning is even more complicated,

27:01.720 --> 27:03.720
but it's still fairly fast.

27:03.720 --> 27:09.720
I have limited time, so I'm going to just skip over the details of scanning.

27:09.720 --> 27:12.720
If you want to talk to me about how this works internally at any point,

27:12.720 --> 27:15.720
just feel free to grab me after this.

27:15.720 --> 27:20.720
But basically, listing a, the important thing to remember is just listing a directory

27:20.720 --> 27:23.720
is a prefix scan over the keys.

27:23.720 --> 27:27.720
You look up the idea of the directory you're listing,

27:27.720 --> 27:30.720
and then you say, give me every file that starts with it.

27:30.720 --> 27:34.720
And because it's a B3, all the keys are close together nicely sorted.

27:34.720 --> 27:37.720
So you get really good locality.

27:37.720 --> 27:41.720
In search, it's similar to the rules or similar to B3s,

27:41.720 --> 27:45.720
except there, again, there's a little bit more complexity because

27:45.720 --> 27:50.720
when you merge a node, well, in a B3, you only have one thing to consider

27:50.720 --> 27:52.720
when you're asking, can I merge it?

27:52.720 --> 27:55.720
Do I have room for the fan out nodes?

27:55.720 --> 27:59.720
So if you've got two nodes that are combined less than half full,

27:59.720 --> 28:01.720
you can just kind of merge them.

28:01.720 --> 28:04.720
Here, you've also got the message buffers to consider.

28:04.720 --> 28:05.720
And what do you do here?

28:05.720 --> 28:07.720
Do you flush them?

28:07.720 --> 28:09.720
Do you not?

28:09.720 --> 28:10.720
Do you?

28:10.720 --> 28:11.720
What do you do?

28:11.720 --> 28:14.720
I just kind of do the lazy thing and say,

28:14.720 --> 28:18.720
if the message buffers full, even if you could flush the tree,

28:18.720 --> 28:19.720
just don't.

28:20.720 --> 28:23.720
So there's room for optimization there,

28:23.720 --> 28:27.720
but it turns out that even with the relaxed merging,

28:27.720 --> 28:29.720
the tree nodes usually are pretty okay.

28:29.720 --> 28:31.720
And the balance isn't bad.

28:31.720 --> 28:33.720
The fill factor isn't bad.

28:33.720 --> 28:36.720
Occasionally, if you're running check on the file system,

28:36.720 --> 28:40.720
it will tell you, hey, you've got an under full pivot.

28:40.720 --> 28:44.720
And it's not perfect for performance,

28:44.720 --> 28:47.720
but it doesn't seem to be a problem.

28:47.720 --> 28:48.720
Okay.

28:48.720 --> 28:51.720
So I've talked a little bit about absurds already,

28:51.720 --> 28:56.720
so I was talking, I was, yeah.

28:56.720 --> 29:00.720
So what I do for the file for snapshots is basically,

29:00.720 --> 29:03.720
this is where I got into on the whiteboard.

29:03.720 --> 29:06.720
I have two trees or two layers of trees.

29:06.720 --> 29:08.720
I have the snapshot tree,

29:08.720 --> 29:15.720
which is also a Bepsalon tree that points to the roots of each snapshot.

29:16.720 --> 29:20.720
So your snapshot tree points to your main snapshot,

29:20.720 --> 29:27.720
which is basically the default name that you put your data into.

29:27.720 --> 29:30.720
Your admin snapshot, which is where your user database

29:30.720 --> 29:35.720
and a couple of other file system and many things go into.

29:35.720 --> 29:38.720
And the reason this is separate, by the way, is because,

29:38.720 --> 29:42.720
well, if I have different snapshots, different user lists,

29:43.720 --> 29:46.720
who has the permissions to do what?

29:46.720 --> 29:50.720
So the user database is stored separately from all the snapshots

29:50.720 --> 29:53.720
and is used across all of them.

29:53.720 --> 29:58.720
And then your dumps, which is basically your history.

29:58.720 --> 30:02.720
Okay, so that's basically how snapshots work.

30:02.720 --> 30:06.720
And by the way, I should have said this at the start of the talk,

30:06.720 --> 30:08.720
but if you have questions at any point,

30:08.720 --> 30:11.720
feel free to stop me and I'm happy to explain a little bit more.

30:11.720 --> 30:14.720
Yeah, I know this, do you have a support for the history of coal mine

30:14.720 --> 30:18.720
for GFS name or a support for the yesterday coal mine?

30:18.720 --> 30:21.720
Yeah, I will do that.

30:21.720 --> 30:24.720
Okay, so the block layer.

30:24.720 --> 30:34.720
This is kind of the, this is where we actually put the data on disk.

30:34.720 --> 30:37.720
This is where the rubber meets the road.

30:38.720 --> 30:43.720
And basically what we do here is also trying to be roughly,

30:43.720 --> 30:46.720
well, this is very ZFS inspired.

30:46.720 --> 30:51.720
So it's trying to be, but it also simplifies a lot of things.

30:51.720 --> 30:56.720
So what we do here is we take a disk and we split into a bunch of arenas.

30:56.720 --> 31:04.720
So you have your disk and I think we default to eight arenas,

31:04.720 --> 31:06.720
but the number doesn't actually matter.

31:06.720 --> 31:11.720
And what the arenas and each arena is a pool of fixed size blocks.

31:11.720 --> 31:15.720
So each block is the same size.

31:15.720 --> 31:18.720
And then if we want to grow the file system,

31:18.720 --> 31:19.720
we support that.

31:19.720 --> 31:25.720
We can, you just add more arenas to the end of your arena list.

31:25.720 --> 31:29.720
The slide plus.

31:29.720 --> 31:35.720
So what, and the way we keep track of the blocks is we just have a,

31:36.720 --> 31:40.720
in memory we have a avial tree.

31:40.720 --> 31:44.720
And in, on disk we have just a flat log of regions.

31:44.720 --> 31:48.720
So the regions basically say, you know,

31:48.720 --> 31:54.720
it's, this block is allocated from here to here is free.

31:54.720 --> 31:58.720
And, you know, when we want to allocate a block,

31:58.720 --> 32:03.720
will you just append to the log and say, well, this block is allocated.

32:04.720 --> 32:07.720
And this, this block has been freed and this block has been allocated.

32:07.720 --> 32:08.720
This block has been freed.

32:08.720 --> 32:13.720
So as you can imagine, this ends up becoming a fairly long log of,

32:13.720 --> 32:15.720
basically, every operation.

32:15.720 --> 32:18.720
So every now and then we basically take that log and compress it.

32:18.720 --> 32:22.720
So we take the, like, if you have block x that you allocate,

32:22.720 --> 32:24.720
and then you free block x later,

32:24.720 --> 32:29.720
then it just turns into one entry that records the final state of the block.

32:30.720 --> 32:33.720
And that's taken done by taking the avial tree,

32:33.720 --> 32:37.720
and writing out the set of ranges to disk.

32:37.720 --> 32:43.720
So the avial tree basically maintains, you know,

32:43.720 --> 32:50.720
maintains the state of the whole disk in memory of these blocks in memory,

32:50.720 --> 32:55.720
which can be, yeah, as a set of ranges,

32:55.720 --> 32:58.720
and is just kind of updated in place.

32:58.720 --> 33:03.720
Okay, so, yeah.

33:03.720 --> 33:08.720
The blocks themselves, we can have,

33:08.720 --> 33:11.720
every block pointer is a tuple of three values.

33:11.720 --> 33:13.720
It's a address.

33:13.720 --> 33:17.720
The hash of the contents of the block, which is how we detect corruption,

33:17.720 --> 33:21.720
and the generation, which is used for snapshotting and,

33:21.720 --> 33:23.720
and space reclamation.

33:23.720 --> 33:31.720
The, ah, oops.

33:31.720 --> 33:34.720
Yeah, one less per arena, which means that,

33:34.720 --> 33:36.720
and the reason we keep one less, the reason for arenas,

33:36.720 --> 33:39.720
which I forgot to mention, is because we,

33:39.720 --> 33:42.720
when we compress the log, we basically can't do anything,

33:42.720 --> 33:45.720
and compressing the log can take time.

33:45.720 --> 33:47.720
So if we split the disk into arenas,

33:47.720 --> 33:50.720
we can do log compression in the background,

33:50.720 --> 33:53.720
while we allocate from the other arenas.

33:53.720 --> 33:57.720
So this basically allows us to concurrently maintain the file system,

33:57.720 --> 34:03.720
and, ah, and actually, ah, use it.

34:03.720 --> 34:06.720
Ah, oops.

34:06.720 --> 34:09.720
I should have done one other thing, but, ah, okay.

34:09.720 --> 34:12.720
So I did shorten this presentation a little bit,

34:12.720 --> 34:16.720
because, no, I have limited time.

34:16.720 --> 34:18.720
I'm actually coming up on the end of it.

34:18.720 --> 34:20.720
So, okay.

34:20.720 --> 34:22.720
Last thing, how is this thing tested?

34:22.720 --> 34:24.720
So we've got a fuzzer in the file system,

34:24.720 --> 34:27.720
ah, which will generate random operations,

34:27.720 --> 34:30.720
ah, as quickly as possible, and just kind of absurd,

34:30.720 --> 34:33.720
and do consistency checks all the time.

34:33.720 --> 34:36.720
Ah, we've got, ah,

34:36.720 --> 34:38.720
tester, a virtual disk that,

34:38.720 --> 34:41.720
we can use to simulate block by block IO,

34:41.720 --> 34:44.720
so you do say a build of go,

34:44.720 --> 34:46.720
and then we do a replay.

34:46.720 --> 34:48.720
We start with an empty file system,

34:48.720 --> 34:51.720
and then do the first block IO operation,

34:51.720 --> 34:53.720
and check, is it consistent?

34:53.720 --> 34:54.720
Then do the second one and check.

34:54.720 --> 34:55.720
Is this consistent?

34:55.720 --> 34:58.720
And then do the third one and check if, if this is consistent.

34:58.720 --> 35:01.720
And then, ah, about 100,000 operations later,

35:01.720 --> 35:03.720
we've gone through the entire go build,

35:03.720 --> 35:04.720
and, ah,

