WEBVTT

00:00.000 --> 00:10.000
Background, I've mostly done file system work in recent years, but I've done some other

00:10.000 --> 00:13.640
kernel stuff before.

00:13.640 --> 00:20.760
So let's start by discussing why it is, we're building this thing to begin with.

00:20.760 --> 00:25.160
The company will work for diversity, we provide a software stack that helps people manage

00:25.160 --> 00:29.360
some of the biggest archives in the world, and it's an interesting space to be in.

00:29.360 --> 00:33.580
We're going to focus on a post-exfiles system here, but we need to talk about how this

00:33.580 --> 00:39.680
file systems use to understand what engine a file is trying to do because these file systems

00:39.680 --> 00:41.680
are very weird.

00:41.680 --> 00:49.520
So the general idea is that they treat this whole software stack as sort of a archive that's

00:49.520 --> 00:52.360
after their main large data stores, right?

00:52.360 --> 00:56.280
This is where all the files go to just sit at idle mostly, there's a lot of motion in and out.

00:56.280 --> 01:02.880
But I've tried to lay out the sort of workflow here that shows why these file systems become

01:02.880 --> 01:03.880
a challenge.

01:03.880 --> 01:10.360
So the general model is they'll use the Pozix file system as just a buffer for files that

01:10.360 --> 01:14.320
are going into and out of this archive media that's very irritating to work with.

01:14.320 --> 01:18.240
So they'll have a lot of files they want to archive, and they'll dump it all into the Pozix

01:18.240 --> 01:26.640
file system, which acts as sort of a cache, and they'll do this in these big batches.

01:26.640 --> 01:31.520
What's interesting is that the files accumulate in the Pozix file system what we're going

01:31.520 --> 01:35.760
to be talking about, and they sit there and they sort of, there's a policy engine that

01:35.760 --> 01:39.520
will look at all those files and say, okay, these have all arrived, where do they go?

01:39.520 --> 01:43.520
We'll put them off on these tapes, we'll talk about tape, or we'll put them off on these

01:43.520 --> 01:49.600
object stores or whatever, but what's interesting for this file system is that once the

01:49.600 --> 01:54.160
file data has been put off on the archive, we truncate all the file contents.

01:54.160 --> 01:58.800
It has a slightly different meaning here, I've put this little offline flag here, if there

01:58.800 --> 02:02.960
are file system people here, when you have unwritten extents, there's no file data, but

02:02.960 --> 02:08.160
there's a logical block allocated, we do the same thing except that marking on the extent

02:08.160 --> 02:12.320
of the data block says, I have this on an archive, but it's not here in this file system

02:12.400 --> 02:16.880
cache, all that's to say that the data at rest, there's no file contents.

02:17.920 --> 02:23.280
This is a file system with billions and billions of files with no data, it's just the metadata,

02:23.280 --> 02:29.440
and that's mostly what makes these systems so weird, but so to go through this archive flow,

02:29.440 --> 02:34.560
they'll dump all the files in this Pozix file system, you know, through our sink or an

02:34.560 --> 02:38.960
FS or Samba or whatever, they have their own homegrown data motion scripts or whatever,

02:39.760 --> 02:44.160
our little archive agent will go and get all those files read them again,

02:44.160 --> 02:48.320
write them off to the archive and now they're at rest, and then when they later want to recall

02:48.320 --> 02:53.680
them, they'll try and fetch them from the Pozix file system again via whatever means,

02:54.560 --> 02:59.120
and the kernel sees that the extents are offline and it goes and asks the archive agent to go

02:59.120 --> 03:03.280
get it from the archive and it brings it over, writes it into the file system, and then the

03:03.280 --> 03:08.720
reads are satisfied and off it goes out through the services. So the data has this flow in and out

03:08.720 --> 03:16.080
of the system, and we have a bunch of tasks reading and writing all the time and all these little

03:16.080 --> 03:21.600
files sometimes, and then we have our software agent that's behind the scenes also messing around

03:21.600 --> 03:29.120
the files, and what we're to understand from all this is that we can have very weird access patterns

03:29.120 --> 03:33.760
that are, you know, defined by the site's access patterns, and then we also have our little archive

03:33.920 --> 03:39.280
agent walking all the metadata, so we have a lot of opportunities for contention and for reads and

03:39.280 --> 03:43.920
writes to land on each other and a lot of things, and that's what the talk is mostly going to talk about

03:43.920 --> 03:51.920
that, how engine affects addresses that problem. After those two workflow topics here,

03:51.920 --> 03:57.120
we're describing the environment just to give you an idea, very big namespaces, that's another part

03:57.120 --> 04:03.200
of how these systems are interesting. The reason this thing is multi-node because is because of these

04:03.280 --> 04:07.120
archive tiers are big enough that they have to have a lot of nodes to get aggregate bandwidth

04:07.120 --> 04:12.720
to it. These two charming photos on the right. That's the same manufacturer, it's the same drive.

04:12.720 --> 04:16.720
The top one is an awesome glamorous shot, right? The bottom one is one of us walking to a

04:16.720 --> 04:21.760
data center taking a photo of the phone. It's the exact same thing. Those are racks and racks of tape

04:21.760 --> 04:28.240
drives and robots and actual cartridges. Often, these are a challenge because that little third

04:28.400 --> 04:36.000
bullet there drives are not uniformly accessible from any node. This gets to the contention

04:36.000 --> 04:42.240
problem again. We can have files arrive on one node for whatever reason, but they need to be read

04:42.240 --> 04:46.880
from a different node because they have to be written to a specific tape drive because the

04:46.880 --> 04:54.080
policy engine said it has to be in this set of tapes. It means we don't have, this is where the

04:54.080 --> 04:59.040
multi-node kind of pozics name space comes from. The services need to be able to read and write

04:59.040 --> 05:03.120
from any node and our archive agent has to be able to read and write from any node because

05:04.320 --> 05:08.080
files have to flow through specific nodes to get to some archive media sometimes.

05:08.720 --> 05:13.840
I just threw in a little example there is that's operational in the field. This class of machine,

05:13.840 --> 05:19.680
a bunch of network out the top, a bunch of, well, the storage fabric in this example is

05:20.560 --> 05:25.280
in fit a band, but that's a thing and then a bunch of fiber channel for the tape drives. This is

05:25.280 --> 05:31.840
what's deployed today to give you an idea of the scale of the thing. The billions of files in

05:31.840 --> 05:37.520
particular, sometimes we have, you need to do operations on a lot of those files and doing

05:37.520 --> 05:48.960
anything of billion times is really upsetting. So, in comes engine a fest. This is how

05:49.920 --> 05:56.320
I like to sort of understand the different design choices you can make to get to this coherent

05:56.320 --> 06:02.240
pozics name space and what we're doing differently in engine FS. So, we can start with that column

06:02.240 --> 06:09.600
on the far left to sort of frame the rest of them. To me, this is like the minimal job of the file

06:09.600 --> 06:14.560
system is to go between processes that are calling a system causes or the P's at the top and the

06:14.560 --> 06:19.280
devices that are doing the actual IO. That's the stuff at the bottom. Anything in the middle is

06:19.440 --> 06:23.920
irritating and we wish we could get rid of it. But we have to have the processes doing the work

06:23.920 --> 06:28.640
and we have to have the devices actually working with persistence. So, in the local file system case,

06:29.360 --> 06:36.320
you construct a coherent pozics name space with a bunch of kernel software runtime constructs.

06:36.320 --> 06:41.200
You have caches and you have spin locks and you have mutexes and all that stuff. That's what

06:41.200 --> 06:47.440
makes it safe to have a bunch of processes working on one name space. If you try to have two tasks

06:47.440 --> 06:53.600
delete the same file, it'll be in the VFS in one of these locking constructs that makes that

06:53.600 --> 06:58.880
operation safe. That's where the consistency comes from. So, that's sort of how we start the

06:58.880 --> 07:04.800
conversation. That's in your head. That's how local file systems are. X3, XD4, XFS. All that stuff.

07:04.800 --> 07:10.400
That's the world there. But that's a single node. We want to be able to get this coherent

07:10.400 --> 07:16.320
name space across multiple nodes. How do we do that? The next column over is NFS, which is one of the

07:16.400 --> 07:22.400
first and easiest to understand. We have the exact same model where we have tasks calling into the

07:22.400 --> 07:28.960
VFS. But now instead of what comes out is just raw block IO. What we do instead is we just package

07:28.960 --> 07:34.160
up those calls again. If you did an unlink of a file, it goes through the VFS. We just send the same

07:34.160 --> 07:38.960
message that says, and just unlock that file and we send it off to another VFS. And it's down in

07:38.960 --> 07:46.800
that other VFS in the server where the safety comes from. If you get two tasks on different nodes

07:46.800 --> 07:51.760
trying to unlink the same file, they don't know about each other at all. And it's in the server where

07:51.760 --> 07:57.760
those two things are resolved. So, that's where the safety lives in the NFS model. And that's a

07:57.760 --> 08:04.880
problem because it means safety is only resolved off at the server. If you want to unlink a whole

08:04.960 --> 08:10.000
bunch of files, you're sending a whole bunch of messages to that remote server. You can't make

08:10.000 --> 08:14.560
local cash decisions because you don't know if they're safe until the server has checked with

08:14.560 --> 08:21.520
everybody else. So, you get these per file round trip RPC costs to get safety. So, that's the sort

08:21.520 --> 08:26.240
of quote unquote network model. These little quoted terms at the top are just how I think of this

08:26.240 --> 08:32.720
in my head. The third column, what we'll call shared block, fellow dinosaurs might think of these

08:32.800 --> 08:38.560
is custard file systems. That's my brain. I have the example here is scout FS, which is a system

08:38.560 --> 08:44.000
we have. There are others, GFS to OCFS to all these class of systems. There's a bunch of proprietary

08:44.000 --> 08:54.000
ones work in the same way. This is more of a hybrid between the local file system and the network

08:54.000 --> 08:59.600
of FS a little bit. So, we start again with all the processes calling into the VFS. But in this

08:59.600 --> 09:04.560
case, what comes out of the bottom of the VFS is block IO again. It happens to go to a, you know,

09:04.560 --> 09:11.840
a raid head, a shared cluster block storage thing. But what's interesting is the safety, even

09:11.840 --> 09:17.680
though the VFS is a remitting block IO, they don't know that it's safe unless they've talked to

09:17.680 --> 09:23.200
this external lock service. So, if we go all the way back to the one on the left, the local model,

09:24.400 --> 09:28.720
the way the kernel made that safe was you get a software mutex and you do your work and you unlock

09:28.880 --> 09:34.560
the mutex. In this shared block model, you sort of take that programming concept, but now instead

09:34.560 --> 09:40.000
of a local kernel software lock, you have a remote RPC lock and unlock. So, it's the same kind

09:40.000 --> 09:46.080
of system where in these shared block file systems, you have your metadata processing code path

09:46.080 --> 09:51.760
and instead of mutex on lock and lock. You send these RPCs off to a remote thing and there's

09:51.760 --> 09:57.360
caching and stuff, it's a little more complicated than that. But the basic idea is you're doing

09:57.360 --> 10:03.760
a lot of remote messages out of band to get these locks that make your operation safe. That's where

10:03.760 --> 10:10.720
the safety comes from. And again, kind of like the POSIX RPC case, this is a lot of extra messaging.

10:10.720 --> 10:16.720
Like, if you were to compare the two flows, the local file system compared to these shared block

10:16.720 --> 10:21.760
file systems, there'd be a lot of sort of latency hiccups in the shared block one where it's talking

10:21.840 --> 10:26.880
to its remote locks, or as it does all its work. And so, this is where engine effects comes in.

10:26.880 --> 10:32.480
A brief little fun anecdote. The name came from me being so frustrated at naming and wanting to

10:32.480 --> 10:36.960
start messing around with prototyping. So, I just did NGN as next gen and then realized it could be

10:36.960 --> 10:46.160
pronounced engine. So, that's engine effects. Never get it naming. But in this model, we've made it look

10:46.240 --> 10:54.640
a little more like CPU cache coherency protocols. What's happening here is we have the same kind of

10:56.480 --> 11:03.440
model where the VFSs are emitting block IO again. So, they look a lot like the fast local

11:04.160 --> 11:12.640
file system case. But there's metadata along with the IO's that manages cache coherence. So,

11:12.640 --> 11:21.600
instead of for every operation sending a remote call to a server like we did in NFS, we do the same

11:21.600 --> 11:27.040
exact kind of IO as you'd see with the XC4 or whatever. But we have these metadata tags flowing

11:27.040 --> 11:33.520
along with the IO. And these little devd boxes, the main design element here is that in

11:33.520 --> 11:40.720
NGNFS, there's a user space process sitting in front of every single device. And that user space

11:40.800 --> 11:46.240
process offers a network protocol that looks like an IO protocol. It has reads and writes and all that.

11:46.960 --> 11:52.560
But it has a few more messages for cache coherence. And in the reads and writes, are

11:53.360 --> 12:00.000
some more metadata to describe cache intent. And that's the biggest takeaway here. What we're doing is,

12:00.000 --> 12:05.120
instead of having a VFS talk to devices with just block IO and VME, SAS, whatever,

12:05.840 --> 12:11.760
we have the VFS doing a network protocol to little purdevice servers that are then doing the IO to the

12:11.760 --> 12:19.280
devices. So, to make that make a little more sense, we have this little timeline that is,

12:20.640 --> 12:25.840
this is the heart of the cache coherence model model. So, down that middle spine,

12:27.040 --> 12:32.000
this is all from the perspective of one block, the system is all block IO. The middle spine is the

12:32.000 --> 12:37.120
state of one of these device servers that's responding to all these network requests. On the

12:37.120 --> 12:43.360
left and right side, we have the VFS agents that are doing an operation. The operation can be darn

12:43.360 --> 12:51.200
near anything. But we start this timeline time flows down. On the left, we have a VFS that wants to,

12:51.200 --> 12:55.360
wants to update a block. Whatever it is, it's going to touch an I now, it's going to remove a

12:55.360 --> 12:59.680
directory entry, who knows. It has to first get the current version of the block to be able to modify it.

13:00.640 --> 13:05.440
So, it does one of these network protocol messages to the device server to say, I'd like to read this

13:05.440 --> 13:11.360
block, please. But I'm reading this because I want to write. I send a little right intent tag that tells

13:11.360 --> 13:17.520
the metadata server, the devd, that I would like to get this block because I'm going to modify it.

13:17.520 --> 13:23.120
So, the moment the device server responds with the actual contents of the block,

13:24.320 --> 13:29.040
in its state, it has a little database for the state of all the blocks. It remembers that that block,

13:29.040 --> 13:36.400
someone's writing it. The moment it sends the actual current version of the block to the writer,

13:36.400 --> 13:42.480
when the writer, the column on the left, gets the block. In memory, it can now consider that cash

13:42.480 --> 13:47.440
block rightable. It can actually dirty it in memory. This is the usual file system path you'd

13:47.440 --> 13:53.760
see. And so, now, before we consider the messaging from the other VFS who's trying to read the

13:53.760 --> 14:01.200
block, we have a state where the device server remembered that it gave it a block for right,

14:01.200 --> 14:06.560
and we have a VFS agent who is currently writing it in however it wants to. But for whatever reason,

14:06.560 --> 14:13.520
say another VFS wants to read this block. If the first VFS was, you know, removing an entry,

14:13.520 --> 14:18.480
say the other VFS wants to read the directory, right? For whatever reason, it wants to read this

14:18.480 --> 14:23.760
block now. So it sends the same kind of read message, but it's a little metadata says,

14:23.760 --> 14:28.800
I'm reading this block, actually because I want to read it, I don't want to write it. When the device

14:28.800 --> 14:35.280
server gets that read message, it needs to resolve that right state, right? It can't just give

14:35.280 --> 14:39.600
whatever version of the block it may have because that other note has written a more recent version,

14:39.600 --> 14:43.280
right? So it sends a message to that dev the on the left to say,

14:43.280 --> 14:47.920
and as somebody's trying to read this thing, give me back the block, please force your cash state

14:47.920 --> 14:55.360
into read mode. So when the VFS on the left gets that message, it moves its version of its cash

14:55.360 --> 14:59.760
block into only read mode, which because it had a dirty before, means it has to write the current

14:59.760 --> 15:04.640
version out, right? So it sends that right command the back of the dev, these say, hey, I'm

15:04.640 --> 15:08.720
modified this block here, it is. And as I send you this right message, I tell you,

15:09.360 --> 15:12.880
and my cash is read only, you don't have to worry about me modifying this thing anymore,

15:13.200 --> 15:17.920
I've made sure through my local software constructs that no one's going to modify that thing.

15:19.040 --> 15:25.200
In the middle when the device server now gets that right command, it knows the block is readable

15:25.200 --> 15:29.680
everywhere. It doesn't have to worry about writers, so it can send the block contents back to the reader.

15:31.040 --> 15:35.920
So this looks a lot like people are familiar, this looks a lot like, you know,

15:36.560 --> 15:40.400
CPU cash protocols, it's peer-to-peer instead of snooping all that stuff,

15:41.120 --> 15:46.320
but blocks are in states and different actors, right? Some people have it for read, some people have it

15:46.320 --> 15:51.920
for write, and there's that middle-devd process that's in charge of doing the messaging to

15:51.920 --> 15:55.440
coordinate all this access. And what I think that the real

15:57.600 --> 16:00.320
wind behind engine effects, if you look at this diagram,

16:00.720 --> 16:07.120
if there were no cash coherence here, no multiple actors, a lot of these messages would be there anyway,

16:07.120 --> 16:10.320
right? In fact, if you have to read a block and then you have to write it back out,

16:10.320 --> 16:14.960
all that's present here, the only thing that's extra is that forced read mode message,

16:15.600 --> 16:20.400
right? We don't have these external lock calls to make stuff safe. All we have is sort of

16:20.400 --> 16:25.040
annotating the block since we trade them around. So this is kind of the core of the thing.

16:25.040 --> 16:29.840
And if anyone has questions throughout this, please holler. I like interaction. It's good.

16:31.200 --> 16:34.320
So this is the core of it. This is the heart of what makes this engine investing interesting.

16:34.320 --> 16:37.920
And this is what the talk is about, just this little cash coherency protocol.

16:38.720 --> 16:43.920
10 minutes, Jesus. Okay, but we can talk about, we can have a specific use case, as a file

16:43.920 --> 16:51.440
system designer, why is this interesting? That comment is in the C file that handles this stuff

16:51.440 --> 16:57.600
in the local VFS, and I think a few of you know who wrote that. So let's take this problem case,

16:57.600 --> 17:02.480
as kind of an interesting example of if we're one of these local VFSs, and we're implementing

17:02.480 --> 17:09.520
a POSIX file system with this read right, but cash aware network protocol. What does it do for us?

17:09.520 --> 17:13.120
How does it save us work? We can look at this case because it's relatively easy to understand,

17:13.120 --> 17:18.640
but this pattern applies to a lot of shared structures and file systems. There's a lot of stuff

17:18.640 --> 17:22.720
under the hood, like allocators and orphan lists, and all sorts of nonsense that has the same

17:22.720 --> 17:26.880
sort of pattern, but this one's really easy to understand, I think. So we have a problem in a

17:26.880 --> 17:30.800
rename where you can't arbitrarily rename directories anywhere, because you can create loops.

17:31.520 --> 17:39.280
That's what these first two little blocks are showing us, right? We make a directory link a chain,

17:39.280 --> 17:44.880
and then if we were to try and move that A directory to be a sub directory of C, A would still have

17:44.880 --> 17:49.760
B as an entry, right, and you get a loop out of that, and if you try to do that, it yells at you.

17:49.760 --> 17:54.880
You cannot move a director under itself. There's a lot of other cases like this, but this is the one

17:55.040 --> 18:00.880
that's easy to understand. The way it returned to that error message is what we're talking about.

18:01.680 --> 18:05.600
If you look at it, it's not exactly obvious because, well, maybe I don't know.

18:06.320 --> 18:11.040
The rename call on the kernel only takes two arguments. It takes the two directories that are being named.

18:11.040 --> 18:15.520
So it doesn't really know that they're related to each other, and doesn't know that they're forming a cycle

18:16.400 --> 18:24.640
initially. So it has to figure that out, and what it does internally is it walks from both directory arguments,

18:24.640 --> 18:29.040
back up to the root, it just walks all the parent directories, and if it finds a relationship

18:29.040 --> 18:34.000
between them that it's not comfortable with, it returns this error. That's what happens in this case, right?

18:34.000 --> 18:37.680
It starts at C, and it's walking parents, and it sees A and says, hold on.

18:38.720 --> 18:43.440
I walked from one of my operands, and I hit the other one, and it's bad news. That's what creates

18:43.440 --> 18:48.720
the cycle, so it's returns the error. But you have to think that while you're walking these parents,

18:49.680 --> 18:54.800
the tests you're doing only make any sense if they don't change while you're walking,

18:54.800 --> 18:59.520
right? And this is the heart of the problem. In the kernel, we do this with a per-name space,

18:59.520 --> 19:06.320
that's that DSB is super-block. There's a mutex for the entire name space. Any time you rename between

19:06.320 --> 19:11.040
directories, you're globally serializing all renames between directories to this day.

19:11.920 --> 19:18.400
That's all well and good when you're on a laptop. When your, you know, a bunch of nodes

19:18.400 --> 19:25.120
doing this weird service on behalf of this like archival stack, serializing across the entire cluster,

19:25.120 --> 19:31.680
across all of the machines, as a real problem. So it'd be nice to address this. And that's where

19:31.680 --> 19:36.080
this sort of test, this per-block has to go in and see things comes in really powerful, because

19:36.480 --> 19:43.280
by just implementing this naturally, we solved this problem, because as we saw in that sort of

19:44.880 --> 19:53.040
cascoherency, message exchange chart, as we're reading, we're getting a cash state. As we're

19:53.040 --> 19:57.840
walking those parents, we can say, I'm walking these parent blocks to test, you know, these

19:57.840 --> 20:02.480
ancestor relationships. But I'm holding the blocks in Reed State, no one can write them until I've

20:02.480 --> 20:08.080
finished my search. So by doing just using the block IOS as you would, you solve this global

20:08.080 --> 20:13.040
rename problem and you don't have to have the global new text. I've made three little cases here

20:13.040 --> 20:16.880
just to sort of demonstrate the different possibilities, right? The one we started with, we're

20:16.880 --> 20:21.360
here return an error, we only get read references to the block, can be fully concurrent. There's

20:21.360 --> 20:28.080
no writers, there's no serialization. All nodes can be doing that at all times. But even if the

20:28.080 --> 20:34.400
rename does succeed, the points for contention where you'd have to wait for each other are

20:34.400 --> 20:39.200
only the directories being modified. Right, this example on the lower left is sort of a silly

20:39.200 --> 20:43.680
one where say you had a really deep directorie and you moved it all the way back to the front.

20:43.680 --> 20:47.280
That's what this pattern would look like. It would have read references in the middle because

20:47.280 --> 20:51.200
it was walking the parents to find out it was safe. It was safe and it modified the

20:51.200 --> 20:55.920
directorie at the very top and the director is the very end. In the example in the lower right,

20:56.000 --> 21:01.520
which is more common, you rename between sub directories that aren't ancestors of each other at all.

21:03.280 --> 21:08.400
It'll walk to the parents so they'll get read references on all the blocks and then only modify

21:08.400 --> 21:12.560
the leaf directories. So you can have as many of those you want concurrently all sharing read

21:12.560 --> 21:17.600
references on the parents and what I like about this that's so interesting is we didn't have to

21:17.600 --> 21:23.200
do any special engineering other than write a final system to avoid this global serialization

21:23.280 --> 21:28.720
problem. Right, and that's what make this idea so powerful. And this is rename is just one example.

21:28.720 --> 21:35.040
This happens with file reading, write all sorts of things and it's it's why as a file system designer

21:35.040 --> 21:40.080
it's so interesting to work on this stuff because we don't have to spin up all these

21:40.080 --> 21:44.880
you know extra locking relationships to make everything safe like we did in the rename case.

21:46.080 --> 21:50.640
Just by doing your work by assembling the blocks you want to work with, modifying them or not,

21:50.640 --> 21:56.720
and then letting go you get concurrency that's safe. And that's it. That's the talk.

21:57.280 --> 22:01.920
It's just a little design overview of what makes this thing interesting. So if there's any questions,

22:01.920 --> 22:03.920
I'd be happy to answer them.

22:04.320 --> 22:14.320
We had deferred to the microphone here.

22:18.560 --> 22:25.120
Speak very close to the microphone. Hi, great talk. One question. Partition tolerance?

22:25.840 --> 22:30.720
What what? Partition tolerance? I didn't I'd understand. What do you do in the event of a network

22:31.520 --> 22:38.080
Oh boy, that's a whole other layer. There's a core and set. There's maps of things. There's yeah, yeah,

22:38.080 --> 22:41.520
all sorts of stuff. Partition. Yeah, that's what I didn't hear.

22:43.360 --> 22:47.360
Need to somehow be able to throw it across the room, right? Yeah, we need to catch boxes like that.

22:47.360 --> 22:48.400
Yeah, yeah, yeah.

22:54.800 --> 22:56.960
What happens if dev decrashes?

22:57.840 --> 23:00.800
So there's redundancy between them. You do rate on top of them.

23:03.120 --> 23:05.200
And that's a hot mess, but that's part of it.

23:09.440 --> 23:14.080
A little bit. There's some things inside post-expile systems. There's recovery protocols.

23:14.080 --> 23:17.840
It's stuff like if you had a file descriptor open, there's a few tiny bits of memory that

23:17.840 --> 23:19.440
is need to be known all over the place.

23:19.840 --> 23:28.160
What if two nodes have the same read cache over the same region? And then somebody else

23:28.160 --> 23:35.360
does to write and that on those blocks. Very similar thing as we saw that force mode read message.

23:35.920 --> 23:41.360
The devity would send lots of messages that force them to be invalidated. So if a write request

23:41.360 --> 23:45.600
came in, it would send invalidations to all the readers. They'd all drop their reads. And then

23:49.840 --> 23:56.640
it keeps a list of all the addresses that are in these block numbers, yeah.

23:56.640 --> 24:17.600
Can you say a bit about the state of production readiness of engine effects?

24:18.640 --> 24:24.720
It's early days. Part of what's so interesting about this attempt. We've done some other file

24:25.200 --> 24:31.120
systems. This one we are developing the open very early. A lot of the design is solid and we're

24:31.120 --> 24:35.280
going to get this thing going. But if you were to look at that get-tree, you'd see that it's slim

24:35.280 --> 24:40.320
pick-ins, but you'll see a lot of activity. It's early. Still.

24:43.200 --> 24:50.080
A lot of fish done. Do you have an idea when you would be, you would be willing to

24:50.080 --> 24:54.160
obscure this, I guess, right? Yeah, the client. What's interesting about a lot of this is

24:54.160 --> 24:59.520
it's all in user space, but there's a VFS client. And yeah, that's pretty cool.

25:02.160 --> 25:06.000
All right, thanks a lot. Thank you.

