WEBVTT

00:00.000 --> 00:17.600
Good evening and welcome to the stock again. Nice to see some of you again. All right. This talk

00:17.600 --> 00:22.840
is about the sex-self-final system experience in umbrella release, which is the upcoming

00:23.000 --> 00:28.520
self-release this year. My name is Minkie, I work for IBM and I'm literally for the

00:28.520 --> 00:35.560
falsacist team. Okay, you probably would have seen this 10,000 times, but I'll just include

00:35.560 --> 00:41.800
this for for brevity. Self-self-fascist is a post-excompliant file system, distributed file

00:41.800 --> 00:49.560
system, and these have metodata server, and clients cooperated, we maintain a set of distributed

00:49.560 --> 00:57.480
cache, including I-nodes and directories. So, clients cache a lot. India sends out something

00:57.480 --> 01:03.480
all the scabs to clients, so that to allow them to delegate a certain part of the tree,

01:03.480 --> 01:07.080
so that the clients don't have to get in touch with the metadata self-requently,

01:07.080 --> 01:12.200
thereby improving performance and throughput, and the file data, the clients actually talk

01:12.200 --> 01:17.880
directly to the jarapolent loss. This is just a highlight of you, I think, super interesting here.

01:17.880 --> 01:27.000
Okay, before we go on into umbrella and whatnot, this is a history of how self-and-self-fascist

01:27.000 --> 01:35.960
has come up. It started around 2006, Sage's project, PhD thesis, scalable hyperformance,

01:35.960 --> 01:43.160
distributed file system, somewhere around 2011, Ink Time was formed, and around 2014,

01:43.160 --> 01:50.280
is been product acquired, Ink Time. Most of the times, self-fascist was marked as tech review,

01:50.280 --> 01:56.040
it was not really production ready, reason being the development of recovery tools,

01:56.040 --> 02:00.200
and things like that. So, if you happen to recall, to lose a file system, there was no way to

02:00.200 --> 02:05.240
rebuild it. So, once we introduced that, we officially marked self-refascist stable in a single

02:05.320 --> 02:14.680
MDSK, somewhere around 2016-ish, dual release, and near down the line, we introduced multi-MDS

02:14.680 --> 02:21.560
as stable, multi-proactive MDSK, with real-fragged static fragmentation and also sub-direct

02:21.560 --> 02:29.320
propending. Some were on not-last 2019, and first clusters, management was introduced in

02:29.400 --> 02:34.840
the manager module, and the concept of self-refascist sub-volume came into existence,

02:34.840 --> 02:39.960
and we will talk about that and what we are doing. Pacific was a big release, because we

02:39.960 --> 02:47.560
introduced lots of features, self-refascist stop, snap schedule, let's see multi-MDSK

02:47.560 --> 02:53.240
crabbing, limited window supported with self-token, there is a client with four windows,

02:53.240 --> 02:58.760
self-refascist mirror demon, that is the asynchronous app, short application for the file system,

02:58.760 --> 03:05.240
and also FSK. The kernel FSK script, I mean here, and we will talk about what FSK script

03:05.240 --> 03:13.960
in the user space, with top-coming releases. And somewhere on 2024, 2023-ish, the squid release was also

03:13.960 --> 03:22.040
pretty decent in terms of features, we have the choirs, and were improved MDS lock trimming,

03:22.280 --> 03:29.560
which was pretty nice, because now you don't get the spurious MDS lagging behind trimming

03:29.560 --> 03:35.960
warning, and we also disable the automatic balance, because it was misbehaving a lot.

03:35.960 --> 03:40.360
So that was more or less in squid, I will talk about tentacle in the next slide, there is an

03:40.360 --> 03:44.440
opportunity in a space to fit here, and finally we will go into the umbrella release, and what's

03:44.440 --> 03:51.960
coming up. So tentacle was released last year, so let me just do a pretty quick recap of

03:51.960 --> 03:57.400
on what was introduced. Case-in-sensitive directory trees and sub-volume, this is something that

03:57.400 --> 04:02.120
my colleague Patrick worked on at last year, this is to support somebody use cases with surface,

04:02.840 --> 04:08.680
because surface is a post-exampline file system, file names are case sensitive, while soundbar works

04:08.680 --> 04:13.160
the other way around, case-in-sensivity, shoot-impro performance, and bring it on par with other

04:13.160 --> 04:17.400
storage systems, we had to do some kind of strengthening on the MDS and the client side, so we

04:17.400 --> 04:23.640
introduce a case-in-sensiv knowledge in the MDS and the client for some of your cases.

04:24.920 --> 04:30.040
There was also a nice feature for blog, a five-log difference is between snapshots,

04:31.080 --> 04:37.240
so we introduce a blog-tif-a-p-c, where you can ask the self-married RS server, give it a

04:37.240 --> 04:43.240
file name, and to snapshots, and it tells you what offset file, offset, and lens, excuse me,

04:43.240 --> 04:48.040
has changed between that file, between two snapshots, and those snapshots mean not be considered,

04:48.040 --> 04:53.320
it can be further spaced, and the metadata server calculates all that has changed and gives you back.

04:55.240 --> 04:59.560
The blog-tif-a-p-i, already obviously, is being used by the surface-married human,

04:59.560 --> 05:08.360
and now it's much more faster. And we also had some requests from the community as part of

05:08.360 --> 05:14.280
self-local on 2024 to years back regarding log replay times, and how much there are a bunch of

05:14.280 --> 05:19.480
asynchronous jobs that run on the MDS, so there was no way to know how much time that would take,

05:19.480 --> 05:25.400
and there were some couple of features, so if you fail over a self-married RS server, and the

05:26.120 --> 05:32.920
fail over MDS now takes more than 30 seconds to replay its log, you get warnings and also an estimated

05:32.920 --> 05:37.880
log completion time, that was putting it. Now let's talk about umbrella.

05:39.080 --> 05:43.560
The main features, these are main features, I have a couple of them, slides for these,

05:44.280 --> 05:49.240
a self-efferves are volume metrics, eager, mechanically-carious, being actively working on this,

05:49.240 --> 05:54.280
we'll talk about this, snapshot visibility, mechanically-impair as well, don't it recently,

05:55.400 --> 06:01.160
directory quarantineing, you know, you'll talk about this, but basically it's like

06:01.160 --> 06:06.440
quarantining and adpress so that nobody had access to it, and you need a special key to access

06:06.520 --> 06:12.520
that particular directory. Use a space encryption support, so I told Kurnitland that already

06:12.520 --> 06:17.320
had the encryption support, all that is imported back to the user space, the main uses of this

06:17.320 --> 06:24.600
is NFSkNation and SMB. Again, more on the usability side, estimated completion playing for

06:24.600 --> 06:30.840
disaster recovery tools, you know, whoever has use a DR2 with self-efferves knows how painful it is,

06:30.840 --> 06:35.160
because it doesn't tell you anything, absolutely nothing. It just keeps on running, if you have

06:35.240 --> 06:38.840
parabytes of data and self-efferves, it probably will run for days, so you have no idea whether

06:38.840 --> 06:45.000
it's making progress or doing anything. So now, with this feature, you know,

06:45.960 --> 06:51.000
self-status will actually show you the exact amount of time it will take for a disaster recovery

06:51.000 --> 06:57.400
finish. A couple of features here, the Kurnitland driver support is answered and dissipated,

06:57.400 --> 07:02.120
somewhere down the line in the future. The Kurnitland driver obviously doesn't follow the

07:02.760 --> 07:06.360
user space timelines or the September line, so that works a bit differently.

07:08.600 --> 07:12.760
Come on audit logging, we can collect the areas spoke about this earlier today.

07:13.560 --> 07:18.040
That's also upcoming, Meridim and improvements. We'll talk about this,

07:18.040 --> 07:23.640
major improvements in reporting metrics, and a bit of the last low-hanging performance sting.

07:24.280 --> 07:31.480
MDS tracing framework, again, colleague, Igor has been working on this and stabilization

07:31.560 --> 07:39.800
of a couple of APIs that's used by an SSN Samba, and also the long-standing QS quality of

07:39.800 --> 07:46.840
surface based on DM clock. That's going as a tech preview, probably. So let's start talking about

07:46.840 --> 07:54.520
each of these features, so volume metrics. So currently, Kurnitland user space driver, forward

07:54.520 --> 07:58.760
pipeline metrics to the MDS. So the client, you know, when it's doing IO, is trying to

07:58.760 --> 08:04.040
see how much of a deal response from the MDS or the darapolis coming, and it aggregates this,

08:04.040 --> 08:10.600
and sends it to the MDS. These are per client, right? These are not per share. So if you're mounted

08:10.600 --> 08:15.480
a particular subvolume, like if NFS has mounted a particular share, which is actually a direct repart.

08:16.520 --> 08:21.880
You don't get per subvolume data, per subvolume metrics. So that is changing now.

08:21.880 --> 08:30.680
Now, first share subvolume will be reported for an unspawnetering. The way this is being handled

08:30.680 --> 08:37.800
is, as a total MDS has some concept of a cap, a capability for an I node, and we who

08:37.800 --> 08:44.200
want to that. So whenever the MDS hands out a cap for an I node, it sends along the subvolume

08:44.200 --> 08:49.960
ID on the widget lives. So that's basically the I node number and the client has everything to

08:50.040 --> 08:56.680
track all the metrics for this particular I node number, right? Anything, when all I was

08:56.680 --> 09:03.640
will start with a cap request. So essentially the client knows which subvolume the file or direct

09:03.640 --> 09:09.400
presender, and it starts tracking that particular those metrics. And the whole logic of forwarding

09:09.400 --> 09:15.720
it to the MDS still works. So now you get like per share metrics. This is a sample, a dump of

09:15.720 --> 09:22.520
how it will look like. So this are available with Perf dumps. And if you are learning other things

09:22.520 --> 09:27.800
like node expert demons, these are plummeted into parameters. But you know Perf dump shows you now

09:27.800 --> 09:31.000
the actual subvolume path and the counters for those path.

09:34.600 --> 09:42.440
Okay, snapshot visibility. This has been mostly driven by IBM. The users for this is an FS

09:43.080 --> 09:48.040
primarily where it was desired for the NFS clients to not be able to navigate snapshots.

09:50.040 --> 09:55.880
All those snapshots can be created using you know every directory in SFFS has a special dot snap

09:55.880 --> 10:01.000
directory and you can create snapshots by just laying an MKDI under that special dot snap.

10:01.000 --> 10:06.760
This dot snap doesn't show up in listing but it's probable so you can see it into it but not listed.

10:06.840 --> 10:13.480
This snapshot visibility you can switch the strict access to the dot snap directory.

10:13.480 --> 10:19.160
So once you enable this which is actually a per client and a per subvolume basis you know the

10:19.160 --> 10:26.040
clients do not have dot access to dot snap. With the odd cap restricted dot cap they can actually

10:26.040 --> 10:31.320
not snap shut the directory itself but when you switch this on they can't even traverse dot snap.

10:31.880 --> 10:37.400
Okay, this is the feature I was talking on which will go as a tech preview, directory quarantine.

10:39.400 --> 10:43.400
It allows the operator to quarantine a SFFS subvolume they were a stricting access.

10:43.960 --> 10:48.440
The use case for this is basically safeguarding data during security incidents such as ransomware.

10:49.480 --> 10:54.680
Once a directory is marked as quarantine new mounts attempts on the subvolume are are denied.

10:55.640 --> 10:59.800
Existing mount points that mount the subvolume path will be evicted.

10:59.800 --> 11:02.680
They are blocked listed so they will have to remont which will be disallow.

11:03.320 --> 11:08.840
For existing mounts of the parent directory so if some client has mounted in the root of the file

11:08.840 --> 11:14.680
system and you mark a sub directory as quarantine all i operations on files within the quarantine

11:14.680 --> 11:22.520
directory you start return errors and the only way to have access to the quarantine directory is for

11:22.600 --> 11:31.160
a cluster operator to hand over a special auth key which is like a keyring to the user which is also

11:31.160 --> 11:36.520
path restricted and whoever has that keyring is the only one who can actually access it.

11:36.520 --> 11:39.800
So if you don't have that keyring you are a direct placement of quarantine.

11:41.160 --> 11:44.440
This will be as a technical preview for the upcoming umbrella release.

11:46.360 --> 11:51.160
Use of space encryption support as a set the kernel driver has support for encrypted subvolumes from

11:51.480 --> 11:57.720
six of six. The entry names are encrypted using the fscript library file.is and scripted with

11:57.720 --> 12:04.360
per file keys store n i no matter that now. This is compatible with all the user space tools.

12:05.000 --> 12:09.560
So the fscript tools are you know they work with the xt and the xfo file system.

12:09.560 --> 12:13.160
They transparently work with with with self-refestue.

12:15.000 --> 12:20.280
It's essentially a port of the kernel client's implementation of fscript and the same

12:20.360 --> 12:25.320
primitives and protocols are in use and cross-compatible with the kernel.

12:25.320 --> 12:30.680
So fscript enable clients has to present a key to the log in a log x directory.

12:31.960 --> 12:38.520
If you don't have the key you can't log it. fscript univered clients will see the encrypted version

12:38.520 --> 12:44.840
of the file they were making no sense. My colleague Chris has done an extensive talk on it

12:44.840 --> 12:48.760
on this couple of years back in self-local and I'll link that at the end of this of the

12:49.720 --> 12:55.480
presentation that will give you extremely detailed view of how it's implemented.

12:56.920 --> 13:01.880
Okay, some UI improvements disaster recovery ETC.

13:02.760 --> 13:08.040
So as I said offline disaster recovery tools offer no or limited progress.

13:10.440 --> 13:15.080
lack of estimated time completion for deploying tools such as data scan and the journal tool.

13:16.040 --> 13:22.280
You know if you have ever run it you will know the pain but with now with umbrella the progress

13:22.280 --> 13:27.720
and ETC estimated time of completion is reported by a sub-stress. So has a sample one.

13:27.720 --> 13:34.920
I've omitted the other health okay things. So this is like a self-refest data scan scan

13:34.920 --> 13:44.520
extents sub-task or maybe let's call it a step that's being run on a distributed on on on on

13:45.560 --> 13:49.160
the file system basically scan extents is going to scan the data pool

13:50.920 --> 13:57.800
going to each object into data pool and strength recovery of the metadata and with this you

13:57.800 --> 14:01.560
you know how much time is going to take. I hope you never have to run this but if you do

14:02.600 --> 14:09.320
you actually get a lot of you know information on how much time is takes. So in this case it takes

14:09.640 --> 14:15.400
27 minutes whatever. Come on all it's logging. This is the presentation me and my colleague

14:15.400 --> 14:21.960
very, very early on the day. A lack of historical record for you know was making things like

14:25.160 --> 14:31.080
complications in cases you don't know what exactly happened. So with now with the implementation

14:31.080 --> 14:38.120
of the command audit logging system you'll get a structured journal logical order of what

14:38.200 --> 14:43.800
commands for run on the file system management commands you know when was maximum

14:43.800 --> 14:48.120
the exchange, when was a particular config was switched on every single history is therefore you

14:48.120 --> 14:52.120
to query and for us to be able to see what accurately was done.

14:55.160 --> 15:03.960
Improve Meridiman, Federico was talking about this in morning. Thank you and this is going

15:04.040 --> 15:10.760
in the umbrella list release. The tentacle release significantly improved this

15:10.760 --> 15:18.600
after synchronization performance by using plug diff. There have been some performance deficiencies

15:18.600 --> 15:23.240
for corner cases that has been addressed that will be addressed with the umbrellas and those

15:23.240 --> 15:31.480
will probably be back potty to it and tickle release. But it may in focus here is for you know metrics

15:31.560 --> 15:37.800
and feedback how much time is going to take for snapshots to get a synchronously replicated

15:37.800 --> 15:44.280
to a different cluster. So that was something that was entirely missing and this it has been

15:44.280 --> 15:49.240
actively gone. Integration with dashboard is also another thing that will happen.

15:50.040 --> 16:01.720
Facing framework. So the design implements the hierarchical tracing framework so you know open

16:01.720 --> 16:07.960
open telemetry is one of the popular tracing you know addition frameworks that that is widely used

16:08.520 --> 16:14.840
and the MDS tracing framework implementation or the framework will actually you know the goal

16:14.920 --> 16:21.560
is to integrate to integrate with hotel however for umbrella release we are not totally integrating

16:21.560 --> 16:27.880
with open telemetry but linked on the foundation stood with the future. So the way it will

16:27.880 --> 16:36.120
work is each of those MDS a client request will be tracked inside the MDS and each of those metrics

16:36.120 --> 16:42.920
will be exported via NMDS command. So this is how it what will look like this particular

16:43.240 --> 16:50.840
jation or the metadata is hotel compatible. So when in the future we do an open telemetry integration

16:51.560 --> 16:57.480
all of this will be nicely variable in the in the yogurt UI or the open telemetry UI. So in this

16:57.480 --> 17:07.640
case if you see real quick there's a link from the client that is being handled by the MDS and these

17:07.720 --> 17:16.680
are the various stages of the request inside the MDS and you can see the MDS is not tracking

17:17.240 --> 17:23.640
at a very granular level how much each of those steps are taking and some of these these

17:25.480 --> 17:29.800
stages are asynchronous same on that is taken care of like the journal weight is

17:29.800 --> 17:36.920
as the MDS somitting a journaling record and waiting not really waiting for it but you know waiting

17:36.920 --> 17:42.600
for a call by for the journal to journal IO2 finish and the time metry tracing MDS tracing takes care

17:42.600 --> 17:52.280
of all of it. Okay lips of this asynchronous and 0 copy API is that was there in the

17:52.600 --> 17:58.360
safer for users based client but they were not much stable. Main users of this is again

17:58.360 --> 18:09.000
NFS Ganesha and SMB the primary reason they were not being being being used were because of the

18:09.000 --> 18:17.560
lack of testing. So with now a tumbler release at least workload testing has been started to

18:17.560 --> 18:22.280
do with Ganesha clients so we test fews which is uses this client the call of client and now the

18:22.280 --> 18:30.120
Ganesha client so all of the workloads will run with all of three and NFS and Samba I believe

18:30.120 --> 18:38.200
Samba 2 has integration of to use the asynchronous IO interfaces with lips of FS and those

18:38.200 --> 18:42.440
probably will also be called as stable NFS has integration but they have switched it off

18:43.400 --> 18:53.320
once we market as stable as my NFS will to market as usable. Okay last but not the least I guess

18:54.040 --> 19:01.880
long standing request CFMDS quality of service this feature that has been contributed by line

19:01.880 --> 19:09.400
cooperation this spoke about this in CFL account 2023 and I'm sure I might believe they

19:10.120 --> 19:17.480
video is out there they have been using it for long and they did submit a poll request however

19:18.040 --> 19:25.240
there was you know couldn't make to review the poll request but now we have taken efforts to do that

19:25.240 --> 19:30.840
at a test at a documentation which was actually lacking in the poll request however it will be a

19:30.840 --> 19:36.120
technological preview take preview again so we have to like enable it it will be disabled by default

19:37.080 --> 19:43.320
and once we are like you know happy with our tests and documentation mostly test is when we

19:43.320 --> 19:50.120
will make it open for use. Okay here are some links for some other things I spoke about

19:51.000 --> 19:58.360
the tracing changes the QS changes documentation to how to extract some volume metrics

19:58.360 --> 20:04.200
and a detailed user space as a script talk from CFL account couple of years back.

20:06.200 --> 20:08.200
Yeah I'll do happy with that first.

20:19.480 --> 20:23.320
Please, I think you know if you start the right place as this question but

20:24.680 --> 20:31.000
to say I know how to administer set of tests and I want to learn how to build on

20:31.000 --> 20:37.720
this sort of something that's designed for AI workflows like Lester, I'd like to know

20:37.720 --> 20:44.680
how hackable is set fast how to use it to get closer to something that can service

20:44.680 --> 20:54.040
a GPU cluster for high performance. Okay so your question is basically how do I tune a

20:54.040 --> 21:02.680
SFL system to work well with say an AI workflow? Okay so what typically would be the

21:02.680 --> 21:12.040
AI workflow you're looking at? Let's say you have a cluster with CCGBUs and you want to be able to

21:13.080 --> 21:18.600
reach like some of this I can see is and you know very high throughput

21:19.240 --> 21:27.320
and the accuracy is very accurate and you know I know it's a bit special that we have

21:27.320 --> 21:34.120
with no production with through communication does not reach that because of all the overhand

21:34.120 --> 21:43.160
of replication and you know in the S overhead and in the view but like I've heard of certain

21:43.160 --> 21:48.840
numbers from other people who do the last year and I was wondering if I can get them up like

21:49.960 --> 21:55.720
build a customs set fast with sort of two in the weeks just to you know get closer to them.

21:57.080 --> 22:03.080
Okay I don't have an answer right now frankly but you know you know if you have specific cases

22:03.640 --> 22:12.440
we can talk about it you know but but supporting some kind of a Lester base and and

22:12.600 --> 22:18.200
achieving Lester kind of throughput because Lester is pretty much you know everything is hard

22:18.200 --> 22:23.880
to be based and you know to quite special things this is not that surface and not that so you

22:23.880 --> 22:29.000
have you know that requires a special kind of you might want some special configurations

22:29.640 --> 22:33.240
which we can talk about but I don't have anything on top of my head to give you an answer

22:35.640 --> 22:40.360
but AI workloads are nasty I mean I recently saw an AI workload that was trying to list

22:40.360 --> 22:47.640
a directory with one million files or 10 million files that is not going to scale with any

22:47.640 --> 22:58.120
files system leave the surface right not even Lester I think maybe not even Lester

22:59.000 --> 23:06.280
yeah I mean it's it's just management to know that they don't bucket us for you know

23:07.480 --> 23:17.240
all these new AI workloads and our customers you know tell you all so we have this from this provider

23:17.240 --> 23:26.760
and you can you you know provide this kind of phones it's just like I'm trying to understand

23:26.760 --> 23:33.320
if if if it's like a technological limitation or is it something else like do I need to

23:33.320 --> 23:41.800
think about hardware or different kinds of you know well it's not a technical technological

23:41.800 --> 23:46.920
limitation but at least for the you know sometimes the applications have to be written

23:46.920 --> 23:52.920
sincerely right listing a 10 million files some directory is not going to lead to nice things

23:53.080 --> 23:58.200
so sometimes you have to just just just write the application in a more sensible way

23:59.240 --> 24:05.000
and I have seen in the past that folks try to run an application that was previously working well

24:05.000 --> 24:12.280
with a local file system and expect to work the same way in a distributed file system that might

24:12.280 --> 24:19.000
not be innitiable target you know I mean you see projects and you see my own and everything is fine

24:19.000 --> 24:25.160
but behind the scenes things are very difficult there are things that local files systems can

24:25.160 --> 24:32.040
do because it's local and distributed files systems cannot I'm going to come across as

24:32.360 --> 24:45.640
kind of a lemon and maybe but I've heard a coin word called GPU direct direct like something

24:45.640 --> 24:54.200
that basically by a message city you to read from this which is you know storage engine

24:54.840 --> 25:03.880
there are two GPUs without father and CPUs is it something you've heard of no I didn't like

25:08.120 --> 25:12.120
yeah like I didn't but not really I mean

25:24.360 --> 25:28.360
right

25:36.360 --> 25:43.880
yeah pretty much yes oh sorry your question is surface pretty much having to to the internet yeah pretty much

25:44.360 --> 25:49.320
so

25:51.320 --> 25:55.320
thank you

