WEBVTT

00:00.000 --> 00:09.360
Hey folks, one welcome to this talk, good afternoon.

00:09.360 --> 00:10.360
My name is Venki.

00:10.360 --> 00:16.240
I am the tech lead for the file system stream and I have data, my colleague.

00:16.240 --> 00:22.480
We both work with IBM on the file system stream and this talk is about command auditing

00:22.480 --> 00:25.720
framework for quicker cluster rescue.

00:25.720 --> 00:27.320
Let's see what this is.

00:27.320 --> 00:34.080
Okay, so anybody who has operated or served cluster knows there are debugging challenges.

00:34.080 --> 00:39.760
As long as things are working fine, it's good, but as you hit a roadblock, cluster warnings

00:39.760 --> 00:45.280
and things like that, that's the time that you want to kind of see what's going on with

00:45.280 --> 00:46.280
the cluster.

00:46.280 --> 00:50.440
You want to be able to debug it, you want to know what's going on.

00:50.440 --> 00:52.800
And served debugging is not straightforward.

00:53.040 --> 00:55.560
It presents significant challenges.

00:55.560 --> 01:00.520
Sometimes it requires just going through the talks, sometimes not.

01:00.520 --> 01:07.080
You want at times when the issues are complex, you would want to get in touch with the community

01:07.080 --> 01:10.080
and the actual developers and see what's going on.

01:10.080 --> 01:14.320
So it's not straightforward, right.

01:14.320 --> 01:19.920
And on top of that, coming from a file systems point of view background, sets of various

01:19.920 --> 01:24.400
challenges are even more challenging.

01:24.400 --> 01:28.920
There are lots of ways to change cluster state, to play at one with different bits like

01:28.920 --> 01:33.080
the MDS map and different files as a components.

01:33.080 --> 01:40.480
There are not a handful, but like a kitchensync of commands, I would say.

01:40.480 --> 01:47.600
And each command has arguments, many arguments, and if you were able to mistakenly swap,

01:47.600 --> 01:49.880
you don't know what you did.

01:49.960 --> 01:56.240
We have placed in save gods, in commands like, which are riskier to operate, something

01:56.240 --> 02:01.520
like, yes, I really, really mean it, and that deliberately makes the user or the cluster

02:01.520 --> 02:06.200
operator to think that this is dangerous, or should I still execute.

02:06.200 --> 02:08.320
But it's still abused.

02:08.320 --> 02:14.520
We have seen, you know, send reports from folks who have, do you copy-based commands

02:14.520 --> 02:17.320
and it happens all the time.

02:17.720 --> 02:21.680
The other self-refest challenges, you know, we have a set of disaster recovery tools.

02:21.680 --> 02:27.040
So for some reason, if you lost a PG in self, you know, a placement group, and that

02:27.040 --> 02:32.760
placement group was probably holding, like, some of the MDS meta-tara, you know, your toast,

02:32.760 --> 02:38.560
because, you know, you recover the PG, lost PGs, and then you, MDS meta-tata is gone,

02:38.560 --> 02:40.120
your file system is gone, right.

02:40.120 --> 02:44.080
So you have to now run a set of disaster recovery tools to rebuild them.

02:44.080 --> 02:50.040
Unfortunately, those don't do not leave any trace of what's going on, you know, so it

02:50.040 --> 02:57.560
becomes extremely challenging, even with the disaster recovery tools, the order of execution

02:57.560 --> 02:58.560
matters.

02:58.560 --> 03:04.880
So, you know, our dogs are well laid out, yes, but sometimes, you know, you know, folks

03:04.880 --> 03:08.600
just execute things out of order and that leads to loss of complexity.

03:09.520 --> 03:14.680
Okay, so, let's see, let's see a scenario, like a self file system issue is reported

03:14.680 --> 03:16.320
on the user's list.

03:16.320 --> 03:17.320
How does one go about it?

03:17.320 --> 03:18.320
How would I go about it?

03:18.320 --> 03:22.880
Or, you know, when I see an issue in the mailing list, whatever I do, some of the reports

03:22.880 --> 03:27.480
are detailed, some are not, unfortunately, you know, we can't expect everyone to have, like,

03:27.480 --> 03:32.600
the most continuous, more details in the, in the, in the, in the mailing list, or in a

03:32.600 --> 03:34.520
slack, or I see a whatever.

03:34.520 --> 03:41.240
So, we have to go on a, uh, back and forth with users to see what was done, right, what

03:41.240 --> 03:46.440
commands were run, or the hardware's correct, uh, and not to blame the user, not blaming

03:46.440 --> 03:52.800
anyone, you know, uh, sometimes the, the, the, the, the detail we get back is very, very

03:52.800 --> 04:00.080
tells, uh, so we have to work with that, um, so once we have that all, all that information

04:00.080 --> 04:04.840
what we, we, we, we, we do some kind of a post-mortem, yes, uh, we tried to help the user.

04:04.840 --> 04:09.720
So, for that, what I, I would look, right, as a, as a self, as developer would be the exact

04:09.720 --> 04:12.960
sort of commands that were run and their completion state.

04:12.960 --> 04:17.800
The completion state is very important, because if a particular, uh, command was run, say

04:17.800 --> 04:24.120
manipulating the MDS size, right, the number of MDS is, and that didn't run to completion,

04:24.120 --> 04:29.920
I want to know that, right, because it makes the debugging, um, my, my debugging is based

04:29.920 --> 04:30.920
on that context.

04:30.920 --> 04:37.020
Um, and at times, I also want to know the MDS map, MDS map is basically, uh, you know,

04:37.020 --> 04:41.520
a metadata about what MDS is, are in the cluster, which is handling the file system, and

04:41.520 --> 04:46.080
what are the state of each of the MDS's, and many other bits, but I also want to know

04:46.080 --> 04:47.080
that, right.

04:47.080 --> 04:51.080
So, my debugging is now, I'm building up the context for debugging.

04:51.080 --> 04:57.200
Um, I also, more, most importantly, as I mentioned, want to know if disaster recovery was

04:57.200 --> 04:59.880
a way executed on the file system.

04:59.880 --> 05:06.080
Um, you know, there have been cases where like some, uh, someone got a broken self of

05:06.080 --> 05:12.800
cluster, because of a PG loss, ran disaster recovery, recovered it, uh, and possibly a couple

05:12.800 --> 05:14.280
of months down the line.

05:14.280 --> 05:17.160
Some issue happened, but, you know, it was related to disaster recovery.

05:17.160 --> 05:21.400
So, I want to know that, right, and who remembers like a couple of, you know, months down

05:21.400 --> 05:22.960
the line, what they did.

05:22.960 --> 05:23.960
Okay.

05:23.960 --> 05:29.040
So, especially for like, you know, uh, performance issues, I also want to have like a periodic

05:29.040 --> 05:30.040
set of problems.

05:30.040 --> 05:34.040
So, this is what I'm looking at, you know, at a broader sense.

05:34.040 --> 05:41.480
Um, so, question is, how much of history am I looking at or anyone looking at?

05:41.480 --> 05:45.760
cluster commands, whatever commands were, and like recently executed, right.

05:45.760 --> 05:51.160
MDS map, I wanted at bounded time frame, maybe, you know, like when upper particular

05:51.160 --> 05:54.200
issue started showing up, I, I want from that.

05:54.200 --> 05:59.440
Of dumps, again, I want to know only, I want to know the perv dumps, uh, when, when, and the

05:59.440 --> 06:01.960
certain performance issues started appearing.

06:01.960 --> 06:04.840
Disaster recovery, three, I want to know it as far as possible.

06:04.840 --> 06:09.120
I want to know if it was ever run, maybe five years ago, maybe ten years ago, but I

06:09.120 --> 06:12.240
want to know that, because that changes my entire deeper gain context.

06:12.240 --> 06:13.240
Okay.

06:13.240 --> 06:17.640
So, we have some kind of an audit logging in, in, in, in, in, in, in, in, in, in, in,

06:17.680 --> 06:19.320
and it's implemented in self-monitor.

06:19.320 --> 06:24.640
So, let's see if, you know, talk a bit about that to see if it's enough or not.

06:24.640 --> 06:29.640
So, the, so, this, this, this component in the monitoring, self-monitor is called the

06:29.640 --> 06:33.720
monitor logs, and this is what it does, or it provides.

06:33.720 --> 06:38.560
It provides a service where self-demons can set custom logs to the monitor.

06:38.560 --> 06:40.120
These are not, these are like audit logs.

06:40.120 --> 06:46.080
They can be like small strings, uh, to the monitor for, for, according, um, you know, uh,

06:46.080 --> 06:50.120
all the demons in self have, you know, make use of monitor client to talk to the monitor,

06:50.120 --> 06:52.520
to authenticate, that's the very first step.

06:52.520 --> 06:56.000
So, the same monitor client can use this in these audit logs.

06:56.000 --> 07:01.240
These are it logs when the monitor receives it, it is stored in Paxos, uh, so the Paxos

07:01.240 --> 07:06.600
service for, in monitor that deals with these audit logs is called the log monitor.

07:06.600 --> 07:09.280
And these are it logs are stored in the monitor DB store.

07:09.280 --> 07:11.080
These are like local storage, right.

07:12.040 --> 07:14.160
And crucial part is these are periodically trimmed.

07:14.160 --> 07:19.560
Um, you know, if the monitor did not, do not trim it store, it would be, if it

07:19.560 --> 07:23.920
is growing forever, and consume this space, and eventually run out of this space, we

07:23.920 --> 07:24.920
don't want that.

07:24.920 --> 07:30.640
So, they speed on the trimming for the, uh, for the audit logs and the monitor.

07:30.640 --> 07:32.360
So, these are the downsides, right.

07:32.360 --> 07:36.120
The use of local storage for storing these audit logs is counterproductive.

07:36.120 --> 07:40.320
You know, I don't want it if I lose it, I, I lose all the audits.

07:41.040 --> 07:45.840
Logs are periodically truncated to mitigate on bounded growth as I told, um, you can't

07:45.840 --> 07:47.080
leave it on bounded growth.

07:47.080 --> 07:50.760
So, we have to trim it, um, you know, uh, periodically.

07:50.760 --> 07:56.480
Uh, so if say disaster recovery tools started to send these audit logs to self monitor, it

07:56.480 --> 07:57.480
would be recorded, yes.

07:57.480 --> 08:01.680
But I won't know if it was ever, disaster recovery was a very secure 10 years ago, because

08:01.680 --> 08:03.320
the monitor would have trimmed it.

08:03.320 --> 08:07.080
That is crucial for my debugging, but we don't have that information.

08:07.840 --> 08:11.960
Another idea that came up was why doesn't the monitor just store the audit logs in

08:11.960 --> 08:13.000
radios, right?

08:13.000 --> 08:19.480
Um, that is a possibility, however, you know, monitor cannot really talk to radios.

08:19.480 --> 08:26.680
That will lead to like very bad, uh, feedback loops, where, you know, the monitor to,

08:26.680 --> 08:31.880
uh, dumb something radios has to talk to itself, uh, and again talk to radios.

08:31.880 --> 08:36.440
So, these, this is like a cyclical dependency that happens, uh, and that really cannot, you

08:37.000 --> 08:38.520
know, work.

08:38.520 --> 08:43.960
And we're adding the log audit log itself, uh, these can be small strings, they are acceptable.

08:43.960 --> 08:48.080
But I don't want to just need, you know, small strings of our, of the commands.

08:48.080 --> 08:51.840
I want to know the MDS map periodically, I want to know the Perf terms.

08:51.840 --> 08:57.400
I want to be able to, you know, store nice JSON dumps, so that I can have access to them.

08:57.400 --> 09:02.560
Uh, and even if I was able to do that, right, um, I don't have a mechanism to read it

09:02.640 --> 09:03.680
back nicely.

09:03.680 --> 09:09.200
There are commands, so that can pull off these, uh, these, these, or these logs from monitor,

09:09.200 --> 09:10.560
uh, but they are not flexible.

09:10.560 --> 09:16.240
I can't say, give me, give me like, the logs from last month, uh, to made of this month.

09:16.240 --> 09:17.360
No, I can't say that.

09:17.360 --> 09:18.400
So it's very crude.

09:18.400 --> 09:20.880
It is there, but it's very crude.

09:20.880 --> 09:27.760
So these are the challenges, uh, um, right, and the monitor has the log store, uh,

09:27.760 --> 09:31.440
but commands may be distributed between the monitor and the manager, right?

09:31.440 --> 09:35.440
We have the self manager component, some commands execute and monitors, some commands, and

09:35.440 --> 09:36.720
self, uh, manager.

09:36.720 --> 09:41.040
Um, so the challenges were any of the commands unsuccessful.

09:41.040 --> 09:42.240
I don't know that.

09:42.240 --> 09:47.600
I want to not be precise, arguments, I want to not disaster recovery state, and I also want

09:47.600 --> 09:53.920
to somehow correlate a chronological sequence of audit of what was executed on a cluster.

09:53.920 --> 09:55.160
So these are the challenges.

09:55.400 --> 10:05.160
Okay, um, the proposal is, and there's a feature that we have been working on is decentralized

10:05.160 --> 10:06.200
audit logging.

10:06.200 --> 10:12.440
Um, so these are the audit logs that are stored in rados, in one of the rados pools,

10:12.440 --> 10:16.120
and we'll introduce that, uh, by the self manager.

10:16.120 --> 10:23.240
Uh, and this, and we'll provide an structured chronological order of command history.

10:23.640 --> 10:30.440
Um, with this, it's possible to track the precise sequence of disaster recovery steps, uh,

10:30.440 --> 10:35.640
monitoring execution of commands, you know, how much time to take, um, you know, we also

10:35.640 --> 10:39.880
record every single information like what flags were used, what other type, uh, did the

10:39.880 --> 10:45.640
command train to completion, and was like the dangerous flaglessly, really, really mean a flag was used.

10:46.680 --> 10:50.280
Um, and also, you know, query mechanism to, to, to pull out commands which failed,

10:50.840 --> 10:56.680
timestamps, uh, structured entries, uh, and also, you know, uh, with, with all this information,

10:56.680 --> 11:01.320
right, it's, you can, you can imagine debugging becomes much more, uh, easy.

11:01.880 --> 11:08.520
Uh, all I have to do is, you know, when a, when a user reports an issue, uh, I have to just pull

11:08.520 --> 11:14.040
out, tell a command to the user to pull out, they say the last one month of command log history,

11:14.040 --> 11:18.280
and probably, some other steps, uh, but I have all the information for debugging.

11:20.920 --> 11:27.160
Uh, so just really quick log monitor, you know, uh, um, uh, uh, this already I spoke about, um,

11:27.160 --> 11:31.720
it's there for, it's, uh, the, the audit command thing is there in the monitor, but we can't make

11:31.720 --> 11:36.520
use of it. Um, we have all the things in log monitor like, you know, who is sending, which

11:36.520 --> 11:41.560
demoness sending, with sequence, with channel, what time, and for every single thing.

11:42.920 --> 11:50.200
Okay. So, now, the idea to hear is, the monitor cannot dump in, in radar. So,

11:50.840 --> 11:54.680
there has to be a way for the monitor to relay commands to the chef manager, so that it's

11:54.680 --> 12:00.520
chef manager can now go and, uh, record it in rados. So, there is something called an audit

12:00.520 --> 12:05.080
for log relay. Log monitor supports subscribing to a log channel. It is already there. Okay.

12:05.640 --> 12:11.000
Um, and the subscribers can request a one-time digest of logs, and after that, once you request

12:11.000 --> 12:18.120
to digest of logs, the monitor will be keep on sending you incremental, uh, audits. Um, after a

12:18.200 --> 12:24.680
access proposal is made, uh, so the simplest thing to do is for the chef manager to subscribe

12:24.680 --> 12:30.440
to the monitor log. So, there's, so this, there's a new manager module for it. It's called audit

12:30.440 --> 12:36.600
command, and there's a new pool, dot audit pool, um, when the auditman starts up, very first time

12:36.600 --> 12:43.560
it, uh, subscribes itself to the log monitor channel, on the audit channel, uh, and once it subscribes,

12:43.640 --> 12:49.560
the monitor will keep on modifying it to, for the audit logs, and the auditman, uh, module,

12:49.560 --> 12:55.800
all it does is talks to, um, talks to rados, why, uh, SQLite, because all these audit laws are

12:55.800 --> 13:01.800
going to end up in the pool, in a database, and the database is SQLite, which is back, which,

13:01.800 --> 13:06.840
which, which, which, which, which the backing story story story was. Um, so the monitor, the manager

13:06.840 --> 13:12.120
will be changed to use this particular framework, and we start recording it in SQLite.

13:13.560 --> 13:26.520
Um, and I'm going to hand it over to Dharia, uh, for the database snippets.

13:44.360 --> 13:53.800
I'm a audible. Thanks. So, um, to store the audit logs right, we would need, we would be

13:53.800 --> 14:04.280
making use of the, uh, lips of SQLite, uh, audible extension. Um, it provides a, uh, to create a

14:04.280 --> 14:11.800
SQLite via first, which would be, uh, communicating with the, uh, interfacing with the SQLite databases,

14:11.880 --> 14:18.360
which are stored in rados. Um, and one of the reasons to use, uh, the, uh,

14:18.360 --> 14:25.480
self-acquilite, uh, managed SQLite databases is that, uh, it's one of the suffix, suff, suffix

14:25.480 --> 14:32.360
related to, uh, sort of, you know, allow multiple clients to, uh, access the database in a serial fashion,

14:33.160 --> 14:38.600
uh, which would be managed by the rados logs, uh, you know, been provided by the self-acquilite

14:38.680 --> 14:48.440
via first, for lips of SQLite, that's a link down below. So, um, let us discuss the core part of the

14:48.440 --> 14:54.200
audit logging framework, which is to flow the commands from the demons to the databases.

14:54.760 --> 15:00.280
The framework would be making use of a new pool called the audit pool to store the databases and

15:00.280 --> 15:07.720
the audit logs. Um, all the self-CLI commands, whether it's the monitor command or the, uh,

15:07.800 --> 15:12.760
manager command would be routed via the manager, the reason being that the monitors can

15:12.760 --> 15:18.920
automatically communicate with the rados, right? So, um, from the manager, the audit logs would be

15:18.920 --> 15:24.680
reaching the respective databases. Um, the manager would be interacting with the, uh,

15:24.680 --> 15:31.720
invoking the SQLite APIs from the library to execute the SQL queries and, uh, record the commands

15:31.880 --> 15:39.480
into the respective databases in the audit pool. So, here's how it would look like. Um,

15:40.120 --> 15:47.560
we start from this FCLI from there, um, you have the command. If, if it's a M. J. Command,

15:47.560 --> 15:52.760
and it, the, the monitor will edit it and it's over to the, uh, a self-managed demon,

15:52.760 --> 16:00.760
and from the manager demon, it works the, uh, SQLite 3 library and it routes via the SQLite

16:00.760 --> 16:08.200
VFS and then the monitor log is committed to the underlying M. J. Audit, TB, which is persisted in the

16:08.200 --> 16:14.280
audit pool. Um, if it's a monitor command or any command coming from the other entities,

16:14.280 --> 16:21.080
it's stored in the PXO store, um, and, uh, it's, uh, on demon-based this thing, um, the manager would

16:21.080 --> 16:26.200
be requesting for the audit logs, um, making use of the log monitor, which would be fetching the

16:26.280 --> 16:32.840
log summary from the PXO store and it, uh, the log monitor would process it to fetch the log entries

16:32.840 --> 16:39.800
and return the audit logs to the manager and then the flow is same as the M. J. Audit logs.

16:40.600 --> 16:47.880
So, this is how it would be. Um, so one of the, uh, as Venki discussed earlier that,

16:49.240 --> 16:55.240
getting the, or, you know, persisting the, uh, disaster recovery usage across the cluster is very

16:55.320 --> 17:01.800
important. So, but the disaster recovery tools are quite different from the, uh, demon commands,

17:02.600 --> 17:09.880
um, they need to be worked directly via the binaries. So, we are going to do this with, um,

17:09.880 --> 17:16.200
the usage of a separate database would be reserved for the, uh, the recovery tools, all the recovery

17:16.200 --> 17:20.440
tools, whether it's the channel tool or the data scan tool would be making use of the same

17:20.600 --> 17:25.960
of land, say, of land tools stored all the to DB and all the tools will be linked to the

17:25.960 --> 17:31.800
lip-saccharacculate to facilitate this, um, the link to the disaster recovery tools is down below.

17:32.840 --> 17:41.480
Um, this is how it would be for the offline tools. Um, start from the offline tool, the

17:41.480 --> 17:47.800
information of the offline tool, um, the offline tool would try to access the underlying data

17:48.680 --> 17:54.280
service via the SQL activity library and the, uh, SQL at VFS, the self SQL at VFS layer,

17:55.720 --> 18:01.080
um, and once the connection has been established, we would start writing the SQL logic,

18:01.080 --> 18:07.160
which is, which could be the suffocational tool journal or header or any event into the underlying

18:07.160 --> 18:17.480
audit pool. And, um, once we are done writing it, uh, the tools cannot, uh, delete or close or log

18:17.560 --> 18:22.280
the DB, they can just release the memory buffers, which would be the data. The sendals and some

18:22.280 --> 18:31.080
other stuff, the memory, uh, the structure is that would be using. And, um, so, yeah, this is how

18:31.080 --> 18:40.520
it would be for the offline tools. So, um, the current DB schema, the current audit log, uh,

18:40.600 --> 18:47.640
would be consisting of six rows, uh, six fields. First one would be the, uh, sequence number,

18:47.640 --> 18:53.960
which would be, uh, fetched from the monitor for the, uh, manager or the offline tool logs.

18:53.960 --> 18:58.840
Then there's the command, which is executed in the in a time and the completion time, which would

18:58.840 --> 19:04.920
be, uh, in timestamps, um, then there's the status of the command, whether it has passed or

19:05.000 --> 19:11.080
it failed or it's pending. Um, and then there's the return value, which is the retwelled from

19:11.080 --> 19:22.360
the command execution, uh, auditman to the rescue. Um, so the auditman is going to be a,

19:22.360 --> 19:29.960
uh, staff manager, uh, plugin, and it would be used to access the audit logs. So, the key features

19:30.120 --> 19:36.600
includes, uh, rich and extensive set of commands that would be able to fetch and retrieve,

19:36.600 --> 19:43.080
uh, audit logs and different formats and different ways. Um, it is capable of querying, uh,

19:44.280 --> 19:51.560
audit databases from, uh, for the, uh, the manager or the monitor or the offline tool DB. Um,

19:51.560 --> 19:58.040
and there is also retention policy in place, uh, which would keep only the last key, uh, logs,

19:59.000 --> 20:05.240
uh, in order to not breach the limit and overgrow the size of the databases. Um, you can be said

20:05.240 --> 20:11.800
why I just command, uh, staffer, the audit attention, add or the DB and then the log, count, uh, say,

20:12.680 --> 20:20.600
say, you just want to add, retain last 100 kilox, you can do that. So, um,

20:20.920 --> 20:28.120
this is how it would be, uh, if you start from the self CLI, um, the, uh, the command,

20:28.120 --> 20:36.040
a self FS, uh, MGR or whatever, it which is self MGR via monitor, um, and once that is done,

20:36.040 --> 20:42.760
the audit plugin would be constructing the URI, because we need to ensure what that happens, we need

20:42.760 --> 20:49.320
to access, whether it is monitor or manager or any disaster recovery. And once that is done,

20:49.400 --> 20:55.400
we need to append some stuff like the dot audit and also the URI would look something like this,

20:56.760 --> 21:02.520
um, file, type of slash, an audit DB and then the, uh, self which is going to be the skillet

21:02.520 --> 21:09.640
via FS. And then it would make use of a read-only handle to, uh, communicate and establish the

21:09.640 --> 21:15.880
connection with the, uh, audit pool, um, and once that is done, we would be executing this

21:16.840 --> 21:22.760
led query to read from the, uh, respective database and then the audit plugin would be formatting,

21:23.640 --> 21:32.200
uh, the, it tried data and then send it to the CLI. So, uh, this is some of the, uh, commands,

21:33.000 --> 21:39.560
um, the command by default only fetches the last 100 records, obviously for sanity, uh, by default,

21:39.640 --> 21:45.880
it would be just a surface audit MGR. TMGR over here is the, uh, demon, sorry,

21:45.880 --> 21:51.720
it is a place order for the DB that you want to use here, we are trying to fetch, uh, the audit

21:51.720 --> 22:00.200
logs for the MGR demon, uh, it would retry all the six, uh, fields and, uh, show it to you in

22:00.200 --> 22:08.200
a JSON or a blog, um, and then there are various ways to, uh, retry commands say if you want to,

22:08.920 --> 22:15.320
say fetch, uh, last 1000 queries, but you want to only do it after a particular sequence or before,

22:15.960 --> 22:21.800
uh, sequence and you also have the count, which would return you, if, if, come on, what was the number

22:21.800 --> 22:27.160
of times the command was executed, and that's just brief, just brief, the ID, the sequence number,

22:27.160 --> 22:34.680
uh, the command is set as the recent N, which means last 100, 1000 or 1 liter cake, uh, logs,

22:34.760 --> 22:39.160
and then if you want to fetch only a particular fields, like sequence number, the command was

22:39.160 --> 22:45.160
the count time, you have that, and you can also audit based on the time, whether you want to do it

22:45.160 --> 22:50.920
from a particular date or to a particular date, you can do that, um, you also have the range

22:50.920 --> 22:58.120
from this day to the next, the, from D1 to D2, um, you can also order the logs,

22:58.120 --> 23:03.560
via the order, by field and order, and we'll then also, by the S&U descending order,

23:04.280 --> 23:10.360
if you want to fetch, maybe to last a day log, or maybe last a year's log, you can do that,

23:10.360 --> 23:16.280
and we can also fetch, uh, the audit logs, only a pack, uh,

23:16.280 --> 23:20.920
literally the field, or maybe all the past commands, or all the earning commands,

23:21.720 --> 23:29.160
and we can fetch the audit logs by Jason, which is pretty, so we can also, uh, so, um,

23:29.240 --> 23:37.880
there are commands, which might need no correlation, uh, the auditman makes use of log monitor to

23:37.880 --> 23:44.680
fetch the monitor logs, of lentils, um, like direct use of the, uh, the instance with the

23:44.680 --> 23:50.600
SQL ADB. Um, if you want to correlate commands from various sources, auditman can reconstruct

23:50.600 --> 23:56.440
the, um, event sequence by matching the records, um, often with the criteria, maybe say any time,

23:56.520 --> 24:03.400
which usually is the case, um, this is done by the compile command, uh, say if you want to, uh,

24:03.400 --> 24:09.080
uh, uh, you know, compare the MGR and the monitor logs based on the initiation time,

24:09.080 --> 24:14.440
uh, between the range, the first of 10 to 30s of 10, with delta speed being 1 seconds,

24:15.160 --> 24:20.840
you'd be getting something like this, uh, so say there's a first scrub command, which is executed,

24:20.920 --> 24:29.160
and which is managed by the, uh, MGR demon, you would have something like this. Um,

24:29.160 --> 24:35.880
extending the, uh, entire audit logging framework beyond just commands, um, right now, uh,

24:35.880 --> 24:44.120
it only, uh, stores commands in the, uh, databases, but, um, demons like MDS, uh, could make use

24:44.120 --> 24:49.880
of this by dumping the MBS state, when the file sits system encounters on degradation or maybe

24:50.840 --> 24:58.760
subordinates or so, both dumps can also be added, uh, by demons, if any OSD or placement group,

24:58.760 --> 25:05.880
degradation occurs, um, this can help us not only just for surface auditing, but for, uh,

25:05.880 --> 25:12.680
a safe wide auditing, um, but yeah, this starts with surface and, uh, in future, we'll be trying to

25:12.760 --> 25:19.640
implement it over the demons as well. So, yeah, that's it. Any questions?

25:28.600 --> 25:32.600
Yep. What's the timeline for the, uh, what's the, uh, what's the, uh, what's the, uh,

25:32.600 --> 25:36.040
uh, what's the, uh, what's the, uh, what's the, uh, what's the, uh, what's the, uh, what's the, uh,

25:36.040 --> 25:42.200
so this is currently, uh, yeah, so the, I think the question is that what's the timeline,

25:42.200 --> 25:46.920
this would be part of the area is right, uh, so right now it's currently working progress,

25:46.920 --> 25:52.040
but we're eye for umbrella, I guess. Yeah, so there is currently a working progress, uh,

25:52.040 --> 26:00.360
peer that I have, it's a tough peer. I don't have the link handy, but, uh, yeah, so umbrella, yep.

26:01.320 --> 26:09.240
Any other questions? Yeah. Sure. Yeah. Uh, um, uh, uh, uh, uh, very cool, uh, uh, uh, I like the idea.

26:09.240 --> 26:14.360
Thank you. Um, uh, maybe you said, you talked with, could you maybe repeat this there, like,

26:14.360 --> 26:20.200
anyway, like, I can customize what should be locked and how would that look like?

26:21.160 --> 26:26.680
So, you want to customize the fields that we, uh, store in the database, right?

26:26.680 --> 26:41.160
So, right now it is a, yeah. So, right now it is a fixed set that will include it is just

26:41.160 --> 26:46.240
six fields like the sequence number the command, the initiation time, the completion time,

26:46.240 --> 26:51.840
the status and the hardware, but in future we have been V2, we might incorporate this

26:51.840 --> 26:56.320
but thanks, that is a nice advice. Thank you. Thank you.

26:56.320 --> 27:01.840
We have, well, this module of the module and all the whole thing will be enabled by default

27:01.840 --> 27:07.200
with always a module for example, at the manager in the manner of configuration.

27:07.200 --> 27:12.640
Do you visit this care intention to enable this like the logging of all commands?

27:12.640 --> 27:19.360
So, the question is that whether the auditing would be enabled by default, we would like

27:19.440 --> 27:26.000
to that it stays enabled by default, but I guess we can make it an obtain as well.

27:26.000 --> 27:31.360
The thing is it needs a deliberate switch on, we use it because there is a data space in

27:31.360 --> 27:38.400
mode, right. There is a conjunction of cluster space. Although we expect it not, we do huge,

27:39.280 --> 27:45.360
maybe you can use irresa code it will go back because time is not, you know, the speed is not

27:45.600 --> 27:56.080
so it needs to be deliberately switched on. But if, you know, we see if the courage for this feature

27:56.080 --> 28:01.760
is the whole offset, then we might like, let me get all this on, like, the one we are talking about.

28:03.760 --> 28:09.280
I remember Patrick mentioned that the next step might be simple to repair and MBS getting

28:09.280 --> 28:14.480
all the knowledge from possibly this database and get to motor repair at the same time.

28:15.600 --> 28:20.160
So, yeah, that's something that I've been talking for on my own, where you can ask for my hand and just

28:26.800 --> 28:32.640
yeah. So, Dan, do mention that one of the things that MBS would be nice and doing is to

28:34.640 --> 28:38.000
execute some of the disaster recovery steps itself. So, if there is some failure,

28:38.080 --> 28:43.600
it knows what the type of failure is instead of having the user run it by themselves, the MBS can

28:43.600 --> 28:48.240
auto recover itself. But it's something we have been thinking about, but not yet planned.

28:49.120 --> 28:54.080
This takes it like, with this, the MBS has all the required information of what could have

28:54.080 --> 29:00.080
gone wrong. But it will only do it if it's 100% sure, you know, if it's not, if it's not even one,

29:00.080 --> 29:04.160
like 99% if it thinks it's, you know, it knows the problem but it's still not sure,

29:05.120 --> 29:12.720
it should still not go into it. So, and that's the thing. But all this makes it easier for the MDS

29:12.720 --> 29:20.960
and also for us as developers and support folks to reduce the time to, you know, bring back

29:20.960 --> 29:28.320
a file system online. So, because there's lots of back and forth. I know, yeah,

29:28.400 --> 29:34.240
I'm so proud, somebody who ran a command with, yes, I really mean it, but never told us.

29:35.520 --> 29:40.640
Yeah, not deliberate, but it happens.

29:44.000 --> 29:48.000
Thanks for watching.

