WEBVTT

00:00.000 --> 00:12.320
Speed up live updates of your kernel and still be able to keep a, for instance, a

00:12.320 --> 00:17.080
virtual machine's life while you're doing the live update of the kernel on the host.

00:17.080 --> 00:24.720
There may be a, in memory file system that you want to keep its contents, for instance,

00:24.720 --> 00:31.000
if you still be a fast, with lots of data and you don't want to lose that data, when

00:31.000 --> 00:39.240
you do the update of your kernel, you want to keep PCI devices configuration, there is no

00:39.240 --> 00:46.960
reason to really eat PCI config space because you didn't really boot, you didn't change.

00:46.960 --> 00:56.240
There are via file attached to virtual machines that can continue doing the amazed

00:56.240 --> 01:04.120
during the live update of the kernel, but when kernel reboots, their state becomes cleared

01:04.120 --> 01:13.360
and they restart from the very beginning, so VMs get disconnected from their IO's.

01:13.360 --> 01:16.360
And there may be different use cases as well.

01:16.360 --> 01:24.080
For example, the first driving use case for resisting memory across KXXX was a P-ROM suggested

01:24.080 --> 01:36.760
by a virtuoso that they wanted to create a version of TempaFest that would be persisted

01:36.840 --> 01:43.560
during the KXXXXX and it'll survive reboots so that when you use commute to checkpoint

01:43.560 --> 01:51.560
and restore your applications, you could store the checkpoints that the TempaFest and the

01:51.560 --> 01:58.440
Blaz just persisted over reboots and then restoring the application contents would be

01:58.440 --> 02:03.960
much faster than doing a cold reboot and then going to the disk and so on.

02:03.960 --> 02:09.400
So, over the years, starting from 2017, there were a couple of proposals

02:09.400 --> 02:14.440
how to preserve memory contents, and there was a pyramid.

02:14.440 --> 02:23.200
There was PSRAM, there were different proposals of memory allocators that would be

02:23.200 --> 02:31.120
aware of state persistence, and would allow you to somehow pass metadata from one kernel

02:31.120 --> 02:37.760
to another, and then the memory that would be allocated from the allocators will survive

02:37.760 --> 02:52.240
KXIC, and the latest proposal for memory persistence of a KXIC was KXIC hand over mechanism,

02:52.240 --> 02:58.520
which will expand in this presentation, and the guest mammothes.

02:58.520 --> 03:03.400
So, I'll plot the slides, and these links are live to the discussions on the Linux kernel

03:03.400 --> 03:06.640
mailing list.

03:06.640 --> 03:11.640
So what is KXIC hand over OKHO?

03:11.640 --> 03:23.480
It's framework for serialization and serialization of the state that KXIC users can register

03:23.480 --> 03:31.000
their callbacks into the KXIC and when the KXIC calls these callbacks, the users decide

03:31.000 --> 03:37.080
what data they want to serialize, and on the boot of the next kernel, they decide what data

03:37.080 --> 03:48.600
they want to serialize, and KXIC provides the ability to save arbitrary properties, and also

03:48.600 --> 03:59.640
provide the ability to save arbitrary ranges, that a guaranteed in some way to survive

03:59.640 --> 04:08.960
KXIC, in a sense, this is similar to Zen Live update approach, the implemented with Zen

04:08.960 --> 04:23.520
breadcrumbs, and KXIC consists essentially of three major parts, first is the KXIC

04:23.520 --> 04:31.720
WFDT, it's a data structure based on its essentially flattened device tree that already exists

04:31.720 --> 04:42.840
in the kernel, and KXIC uses the FDT library, the FDT contains the properties, users

04:42.840 --> 04:51.800
wants to use this one to serialize and serialize, and one of the properties that more

04:51.800 --> 04:58.520
predefined is the memory ranges, that KXIC user can say, okay, I have this memory range

04:58.520 --> 05:07.240
please put it downwards, so I will be able to read it after KXIC finishes, another part is integration

05:07.240 --> 05:19.800
of KXIC boot, and the boot is set up with data structures, and currently it's implemented

05:19.800 --> 05:29.560
only for X86 and ARM, on X86 KXIC, KXIC is appended to set up data structure with the pointers

05:29.560 --> 05:37.800
to its physical memory locations, where it stores the device tree, and where it stores the memory

05:37.800 --> 05:48.800
required for the next boot, and for ARM64, KXIC, KXIC is added to the device tree created

05:48.800 --> 05:59.600
by ARMKXIC as a chosen node, and the third component is what we call scratch memory, these

05:59.600 --> 06:08.400
are memory regions that KXIC only case early on the boot of the first kernel, and then this

06:08.400 --> 06:18.480
regions declared a CMA with CMA mobility, so that they can be used by the system to

06:19.440 --> 06:28.240
use the space memory locations for movable locations, and then these regions are passed to the

06:28.240 --> 06:36.320
next kernels, and the only memory that is the next kernel will be using for its bootstrapping,

06:36.320 --> 06:48.960
and now KXIC device tree, as I said, it's written device tree, it is the same format

06:48.960 --> 06:57.600
as a device tree that VMware uses to describe the hardware, but in case of KXIC, it uses

06:57.600 --> 07:08.400
to describe properties and memory ranges that we want to preserve during KXIC, and the device tree

07:08.400 --> 07:20.160
is created at some point of KXIC process, and every driver that wants to persist its data

07:20.160 --> 07:37.440
has to provide its own nodes in the TFT, and to fill the node data in the TFT, I'm trying to understand

07:37.440 --> 07:48.200
what Alex wrote here, so as FDT is widely used, it all provides tools for verification and

07:48.200 --> 07:58.680
some standardization, it gives a ability for driver writers to understand how much previous version

07:58.680 --> 08:10.120
of what was serialized is compatible with the next version was what is serialized, and for instance,

08:10.120 --> 08:19.160
if we are doing KXIC of some more advanced kernel version, if we go back to one of the previous

08:19.160 --> 08:27.320
kernel version, there is a way to understand that properties saved in the existing data are

08:27.320 --> 08:37.640
compatible or not compatible with what the driver can handle, and so this is an example of

08:37.640 --> 08:47.560
how FDT may look, this example saves FDT buffers and events, and then the FDT can be reconstructed

08:47.560 --> 08:58.360
from the saved data, so there is events in BitMusk that says what events were enabled, there are

08:58.360 --> 09:09.160
not memory ranges that FDT buffers used on the CPU 0 that has, in this example it has four

09:09.160 --> 09:22.120
ranges and pretty small, in reality it becomes much larger but you get the gist, another important

09:22.200 --> 09:32.520
thing what happens with the memory during transitions of K joint KXIC and so on, so when the first

09:32.520 --> 09:44.520
kernel boots very early during booted reserves some contiguous memory areas, which are kind of blue here,

09:44.600 --> 09:57.480
then as the system continues to work the orange yellow, I don't know, on color blind a bit,

09:59.080 --> 10:07.240
the orange yellow areas that's what we want to resist during the KXIC and these areas should

10:07.240 --> 10:18.120
be designated as such by the users, when KXIC actually happens, the only memory that the second kernel

10:18.120 --> 10:26.360
will use is the green area screech memory that was seen in my, for the lifetime of the first kernel

10:26.920 --> 10:33.640
now it becomes like the only usable memory for early memory allocations pretty much until

10:33.640 --> 10:43.080
a page allocator is ready and this way we can be sure that we can reserve everything that we want to

10:43.160 --> 10:57.320
persist and once the system knows that the persistent ranges are secure and safe the screech areas

10:57.320 --> 11:10.440
again became CMA and they can be again reused by the allocations of movable data and did

11:10.440 --> 11:28.120
make a continue for years hopefully, there is a bunch of this is a fast control, it's not certain

11:28.120 --> 11:36.920
stone yet but we think these are important, so first of all there are scratch fees and scratch

11:37.000 --> 11:48.200
land that define physical addresses and lands of the scratch areas, they can be used by KXIC users

11:48.200 --> 11:58.760
based tooling to understand where the kernel image and the other parts of the KXIC work KXIC data

11:58.760 --> 12:10.600
should be stored, it's still not implemented right, then there is an active control that tells

12:10.600 --> 12:23.400
the user if KXIC is active or not and let's use it to enable disabled KXIC and there are controls

12:23.400 --> 12:35.640
for a device tree configuration which DTMAX says how much memory can the device tree use

12:38.200 --> 12:46.120
if you cannot grow FDT infinitely you have to specify its maximal size when you create it first

12:47.080 --> 12:56.600
and then the CIS kernel KXIC DT is the representation of the device tree, it's binary form,

12:58.600 --> 13:05.720
it appears after KXIC is activated and it contains the device tree created up till this point

13:05.720 --> 13:17.000
so what we have seen here if you use DTC minus I whatever and look at KXIC DT you will see

13:17.000 --> 13:31.080
something like this and there is a CIS framework KXIC DT, this is the device tree that was passed

13:31.080 --> 13:39.320
from the first kernel to the next kernel and from here we can see what was in the persisted

13:39.320 --> 13:49.640
at least what we tried to persist and how we tried to persist it so how KXIC works more or less

13:51.240 --> 13:58.440
I think I'm repeating myself in the way but still during very early boot time KXIC reserves

13:58.440 --> 14:07.880
in scratch areas then at some point later in time it declares this area CMA because

14:08.920 --> 14:16.120
you have to reserve contiguous large chunk of contiguous memory before you have the free

14:16.120 --> 14:23.960
list and page allocator and you can't say what page block types I at that point have to be separated

14:23.960 --> 14:34.680
in different stages and then the system leaves and at some point users decide that users decide

14:34.680 --> 14:43.080
that they want to do kernel life updates so they enable KXIC by doing H1 into CXIC

14:43.080 --> 14:54.600
each of a KXIC active a KXIC all calls to its users callbacks to perform their serialization

14:54.600 --> 15:06.120
of data they want to persist and this is the KXIC data structures are painted to KXIC image along

15:06.760 --> 15:22.280
kernel image in it or D and whatever else KXIC stores so this pretty much happens and KXIC

15:22.360 --> 15:35.560
lot time is the there are two additions to KXIC image KXIC creates one is the physical addresses

15:35.560 --> 15:44.680
of scratch area scratch areas and another one is the FDT block the physical address of FDT

15:44.680 --> 15:55.080
block so that when KXIC moves pages around it will know where to place K where to place FDT

15:55.080 --> 16:07.240
and how to pass the scratch area addresses to the next level when the KXIC reboot happens

16:08.120 --> 16:17.160
there is some logic in KXIC that knows how to move pages from place to place it involves a lot

16:17.160 --> 16:27.320
of copying unfortunately but it's something we're going to improve I hope and then when the second

16:27.320 --> 16:42.680
kernel starts a setup arch on X86 on 64 passes the KXIC data that was there for in their setup

16:43.640 --> 16:55.640
data structures it creates a very early in booting, it recreates the scratch areas and

16:55.640 --> 17:04.200
they make sure that early memory allocations work only within that scratch areas then KXIC

17:04.200 --> 17:17.720
passes the device tree and from that point it's possible for KXIC user to deserleys their state

17:18.840 --> 17:28.280
depending on when the these driver initializing it might be in it called time it might be

17:28.280 --> 17:34.600
module any time but the state is there we make sure it's preserved and it stays in memory and

17:34.600 --> 17:44.200
nobody will touch it now there was a recent discussion about this thing on the Linux and

17:44.200 --> 17:51.240
mammalian list how do we want all this thing to happen there are a lot of opinions obviously

17:52.120 --> 18:04.520
and we still have a lot of open questions and we need to address some of them related to how we

18:04.520 --> 18:14.600
managed scratch areas for instance it is possible to predict in some way how much scratch

18:14.680 --> 18:22.920
memory the next kernel will need we know how much memory the first kernel used for its boot

18:22.920 --> 18:31.880
time allocation so we can kind of do some math and hope that the next kernel is not too different

18:31.880 --> 18:42.440
from the first kernel and calculate some scratch sizing based on current kernel use and I don't

18:42.760 --> 18:54.600
know give it 150% or maybe a bit more and the reason opinion that it should be set in the kernel

18:54.600 --> 19:04.040
common line and that the administrator knows better yet to be seen if administrator really knows

19:04.040 --> 19:17.080
better another interesting thing is what happens if administrator thinks they know better

19:17.080 --> 19:24.360
and they want to resize key show after boot of one of the next kernel and say okay the fourth

19:24.360 --> 19:34.200
kernel will boot with different scratch sizes we still don't know how to address this to be honest

19:34.840 --> 19:42.600
because if it's done late in the system lifetime there is no guarantee we'll be able to

19:42.600 --> 19:51.560
to allocate contiguous physically contiguous memory chunks it will be easy if the administrator

19:51.560 --> 20:05.320
would want to reduce but increase will be difficult and another open question is what do we do

20:05.320 --> 20:13.160
if there is no enough memory scratch area when boot time allocation is happen do we panic

20:13.240 --> 20:23.640
or do we just disable key show and continue with the normal reboot and another thing that was

20:25.080 --> 20:33.160
discussed on the mailing list is what exact data format do we want some people say that

20:33.160 --> 20:41.160
everything is too strict and rigid and doesn't allow for parallelization of drivers serializing

20:41.160 --> 20:51.240
their data and from the other hand it gives some standardization and so there was a big proposal

20:51.240 --> 21:01.640
about having some intermediate data structures for resisting device properties and driver properties

21:01.640 --> 21:09.480
and memory ranges then in some point converting into FDT and somebody suggested that we just

21:09.480 --> 21:17.320
reinvent some another format and we'll use that format so we'll continue to talk about this I

21:17.400 --> 21:34.280
presume until we enjoy the merge upstream and another interesting point is what states do we want

21:35.960 --> 21:43.160
to have right now what we have is activate case a cloud and they can exactly boot essentially

21:44.120 --> 21:51.800
so the reason is are to make it possible to activate case or after case a cloud

21:52.760 --> 22:01.400
and it's a bit complicated because of how case it works and it's not impossible but

22:03.400 --> 22:10.600
still it would be quite intrusive and another thing that was raised that

22:10.600 --> 22:19.160
in the complex device drivers like for example more than need cards will require much more states

22:19.160 --> 22:29.880
in the middle like prepare prepare again and then save state restore state and the basic model

22:29.880 --> 22:36.600
of activate load it doesn't work with them and it's also something we are going to look into

22:36.600 --> 22:43.800
in the near future so that's all I had I think I made it in time

22:54.760 --> 22:56.120
hard

22:56.200 --> 23:06.200
right so you're using CMA because it's continuous right you need the scratch space to be

23:08.200 --> 23:13.880
essentially I reserve man block because it's continuous and then I tell page allocator

23:13.880 --> 23:20.520
okay it's a may you can use it for movable but it's CMA guarantee to be available for this

23:21.480 --> 23:27.160
what do you if there's some driver that needs to persist state or it's in your set of

23:27.160 --> 23:37.320
so now we can't persist and move we can't persist movable ranges so if it's CMA it's

23:37.320 --> 23:43.800
going to be movable and if some driver did the page alloc of something it would be unmovable

23:44.760 --> 23:49.640
so we can't persist and movable but we can't guarantee persistence of movable

23:51.480 --> 23:57.080
in another quick question what's your I mean this is all needed to be exact very fast right

23:57.080 --> 24:04.680
to keep the PMs and drive devices alive so what's your roughly your time budgets for doing the

24:04.680 --> 24:16.040
character time budget is always as low as you can right the few interesting tricks we've done here

24:16.040 --> 24:20.600
is the two-phase approach where you activate it in only KXC which allows you to move a lot of

24:20.600 --> 24:26.200
the complexity like the civilization which can be expensive into a part where you can still

24:26.200 --> 24:30.680
continue running your system so it doesn't actually buy you into the KXC downtime you only

24:30.680 --> 24:38.840
probably really inflicts a pain in end-down times is the visualization which is reasonably fast

24:38.840 --> 24:41.000
at least since you have a well-stocked data formula

24:41.000 --> 24:52.440
yes there

24:52.440 --> 24:56.440
time, so both of them did a couple of them.

24:56.440 --> 25:01.440
Who was it there?

25:01.440 --> 25:04.440
Hi.

25:04.440 --> 25:07.440
One of the slides mentioned that the data are

25:07.440 --> 25:10.440
mutable after serialization.

25:10.440 --> 25:12.440
So this immutability, it's

25:12.440 --> 25:17.440
and sure generally, or by each user of the API,

25:17.440 --> 25:20.440
must ensure that the data are mutable after it's serialized.

25:21.440 --> 25:24.440
The user must know that they serialized

25:24.440 --> 25:28.440
and the data they serialized cannot change.

25:28.440 --> 25:32.440
It's, Kicho doesn't provide any guarantees about it.

25:32.440 --> 25:35.440
It's up to the drivers.

25:35.440 --> 25:36.440
All right.

25:36.440 --> 25:37.440
All right.

25:37.440 --> 25:38.440
All right.

25:38.440 --> 25:40.440
All right.

25:40.440 --> 25:41.440
All right.

25:41.440 --> 25:42.440
All right.

25:42.440 --> 25:43.440
All right.

25:43.440 --> 25:44.440
All right.

25:44.440 --> 25:46.440
All right.

25:46.440 --> 25:47.440
All right.

25:47.440 --> 25:48.440
All right.

25:48.440 --> 25:49.440
All right.

25:49.440 --> 25:50.440
All right.

25:50.440 --> 25:51.440
All right.

