WEBVTT

00:00.000 --> 00:15.520
OK. Hello, everyone. I'm Stefano Garzarella. Today, we will try to answer this question. So,

00:15.520 --> 00:22.480
if we can run Qemu and Vostuser, or an operating system other than Linuxware, a special Vostuser

00:22.480 --> 00:29.720
was developed. Before going into the date, when we talk about Vostuser as materials, already

00:29.720 --> 00:35.000
mentioned, we are talking about Vortio devices. So, Vortio is a specification, if I think

00:35.000 --> 00:42.680
most of you already know. It is a specification to create standard devices for virtual machines.

00:42.680 --> 00:50.920
I put here some links to the last spec, and also to the GitHub repo where the code of the spec

00:50.920 --> 00:57.560
is. And essentially, the specification defines the core components of every device. The

00:57.560 --> 01:04.040
initialization steps that Gaston host needs to do in order to create the communication channel.

01:04.040 --> 01:10.920
The transports that Vortio supports, which are PCI, MMO, and Trialio, and then there is a big

01:10.920 --> 01:16.680
section for device types. So, hold the device that are defining it into the Vortio spec.

01:17.480 --> 01:22.440
When we talk about Vortio, of course, we are talking about devices and drivers. So, usually the

01:22.520 --> 01:29.800
Vortio driver, sorry, let us start to the device. Usually, the Vortio device runs into the host,

01:29.800 --> 01:36.200
and the guests will have a driver, a Vortio driver, to communicate with the device. The spec,

01:36.200 --> 01:41.160
I mean, defines essentially the specification. And implementation, of course, defines

01:42.760 --> 01:49.000
three paths. One is the control path, the data path, and the notification mechanism to

01:49.880 --> 01:55.720
to wake up both sides. So, essentially, the control path is used for feature negotiation,

01:55.720 --> 02:02.440
so the driver and the device will understand what they can use. There is a configuration space

02:02.440 --> 02:10.280
to expose some information from both sides, like macadras for net device or context ID for V

02:10.280 --> 02:15.720
so, can stuff like that. And then, of course, the configuration is the control path, sorry,

02:15.720 --> 02:23.400
is used to set up the data path. So, the driver allocates all of the components for the

02:23.400 --> 02:28.920
data path, like the Vortio use, and then, to the control path, share to the device where they

02:28.920 --> 02:38.040
are into the memory. The data path itself is mainly, is mainly built on top of the Vortio,

02:38.040 --> 02:44.840
which we can see as a ring buffer. The spec defines multiple format. There is the split format,

02:44.920 --> 02:51.320
which is the initial one, and then they define that. So, another one pocket, the pocket will

02:51.320 --> 02:59.160
queue, which is more optimized to reduce bus transaction. And as we will see, this was done,

02:59.160 --> 03:04.920
in order to move the device also into the hardware. And then, there is a notification system.

03:04.920 --> 03:10.840
Usually, we call kick, the notification coming from the guest to the host, and our queue

03:10.840 --> 03:17.960
or interrupt the other way. Now, the question is, where the device is emulated? Of course,

03:17.960 --> 03:23.080
into the host, but in which component, the common scenario is into the VM. In our case,

03:23.080 --> 03:29.080
we will talk about Qemu, that we can consider the reference implementation of Vortio

03:29.080 --> 03:37.160
specification, most of the device of Vortio devices are supported by Qemu, and Qemu can

03:37.160 --> 03:41.960
easily intercept the control part, can easily also handle the data parts, it has complete

03:41.960 --> 03:49.400
access to the guest memory, and can use, for example, special mechanism from KVM, in order to implement

03:49.400 --> 03:55.720
an notification like the event of D for inject IQ or get notified when some register are

03:55.720 --> 04:01.880
written by the driver. Now, we have also other way to emulated the device into the host.

04:01.880 --> 04:07.960
The other way is Vhost. Essentially, it was introduced initially to improve the performance of Vortio

04:07.960 --> 04:15.880
net. The idea is to move the emulation of the device from the VMM to the host kernel. And

04:17.160 --> 04:22.360
in this way, the control part is still intercepted by Qemu, and then need to be propagated to

04:22.360 --> 04:29.960
the host through an IOCTL's API. And the data part will be completely handled by the kernel,

04:30.040 --> 04:37.560
by the device implementing to the kernel, because the kernel can attach the address space of the VMM.

04:38.760 --> 04:46.760
Linus currently supports three devices, VoSNet, VoSKazi, and VoS VsoC. And as we mentioned,

04:46.760 --> 04:54.040
the main advantage are performance, because we can skip a lot of C school. If we think about

04:54.120 --> 05:01.320
Vortio net device, that usually is attached to a top device. In it's to do a lot of system

05:01.320 --> 05:07.960
calls, because for example, the driver put a pocket into the Vortio needs to notify the VMM. So

05:07.960 --> 05:15.080
do a VMM exit, go into the kernel into the host, then the kernel will notify the VMM. And so

05:15.080 --> 05:20.520
the VMM needs to communicate with the top. So it's a system call, and then it's to, for example,

05:20.520 --> 05:26.680
put the device to put the request into the user dream, and send another C school to inject the

05:26.680 --> 05:31.160
interrupt. So we have a lot of C school. If we move everything into the kernel, of course,

05:31.160 --> 05:36.760
we reduce the number of C school per request or the latency and also the throughput will improve.

05:36.760 --> 05:43.160
Another advantage that we use it for VsoC, for example, is that since we are into the host

05:43.160 --> 05:49.400
kernel, it's easily to interface with kernel stacks, like address family, like the VsoC address family,

05:50.600 --> 05:57.560
so we can easily communicate to the VsoC stack and put the pocket there. Of course, we have

05:57.560 --> 06:04.600
drawbacks. So safety the device is into the kernel. So if it crash, yeah, you can have issue.

06:05.480 --> 06:11.560
And it is really Linux specific, because all the devices are provided by the Linux kernel. So they

06:11.560 --> 06:15.880
are really into the Linux kernel. So you cannot use it without them without the Linux kernel.

06:16.520 --> 06:21.640
And if you want to update the device, you need to update your host kernel. So maybe for fixes,

06:21.640 --> 06:30.360
should be okay. For new feature, it can take time. So another option, which was inspired by Vhost,

06:30.360 --> 06:37.400
was the Vhost user. Mattias also really talked about it. The control part in this case is

06:38.920 --> 06:44.680
done by a unique domains. So essentially, the device in this case is also moved out from the VMM,

06:44.680 --> 06:50.520
but instead of the kernel, it's moved to another user space process. And the control part,

06:50.520 --> 06:56.760
as I mentioned, is a unique socket, but it's really similar to the IOCTL. So the IOCTL of Vhost,

06:56.760 --> 07:04.040
they define very similar messages. And the data path, in this case, is implemented to a shared

07:04.040 --> 07:10.520
memory. So Kimo needs to allocate the guest, RAM, in a special way, in order to be able to

07:10.520 --> 07:15.800
share that memory to a file descriptor. So the memory should be addressable by a file descriptor,

07:15.800 --> 07:21.560
and then pass it to a unique domain socket to the other process. The main advantage is our safety,

07:21.560 --> 07:26.360
is an external process, is even external to the VMM. So if it crash, we can easily reboot it,

07:26.920 --> 07:34.040
and everything should be fine. Device updates, again, is a user space process. We can easily

07:34.040 --> 07:41.080
hot-homplug it, start a new version, have hot-plug it live. And you can write it in a different

07:41.080 --> 07:45.800
languages. We will see, as a lots of Matya mentioned, we have a lot of them implemented in Rust,

07:46.680 --> 07:51.640
because it's a completely an external application. And of course, more isolation. You can

07:51.640 --> 08:00.120
confine, you can put it in a jail or a container or whatever, or a special C group,

08:00.840 --> 08:07.560
just needs to have a unique domain socket to the VMM. Throwbacks? Yeah, the performance

08:07.560 --> 08:13.160
again, the process is moving to user space. So again, we have to do C schools, but there are

08:13.160 --> 08:18.600
other techniques we can use like DPDK, SPDK in order to, or IOU in order to reduce the C

08:18.600 --> 08:25.000
school. And we need a bit more coordination, because that application need to be spawned before the

08:25.000 --> 08:30.600
VMM, but thanks to management layer, like Leapware, this could be completely hidden to the user.

08:31.160 --> 08:36.440
And it's Linux specific. I mean, it was Linux specific. It's not really, it's a spoiler of the

08:36.440 --> 08:42.680
talk, but yeah, it's not really Linux specific. I want just mentioned VDPA, which is another

08:42.680 --> 08:47.480
framework we are developing, where in this case the device is moved into the hardware. Now,

08:47.480 --> 08:53.800
we have a lot of smart nicks, DPUs, so thanks to this framework, Vertejo can also implement it

08:53.880 --> 08:58.200
really into the hardware. If you're interested, we already did a lot of talks about it.

08:58.760 --> 09:04.920
I will not go into the detail. So let's go back to the ViosT user. I put the link to the

09:04.920 --> 09:11.480
specification here. The specification is maintained by the Krimu community, and it's the

09:11.480 --> 09:18.040
finally there's exactly as a control plane to essentially share the word queue with an user space

09:18.120 --> 09:23.560
process on the same most. We use the term of front-hand and back-hand to identify the application

09:23.560 --> 09:29.480
that shares the word queue in our case QIMU, so the VMM, and back-hand the consumer of the word queue.

09:29.480 --> 09:33.960
So usually the application that implements the that emulates the word value of the device.

09:34.840 --> 09:42.280
And the key components, the key components of the specification are the units domain socket,

09:42.280 --> 09:47.080
we already mentioned, the Ancillary data support of Unix domain socket. So essentially,

09:47.480 --> 09:55.080
that support allow QIMU to QIMU allows application to share file descriptors and access to

09:55.080 --> 10:00.760
that resources. That file descriptor could be, in our case, could be a shared memory, which is another

10:00.760 --> 10:07.800
key component. So we need a way to allocate memory that could be shared, passing a file descriptor

10:07.800 --> 10:15.400
to the other process. And so the other process can easily mop that memory. And for the notification.

10:15.400 --> 10:22.760
So vostuser was based on eventfd, which are, we would see Linux specific. But for example, QIMU

10:22.760 --> 10:28.920
is the, I mean, support automatically a fallback to pipe or pipe to to implement notification.

10:28.920 --> 10:34.600
I mean, I interrupt the injection and kicks coming from the, from the driver.

10:35.640 --> 10:45.000
So vostuser input on policy. This was the main question. Can we use vostuser on system other than Linux?

10:45.000 --> 10:52.200
The answer is yes on POSIX is true on Windows. I don't know. But yeah, for for for for other

10:52.200 --> 10:59.240
system that are POSIX compliant, the answer is yes. And here I put all the, I put the links to the main

11:00.440 --> 11:05.960
feature, I mean, the main C is called the finite bypassing. So in the case of, let's see about

11:05.960 --> 11:12.440
the key components. We already talked about Unix domain socket. Yes, they are into the POSIX

11:12.440 --> 11:18.120
specification. And see the read data support. Also, they are defining it into the spec. So they

11:18.120 --> 11:24.120
define the C message either structure that is used to classify the scriptor and SGM rights.

11:24.680 --> 11:32.680
Then share the memory. Yes, the, the, the POSIX specified the SGM open C school, which is exactly

11:32.680 --> 11:38.200
defined as a way to create a connection between a shared memory and the file descriptor. That is

11:38.200 --> 11:43.480
exactly what we want. And then notification. As I mentioned, event of D, which are using on Linux,

11:43.480 --> 11:50.920
are Linux specific. Maybe free BSD also now support it. But I'm not POSIX, but we have PIPEN

11:50.920 --> 11:56.920
PIPQ and Qemu and other VMMs already support the fallback to PIPEN PIPQ for the notification.

11:56.920 --> 12:03.880
So yes, we can. The Qemu already supported most of them. So Unix domain was there and

12:03.880 --> 12:11.800
cillary data notification. The only thing we, we missed was then the way to allocate shared

12:11.800 --> 12:19.160
memory with SGM open C school. Now, let's take a look on how Qemu handle the guest memory,

12:19.160 --> 12:24.120
the allocation of the guest memory. Usually we identify that with the Qemu memory backends.

12:24.840 --> 12:30.440
When you create a VM, you can use the simple dash ham option to specify just the size of the,

12:30.520 --> 12:37.960
of the memory. But you can also use something with the machine option. I mean, you can specify

12:37.960 --> 12:43.400
for your machine, which memory you can use. And this is a more advanced way to define your

12:43.400 --> 12:49.560
memory. Because some of that memory backends support an option, which is shared option,

12:49.560 --> 12:54.680
that is exactly what you want. If that option is turned on, means that Qemu will try to allocate

12:54.680 --> 12:59.880
that memory in a way that can be shared with an external application. Now, let's get, let's

13:00.520 --> 13:06.520
take a look of the Qemu memory backends. So the first one I want to mention is the RAM memory

13:06.520 --> 13:14.200
backends, which is the simple one. Essentially it's the same user when you specify dash with the

13:14.200 --> 13:20.360
memory size. But you can put more option like pre-allocation. So you can ask Qemu to pre-allocate

13:20.360 --> 13:27.720
everything and and other options. The other one I want to mention is the file memory backend,

13:27.720 --> 13:35.080
which is one of the memory backends that support the share option. In this case, the parameters,

13:35.080 --> 13:41.720
the main parameter is a path into the file system. That could be a simple file, but also a specialized

13:41.720 --> 13:49.720
for shared memory or a huge page file system. And as I mentioned, this memory backend can be

13:49.720 --> 13:56.360
used to share the memory with another application. But the most interesting one is the memory,

13:56.360 --> 14:04.520
the memory backend. It's an anonymous, it's allocating an anonymous object. With an anonymous, I mean,

14:04.520 --> 14:10.840
it's not had just by any file into the file system. It's just had just by a file descriptor

14:10.840 --> 14:16.360
that can be shared through the units domain circuit. And this is exactly what we want for

14:16.680 --> 14:22.440
users. So when you use the user on Linux, you can use the easily mammothD, where the share option

14:22.440 --> 14:28.840
is by default true because it's exactly the use case. But it Linux only. So it used the mammoth

14:28.840 --> 14:34.280
decreate, which is not in POSIX. So what we did, we had the new memory backend, which looks similar

14:34.280 --> 14:41.720
to that. It's called memory backend SHM, which do exactly something similar, but using the SHM

14:41.720 --> 14:48.360
Open C-School, which is POSIX, I can use it on previous D, MacOS, or all of them that support POSIX.

14:48.360 --> 14:55.160
That POSIX is called. And it's available from Qemun9.1. And then there are other backends, but

14:55.160 --> 15:04.520
we don't care. So this is the main series where we implemented that kind of things. So most of

15:04.520 --> 15:11.320
them are upstream from 9.1. The red ones are not. We will see in the next slide. Why?

15:11.480 --> 15:19.480
So the first part was just fixes, because when we run the Qemun, the Vostuser application on

15:19.480 --> 15:25.400
other system, we discovered some assumption that we didn't Linux was not true on other operating

15:25.400 --> 15:31.640
system. Then we enabled it. The main thing was the new memory backend, and then we enabled some tests

15:31.640 --> 15:40.440
that essentially highlight some issue that we still didn't fix. So for this reason, we didn't

15:40.440 --> 15:47.080
merge the three patches in red here. I mean, that patches are now I'm maintaining on my fork,

15:47.960 --> 15:54.440
but which I recently rebased. Essentially, if a test is failing on previous D and MacOS,

15:54.440 --> 16:01.160
that the test is the Q-test test, but the main one is the Vostuser Reconnect test. So essentially,

16:01.160 --> 16:08.600
that test was never run on that platform. So we don't know if it is an issue into the test,

16:08.600 --> 16:15.800
or something we didn't fix into the Vostuser implementation. But on 3BSD, X36 hosts,

16:15.800 --> 16:23.640
you can easily reproduce running the power PC 64 using TCG. So using binary translation,

16:23.720 --> 16:32.440
it's every time the issue. On MacOS, on the harm 64 MacOS, running the harm 64 test,

16:32.440 --> 16:40.440
you can hit the issue frequently. So it was hard to debug, and we didn't have time to go into the

16:40.440 --> 16:48.520
date, but yeah, this is something we need to hardware. If you want to try, you can use this this fork.

16:49.080 --> 16:55.640
In this case, I put an example where I essentially shared the root file system

16:56.920 --> 17:03.880
through a Vostuser block device through the VMM. So the root file system is, in this case,

17:04.920 --> 17:11.000
the Vostuser device. In Q-emo, the Q-emo repository contains two application that can

17:11.000 --> 17:18.600
expose file into a Vostuser device. Vostuser block device. The simple one is Vostuser block,

17:18.600 --> 17:24.520
which supports only a row file. So you can expose a row file as Vostuser block device. So you can

17:24.520 --> 17:30.280
use the Q-emo repository team on which is a more advanced application, because it supports the

17:30.280 --> 17:36.600
entire Q-emo block layer. So you can share every file that Q-emo support as a block device.

17:36.600 --> 17:45.400
Like a Q-CoV-2, or also a ZFNBD remote disk. You can share it as a Vostuser block device, or also other

17:45.400 --> 17:52.200
experts like NBD, Fuse, and someone else. If you are interested with, we talked about it with

17:52.200 --> 17:59.160
Michael Lee Kevin Wolf in KVM for some years ago. And then you can start K-emo in this way.

17:59.160 --> 18:09.080
I mean, I put in bold what we need for the Vostuser block part. The other is really depends on the

18:09.080 --> 18:16.360
on the system, because in Linux we can use KVM, Accelerator, or MacOS, we can use the HF Accelerator.

18:16.920 --> 18:22.920
But essentially, in order to attach to the Vostuser device, you can use exactly the same option.

18:22.920 --> 18:28.600
So first of all, the memory back end. In this case, we are using SHM because it's POSIX.

18:28.600 --> 18:33.880
So you can use it on every of them. And then you need the Vostuser block device, which is,

18:33.880 --> 18:38.680
as we mentioned, the control part is still intercepted by Q-emo. So we need this small device in,

18:40.680 --> 18:46.840
in the DMM. And then we have the last one is the Unix part, is the Unix domain socket,

18:46.840 --> 18:51.320
where the device is exposed by the Q-emo storage demo that we start before.

18:51.720 --> 18:58.520
Okay, we talked about Q-emo, now let's talk about the device. As I mentioned, some of them are in Q-emo,

18:58.520 --> 19:04.200
but we have also a lot of them outside of Q-emo, in this case written in Rust. And most of them are

19:04.200 --> 19:09.800
based on the Rust VMM components. Trust VMM is a community which provides the building blocks

19:11.000 --> 19:21.240
for building VMM and hypervisor. Some of hypervisor that use our components are, for example,

19:21.240 --> 19:26.920
cloud vapor visor, firecracker, lead k-run, and all of them use our components. And we are

19:26.920 --> 19:33.080
collaborating upstream in order to create this building blocks. And I put some links to the community

19:33.080 --> 19:40.520
channels. And of course, what we support also Vhosts. Both for the VMM, but also for the devices.

19:40.520 --> 19:44.680
So we have a couple of crates, Vhosts, which is a pearl library that exposed the

19:44.840 --> 19:51.000
message structure and that kind of things for Vhost, Vhost user and VDPA. And then we have a

19:51.000 --> 19:57.720
Vhost user backend crates which essentially allow you to easily create a Vhost user device application.

19:59.080 --> 20:03.400
Other than that, we also support, we also have a lot of devices implemented and maintained by the

20:03.400 --> 20:09.800
Rust VMM community into the Rust VMM Vhost device repo on GitHub. You can find that. And we have a lot

20:09.800 --> 20:16.280
of device supported, like can console, scazi, visor, sound, and some of them working progress,

20:16.280 --> 20:21.480
like GPU and video. We also have other devices outside of Rust VMM, but they are

20:21.480 --> 20:29.320
based on our our crates, like Vhost and Vhost user backend. Can we run Rust VMM crates on

20:29.320 --> 20:35.960
POSIC? We still need to do a lot of work there. Because most of our crates are based on the VMM

20:35.960 --> 20:43.400
CCUTIL crate, which is supported, Linux is supported, of course. We in this partially, but we use a

20:43.400 --> 20:48.840
lot of Linux specific stuff, like Apple, eventfD. So we need to replace them, like for example,

20:48.840 --> 20:54.040
for Apple, we can use the small polling crates. And for eventfD, maybe we can implement some

20:55.240 --> 21:00.920
automatically fallback, like QEmo, does with Python by 2. I put also here some open issue,

21:00.920 --> 21:05.800
because the community is interested in supporting other operating system, like macOS.

21:06.600 --> 21:11.800
And so which are the next steps? As I mentioned, first of all, in QEmo, we need to understand why

21:11.800 --> 21:18.680
that test is failing. And then we have a lot of things to do on Rust VMM, to improve POSIC

21:18.680 --> 21:26.840
support in the VMM CCUTIL. Upgrade, we need to replace the Linux SQL as I mentioned, and for every

21:26.840 --> 21:31.480
device, maybe we need to add up a bit the code, for example, for BERTIFS, the file system, of course,

21:31.480 --> 21:37.960
is different on macOS, for example. And so we need to some specific work on it. So if you are interested,

21:37.960 --> 21:41.080
if you're interested, if you're to reach me, yeah, we have a lot of things to do. It's a side

21:41.080 --> 21:46.040
project for me, so I don't have time, but yeah. Thank you. Any question?

21:56.840 --> 22:18.520
Sorry, the question was, how many people use QEmo on FreeBSC? Something like that, or yeah.

22:19.480 --> 22:28.600
Yeah. On FreeBSC, honestly, I don't know. On macOS, most of them are using, for example,

22:28.600 --> 22:39.480
for Podman, in order to start Linux VM to use containers. And since QEmo does not provide

22:39.480 --> 22:45.560
some devices like BERTIFS, which is a cool one, is not implemented into QEmo, is implemented

22:45.560 --> 22:51.320
only as a BIOS user device. This could be cool to support. In order to use that device, also on

22:51.320 --> 22:58.920
macOS to share directory between the host and the VM that runs the container.

23:15.640 --> 23:21.320
Essentially, QEmo is really widely used in order to do binary translation. So if you want to

23:21.320 --> 23:29.000
emulate ARM on XETC6, and then if you want to use a BIOS user device, it could be nice to have.

23:33.320 --> 23:34.520
Okay. Thank you.

