WEBVTT

00:00.000 --> 00:18.960
Hello everybody, thanks for being here. I'm Simone Tirobozki. I'm working in a redata on the

00:18.960 --> 00:21.360
CubeVirt project.

00:22.240 --> 00:25.840
Okay, can you hear me now better? Thank you.

00:27.600 --> 00:33.280
So here we are going to talk today about scheduling. Scheduleing is the process of matching

00:33.280 --> 00:42.800
workloads to nodes. By default, CubeVirt is using the standard Kubernetes scheduler, the cube scheduler.

00:44.720 --> 00:50.720
As you know, CubeVirt is about virtual machines, but a virtual machine at the end is

00:50.800 --> 00:58.560
executed in a pod and the pod has to be scheduled. But pod is scheduled by the standard CubeVirt

00:58.560 --> 01:11.680
Kubernetes scheduler. Let's talk a bit, let's do a diversion about how we define our

01:12.640 --> 01:20.480
source needs for our workload. On Kubernetes on a pod and source on virtual machines,

01:21.360 --> 01:28.640
we have requests and limits. Requests are the amount of resources

01:28.640 --> 01:39.280
are lowered to be used with a strong guarantee of availability. You can request a CPU and you can

01:39.280 --> 01:48.000
request a memory. Schedule is not going to overcome meet on a request. You ask for a certain amount

01:48.000 --> 01:54.560
of resources and you are going to get that. Limit is the max amount of resources that can be

01:54.560 --> 02:02.800
used regardless of any guarantee. The scheduler is completely ignoring limits. What's

02:02.880 --> 02:15.040
the implication of this, requests should be typically less than a user. It's going to be less

02:15.040 --> 02:22.800
than limit. This means that you are overcoming a bit and the resources can be available on an order.

02:23.760 --> 02:31.440
If the user is over the limit, you are going to be totally in case of CPU. You are not going to

02:31.520 --> 02:39.520
go over the limit or in case of memory. You can eventually be killed by the out of memory killer.

02:41.760 --> 02:48.560
In CubeVirt we manage virtual machines. You define what you need on your virtual machine. You define

02:48.560 --> 02:54.400
what you need a certain amount of CPU cores. You define that you need a certain amount of memory

02:54.400 --> 03:03.040
for your virtual machine. The CubeVirt controller is translating it into a source of some parts.

03:03.040 --> 03:15.200
We are not in general, we are not setting limits. We are not setting limits because we don't want

03:15.200 --> 03:21.680
to have our virtual machines being killed by the out of memory killer. We are not setting limits

03:21.680 --> 03:27.680
on CPU because we want to take the full advantage of the service available on that order.

03:28.560 --> 03:36.640
Normally, CubeVirt is not over committing in terms of memory. If you require two gigs of

03:36.640 --> 03:43.520
memory for your virtual machine, CubeVirt is going to render a pod with two gigs of memory. As

03:43.600 --> 03:51.840
a request plus something more as a safety threshold for ancillary services within the virtual

03:51.840 --> 04:00.720
home chat pod. On CPU by default, CubeVirt is over committing by a factor of 10. It means that

04:00.720 --> 04:06.960
if you are requiring a 4 CPU cores, the CubeVirt controller is going to scale the pod, configure

04:07.040 --> 04:16.720
to acquire 0.4 or 400 ml cores. This means that from the scheduler point of view,

04:16.720 --> 04:23.520
a virtual machine with a 4 cores is just a pod with 0.4 to be scheduled.

04:25.600 --> 04:33.200
The scheduling is the process of binding a virtual machine and then a pod to an order.

04:33.520 --> 04:43.040
The scheduler should find an order that is capable of executing vector work load

04:43.040 --> 04:51.520
ensuring that on the order you have a fire mount of a source to match at a list the source

04:51.520 --> 04:59.920
that we require. The process of scheduling is a bit more complex. We have a predicate.

04:59.920 --> 05:09.040
Whatever is a predicate for the scheduler is something that should happen on the order to be able

05:09.040 --> 05:16.640
to handle the work load that it is going to be scheduled. The scheduler takes the definition of

05:16.640 --> 05:24.000
the pod. It starts looking for all the nodes in the system. It scans all of them. The first step

05:24.000 --> 05:31.040
is filtering. There are a set of filters, nodes that are not going to match the predicates.

05:31.040 --> 05:36.480
Set on the nodes are going to be filtered out. They are not good candidates. Then there is a

05:36.480 --> 05:42.000
calling mechanism. What is not mandatory? It is just a priority. It is just a weight.

05:42.000 --> 05:46.000
nodes are weighted and the scheduler is going to choose the best fitting node.

05:46.000 --> 05:55.440
Now, we have a virtual machines and we have pods. But virtual machine and pods are different.

05:55.440 --> 06:01.600
We know, I mean, now, on a virtual machine, you have a different operative system. It usually requires more

06:01.600 --> 06:08.880
source than a pod. You probably need to locate a few cores. Probably you are going to

06:08.880 --> 06:16.880
locate a few geeks of RAM. The boot will start up time. It is as low because you need to

06:16.880 --> 06:24.560
to boot a full operative system. Usually, a virtual machine is state full. We have data on a volume.

06:24.560 --> 06:31.040
It is going to be safe. We don't have as persistent volumes. The virtual machines can be

06:31.040 --> 06:38.080
live migrated between nodes without time. You cannot do that for pods. On the other side,

06:38.240 --> 06:44.560
you can easily start a pod on a different node because of a start time. The starting time is short.

06:46.640 --> 06:51.440
In order to scale out a virtual machine, probably you need to reconfigure it.

06:53.440 --> 07:01.120
The user expenses are different because the user is not supposed to see his virtual machine

07:01.120 --> 07:09.680
continuously booting. When you define a virtual machine, you will be surprised to be

07:09.680 --> 07:16.960
as close as possible to Kubernetes. So, basically, you can use the same semantics using

07:16.960 --> 07:28.320
node selectors or the finity pod. We tied it to as close as possible. Basically, one-to-one.

07:31.200 --> 07:40.560
But what use really wants? I work on a virtual machine before working on Kubernetes. We know that

07:40.560 --> 07:49.280
when cluster means migrating a virtual machine, they look to be able to select the node

07:49.280 --> 07:54.800
where they want to have a virtual machine landing. You can find it in a virtual machine.

07:54.960 --> 08:02.640
The semantics, the semantics. You expect something like that. On the other side,

08:02.640 --> 08:10.000
on Kubernetes, a virtual in a live migration is just an instance of the virtual machine

08:10.000 --> 08:17.680
instance migration object. It's a name-space-ed object, so an inspection can create it.

08:18.640 --> 08:25.040
On the spec of this object, you can only specify the name of a virtual machine that you want to

08:25.040 --> 08:31.360
migrate. This means that you have no control at all. It's up to the scheduler to select

08:31.360 --> 08:39.120
a node, but it's going to fit this virtual machine. It's an out-defend from

08:39.120 --> 08:46.080
what the users of the traditional virtualization systems or the school systems are supposed to do.

08:48.640 --> 08:55.920
This has been debated a lot over the community. We had a few proposals in the past. We had

08:55.920 --> 09:03.120
a lot of discussions here and there. It's not a new idea. We know that users are expecting

09:03.120 --> 09:11.280
something different. Up to now we are still not able to converge on this. It's controversial.

09:11.280 --> 09:21.680
Let me try to explain it. We know that experience that means users are used to control

09:21.680 --> 09:26.560
where the traditional workloads are going to be moved to. Just because they are a bit

09:26.560 --> 09:33.200
ready to do that, because maybe they are lying on existing patterns or automation,

09:33.200 --> 09:39.520
because they are planning maintenance with a set of nodes after the other. The water, I mean,

09:40.480 --> 09:45.920
it's very fast. They know what they want to do. On the other side, as a virtual machine

09:45.920 --> 09:52.640
your owner, you don't want to see that your object should be updated or amended just because

09:52.640 --> 09:59.760
a cluster admin needs to schedule that virtual machine on a different node. The goal of what we are

09:59.760 --> 10:06.720
talking about is to allow a cluster admin to trigger a live migration of a virtual machine,

10:06.880 --> 10:17.040
limiting the set of candidates of a validates node. The target node that is explicit

10:17.040 --> 10:23.920
required for the actual live migration should not stay there. It should not influence the future

10:23.920 --> 10:35.120
of a virtual machine. It's just for one off migration attempt and it should not bypass constraints

10:35.120 --> 10:42.880
that are already set on the virtual machine. You should not be able to bypass what it's there

10:42.880 --> 10:52.080
of a writing it. We are proposing now a really simple design. The idea is that directly on the

10:52.080 --> 11:01.360
virtual machine instance migration object, we can add an additional node selector that is going to

11:01.360 --> 11:09.280
be measured preceding all the nodes selectors and all the affinity that are already set on the

11:09.280 --> 11:17.200
virtual machine. From a CLI point of view, it's just about passing an additional parameter.

11:17.200 --> 11:28.640
At that point, you can inject on the fly, a set of additional constraints that should be

11:29.600 --> 11:38.160
a specter for better migration attempts. The proposal is quite simple. It got a lot of criticism.

11:40.880 --> 11:47.440
The first one, we are on Kubernetes, it's a cloud native solution. Kubernetes is the scheduler,

11:47.440 --> 11:52.880
they use a should not interact, should not take over. Okay,

11:53.760 --> 12:02.240
but there are the user, I mean. Then on Kubernetes, we cannot live in a greater

12:02.240 --> 12:07.200
part when we have the node, it's not there, but of course we cannot live in a total

12:07.200 --> 12:12.480
part in Kubernetes. And without Kubernetes, we don't have a virtual machine at all,

12:12.480 --> 12:18.080
but now we have Kubernetes, we have a virtual machine, so we can somehow handle this.

12:18.800 --> 12:26.080
Then of course, Kubernetes, we have a native paradigm to individually address some,

12:26.080 --> 12:33.440
if not all the use cases that I presented before, doing adding attain central relations,

12:33.440 --> 12:42.160
coordinating an accord on it nodes, there is an active way of doing that. Probably is not as intuitive

12:42.160 --> 12:47.600
as it should be for experience at the cluster of the means that simply want to live

12:47.600 --> 12:56.720
a greater virtual machine from this node to that one for whatever reason. Another crazy system

12:56.720 --> 13:02.080
that got a easy in the community is that live migrations are a source of expensive

13:02.080 --> 13:10.800
operations. We know that they are consuming a lot of bandwidth within the cluster, so they

13:10.800 --> 13:19.280
cap at a certain number. The idea is that if we allow users to freely manage

13:21.520 --> 13:27.840
live migrations, they can maybe start abusing on that introducing too much load on the cluster.

13:29.200 --> 13:39.840
We solve that this, let me try to quickly explain it on Kubernetes, we have two different

13:40.480 --> 13:48.720
admin roles, the cluster admin role, which is all the way to do whatever on any resource on the cluster,

13:48.720 --> 13:56.960
and the admin cluster role that is available, that is all, that the normal it's supposed to be bound

13:56.960 --> 14:06.240
inside a single name space to add user. The cube via the admin role, it's aggregated to the

14:06.400 --> 14:15.120
admin role. It means that normally I think it's a common practice to grant the admin role to

14:15.120 --> 14:21.520
select the owners inside an space where they name space owner. It's your kind of tenant, your own

14:21.520 --> 14:29.200
resource there, and right now, by default, you are allowed to create and delete virtual machines,

14:29.200 --> 14:38.240
and the virtual machine instance migration instances, and the Kubernetes RBAC model is purely

14:38.240 --> 14:44.080
addictive. You cannot deny anything, you can only add, so since this is granted by different

14:44.080 --> 14:50.800
installations, it means that all of your name space owners can trigger live migrations for the

14:50.800 --> 14:56.400
virtual machines in the name space, of course. What's the issue? We have all your single migration

14:56.800 --> 15:05.760
queue. It means that you can affect cluster critical operations like no that rains or upgrades,

15:05.760 --> 15:15.280
because you can cancel a live migration request on the queue. So, in the next version of cube

15:15.280 --> 15:23.840
of a cube width, we decided that the cube width admin is not going by default anymore to be

15:23.840 --> 15:30.640
all over the to create and delete the virtual machines instance object, but this is going to be granted

15:30.640 --> 15:37.920
only with an additional role, named the cube width migrate. As a cluster admin, you will be able to

15:37.920 --> 15:46.720
grant it to individual to individual grant it to selected users, or eventually label it to be aggregated

15:46.720 --> 15:53.280
as in the past to the admin cluster role to get back to the previous behavior. It's not an

15:53.280 --> 16:06.880
API change, it's simply adding. Then, back to our initial problem, the month that guided us over

16:06.880 --> 16:14.560
the years is the cube width eraser. If something is useful for pods, we should not implement it

16:14.640 --> 16:22.080
only for the machines. The point is that the year we are talking about live migrations and

16:22.080 --> 16:28.080
live migrations is something that is not relevant for pods. So, this is a weird machine specific

16:28.080 --> 16:38.720
topic. We should address this in cube width. In the proposal, we had also fewer alternatives.

16:39.600 --> 16:48.640
One of them is something that you can already do today without any, without any change in the

16:48.640 --> 16:57.280
cube width code. You can set a temporary, not select or not affinity on the virtual machine, wait for

16:57.280 --> 17:04.160
it to be propagated down to the virtual machine instance. If you configure the cube width instance

17:04.160 --> 17:11.760
with life update or output strategy, which is very difficult, only at that point, you can take

17:11.760 --> 17:18.720
a live migration with existing API, nothing special here, wait for the migration to complete,

17:19.360 --> 17:25.200
and now, only now, you can remove the additional constraints that you set on the virtual machine object.

17:25.920 --> 17:30.640
Why we don't like it, or at least, why I don't like it, it's an imperative flow, it's a

17:30.880 --> 17:35.920
platform. It's still as to be somehow orchestrated, it will be completely up to the user.

17:36.960 --> 17:44.000
It can mess up with DevOps and infrastructure as a code tools that are managing the

17:44.000 --> 17:52.640
virtual machines on your craft. Another possible option, in Kubernetes, you can configure more

17:52.640 --> 18:03.600
than one scheduler. You can add a seconder scheduler. We know that there are also loader-ware scheduling

18:03.600 --> 18:09.200
plugins, so you can configure a seconder scheduler that is loader-ware. It's going to take

18:11.040 --> 18:19.360
into consideration the actual source consumption of virtual machines, but still, this has to be

18:19.920 --> 18:25.040
you have to still, you have to configure the seconder scheduler, and then each individual

18:25.040 --> 18:29.200
virtual machine should be configured to be scheduled by that seconder scheduler.

18:31.840 --> 18:39.040
And still, the scheduler is loaded-ware, it knows the actual CPU consumption on the nodes,

18:39.840 --> 18:47.200
but still, it's going to schedule or according to the static observation that we set on

18:47.200 --> 18:56.160
our virtual machine. And if you remember, we are setting by default one-tenths of the allocated

18:56.160 --> 19:02.240
code. It means that if you need the four cores, the scheduler is not aware vector, and it's

19:02.240 --> 19:11.520
going still to schedule only for 0.4. And this is also going only to affect the scheduling. It's

19:11.600 --> 19:18.880
not going to watch over the time the actual consumption on your cluster and react to the balancing

19:18.880 --> 19:27.520
the cluster. It's still up to you to monitor the cluster and eventually inject migration objects

19:28.320 --> 19:37.200
just to get the scheduler doing something. Another option is to use the cube-desk scheduler

19:37.280 --> 19:42.640
for automatic work or the balancing, eventually combining it with a lower-the-ware scheduler.

19:45.680 --> 19:56.000
Since two months ago, the scheduler is, the opposite of the scheduler is at all, that it's monitoring

19:57.120 --> 20:02.960
your nodes and it can decide to disquede or something. The scheduler has a ton of

20:02.960 --> 20:10.560
machine, it means that the cube-desk is going to react and live a mega-it-it automatically.

20:13.200 --> 20:21.040
Since November, the scheduler is a lot ofware. It means that it can, it's a new feature. It can

20:21.120 --> 20:32.320
then, the scheduler will turn machines according to various CPU consumption. It's a really good option.

20:32.320 --> 20:39.600
It's a really good idea to continue to balance your cluster. On the other side, this is just about

20:39.600 --> 20:48.400
the scheduling. This is not affecting any out. Our schedule is going to complete the migration

20:48.400 --> 20:54.080
because when we are triggering a live migration, the scheduler will trigger it, but then it will be

20:54.080 --> 21:01.200
up to the scheduler to select the node and he is still going to count for what it knows.

21:05.440 --> 21:14.080
In, but this is an interesting proposal. We are continuously working on it. We have one more

21:14.160 --> 21:23.440
thing. We are trying to enhance it with precious tool information. PSI is a metric that is

21:23.440 --> 21:31.360
supported by the Linux kernel since version of 4.20, so it's even not so new. It's supported at

21:31.360 --> 21:41.440
a node and a Cgroup's license level. It's not a metric about the CPU utilization, but it's

21:41.440 --> 21:48.640
exactly measuring the actual productivity loss caused by the scarcity of resources. And we have it

21:48.640 --> 21:56.480
for memory CPU and IO. The kernel is measuring the amount of time where your Cgroup's license

21:56.480 --> 22:04.000
is stuck because it's waiting for a CPU that it's not available at the moment. And the kernel is

22:04.080 --> 22:10.240
supporting it. We did some experiments and the results are really convincing.

22:11.280 --> 22:17.680
Unfortunately, PSI is still not a PSI matrix, still not a portrait by a C-advisor,

22:17.680 --> 22:23.760
which is the tool that supports the matrix to the Q-blet. There is an open PR. This is pretty short

22:23.760 --> 22:29.760
if it's from a one week ago. We are going to have it in the future. This is going to be a really

22:29.760 --> 22:41.280
interesting way to automatically balance the cluster. So now we presented a few options. We have

22:41.280 --> 22:49.360
a design proposal. The design proposal is still not accepted. We are a community. We have users.

22:50.560 --> 22:57.200
So please make your voice heard. If you think that you need this feature even another feature.

22:57.200 --> 23:05.200
If you think that you need something, please talk. I think that as developers, we have a vision

23:05.200 --> 23:11.360
of a cluster. We have a vision of user needs. Maybe we are right. Maybe we are wrong. We want also

23:11.360 --> 23:17.200
to get your feedbacks. Thank you.

23:17.200 --> 23:27.360
Okay. First one. There was a problem in what I said because if I just sent a year ago to

23:27.360 --> 23:33.760
I have my great and then from not A to not B and then my year will be good. And it goes back to

23:33.760 --> 23:41.440
not C. That's completely unexpected. So somewhere must be stored, stay at not C or B or whatever.

23:42.400 --> 23:51.120
Okay. So he said that if we simply add the addition of the constraints to the one of

23:51.920 --> 23:57.280
the objects for the one of my patients, they are not going to stay on the virtual machine. Yes,

23:57.280 --> 24:03.200
it's true. And it's absolutely expected. If you want to be a persistent change, please set it on

24:03.200 --> 24:09.280
the virtual machine object. But that's a problem again. Because if I don't have the permission

24:09.360 --> 24:16.800
to migrate, I have the permission to code migrate as well, which is completely either should be

24:16.800 --> 24:29.920
allowed to do both on either. So the constraints are set of an object and if you are the

24:29.920 --> 24:36.320
owner of the virtual machine object, you are all aware that we are at the object. It's up to your

24:36.400 --> 24:45.520
cluster admin to decide if you are entitled to trigger a virtual machine's instant migration now.

24:45.520 --> 24:53.760
If not, you can only specify where you want to have your virtual machine and sooner or later

24:53.760 --> 25:02.000
it will happen, but it's not up to you to force it. And we can talk later.

25:06.800 --> 25:14.800
So he's asking about the user of a pod disruption budget in Cuba,

25:14.800 --> 25:19.840
yes, we are using them. We are using them to protect the virtual machine to be sure that it's

25:19.840 --> 25:29.120
not going to be killed. And we are using a second PDB also to protect the target pod of the

25:29.200 --> 25:33.760
live migration. So yes, we are using them. Next question.

25:47.040 --> 25:53.840
So the disk of the virtual machine is a storage on a system storage on external storage,

25:54.480 --> 25:58.960
depending on how you configure the virtual machine. It could be eventually automatically

25:58.960 --> 26:05.280
a static. If you say that the virtual machine should be automatically a static, you can configure it.

26:06.160 --> 26:12.320
We also have additional operators that are going to continue to monitor the node

26:13.440 --> 26:17.040
to speed up the covariate process. If you need the eigenveller ability,

26:17.040 --> 26:23.520
if I do a virtual machine. What if the node moves connected to the cluster,

26:23.520 --> 26:27.360
that not with the persistent storage? I have two machines talking the same.

26:27.360 --> 26:31.120
That's why we have a better additional operators. Normally, we wait.

26:34.640 --> 26:40.720
So he's asking what is going to happen if we have two nodes that the node that was

26:40.720 --> 26:46.560
hosting the virtual machine lost the network connectivity, but it's still able to

26:46.560 --> 26:51.760
write on the disk. Potentially it could corrupt the virtual machine. We have locking mechanism

26:51.760 --> 27:00.160
and we have additional operators that are going to use mechanism to be sure that the node is

27:00.160 --> 27:06.400
a really bad if you need them. Normally, we have a long time out to be on the safe side.

27:09.360 --> 27:10.160
Thank you very much.

27:16.560 --> 27:18.560
Thank you so much.

