WEBVTT

00:00.000 --> 00:24.320
Hello, everyone. I'm Vitali. I'm from Red Hat and I normally work on things like

00:24.320 --> 00:33.960
either KVEM, you can meet me in the Linux kernel and other mostly like lower level projects.

00:33.960 --> 00:41.280
And for the last couple of years, I'm involved in making sure Linux runs well on confidential

00:41.280 --> 00:48.960
VMs. So I'd like to continue discussing the topic which I need to start it. And basically,

00:49.040 --> 00:56.400
I want to have like a more generic talk on why we still have some trust issues when we use

00:56.400 --> 01:06.160
confidential VMs and other issues going away anytime soon. So going to the basics, like what is

01:06.160 --> 01:11.440
a confidential VM? What I mean, what's the main promise of the confidential VM? Why would you be

01:11.440 --> 01:18.560
interested? The main promise is that it gives you data confidentiality, right? So nobody

01:18.560 --> 01:27.520
but you as the owner of this VM, no matter where it runs, can get to the data which this VM has

01:27.520 --> 01:37.760
a process or does something, right? So that's the main difference. And where can you get this VMs?

01:37.760 --> 01:45.200
Well, becoming pretty ubiquitous. And over the last year, we've seen that some like cloud

01:45.200 --> 01:52.960
providers, for example, Google, they just moved to a general availability back in fall, I think

01:52.960 --> 01:59.920
like October and November, both with Service and P and TDX, Azure was doing Service and P for

01:59.920 --> 02:06.640
quite a while. They have TDX, Intel TDX and like public preview. I hope we're going to see it in

02:06.640 --> 02:14.240
J soon and in WS, you can get Service and P feature enablement for certain instance types, also

02:14.240 --> 02:24.480
like in GA. So at least it's like major hyper scalers support this already. Then the question is like,

02:24.480 --> 02:30.480
can you run it on premise? Because for quite a while, it was kind of interesting because

02:31.440 --> 02:36.560
you look at the patches and they're still being like discussed upstream Thompson, but then you go to

02:36.560 --> 02:40.880
cloud and they're like, oh, we're a KVM-based and we're already run confidential VMs. How do you guys

02:40.880 --> 02:50.880
do that without the upstream enablement? Anyway, so last year we finally got patches for Service and P,

02:51.840 --> 02:59.600
landed in like KVM-Kimu, I guess like live or than the OVMF. So basically you don't need to

03:00.960 --> 03:08.000
do anything, you can take upstream bits and run Service and PBMs now. Intel TDX is a long story

03:08.880 --> 03:16.080
because it's how it's done. It's coming. I guess that we will finally have a KVM enablement in the

03:16.080 --> 03:24.480
next upstream kernel and KVM will land shortly after. For now, there is an effort in Santa

03:24.480 --> 03:30.080
Ias, there is a Santa of Sig, which builds this not yet fully upstream bits. So if you are interested

03:30.080 --> 03:42.160
in running Intel TDX on premise, check out this Sig. One particularly hard part of TDX enablement is

03:42.160 --> 03:52.400
that you need to have this like D-CAP enclave for making this like TDX codes, which is basically

03:52.400 --> 04:00.640
like an SZX enclave. In the way how Intel Pax did, it makes it really hard to put it in a distribution,

04:00.640 --> 04:07.200
but I've heard good news that it's finally coming into Fedora soon, at least, and hopefully

04:07.200 --> 04:16.320
like other resources. Okay, so you see that we can already get a confidential VIM either on premise

04:16.320 --> 04:22.080
or we can get it in the cloud, right? And as I said, the main promise is that it's confidential,

04:22.080 --> 04:28.800
right? So are we good, right? Like, all done, nothing to talk about? Oh, not quite, I guess.

04:30.640 --> 04:35.520
Very simple, like, question that like, we don't want to depend on your infrastructure,

04:35.600 --> 04:41.840
but you still need it to run your VM. So you think go to some your cloud providers,

04:41.840 --> 04:48.000
web portal, click, click, click. Here is your confidential VIM, connect to this IP address.

04:49.520 --> 04:53.520
Why do you think it's confidential? Why do you think it keeps any of the promises?

04:55.040 --> 05:01.440
Yes, so if you go into this like, confidential domain, something people will tell you one

05:01.440 --> 05:07.600
magic word about this, at a station. You need to test. And then you start reading more about

05:07.600 --> 05:13.280
what a station means, and these people are really smart. They use their own language, and you will

05:13.280 --> 05:18.320
very quickly get lost like I do, right? They like, oh, we have this rest, protocol, and everything.

05:19.040 --> 05:26.400
So they discuss this protocol ever seen, and it's easy to get lost because the partners,

05:26.400 --> 05:31.280
oh, we just get there some evidence from the instance, and then we do magic, right?

05:31.440 --> 05:38.880
And it shows us the green checkbox, you're all good. What, what does it actually prove this?

05:38.880 --> 05:46.000
At a station, what do we need to check? When we got this confidential VM. So I think that

05:46.000 --> 05:53.520
the station is where they're like, umbrella term, that's umbrella. So my take on this would be that

05:54.240 --> 05:59.440
first you need to make sure that you're actually running on the appropriate hardware, right?

05:59.600 --> 06:06.400
It's not like some sort of an emulation of the hardware because otherwise you cannot really

06:06.400 --> 06:14.240
trust it unless you grow this emulation yourself or something. Then one of the VM is built, right?

06:14.240 --> 06:20.480
You need to make sure that it was built in the right state, the initial set of like CPU registers,

06:20.480 --> 06:27.280
where you can set them like on 7th and P, and the initial memory, right? And then you get something

06:27.360 --> 06:33.520
which is expected. Then you're VM probably boots somehow, right? So you have a chain of artifacts

06:34.720 --> 06:40.160
from firmware to some bootloader, maybe another bootloader, and then to your external

06:41.360 --> 06:51.920
and you kind of care what's in this chain. Second, if your VM has any storage, and normally does,

06:52.000 --> 06:57.920
right, it even boots from a storage, then you may want to ensure that nobody is

06:57.920 --> 07:06.240
tempered with the storage. And then comes to the last point that when you have the VM running,

07:06.240 --> 07:11.280
that nobody else but you can actually access it, nobody can inject anything in this VM. And in

07:11.280 --> 07:17.600
particular, some like provision agents are known for providing unlimited access to your VMs.

07:18.480 --> 07:29.120
Anyway, hardware. So, well, we are interested in this confidential VMs, which are like

07:29.120 --> 07:39.120
rooted in hardware, so you trust the CPU. And this VM technologies they normally can provide you

07:39.200 --> 07:46.240
some sort of a think about it, like a sign document saying, this hardware is good.

07:47.680 --> 07:53.440
How do you know that this report is coming from the hardware where you run, right? You can add some

07:55.120 --> 08:03.360
data there, like for example, your SSH host key, right? So it's signed together with your

08:03.440 --> 08:08.720
other information. And then you can look at the report and all key. I'm connecting to like a real

08:09.680 --> 08:19.040
VM. But not all VM implementations will give you a raw access to this mechanism, which can

08:19.040 --> 08:26.000
give you like a report, namely like on Azure, they use like a parameterizer, so you can still get the

08:26.000 --> 08:33.040
report, but from a VTPM, which means that now you cannot do this like challenge, right? You

08:33.040 --> 08:40.960
can put your own data in there. You have to trust someone who got this report for you, right? So your

08:40.960 --> 08:47.440
VTPM becomes your new route of trust. And I'm going to talk about it a bit later on, like if you

08:47.520 --> 08:57.360
cannot trust it really. Okay, fast forward, initial VM state is basically, you will hear, okay,

08:57.360 --> 09:02.000
we have some launch measurement, which is like a hash described in the initial state of

09:02.000 --> 09:10.000
straight registers and memory, which is great, but do you know what the hash is? Can you

09:10.000 --> 09:20.560
pre-calculate it? So yes, if you run your own like CVM icon premise, there are tools where you

09:20.560 --> 09:26.560
can give your like firmware and then describe the configuration of VM, like how many VTPUs you will

09:26.560 --> 09:32.960
have something, and you will get some hash, which presumably you will later see in the VM.

09:33.600 --> 09:40.320
What if you are on a cloud provider? Well, some cloud providers are nice, and they give you something

09:41.520 --> 09:49.440
like a reproducible build of your firmware. Basically, here is the repo of the MF, with our patches,

09:50.480 --> 09:59.760
build it, AWS does that. Thank you guys. You have to use like mix build system to build it,

09:59.840 --> 10:05.600
which is funny, but the instructions are there, you can try it, and hopefully you will get the

10:05.600 --> 10:15.040
hash, which you see in AWS's instances. Then, an is idea, if you can bring your own firmware,

10:15.040 --> 10:21.680
you can also plug presumably eventually, eventually we don't have it on any platform available yet,

10:21.680 --> 10:31.600
but eventually we may get this predictable hashes, and what if not none of this is available for you?

10:31.600 --> 10:38.880
Well, then the only thing you can basically tell is that this hash never changes. Every time you run

10:38.880 --> 10:45.520
your CVM on a public cloud, you get some hash, so if you don't think that this infrastructure

10:45.520 --> 10:51.920
is that can hit it in the evil, and it was even from the world beginning. If you only care about

10:51.920 --> 10:57.520
attacks like, oh, some intruder will get access to some host, which runs my CVM, it's going to change

10:57.520 --> 11:03.760
some bits there, right, and hack into my CVM, then you may consider yourself protected against such a

11:03.760 --> 11:10.080
attack, right, just because you observe the same hash over and over again. Okay,

11:10.800 --> 11:16.080
bootchain, right, you're likely going to plan firmware, start firmware was measured,

11:16.080 --> 11:22.800
ever seen you start booting, and how do you ensure that what he boot is valid, because normally

11:22.800 --> 11:27.440
you load it from a storage, right, so it's not covered by any launch measurements.

11:29.680 --> 11:36.080
Normally, two mechanisms are used, secure boot and measured boot, and again, it makes it sound

11:36.080 --> 11:42.640
like as soon as you enable secure boot you are secure, but that's not always the case,

11:42.640 --> 11:50.720
and the thing is what secure boot really is. There is some database of say good certificates,

11:50.720 --> 11:57.040
and you check that whatever you boot was signed by one of the certificates from this database.

11:58.480 --> 12:04.080
If you take the defaults, for example, there are like Microsoft certificates there, they signed

12:04.240 --> 12:13.440
various shim versions, which load pretty much everything out there. So just by the fact that you

12:13.440 --> 12:20.640
have secure boot enabled, you are not going to gain that much, unless you also measure basically

12:20.640 --> 12:27.040
you remember which certificates you used in the bootchain. This is not what we call measured boot,

12:28.000 --> 12:34.640
and that then you need a trusted device for that. You are going to record this like say

12:34.640 --> 12:40.320
hashes of their certificates, which you used in the chain, or all the binaries, which you used in

12:40.320 --> 12:46.480
bootchain, you are going to record them somewhere, so you are able to check it later, like using

12:46.480 --> 12:55.360
evidence to prove that you are good. Where this place can be. Normally, normally you can use a

12:55.360 --> 13:02.480
TPM device, like on hardware, but in the arena VM, so it's likely going to be a virtual TPM.

13:02.480 --> 13:07.360
There is another mechanism, which I wanted to mention, that TDX, of course, you these similar

13:07.360 --> 13:16.960
registers, which like RTMRs, have only four of them, right? But think is there not that

13:16.960 --> 13:25.120
well supported across the stack, and they have very well limited use, so I'm not

13:25.760 --> 13:31.920
convinced that this is going to get a wide adoption when TDX finally rolls out.

13:33.280 --> 13:40.880
So VTPM, you measure something in a VTPM, and then you trust it. Well, in that case,

13:41.840 --> 13:49.680
you want to be sure that the VTPM is trusted, right? Because if it's controlled by say

13:49.760 --> 13:55.760
fully controlled by the host, well, then the host can change this measurement, right? You measure

13:55.760 --> 14:04.800
something in it, and it will just swap this measurement with something else for you. The idea is for

14:04.800 --> 14:12.960
7th and P, for AMD, that they have this privilege levels, and you can implement your VTPM

14:12.960 --> 14:25.760
as part of firmware. You don't always know how close do they? Some of them

14:27.520 --> 14:36.880
claim that they implement this in particular right Microsoft. If you run on premise, there is

14:37.680 --> 14:46.080
SDSM project which implements VTPM as part of the firmware inside the guest, and the

14:46.080 --> 14:50.400
beauty of it that it gets in the launch measurements, right? If it's part of the firmware, then you measure it.

14:52.000 --> 14:58.320
For TDX, as I remember, like two architectures were suggested initially, they were suggested

14:58.320 --> 15:04.080
you run, they don't have this VMP else, so they were suggested that you can run VTPM and separate

15:05.040 --> 15:11.200
TD, basically a separate VMP, and connect them somehow, right? Which creates its own set of

15:11.200 --> 15:20.640
complexities. The new architecture is called like TD partition, which is basically let's run

15:20.640 --> 15:26.160
an asset hypervisor, right? We have create a one, which is going to be like privileged,

15:26.160 --> 15:35.920
it will contain your VTPM, then you will put your Linux in L2. You will see, as I said,

15:35.920 --> 15:47.920
like TDX and the implement is just blinding cloud providers, which enable it Azure. I haven't seen

15:47.920 --> 15:56.080
any publicly available claim on how they do TDX, on how they do TPM on their TDX instances,

15:56.560 --> 16:02.800
but it's like they have it in preview, so maybe by the time they go to G, they will say something

16:02.800 --> 16:10.960
about it. Google as far as understand, never claimed that their TPM is isolated from the host.

16:11.920 --> 16:18.800
So basically, you have to trust that it's, you have to trust it.

16:18.800 --> 16:29.760
Yeah, yesterday in the Coca Devroom, there was a talk by Microsoft, where they described how they

16:30.320 --> 16:41.520
do this like VTPM based booting for their Azure Linux. I did a similar talk in this Devroom

16:41.600 --> 16:47.600
a year ago, which was more technical than this one, so I'm encouraged you to

16:48.400 --> 16:59.360
watch it if you are interested. Okay, so you have your boot chain validated right, and then it

16:59.360 --> 17:13.360
comes to storage. Think is that, again, storage is not covered by any launch measurements.

17:14.640 --> 17:24.800
You may have parts of storage, which are say persistent. Think about like image-based

17:25.760 --> 17:31.520
operating system, where you don't really want to hide it from the whole world. For example,

17:31.520 --> 17:34.880
it's like an open source operating system. There is nothing to hide, you may say,

17:34.880 --> 17:40.400
but you're still interested that the host doesn't temper or somebody else, and a attacker doesn't

17:40.400 --> 17:48.800
temper with this static parts of your storage. You may use something like

17:49.760 --> 17:57.680
deembarity. You can say sign a hash, what you produce very much. You can inject the key which

17:58.720 --> 18:08.000
signed the hash in your secure boot DB, and then you can check that you are using it. Normally,

18:08.000 --> 18:16.240
this works, right? But again, you will have to make sure that it's all measured, and again,

18:16.240 --> 18:24.240
we go to the point where your VTP must be trusted. With a volatile storage, where you want to

18:24.240 --> 18:32.400
save data, you have to encrypt, right? There is no other protection. The question is,

18:34.000 --> 18:41.280
if you have some encrypted storage, how can you get the key? Again, you don't trust your infrastructure,

18:41.280 --> 18:48.080
so go in, I don't know, a serial consultant during a password on boot is not an option,

18:48.080 --> 18:52.720
because the serial is controlled by the infrastructure you have to do it automatically, and also

18:52.720 --> 18:57.680
it doesn't scale, of course. But even without scaling, it can be done in secure manner.

18:59.840 --> 19:07.200
Normally, people suggest to approach as how you can do it. First, you can use

19:07.520 --> 19:15.200
a policy in your TPM device, and the policy in the TPM device, you basically say,

19:15.200 --> 19:21.200
hey, my PCR register should look like this, and then you can unlock this password and give it

19:21.200 --> 19:26.720
to the operating system, and you can, for example, say, like hashes, again, of the certificates,

19:26.720 --> 19:33.760
which you used in the boot chain previously, or something. This normally works to the point where

19:33.760 --> 19:41.520
you trust your TPM device. If you think that it's implemented right way, or you trust it for some,

19:41.520 --> 19:48.720
other reason, then yes, you can leave the evaluation of the policy to the TPM and use it.

19:50.720 --> 19:56.320
Another idea is to do, like remote attestation, basically,

19:56.320 --> 20:01.280
whether all the evidence we've gathered so far, starting from launch measurements, boot chain,

20:01.280 --> 20:11.120
something, give it to someone who holds our key, and receive a key which decrypts your storage.

20:12.320 --> 20:28.800
Which is, yeah, just good. Only that, that is from the TPM, right?

20:31.840 --> 20:42.800
Sorry, yeah. So, yes. So, what I wanted to say is that if you want to do it, then your

20:42.800 --> 20:53.280
VTPM must be stateful. By stateful, it means that every time you boot VTPM, it has the same

20:53.280 --> 21:00.080
keys in it. Think of VTPM as some like key storage and some very simple processor around it,

21:00.160 --> 21:05.520
which can do some simple actions, like a decrypt seal and seal, verify something.

21:08.240 --> 21:14.160
And if you boot them like public clouds today, again, hyperscalers, I'm using as an example,

21:15.040 --> 21:21.120
they will all give you something which initially looks like a stateful VTPM. It preserves the key,

21:21.200 --> 21:30.880
it's all there. You may wonder where this comes from. So far as I know, only Asia claims that

21:32.800 --> 21:40.640
they do some sort of the under the hood, a attestation, early in firmware, to decrypt the state.

21:40.640 --> 21:48.320
So it basically stored as a file, which can't insert excess, but the file is encrypted.

21:48.320 --> 21:53.920
And to decrypt it early in that estation, they connect to their own Microsoft server, prove that

21:53.920 --> 22:02.720
this is like a genuine CDM and decrypt the state. You can use it, but again, then you trust

22:02.720 --> 22:09.920
these other part of Microsoft infrastructure, if you do so. Is it good or bad? Well, depends on like

22:09.920 --> 22:15.680
what the threat model you use, what you care about. If you don't want to trust your cloud

22:15.680 --> 22:22.800
provider at all, this just shifts your trust from one place to the other inside the cloud provider.

22:22.800 --> 22:30.080
You're not gaining much, right? You may want to say that, okay, I don't want to protect myself

22:30.080 --> 22:36.320
against a completely like evil cloud provider. What I care about is somebody hacking in some

22:36.320 --> 22:44.000
host which runs my CVM and stealing something from it. In that case, you do get isolation from

22:44.000 --> 22:53.360
the host where you run. This procedure allows for debt. What if you run your own infrastructure

22:53.360 --> 23:00.560
and want to build something similar? In SESM projects, there is some work going on, and it's the final

23:00.560 --> 23:12.240
was telling us about it yesterday in the co-curl. It is quite complex because you either need to

23:12.240 --> 23:18.640
have, say, networking early, very early in the boot process, like that firmware is going to talk

23:18.640 --> 23:26.000
to some server way before even your bootloader starts, right? Which is challenging, or we need to

23:26.000 --> 23:31.360
go through the host and have some sort of a process there, but this comes with its own complexes,

23:31.360 --> 23:37.760
especially when you think about this like real world deployments, right? Because you host and

23:37.760 --> 23:43.280
your guest may not even be connected on the same network, right? Your host looks in some network,

23:43.280 --> 23:48.720
like a management network, and your guests are connected with virtual functions to another network.

23:48.720 --> 23:53.280
So, if you go through the host, you cannot reach a server which the guest can also access.

23:54.320 --> 23:59.920
It's still in-gen, but we will have some sort of a solution soon, I guess.

24:00.640 --> 24:10.080
Okay, so last, but not least, is provision in agents. If you run in on something like

24:10.080 --> 24:19.040
a public cloud, you are probably used to think like clouding it, and the thing is about clouding

24:19.040 --> 24:26.160
it, that it's fully controlled by infrastructure. The data source normally can do anything to

24:26.160 --> 24:32.880
your VM, run any script there, and check any key. If you use it, then you trust it.

24:32.880 --> 24:40.320
It's not isolated from provider or anything, you cannot sign it yourself, or at least like

24:40.320 --> 24:48.720
there are no public-level solutions for that. I know that this has been looked at by some cloud providers

24:49.680 --> 24:59.040
who want to provide some sort of trusted data source, so far I haven't seen any implementations.

25:00.160 --> 25:06.240
So, what you have to do is basically avoid any provision in agents, which is hard, because

25:07.520 --> 25:14.160
even if you don't want, for example, your clouding it to inject your SSH key, you still may want to

25:14.480 --> 25:24.960
say let it configure your networking. So, something like a policy-based mechanism would have been

25:24.960 --> 25:37.440
nice to have, just that we don't have one. Okay, so that was basically the state of art as I

25:37.520 --> 25:44.080
see it currently, as you can see in some places we have some fairly good solutions in

25:44.080 --> 25:53.200
other places, not so much, and the work is ongoing. So, thank you, any questions?

25:59.840 --> 26:05.360
Go ahead. I have two questions actually, and if you're possible to embed the encryption key

26:06.240 --> 26:13.280
for the storage and the firmware itself. The question is whether it's possible to embed the key

26:13.280 --> 26:21.360
to the storage in the firmware? The firmware is visible by the host. You don't want to give your

26:21.360 --> 26:26.800
storage key to your host, in no way you want to do that, right, because the host will be able to

26:27.520 --> 26:31.360
then do something with your storage. So, no, you cannot do that.

26:35.520 --> 26:40.800
No, even if you are bringing your own firmware, you cannot give your giving away this firmware

26:40.800 --> 26:46.880
to the host. The host is going to use it to build your VM. It sees it completely.

26:48.880 --> 26:54.960
In my other question is, what do you want to do, computational dedicated servers,

26:54.960 --> 27:02.560
or computational containers, is it for you? The question is if we want to do

27:02.720 --> 27:13.120
like, confidential containers, whether it's different or not, it's slightly different, but

27:13.120 --> 27:19.680
normally you still, if you want to do it, say again, like public cloud infrastructure,

27:20.560 --> 27:27.120
this underlying VM which runs these containers must also be trusted by you.

27:27.120 --> 27:36.800
Because think about this, if your cloud provider can swap your kernel in your VM with something

27:36.800 --> 27:43.680
different and use your kernel for your container isolation, can you guarantee anything about

27:43.680 --> 27:51.520
this confidentiality of these containers while they're processing? Probably not, right? So, it's a

27:51.520 --> 27:58.800
building block, right? Then you will come to get in this container securely, get in data for it,

27:58.800 --> 28:04.880
you know, making sure that only like these trusted assets VMs can do it. Yes.

28:08.480 --> 28:15.760
Dedicated servers, I mean, CVM technology is better than bare metal in this regard, right?

28:15.840 --> 28:25.600
Because if you have your server somewhere, right, then normally somebody can have like physical

28:25.600 --> 28:31.920
access, for example, to the server, can like inspect your memory, for example, or something, CVM,

28:31.920 --> 28:41.600
kind of protects you against that, right? So, well, out of time, if you have more questions

28:41.600 --> 28:48.880
please talk to us outside the room. Yeah, thank you again, I'm around.