WEBVTT

00:00.000 --> 00:20.400
All right, thank you, appreciate everybody taking the time to come to this session about

00:20.400 --> 00:28.800
freeing your Kubernetes network. My name is Doug Smith and I am joined with truly the master

00:28.800 --> 00:38.320
mind behind this work, Miguel, and I'm lucky enough to get to work with him. What we're

00:38.320 --> 00:45.760
going to talk about today is a feature that we often refer to as user-defined networking

00:45.760 --> 00:53.440
or UDN, which is kind of a way of changing the paradigm about how Kubernetes networking works

00:53.520 --> 01:00.080
at its core, and especially related to virtual machine use cases. So what I'm going to do is kind

01:00.080 --> 01:07.040
of introduce some of the motivations, some of the problems, and then Miguel's going to get into the

01:07.040 --> 01:14.800
nitty gritty about the details of how to use it, and if we've got time run through a demo as well.

01:15.040 --> 01:26.000
There is a large stack of open source technologies that are involved in this particular project

01:26.000 --> 01:32.240
where already building on top of Kubernetes, and we're using virtualization on top of Kubernetes

01:32.240 --> 01:38.640
as well, and really these two projects right here at the front of in Kubernetes, which sometimes

01:38.640 --> 01:45.200
we call of NK, and Kubernetes are really at the top of this, but all of these technologies are

01:45.200 --> 01:56.240
open source. You can pull the hood open and get into all of them. Whoops. So let's talk a little bit

01:56.240 --> 02:03.760
about kind of the motivation here, and I kind of see it as sort of two sides to the same coin.

02:04.320 --> 02:11.600
First, you've got really your traditional virtualization user, and in a traditional virtualization

02:11.600 --> 02:17.600
world, like say if you're using open stack and you're using neutron, you're going to make an

02:17.600 --> 02:25.360
assumption that you're going to get a layer to isolation across your virtual machines in order to

02:25.360 --> 02:33.200
segregate the traffic properly and to create a tendency scenario. And if you're a Kubernetes

02:33.200 --> 02:40.640
savvy user, you're going to probably have an assumption that you're going to have a bit more of a

02:40.640 --> 02:48.240
like managed experience, a bit more of an opinionated way to do something. So if you are putting together

02:48.240 --> 02:55.520
something like this today, you're probably, well, I should say today, prior to user-defined

02:55.520 --> 03:01.600
networking, for example, you're going to have to kind of like baby step all of the things that

03:01.600 --> 03:07.680
may be involved. So maybe you're going to have to figure out your own IP address management solution,

03:07.680 --> 03:15.600
and you're going to need DHCP in order to handle that. So you'll have to set up these things yourself.

03:15.920 --> 03:23.840
And this is kind of underpinned by something that is truly different between the virtual machine

03:23.840 --> 03:30.480
world, and the like purely containerized pod world, which is in a virtual machine scenario,

03:31.200 --> 03:37.120
you are often going to assume that the life cycle is different. If you have a virtual machine that

03:37.120 --> 03:41.920
goes down, it's not really going to go down. It's going to be migrating, and what are you going

03:41.920 --> 03:48.320
to need in that scenario? Probably persistent IP addressing. So once you get that migration,

03:48.320 --> 03:54.960
your traffic is still going to go to just the right thing. It's not like a pod life cycle where

03:54.960 --> 04:00.560
you kind of build your applications assuming that they can arbitrarily die, and they'll come back

04:00.560 --> 04:07.280
up, they'll have arbitrary IP addressing. So that's quite a bit different. So really the motivation

04:07.280 --> 04:12.400
is to try to merge these worlds and make something that fits this a little bit better because

04:12.400 --> 04:19.440
the bottom line is very opinionated. It's a design for web scale, and it's going to have one

04:19.440 --> 04:26.160
single network where everything is assumed to have connectivity between all of the elements in

04:26.160 --> 04:34.720
that cluster. So if you wanted to make something that had this type of functionality, you'd probably

04:34.720 --> 04:41.680
have to create that kind of micro segmentation, we'd say, using a network policy. And the thing

04:41.680 --> 04:48.320
with that is that it's expensive on two fronts. First of which is it's going to be computationally

04:48.320 --> 04:53.760
expensive because you're going to have to make a lot of different rules in order to get that

04:53.760 --> 05:01.440
kind of segmentation. And then it's also going to be procedurally expensive for the humans that are

05:01.520 --> 05:05.440
involved because they're going to have to create it. And some of the things may be like

05:05.440 --> 05:11.120
really challenging to do that way. So these are the kind of things that we're like I said trying to

05:11.120 --> 05:19.600
like weave these two worlds. So let's look a little bit at the use cases for this. So one thing

05:19.600 --> 05:27.840
that user defined networking helps to provide is a native name space isolation. So instead of having

05:27.840 --> 05:33.840
that single Kubernetes network where everything is connected, what you'll be able to do right

05:33.840 --> 05:40.960
out of the box with user find networking is to create network isolation per name space. So you could

05:40.960 --> 05:46.560
have just the pods in the orange name space can only speak to the pods in the orange name space.

05:46.560 --> 05:52.240
Same with the red, green, and blue, there's isolation between those. In Kubernetes typically,

05:52.240 --> 06:02.160
you would have connectivity between all of those namespaces. There's also a contract for a cluster

06:02.160 --> 06:10.160
wide network isolation, CUDN cluster user defined networking, which allows you to logically

06:10.160 --> 06:18.320
group together those namespaces and get that isolation per group as well. And looking at the goals,

06:18.880 --> 06:26.880
we want to have that workload and tenant isolation just like I was a diagramming there before.

06:26.880 --> 06:34.560
We want to have it so that you can manage that isolation between those namespaces or groups

06:34.560 --> 06:41.440
of namespaces. Another thing that we really wanted to do is make sure that you are able to

06:41.440 --> 06:49.680
handle overlapping pod IP addresses. So let's say you go and set up a situation that has

06:49.680 --> 07:00.160
of whatever a typical like slash 24 subnet and it's 192 0.2.0 slash 24 and then you're going to

07:00.160 --> 07:06.960
make yet another group of these pods with isolation. Instead of having to go and tweak it and change

07:07.040 --> 07:14.720
that to have a bunch of namespaces so that there can take you as subnet, you can just kind of copy

07:14.720 --> 07:21.920
and paste and just use that again and again and again, which will make it much easier. Additionally,

07:22.800 --> 07:32.160
we want to have that isolation, but we also want to have the kind of Kubernetes APIs that you

07:32.160 --> 07:39.600
would expect to have, such as services, network policies, admin network policies, we still want

07:39.600 --> 07:44.800
to be able to use that kind of stuff as well. So that is still available and accounted for.

07:46.160 --> 07:51.600
We want to have that stable IPAM configuration like I was saying. We want to have those persistent IPs

07:51.600 --> 07:58.000
when we get that, when we have those migration scenarios, we want to have that persistent IP,

07:58.080 --> 08:03.440
when that VM is migrated and comes up on another machine, we want to make sure that that works.

08:04.720 --> 08:11.280
Another thing that we want to make sure that works for you well is to have your public cloud support

08:12.160 --> 08:18.000
so that you can make sure that your traffic is going to get to the place where it needs to go on a

08:18.000 --> 08:25.280
public cloud. So in a public cloud scenario, your provider likely is going to have some constraints

08:25.280 --> 08:31.760
around how network traffic moves across their network, which may mean that you're going to need to

08:32.640 --> 08:39.040
let that public cloud provider know about the IP addresses that you're going to use and you don't

08:39.040 --> 08:44.640
want to have to make a phone call, actually no one has done that for the last one years.

08:44.640 --> 08:49.840
I'm submit a ticket, maybe no one has done that and a while or provision it via their web interface

08:49.840 --> 08:55.600
to like keep adding IP addresses kind of a thing. We want to make sure that that is handled

08:55.600 --> 09:03.280
by these technologies themselves. And lastly, but not leastly is so while we still have that

09:03.280 --> 09:11.920
network isolation, if you are doing functionality in your virtual machines that accesses the Kubernetes

09:11.920 --> 09:18.480
API, access a controller, for example, or you need something out of QBDNS, we don't want to

09:18.480 --> 09:25.520
deny you from using those services either. So we make sure that there is connectivity to those

09:25.520 --> 09:33.840
kind of like core Kubernetes services. So let's dig into the API and I'm going to let Miguel take

09:33.840 --> 09:43.840
the stage here.

09:47.840 --> 09:55.040
Okay, this is not good. I'll let you do this. Okay, now no, it's okay, better. It's okay, it's okay.

09:56.000 --> 10:03.440
I have to do it. Okay, so the first thing we're going to be seeing here is how the API looks like

10:03.440 --> 10:09.360
like how do you actually use and configure the feature. So let's start with an example because

10:09.360 --> 10:15.440
it makes it easier to understand. So if you want to configure namespace isolation for namespace,

10:15.440 --> 10:23.680
let's say it's green. You would provision this the CRD called user-defined network. Again,

10:23.760 --> 10:29.120
you have to decide to do to name the namespace in which you're going to be isolating your network.

10:29.120 --> 10:35.440
Because again, the idea is for you to kind of override the default cluster network. You want

10:35.440 --> 10:40.880
something that is not the default cluster network. Anyone to do that per namespace, you do not need to

10:40.880 --> 10:47.840
do that for the entire cluster. Now, how does the configuration look like? Well, it's quite simple.

10:47.840 --> 10:53.840
You have to decide the topology for virtualization. You need to specify layer two,

10:53.840 --> 10:59.120
does the only topology that will work for virtualization. There are more topologies, but for

10:59.120 --> 11:05.440
vert, you have to stick to layer two and then you configure your layer two. So you have to

11:05.440 --> 11:12.400
to specify that the role is primary. So this is a primary user-defined network. So it knows

11:12.400 --> 11:17.760
how to override the default cluster network. You specify the subnet. It can be

11:17.760 --> 11:24.480
like dual stack. So you can put here a single stack IPv4, single stack IPv6 or dual stack.

11:25.280 --> 11:31.040
You cannot do more than that. And you have to specify that the IPm life cycle is persistent.

11:31.040 --> 11:39.440
So that the IP address allocation knows to stick to the virtual machine life cycle instead of

11:39.440 --> 11:47.280
the pod life cycle. For the other use case in which we want to interconnect two different namespaces.

11:47.280 --> 11:54.160
So in this scenario, you have the red namespace and the blue namespace. And there are interconnected

11:54.160 --> 12:01.120
by the happy network. What we would do is to use this other CRD called cluster user-defined

12:01.120 --> 12:07.840
network. As Doug said explained before, notice in the metadata that this is no longer namespaced.

12:07.840 --> 12:15.840
So it's a cluster wide resource. And if you focus in the bottom and the network stands up,

12:15.840 --> 12:20.880
you'll see that it's exactly what we've seen before. You specify the topology and then you give

12:20.880 --> 12:27.680
configuration of what the topology looks like. So role primary, IPm life cycle persistent. So that the

12:27.680 --> 12:35.040
IP address says will be persistent and you define a subnet. The big difference comes from expressing

12:35.040 --> 12:40.000
how or which namespace if you want to interconnect. For that, you use the namespace selector.

12:40.000 --> 12:46.240
So pretty standard Kubernetes part of the model. You define what namespaces you want to do.

12:46.240 --> 12:50.560
You just put here, like for instance, values, red namespace and blue namespace. And you'd be

12:50.560 --> 12:55.920
interconnecting those two. And everything else would not be able to communicate with them.

12:58.960 --> 13:06.960
Okay, so we've seen how the API looks like and how you configure the feature and enable it on

13:07.040 --> 13:12.400
your namespaces. The ones you're interested on. Now, how does this look like from the

13:12.400 --> 13:18.320
oven perspective and how does this actually work? So for every UDN that you create,

13:19.840 --> 13:25.840
oven Kubernetes will create an oven layer to switch. So and that's why these things are isolated.

13:25.840 --> 13:32.080
So your red namespace is using one layer to switch that is totally disconnected from the rest of

13:32.080 --> 13:38.400
the network. Your yellow namespaces using an entirely different layer to switch. And all these entities

13:38.400 --> 13:43.760
that you see here are replicated for each UDN. The workloads that are in the bottom,

13:43.760 --> 13:49.680
so you see like pod 1, pod 2, which happens to be a VM. And pod 3 are just connected to this layer

13:49.680 --> 13:56.880
to switch. And then we have a gateway router in each of the nodes for it to be able to

13:56.880 --> 14:04.240
egress to the internet to implement more features than that. This gateway router is also connected

14:04.240 --> 14:11.360
to an external switch in each individual node. And this external switch has a particular

14:11.360 --> 14:16.640
special type of port that connects it to an OBS bridge. And then we have a physical interface on the

14:16.640 --> 14:27.040
nodes for the traffic to leave the cluster. Okay, so question now becomes like how do you

14:27.040 --> 14:33.360
re-manage IPAM for the VMs? Because we need to do that in a different way than we do the pods.

14:35.280 --> 14:41.520
So oven Kubernetes what it does is every time you see the pod being created, it will

14:41.520 --> 14:47.520
introspect the pod and figure out, okay, this pod is a pod, I'll do nothing. If it figures out

14:47.520 --> 14:55.440
that or if it understands that a pod is for virtual machine, it will provision a DHCP, an object

14:55.440 --> 15:01.440
in oven called DHCP option, and it will put the following information. So IP address, gateway,

15:01.440 --> 15:05.440
DNS, host name, and MTE. This only happens for the VMs.

15:05.520 --> 15:14.400
It's quite easy to understand that the gateway, DNS, configuration, and DMTE are common for

15:14.400 --> 15:22.560
all the objects in the network, but the IP address and host name are per VM. Okay, we need to move

15:22.560 --> 15:28.960
along a little bit faster. So how do we enable this for cloud platforms? Well, the solution is

15:29.920 --> 15:34.960
why, obviously, you have to SNAT from the pod IP address to the node IP address. Otherwise,

15:34.960 --> 15:39.520
your cloud provider will just look at your traffic and totally stop on that.

15:40.960 --> 15:46.320
And if you remember something Doug told before in the objectives, we want to have overlapping

15:46.320 --> 15:52.640
subnets. This kind of, it's kind of the plot thickens a little bit because in order to implement

15:52.640 --> 15:59.120
that scenario, you need to add an additional layer of NATting. So let's consider this scenario.

15:59.120 --> 16:05.200
You have two different UDNs, primary UDNs actually. So we have the blue and the left and the red

16:05.200 --> 16:13.600
and the right. For each of these primary UDNs, we have to compute, um, Kubernetes decides

16:13.600 --> 16:19.520
something which we call like the a masquerade IP. So if you look like the blue network has the

16:19.600 --> 16:26.960
dot 12 IP address in, in, in the last thought it. And the red has a dot 13. So there's like a specific

16:26.960 --> 16:33.520
per network, masquerade IP. And on the gateway router, we do a SNAT from the pod IP to this

16:34.640 --> 16:40.320
per network, masquerade IP. And when the traffic hits the obvious bridge on the node,

16:40.320 --> 16:47.040
we do another layer of NATting. And we, uh, NAT from this per network, masquerade IP to the node IP,

16:47.120 --> 16:52.320
the traffic leaves, uh, to the, well, outside the cluster. Once it goes back, we do the other

16:52.320 --> 16:59.040
thing around. We do it the other way around. We units NATs back to the per network, masquerade IP.

16:59.040 --> 17:05.920
And on the gateway router, we units NATs it to the actual workload IP address. So yeah, it comes with

17:05.920 --> 17:14.960
a cost. But it's a very cool feature. So how do we, uh, actually do stable IP in? It's actually

17:15.040 --> 17:20.320
quite simple. The idea is mainly like if your virtual machine migrates to a different node,

17:20.320 --> 17:25.840
we just want the IP address to be the same. And there's another thing we want. They want the gateway

17:25.840 --> 17:31.440
configuration to be consistent. So IP address and gateway configuration have to be consistent.

17:31.440 --> 17:36.880
Because if they're not, you're existing TCP connections on east west traffic are broken.

17:36.880 --> 17:42.880
And we don't want that. And on the north south, uh, kind of traffic. If your gateway configuration is

17:42.960 --> 17:48.880
not consistent, they also break. And we also don't want that. Like probably your workloads will

17:48.880 --> 17:52.880
not tolerate it. So the trick is quite simple.

