WEBVTT

00:00.000 --> 00:10.260
Okay, good morning everyone, my name is Miguel Dwarth, I'm here with Federico Pauline

00:10.260 --> 00:15.440
Nelly which is essentially the blood, the sweat and the tears behind the project that will

00:15.440 --> 00:21.760
enable the things we are about to show and we're here to present a talk title, we've

00:21.760 --> 00:28.640
been the fabric which is about EVPN overlays for multi-cluster cube work deployments.

00:28.640 --> 00:34.500
The agenda will be, we will begin with a motivation, then go through what EVPN is which

00:34.500 --> 00:39.680
essentially is like a shortest claim or the enabling technology for this, Federico will

00:39.680 --> 00:43.920
explain to us everything about the implementation of this.

00:43.920 --> 00:47.880
Then we'll show a few demos and finalize with the conclusions in which we'll pretty much

00:47.880 --> 00:51.920
be telling you a summary of everything we just told you.

00:51.920 --> 00:56.560
Now, first thing, why should you care about any of this?

00:56.560 --> 01:02.360
So, the use cases for multi-cluster could be kind of intuitive but let's go through

01:02.360 --> 01:03.360
them nevertheless.

01:03.360 --> 01:09.120
So, let's say that you have a single cluster in light metal cluster located in a building

01:09.120 --> 01:13.360
and for whatever reason everything goes down, you lose power in it after a while everything

01:13.360 --> 01:19.880
goes down like the blast radius is huge, something happens, your entire set of applications

01:19.880 --> 01:20.880
go down.

01:20.880 --> 01:25.720
If you have multiple clusters and your application is spread across them, well, it will

01:25.720 --> 01:26.720
keep working.

01:26.720 --> 01:27.720
So, that's one thing.

01:27.720 --> 01:32.120
We want to have resiliency for your applications.

01:32.120 --> 01:40.120
Things like scaling, hybrid cloud, also require you to have your multiple clusters interconnected.

01:40.120 --> 01:44.640
And the third thing which is very useful is legacy applications.

01:44.640 --> 01:50.120
So, I'm guessing if you were running VMs, you have probably seen a virtual machine whose

01:50.120 --> 01:52.920
identity is pretty much an IP address.

01:52.920 --> 01:58.440
So, that means that throughout the life of VM and probably 20 years, you cannot change

01:58.440 --> 02:04.080
its IP, it needs to connect to the same network, it needs to reach the same neighbor throughout

02:04.080 --> 02:05.080
history.

02:05.080 --> 02:10.720
So, you need it to be connected to the same subnet and you probably have changed the platform

02:10.720 --> 02:12.320
many times.

02:12.320 --> 02:18.880
So, these are the reasons of why should you want to have multiple interconnected clusters

02:18.880 --> 02:24.160
and we will be doing that using a stretched layer 2 network.

02:24.160 --> 02:27.000
So, as I said, goal is to fold.

02:27.000 --> 02:33.360
First one, we want to have a stretched layer 2 network and this will be very useful to

02:33.360 --> 02:38.840
provide us these two features, cross-cluster life migration and resource pooling.

02:38.840 --> 02:45.800
So, in a way, you have multiple clusters but you'll be providing a unified front of resources

02:45.800 --> 02:47.040
to the users.

02:47.040 --> 02:51.800
The second use case is routing between these different networks using a VPN.

02:51.800 --> 02:56.520
For the here you'll have, you'll be able to provide traffic segregation and also direct

02:56.520 --> 03:00.240
routed ingress to the virtual machines.

03:00.240 --> 03:01.760
What does this mean?

03:01.760 --> 03:10.200
It means that, remember, this is to infert, so your VMs actually run in pods and this means

03:10.200 --> 03:16.920
that you no longer need to expose services or use a net.

03:16.920 --> 03:22.200
So, VPN will be our enabling technology of sorts.

03:22.200 --> 03:25.880
It will provide us the ability of having a stretched layer 2 network.

03:25.880 --> 03:30.400
It will also provide us with the other objective we have which is to route between these

03:30.400 --> 03:36.760
different networks and it has some characteristics, first of which and most important,

03:36.760 --> 03:44.600
please, me is that it has reduced a broadcast on-known unicast and multi-cast traffic.

03:44.600 --> 03:51.240
So let's say you have 100 clusters that are interconnected, each of them is running thousands

03:51.240 --> 03:54.360
of pods and hundreds of VMs.

03:54.360 --> 04:00.120
Imagine the amount of neighbor discovery messages that are happening throughout this

04:00.120 --> 04:04.440
scattered clusters or ARPs that is just too much.

04:04.440 --> 04:09.880
We don't want to have that, it needs to be more efficient and if VPN provides all of that.

04:09.880 --> 04:13.600
And remember, we're here to see like seamless workload mobility.

04:13.600 --> 04:14.600
That's what we want to have.

04:14.600 --> 04:20.640
We want to see a VM migrate from one cluster to another without impacting the applications

04:20.640 --> 04:23.160
running in it.

04:23.160 --> 04:31.520
So, VPN first stands for Ethernet VPN and we can see it as a twofold protocol.

04:31.520 --> 04:38.360
So in the control plane, it runs multi-protocol BGP and the data plane can be multiple

04:38.360 --> 04:43.360
implementations like VXLON MPLS SRV6.

04:43.360 --> 04:50.640
In these presentation, the components we're using pretty much has a single implementation

04:50.640 --> 04:56.720
which is VXLON which is why it is involved and let me begin with the control plane.

04:56.720 --> 05:04.840
So as I said before, this is using multi-protocol BGP and essentially it provides two

05:04.840 --> 05:11.160
things on top of BGP which is a type 2 route, we'll see what this is in a minute and

05:11.160 --> 05:12.160
type 5 route.

05:12.160 --> 05:14.960
We'll see what this is in probably two minutes.

05:14.960 --> 05:22.920
So type 2 route whenever a VM connects to the network, this will send, we'll create

05:22.960 --> 05:28.120
a type 2 route which is essentially a means for you to register to the entire network,

05:28.120 --> 05:36.120
scatter throughout these clusters that the way for you to reach this Mac and IP address

05:36.120 --> 05:38.200
is via this next hop.

05:38.200 --> 05:44.080
So you essentially brought casting to the network that for you to reach this VM, you need

05:44.080 --> 05:48.440
to use this particular next hop.

05:48.440 --> 05:53.000
A type 5 route on the other hand is the same concept but not for a Mac and IP, you're

05:53.000 --> 05:54.920
just saying the entire prefix.

05:54.920 --> 06:00.800
So this entire subnet can be reachable via this next hop, you're essentially advertising

06:00.800 --> 06:08.200
how to reach either a Mac and IP pair, a Mac address or an entire prefix.

06:08.200 --> 06:18.400
And I'm going to end this over to Federico and he will explain this in more detail.

06:18.400 --> 06:23.400
Okay, so now let's have a look at how we try to implement all this thing, why we keep

06:23.400 --> 06:28.520
it sane and we try to follow some design principles.

06:28.520 --> 06:34.720
One was try to integrate existing solutions, not to reinvent the wheel to have something

06:34.720 --> 06:41.960
that is composable and plug-able with existing building blocks or to reuse as much building

06:41.960 --> 06:43.600
blocks as we had.

06:43.600 --> 06:49.440
Another thing is we wanted to have something that is kind of primary diagnostic and also

06:49.440 --> 06:59.280
can work on some secondary motors plugins and this is what driven our whole architecture.

06:59.280 --> 07:05.000
So the building blocks are obviously a cubeard, we need that to run our virtual machines

07:05.000 --> 07:08.840
and Kubernetes clusters.

07:08.840 --> 07:14.840
Kubernetes can work with any bridge-based CI, the two most famous R of the CI and bridge

07:14.840 --> 07:20.960
CI, they are used to connect your secondary networks either to an obvious bridge or

07:20.960 --> 07:26.360
to a Linux bridge and there is this other thing in the middle which is open peer-outer

07:26.360 --> 07:32.080
is a new project that we've been working on more or less for the past year and it's about

07:32.080 --> 07:38.040
running our outer on your Kubernetes cluster.

07:38.040 --> 07:42.960
So the idea is of the project and if you're interested into this, we have a full-fledged

07:42.960 --> 07:50.120
talk in the networking dev room later today, the idea is to have a router as a Kubernetes

07:50.120 --> 07:57.240
component that you can deploy on your cluster with hand shards or manifest and basically

07:57.240 --> 08:03.600
it will handle all the complexity of being a router exposing the interfaces to the host

08:03.600 --> 08:07.440
as a real router as web pairs.

08:07.440 --> 08:12.760
It allows you to integrate with the BGP-speaking components such as Calico and MetalB,

08:12.760 --> 08:18.920
but also in the focus of these stockies, you have this kind of dungling wire on the host

08:18.920 --> 08:26.720
that allows you to enter a broader layer to overlay spread across multiple nodes.

08:26.760 --> 08:30.240
This is how more or less the architecture is done.

08:30.240 --> 08:36.440
So OpenP router allows you to expose this leg on the host.

08:36.440 --> 08:43.040
You can attach it to a Linux bridge on the host or a non-VS bridge and you also have facilities

08:43.040 --> 08:50.720
to have this bridge auto created for you and all the rest is handled by multis.

08:50.720 --> 08:57.360
So you just use a secondary network, BGC and I, you connect your VMs, secondary network

08:57.360 --> 09:01.080
to the bridge and the magic happens, hopefully.

09:01.080 --> 09:08.840
This support distributed in Cascatoi, so your local gateway is going to be the same whenever

09:08.840 --> 09:16.920
the virtual machine runs and these enable some interesting features as we are gonna demo later

09:16.920 --> 09:17.920
on.

09:18.320 --> 09:25.760
Inwards is simple, the reality is that in order to enable EVPN on a Linux host, you have to

09:25.760 --> 09:27.760
do a lot of things.

09:27.760 --> 09:33.600
You need to configure Linux bridges, you need to configure Linux VRF, a VXLan interface,

09:33.600 --> 09:39.800
its web pairs and doing it manually or with some automation is a lot.

09:39.800 --> 09:44.840
That's why we try to put together something that handles this complexity for us behind

09:44.840 --> 09:46.720
a cloud native approach.

09:46.720 --> 09:51.320
So you can scale, you have a thousand nodes, you've crafted your nice or not so nice

09:51.320 --> 09:58.160
YAML, you throw it at the cluster and boom, you have EVPN.

09:58.160 --> 10:04.760
So this is what OpenPear out there is, it's just a Kubernetes controller, we don't focus

10:04.760 --> 10:11.960
on a particular CNA plugin so it can work with your primary CNA of choice and it allows

10:11.960 --> 10:18.360
us to scale our deployment in an efficient manner.

10:18.360 --> 10:23.280
So what do we need in order to have EVPN configured on the cluster?

10:23.280 --> 10:28.840
You need virtual tunnel endpoints which is the end of the tunnel or the start of the tunnel

10:28.840 --> 10:30.160
and you need the tunnel.

10:30.160 --> 10:36.120
That's the basic thing that you have to do as a starter.

10:36.120 --> 10:42.920
In order to do that we have this CRD, it's called underlay, basically it's the details

10:42.920 --> 10:47.720
of the session between your top of the rocks which now becomes number because all the

10:47.720 --> 10:54.360
EVPN logic is done inside the node and then you need some extra things like the interface

10:54.360 --> 11:02.560
that you want to use to connect to the top of the rock and a side there because we want

11:02.560 --> 11:07.720
to show it on the next slide, we want the VTAP on each node to be different and so

11:07.720 --> 11:16.560
like we leverage the Kubernetes way to do things and we synchronize and each node gets

11:16.560 --> 11:22.320
a different IP and of course the data plane is VXL and so what really happens is that

11:22.320 --> 11:28.040
the traffic coming from the host gets encapsulated in a UDP message with the local

11:28.040 --> 11:33.800
VTAP as a source IP and the destination VTAP as the destination IP.

11:33.800 --> 11:38.160
How do I know where do I need to send the packet?

11:38.160 --> 11:43.560
This is the type 2 route that Miguel was telling us about.

11:43.560 --> 11:52.880
So the BGP fabric knows how to announce the locality of each marker and each IP are around

11:52.880 --> 12:06.040
the fabric.

12:06.040 --> 12:14.120
Okay, so now how do you create the control plane for this again?

12:14.120 --> 12:19.120
You throw a YAML at the closer and this open BU route you're thinking we just will just

12:19.120 --> 12:24.880
bend over backwards, create the VRFs, the bridges, everything put together for it to just

12:24.880 --> 12:30.040
work which essentially will get you a stretched layer 2 network and overlay across the

12:30.040 --> 12:35.440
entire fabric by fabric I actually mean a bunch of connected routers which could be like

12:35.440 --> 12:41.920
the internet itself and this is what an L2VNI looks like there are three very important

12:41.920 --> 12:49.000
things here first VNI so this is a VXLan concept akin to the VLAN ID for it.

12:49.000 --> 12:56.520
The VRF which is pretty much like the identity of this and we'll see later on how to integrate

12:56.520 --> 13:04.040
this or expose this network to the outside world and finally the host master which is

13:04.040 --> 13:12.440
how do you provide connectivity from the router to the host itself or to the Kubernetes

13:12.440 --> 13:14.560
node it's runs on.

13:15.440 --> 13:21.200
Essentially this is what it looks like so you have your Kubernetes node it has the router running on it

13:21.200 --> 13:29.520
like the blue thing is the network name space of the router and for every VNI that you have created

13:29.520 --> 13:36.160
you'll have a bridge like that one and then you'll be having a dangling VF on the host

13:36.160 --> 13:40.800
and depending on the configuration like if you remember you've seen here that we are using a host

13:40.800 --> 13:47.040
master of type Linux bridge auto create it will create a bridge on the host for you what

13:47.040 --> 13:53.520
do you do with this bridge now this bridge is essentially what your workloads will be using to

13:53.520 --> 13:59.200
attach to the network so you will if you're used to multi you will just provision now a network

13:59.200 --> 14:06.000
attachment definition with the configuration pointed to the bridge with the proper name and you'll

14:06.000 --> 14:12.880
be attaching to that bridge which essentially is this so we have our keyword VM running there attached

14:12.880 --> 14:20.880
to the to that bridge which is using that VF there attaching to open V router and then it will

14:20.880 --> 14:25.600
use the VxLontunnel Federico tool is everything about later a little bit ago.

14:26.800 --> 14:34.800
Now I'm having this part here with to explain how the type two routes work so whenever

14:34.800 --> 14:44.400
a workload attaches to this bridge it will add an entry of this MAC address is available on

14:44.400 --> 14:55.200
this Linux bridge port right this will be reflected on the FDB table right and FRR zebra will

14:55.200 --> 15:02.160
provision a type two route and this is it whenever the other nodes which could be in different

15:03.120 --> 15:09.600
clusters see this type two route what'll happen is they will provision an entry in their FDB table

15:09.600 --> 15:15.920
so they will know this MAC address is available on this VTP which is somewhere else in the world

15:17.040 --> 15:25.680
this is for type two route that has just a MAC address as you see in the bottom sorry right here

15:26.640 --> 15:33.360
now in order for you to have to jump into another network you will also need to have

15:33.360 --> 15:39.440
an IP address right so there are type two routes that have MAC and IP addresses pretty much the

15:39.440 --> 15:45.040
same constant whenever the workload attaches to the bridge it will send or a neighbor discovery

15:45.040 --> 15:52.000
protocol or an ARP week west reply this will appear in the neighbor table FRR zebra will see

15:52.000 --> 15:57.840
this thing will provision a type two route with MAC and IP address that you see there in the bottom

15:58.640 --> 16:04.400
and this will be advertised in the fabric whenever a different node sees a type two route with

16:04.400 --> 16:10.400
and MAC and IP address it will pretty much just provision a static entry in their neighbor table

16:10.400 --> 16:16.320
and everybody will know this VM is available via this VTPP which could be somewhere else in the world

16:17.280 --> 16:25.520
uh beyond layer two um well we have the ability of connecting different routes using these

16:25.520 --> 16:30.160
type five routes I do not have a sequence diagram for that but it's the same constant but instead

16:30.160 --> 16:37.520
of being FDB tables or ARP or neighbor tables it will be the route table and this will allow you

16:37.520 --> 16:43.840
to access services in these other networks or and to provide direct route at ingress into the VMs

16:43.920 --> 16:49.920
this means that you will be able to access to it using directly the IP address of the VM so

16:49.920 --> 16:55.440
no need for that no services no nothing like that by service I mean no Kubernetes services again

16:56.640 --> 17:02.160
and this is how it looks like the key thing here is the VNI must be different than the one you

17:02.160 --> 17:10.400
specify for your uh L2VNI but the VRF must be the same that is how you connect a layer three VNI to

17:10.640 --> 17:16.960
a layer two VNI so essentially how you expose your stretched layer two network to the outside world

17:18.640 --> 17:24.800
essentially this just turns out to having an IP address on on your gateway on the bridge is

17:24.800 --> 17:33.680
management interface and a VRF encapsulating all those routes oh boy could you get me like the

17:34.640 --> 17:41.920
in there there's uh there's a thing so okay let's skip this let's go to the demo right now

17:41.920 --> 17:45.920
so what we will see right now so we will have two different kind clusters

17:47.200 --> 17:51.120
these kind clusters will have uh workloads running in them

17:54.000 --> 17:54.960
we're going to check it

18:04.480 --> 18:11.840
what it's done thank you huge thanks you saved us I owe you a beer so again VMs running in different

18:11.840 --> 18:16.240
kind clusters everything is in this laptop but this is just to showcase a concept this could have

18:16.240 --> 18:22.400
been like throughout the internet and then we will have a host on a separate network that will

18:22.400 --> 18:27.600
try to access them we will see two things first we will see a stretched layer two network so these two

18:27.680 --> 18:34.720
guys will be in the same layer two overlay and we will see another interesting thing we will see

18:34.720 --> 18:40.880
live migration we will have this host a pinging this VM right here and then we will trigger a migration

18:40.880 --> 18:47.600
to this other cluster while this ping is working and we will see how the traffic behaves hopefully

18:47.600 --> 18:53.520
not many packets will be dropped so you can scan this for a link to the asking email demo there's

18:53.520 --> 19:01.920
a link to the script that runs this demo for you so you can do this this at your home okay now this

19:01.920 --> 19:07.360
is the recording unfortunately this won't be a live demo I hope the font is big enough

19:09.280 --> 19:21.520
but let's see so okay oops so in the upper left I have the VM running and cluster a in

19:22.400 --> 19:29.280
no I can in the upper right I have the V I have the cluster b so it has a different VM running in it

19:29.280 --> 19:35.360
and this is where the VM one is going to be migrated to that is why if you look there there's also

19:35.360 --> 19:41.360
VM one but it's waiting for sank jania right there and then we'll in a couple of minutes explain to

19:41.360 --> 19:46.240
us everything about cross cluster live migration but for now keep this in mind VM one running

19:46.240 --> 19:51.920
cluster a VM two running cluster b will communicate from one another and at the bottom we have the script

19:53.600 --> 20:06.080
so okay details IP address in mac address of our VM one IP address of our VM two which is running

20:06.080 --> 20:12.480
on a different cluster and the receiver thing which is the copy of the first VM

20:13.440 --> 20:21.920
okay this is running now we're going to ping from the client so the thing that is outside a cluster

20:21.920 --> 20:27.600
it's reached VM one it reached VM two so we have connectivity from a different network to these

20:27.600 --> 20:37.600
two VMs running in different clusters and I'm just going to show that this essentially is in a

20:37.600 --> 20:46.160
different network if you see VMs are in the 170 IP range and this one is in the 192 168

20:48.000 --> 20:55.680
so I'm going to console into the VM and I'm going to ping this was too fast so from VM one I'm going to

20:55.680 --> 21:01.840
ping the other VM which is the dot 31 so you have a stretch layer two between two different

21:01.840 --> 21:10.160
Kubernetes clusters that was the first demo second demo will be seen now so this I'm going to

21:10.160 --> 21:16.240
show you the type two route so we see here a type two route for Joseph Mac address but it's telling

21:16.240 --> 21:21.760
you to reach this Mac this is the next hop you're going to be using interesting here also

21:21.760 --> 21:27.680
MM2 this is Mac mobility we'll be seeing this in a while and then same information for the

21:27.680 --> 21:34.320
other virtual machine managed to ping this pretty much essentially means that we have

21:34.320 --> 21:41.680
started to ping from our client into our VM one I'm just putting here another ping so you see

21:41.680 --> 21:48.080
that I'm not lying I've provisioned the migration in the destination cluster that is why the

21:48.080 --> 21:54.640
VM one did something weird there and I will now trigger the migration in the source that is why

21:54.640 --> 22:00.960
the virtual machine went to scheduling on destination cluster so I want you to look at this seeing

22:00.960 --> 22:08.160
ping is working and I want you to look at this here saying virtual machine one is available via

22:08.160 --> 22:14.560
this next hop meaning cluster A and traffic is working and the migration the VM is pretty much being

22:14.560 --> 22:23.440
copying it'll take a little bit of time still cluster A schedule it knows to which note

22:23.440 --> 22:30.880
it'll be and move to cluster B right here so this virtual machine meaning this type two routes

22:30.880 --> 22:37.360
like this Mac and IP address is now available in cluster B which has a different next hop IP

22:37.520 --> 22:46.960
address and as you see the ping is going at it like nothing happens it's not exactly like nothing

22:46.960 --> 22:52.720
happened we'll see here like we've lost four packets during the migration we are using a 100

22:52.720 --> 22:59.200
millisecond intervals so this totally amounts to like probably 400 milliseconds of drop traffic

22:59.920 --> 23:07.200
it's not insanely good but it is quite good and I think this concludes demo and I

23:07.200 --> 23:19.520
can leave this now to very go to walk us through the conclusions okay dropping up we saw how we can

23:20.320 --> 23:28.880
leverage existing building blocks to implement a broader layer two layer across different clusters

23:28.880 --> 23:34.400
we showed how you can wrap that layer two domain into a layer three domain and having

23:34.400 --> 23:42.320
routable IPs from outside and I want to stress this it is in the very same way you would interact

23:42.320 --> 23:49.040
with the real router with villains so you have your VMs today with the Mac villain

23:49.040 --> 23:55.440
CNI plugin as a secondary interface you're have your villain coming from the from the router you

23:55.440 --> 24:00.880
have your tunneling mechanism implemented into the router nothing changes from the VM it's the

24:00.880 --> 24:06.320
same but with the component that is running on the Kubernetes cluster which is I think the

24:06.320 --> 24:11.280
nicest bit of this implementation the fact that we are not reinventing the wheel but

24:11.280 --> 24:17.600
reusing existing mechanism to to have a Kubernetes native way to do this and of course

24:17.600 --> 24:26.240
live migration is the cherry on the on on the cake yeah this is more or less than

24:26.240 --> 24:35.360
summer sources we try to keep the documentation aligned and nice to use there are a couple

24:35.360 --> 24:44.240
of blockposts on the keyboard official blog in the open peer router documentation and repository

24:44.240 --> 24:49.760
we are trying to keep a lot of examples on how you would integrate the project with existing

24:49.920 --> 24:58.160
components with with the scripts and automation that you can run on your laptop and time

24:58.160 --> 25:04.960
is up as far as the engine for the routes and that's it if you have any questions we are going to

25:04.960 --> 25:06.960
be around