WEBVTT

00:00.000 --> 00:20.160
Thank you all for coming, so my name is Mike Bullard, today I'm going to be giving

00:20.160 --> 00:26.880
a talk about the Neckhead device, okay, the Neckhead device is a new networking device

00:26.960 --> 00:34.160
that was introduced in the Linux kernel that allows isolated networks in containers without any

00:34.160 --> 00:40.160
performance benefits, so I'll get into what that means in just a minute. I'm going to try to

00:40.160 --> 00:44.800
go through this pretty quick, I wanted to say thanks everybody for coming out today and especially

00:44.800 --> 00:49.840
thanks to the volunteers for helping to put this together and for staying all day in the room today.

00:50.800 --> 00:58.000
So some background, kind of why do we want this Neckhead device? So I'm a software engineer at

00:58.000 --> 01:04.800
Meta and at Meta we tend to run all of our services in containers on a host and so in this

01:04.800 --> 01:11.280
diagram we have a pretty typical setup, the host is connected to a physical ethernet card which

01:11.280 --> 01:16.560
is connected to a network switch and then we could have containers running, dividing up the

01:16.560 --> 01:22.160
resources on that host, more and more we want to run multiple services on the same host.

01:22.160 --> 01:28.000
This is more efficient if we have like CPU that we're not using container A is not using it,

01:28.000 --> 01:32.240
we can load up container B which might have like a batch processing job or something to use

01:32.240 --> 01:38.560
that resource. The issue that we can run into here is like what if container A tries to buy into

01:38.560 --> 01:44.640
this IP and port and container B wants to buy into the same IP and port that's going to cause

01:44.640 --> 01:51.520
a conflict. So ideally we wouldn't have to deal with this at all and this is where network

01:51.520 --> 01:59.200
name spaces really help us out. So a network name space basically gives a container its own view

02:00.480 --> 02:07.920
the network basically its own network interfaces, a set of ports and its own IP tables so it kind

02:07.920 --> 02:13.920
of has its own network connection completely separate from the other containers and the other

02:13.920 --> 02:19.840
network name spaces on that host. And so here's what that diagram looks like now with network

02:19.840 --> 02:27.040
name spaces implemented. So container A is running inside network name space A, container B is running

02:27.040 --> 02:32.800
inside network name space B and the way that we kind of like hook these up so that they can see the

02:32.800 --> 02:40.640
world is we create a virtual ethernet device in the network name space and we create a matching one

02:40.640 --> 02:45.760
on the host and then we basically run like a virtual ethernet cable between those so we wire those

02:45.760 --> 02:55.200
together and that gives us network connection to the container and now this works we can now buy

02:55.200 --> 02:59.840
into any port on here any port on here they don't need to know like what the other container is

02:59.840 --> 03:07.280
doing which is great but like many things and engineering we've solved one problem but we've

03:07.360 --> 03:14.480
created a new problem which is there's a performance implication here so if you kind of like

03:14.480 --> 03:22.560
walks through the path of a network packet coming out of this container first it's sent you know

03:22.560 --> 03:29.920
the job sends some some network packet out it's got a transmit from that virtual ethernet adapter

03:29.920 --> 03:36.480
it's going to go through that virtual ethernet cable it's got a land on the host like virtual ethernet

03:36.480 --> 03:43.440
gets received in that VEV and then that gets received by an IRQ and then like finally it gets

03:43.440 --> 03:48.640
transmitted out of the host so there's many extra hops here compared to when we didn't have

03:48.640 --> 03:56.160
network name spaces and we're just using the ethernet adapter directly and so what we see in practice

03:56.160 --> 04:03.040
is when we enable network name spaces we're not able to like utilize the full network performance of the

04:03.120 --> 04:09.280
host and I'll get into a little more a bit like the specifics but this has caused you know issues

04:09.280 --> 04:15.360
with jobs that are very network intensive you know of course they they need to use the full you know

04:15.360 --> 04:21.760
the full available network performance so come come come to play this net get device this

04:21.760 --> 04:28.720
was introduced and I think November of 2023 and it was really designed around solving this

04:28.720 --> 04:35.680
exact problem so this is a new type of virtual network device that is built with like container

04:35.680 --> 04:43.280
ization in mind and what you see here is when you create a net kit device in this host network

04:43.280 --> 04:50.800
name space it gets a kind of a built in peer net kit device you can almost think of it like

04:50.800 --> 04:56.880
a backpack attached to that primary device those are linked together you know behind the scenes

04:57.840 --> 05:05.360
and it kind of eliminates that extra virtual ethernet cable so and something that's kind of

05:05.360 --> 05:10.960
important to note here is whenever we manage this net kit device it's done from the host side

05:11.600 --> 05:17.040
that's a big part of the reason why it works you know it's it's all being managed as a device

05:17.040 --> 05:21.840
on the host but it's able to kind of peer into this network name space and interact with the

05:21.840 --> 05:27.760
traffic over there and of course the other thing that's important and relevant to the stock is

05:28.560 --> 05:35.760
this is all done using BFF programs which we attach to this net kit device so there is a

05:35.760 --> 05:41.920
net kit attached there's a BFF attached point at the PO side and a BFF attached point at the primary

05:41.920 --> 05:48.880
side and by attaching BFF programs we can direct that network traffic tell it where to go

05:49.840 --> 05:56.400
and this is where it's going to save save us some performance so it's just so kind of what

05:56.400 --> 06:01.440
that looks like now the packet gets sent from the container again like the job just transmit the

06:01.440 --> 06:09.040
network packet it lands on the net kit TX side we have a BFF program running there which can inspect

06:09.040 --> 06:15.040
the packet if it determines that the packet is going to be routed somewhere outside of the host it

06:15.040 --> 06:22.960
can directly just place it into the transmit queue for the physical Ethernet device so it just removes

06:22.960 --> 06:29.440
a lot of extra hops in there and the way that we see it show up is in like the soft IRQ usage

06:30.320 --> 06:36.160
and so with the what we see is the soft IRQ usage kind of gets overloaded for jobs that use a lot of

06:36.160 --> 06:42.480
network traffic but with net kit that's greatly reduced and much more network traffic can get through

06:44.160 --> 06:50.400
in addition to the performance benefits we also see a lot of benefits from using BFF a big one is

06:50.400 --> 06:58.000
just it's much faster to develop and debug these BFF programs so it really gives us like a low level

06:58.000 --> 07:03.200
view of what's happening to that network traffic we can set up you know detailed logs and monitoring

07:03.200 --> 07:08.240
in that BFF program and kind of have like I think it like an x-ray view of like what's going on

07:08.240 --> 07:13.280
in the driver and then we can identify if there's any issues or if there's any performance improvements

07:13.280 --> 07:19.680
you want to make you know it's very quick to write a new BFF program load it tested out and then

07:19.680 --> 07:26.480
deploy it so this is a huge benefit over you know in the past where we had to write a curdle driver

07:26.480 --> 07:32.400
and then recompile the curdle reboot into the curdle this is much faster and allows us to like implement

07:32.560 --> 07:40.160
new features and fix any issues much faster. So I mentioned before the performance implications

07:40.160 --> 07:45.840
so this slide here is showing some you know like experimental benchmarking that was done

07:46.880 --> 07:53.600
so set it to ISO variant in this blog post they used this iper tool to just send as much

07:53.600 --> 07:59.840
network traffic as they could over these different types of network adapters. Here the first

08:00.000 --> 08:08.160
chart here is the VEF that was only able to get about 60 megabits per second compared to the

08:08.160 --> 08:15.120
baseline which is you know around 100 megabits per second so really we can only get about 60% of the

08:15.120 --> 08:23.440
host performance we can also use BFF with VEF and we can apply some of the optimizations there

08:24.240 --> 08:30.720
even with that we can only at best get up to around 90% of the performance of the host

08:31.760 --> 08:38.320
but with net to NBPS we actually can match the host performance so this was proved out in this

08:38.320 --> 08:46.720
experimental testing we've also seen this happen now in like real world jobs so this was one service

08:46.720 --> 08:51.520
at meta that was experimenting with this you know stacking multiple tasks on a single host

08:52.480 --> 08:57.680
and so this graph represents the you know latency of the requests coming out of that host

08:58.960 --> 09:04.880
with VEF the white line with VEF enabled they were seeing around like 12 seconds of latency

09:05.680 --> 09:12.000
compared to the the baseline what we would expect is around 100 milliseconds and when we looked into it

09:12.000 --> 09:20.240
we found this was caused by that high soft IRQ usage so essentially the soft IRQ the queue was

09:20.240 --> 09:25.120
filled up and it just wasn't able to get any more network traffic through when we switched it

09:25.120 --> 09:31.760
over to net kit we saw return to that the baseline that we expect 100 milliseconds so this is kind

09:31.760 --> 09:41.040
of a real world demonstration of the benefits of net kit so you know in comparison we're in the

09:41.040 --> 09:48.880
process of rolling this out I wanted to give some shout outs of course from ISO valen they were very

09:48.960 --> 09:54.960
key in getting this authored and merged into the kernel and also sharing all of that knowledge

09:54.960 --> 09:58.880
so there's no way we could have put this together without the help of this open source

09:58.880 --> 10:04.480
community and having all of the code available to read through and there's so much you know blog posts

10:04.480 --> 10:10.880
and information out there that we can refer to and within meta talks up and Martin are both you know

10:10.880 --> 10:14.560
key members of this network and team that that really helped me through the work in this through

10:15.040 --> 10:25.600
and so yeah and conclusion the network name spaces it's really a key feature that we want to enable

10:25.600 --> 10:30.160
in order to be able to isolate tasks so these tasks don't need to know anything about the

10:30.160 --> 10:35.520
environment they're running on they can use the network as they want to and with net kit we can

10:35.520 --> 10:42.640
turn on that network name spaces feature without incurring any performance penalty and finally

10:42.720 --> 10:48.240
BPS has really allowed us to move much faster and implement new features much faster

10:49.280 --> 10:56.640
as we you know work through implementing the features so this is the conclusion of my talk here

10:57.520 --> 11:02.800
and just wanted to open up for any questions I put up this quote which supposedly was said by

11:02.800 --> 11:08.880
Beethoven I couldn't confirm it but it's that was pretty good so with that that's all I had

11:09.520 --> 11:13.520
I don't know if you'd like to answer any questions

11:29.200 --> 11:34.160
So further received path we we didn't have the performance issues you know mainly because we can

11:34.160 --> 11:39.200
we can pick up that packet right away as soon as it comes into the host physical adapter

11:39.200 --> 11:44.080
and we already had BPS programs running that would redirect it to the current container

11:44.800 --> 11:49.440
you know so in that case like the host is aware of all of the containers and can redirect it properly

11:49.440 --> 11:55.600
but the traffic coming out it was that VE in the container that didn't really have you know

11:55.600 --> 12:01.520
knowledge of where that packet was going so it had to go through the the that software Q pass

12:02.320 --> 12:08.000
and yeah just to repeat the question was do we have the similar performance impacts on the

12:08.000 --> 12:10.160
ingress side for traffic coming into those

12:11.440 --> 12:12.160
other person

12:12.160 --> 12:18.400
the VPF code that you have for next kit to be extended to expect packets for purpose

12:19.680 --> 12:24.320
yes yes I mean this is the big this has been a big benefit for us is like when teams come to us

12:25.040 --> 12:29.920
with you know a specific use case like they have this specific type of network traffic

12:29.920 --> 12:34.480
that they want routed in a particular way we can work with them to you know more quickly

12:34.480 --> 12:40.080
get that implemented you know through these BPS programs and you know one thing that we do is

12:40.080 --> 12:45.600
you know like for specific packets we can redirect them you know directly to the to that

12:45.600 --> 12:52.400
networks which so kind of like bypassing all of the you know internal code that's running on that

12:52.400 --> 12:56.320
that host that may be affecting that that packet we can descend directly to the to the

12:56.320 --> 13:06.880
the racks which so yeah so the question was can we with the packet device can we implement

13:06.880 --> 13:13.760
business logic in the BPS programs and yeah the answer is yes so the packet device has new

13:13.760 --> 13:19.120
attached points but they're very similar to XDP attached points it's just a different place

13:19.120 --> 13:23.760
in the driver where it's attaching so kind of any logic that you could do an XDP program

13:24.480 --> 13:30.320
you can very easily change it to be like a net kit redirect program and then attach it just like

13:30.320 --> 13:33.360
a normal BPS program.

13:41.760 --> 13:51.120
So not not that we've seen so oh yeah so the question was were there any impacts to the container

13:51.120 --> 13:57.600
tasks like did they have to be aware of this and the basic answer is no so we have we have made

13:57.600 --> 14:03.120
not getting out of the default choice for tasks that enable network name spaces and we didn't have

14:03.120 --> 14:09.520
to work with the teams to you know make that switch it's basically a transparent switch the it's a

14:09.520 --> 14:15.360
transparent change for them to just swap out that device I would say the one team that has come

14:15.360 --> 14:20.560
come to us is they had a specific debugging thing they were trying to do where like then they just

14:20.560 --> 14:26.720
knew more about VEF because it looks like any other you know network adapter they specifically

14:26.720 --> 14:32.480
wanted to go back to VEF because they were more familiar with how to attach BPS programs to it

14:32.480 --> 14:37.120
and how to debug things but that's really been the only time that people have come to us

14:37.120 --> 14:42.080
to be requests to go back and I think that will change you know as as there's more knowledge

14:42.080 --> 14:47.280
and more internal tooling for the net kit because again like within that kit we actually can

14:47.280 --> 15:08.240
add more monitoring and like more debugging to is there any impact on inner container is there any

15:08.240 --> 15:15.600
impact on traffic going between containers and the answer is mostly no I'll say there's one

15:15.600 --> 15:21.040
exception not due to net kit but due to network name spaces which is like if tasks were

15:21.680 --> 15:27.040
communicating over local host like if container A and my example wanted to communicate with

15:27.040 --> 15:32.080
container B specifically because they were related somehow in the past teams I've got new

15:32.080 --> 15:36.800
using local host to be that communication but we can't do that with a network name space because

15:36.800 --> 15:41.440
local host is its own local host in the network name space so that's been the only thing that we've

15:41.440 --> 15:50.240
had to you know move teams off of that so yeah perform with what is very similar like it has the same

15:50.240 --> 15:54.560
we have the same kind of redirection happening that we had before where we'll detect if something

15:54.560 --> 16:00.960
is going inside the container and it will get routed the same way it was before not yet for the

16:00.960 --> 16:06.960
performance benefit is it because of the PBS program or the net kit device like what brings which

16:06.960 --> 16:11.840
them in the performance and the results you have so the question was about where does the

16:11.840 --> 16:18.400
performance benefit come from is it the BPS program versus like the net kit device so there's

16:18.400 --> 16:22.560
many people that could speak more qualified end of this question my understanding of it is

16:22.560 --> 16:28.000
it has to do with like where the BPS point attaches so we're able to pick up the traffic at an

16:28.000 --> 16:35.200
earlier point in the transmit process and so with like existing BPS at test points in like

16:35.200 --> 16:40.560
attaching to like the TC attached point the packet has kind of already gone through some of the

16:40.560 --> 16:44.880
network processing on that VE where as now we can kind of like pluck it out of that trend like

16:44.880 --> 16:50.720
a hook into the actual driver is transmit call pick up the packet and put it onto the host

16:51.600 --> 16:55.600
so

17:02.400 --> 17:07.600
I think it was like what are the requirements to enable that kit

17:10.080 --> 17:14.240
so yeah I said that you know like the requirements to enable it is you know that the big one is

17:14.240 --> 17:20.880
having the kernel support that's really the only kind of hard requirement you would also you

17:20.880 --> 17:26.320
would also need to have the BPS programs written so one difference with BPS is if you don't have

17:26.320 --> 17:30.640
or it's like one difference with net kit if you don't have a BPS program attached by default

17:30.640 --> 17:35.920
it will just drop all traffic you know the idea is to prevent traffic from accidentally going out

17:35.920 --> 17:42.160
before the device is configured so you would need to have a BPS program set up to do the

17:42.160 --> 17:48.000
forwarding of the traffic without that BPS program it would just by default it would just drop all

17:48.000 --> 17:55.120
of the packets so other than that there's really no like technical requirements there is like

17:55.120 --> 17:59.280
you have to make sure like the IP tool has been updated to a new version that have

17:59.280 --> 18:04.560
net kit support and like BPS tool those tools they've all been more recently updated with net

18:04.560 --> 18:09.200
kit support so to run a lot of those like command line tools you just need to make sure that you

18:09.200 --> 18:28.160
have the most updated version so the question was like did we encounter any limitations

18:28.720 --> 18:36.240
using net kit compared to like XDP so far I would say no you know we're still kind of in the early

18:36.240 --> 18:41.200
stages like we may discover more as we go on the only kind of got to that we found to look out

18:41.200 --> 18:48.240
for is like making sure that if there is like additional BPS programs that were like in a chain

18:48.240 --> 18:54.720
on the TC like attached to the VE if those may not get run if you if you have the net kit

18:54.720 --> 19:03.040
program like forwarding things off so that's the only that's the only