WEBVTT

00:00.000 --> 00:10.480
All right, so the next term is if you are self-recenting the latest developments for

00:10.480 --> 00:15.000
U-Procent K-Prox, I think I'll see you later, can I?

00:15.000 --> 00:16.000
Thank you.

00:16.000 --> 00:17.000
Okay, hello.

00:17.000 --> 00:18.500
So yeah, my name is Yeriyoshah.

00:18.500 --> 00:24.320
I work at high-development at Cisco, mostly working on that's about an English kernel.

00:24.320 --> 00:31.920
This presentation is basically to give an update about latest data, how to attach

00:31.920 --> 00:38.760
K-Pro programs, so that's the session attachment, so make people more aware of it.

00:38.760 --> 00:44.920
And the other part is about E-POPs and about the effort to speed them up.

00:44.920 --> 00:53.700
So let's start with the session, so if you have K-Pro program, you attach it to the

00:53.700 --> 01:00.020
K-Pop, some true entry, or some kind of function, or to the exit.

01:00.020 --> 01:04.340
And what you can do, you can use, like, either the legacy interface, or you can use the

01:04.340 --> 01:12.180
per-flink, or a K-POP multi-interface, and you attach the K-Pop BPF program to entry, or the

01:12.180 --> 01:13.180
exit.

01:13.180 --> 01:19.780
Now, there is a session attachment where you actually have just a single program that is

01:19.820 --> 01:25.660
attached to the entry, and to the exit, it's always this one program that is called

01:25.660 --> 01:30.860
at the entry and the exit, and there are actually some benefits that I've mentioned

01:30.860 --> 01:36.980
in the other side, but that's like the overview where the session attachment actually belongs

01:36.980 --> 01:42.660
to, so it's just another way to how to attach to the K-Pop program with some benefits.

01:42.660 --> 01:49.700
So I get to them, it's the same, very similar, on the U-POP side, so again, you have

01:49.700 --> 01:55.380
K-POP program that you want to attach to the U-POP, you have a function, you'll attach

01:55.380 --> 02:03.620
to the entry, and to the exit, again, you can use the legacy interface, or U-POP multi,

02:03.620 --> 02:09.580
you always have, like, separate BPF programs to attach, with the U-POP session, now you can

02:09.580 --> 02:14.980
actually, you have just one program that attach to the entry and to the exit, the same

02:14.980 --> 02:17.740
as for the K-POPs.

02:17.740 --> 02:24.700
So what are the benefits, so there's just one program that's sort of already benefit, you

02:24.700 --> 02:33.020
have just one program, so maybe let's duplicate it, code, and just one file descriptor.

02:33.020 --> 02:41.220
The other benefit is that the entry program can actually decide if the return program is executed.

02:41.220 --> 02:48.860
That's actually, one of the reasons we did it, because in touch agon, very common use case

02:48.860 --> 02:57.900
is that we attach for the entry and for the exit, and in the entry, we do a lot of filtering,

02:57.900 --> 03:06.900
and based on the result of the filtering, we forced the return POP to some work, but

03:06.900 --> 03:11.180
before session attachment, we actually need to execute the return POP, and it needs to

03:11.180 --> 03:13.780
find out, do I have to work or not.

03:13.780 --> 03:18.520
Now from the entry program, we can actually say, okay, skip the execution of the return

03:18.520 --> 03:23.840
POP, and all the machinery to actually execute the return POP will be triggered, so we

03:23.840 --> 03:30.800
say, some cycles there, the entry and exit program, another nice feature is that they can

03:30.800 --> 03:32.800
actually share data.

03:32.800 --> 03:39.800
It's not arbitrary size of the data, it's just 8 bytes, it's actually big enough to store

03:39.800 --> 03:46.960
any kind of ID and the entry and exit program can use it to look up in the maps.

03:46.960 --> 03:52.520
There are two special keyfunks for the session attachment, it's the BPS session is returned,

03:52.520 --> 03:57.080
because we have just one program being executed in both context, and you need to find out

03:57.080 --> 04:02.200
are you the entry program or exit, so that's the keyfunk for that, and the BPS session

04:02.200 --> 04:09.920
cookie that will give you the pointer for the shared data, so normally like the entry,

04:09.920 --> 04:19.000
entry probe store, some data for the return probe, just read that, okay, yeah, well it

04:19.000 --> 04:26.000
is supported in the BPS, obviously the data on support wasn't managed yet, but it's already

04:26.000 --> 04:32.680
the K-propsation support is in the CLEM EBPF goal library.

04:32.680 --> 04:37.880
The thing with the CLEM EBPF library is that we need to wait for the stable release for

04:37.880 --> 04:42.780
the feature to get to some stable release, so we can actually update the interface of the CLEM

04:42.780 --> 04:53.300
EBPF library, so the U-propsation is turn of there, but it's coming, BPF trace, I don't

04:53.300 --> 04:59.540
think it's in BPF trace it, but actually the proof of code for this feature was written

04:59.540 --> 05:07.140
in BPF trace, so it's doable, and I guess it will happen at some time.

05:07.140 --> 05:13.940
So there was the session attachment, so hopefully people will start to use it and find

05:13.940 --> 05:25.940
it useful, now faster E-props, so there's been a development on the E-props to make it faster

05:25.940 --> 05:34.060
because they're slow, there's sort of two ways to make E-props faster, there has been

05:34.060 --> 05:41.060
like a bunch of generic E-props fixes made by the country, and other folks like to make

05:41.060 --> 05:49.260
E-props more scalable, so when you have like many E-props, which many E-props, the throughput

05:49.260 --> 05:54.420
of the system like throughput of the E-props going through the kernel, that's much, that's

05:54.420 --> 06:04.020
much better now, the other way of speeding up E-props is actually related to X-86,

06:04.100 --> 06:12.460
architecture on the 64 bits, and the idea is to replace the breakpoint with the C-scall,

06:12.460 --> 06:21.140
as I were showing the next slide, E-props is based on the breakpoint, and the idea is to replace

06:21.140 --> 06:29.260
with the C-scall because C-scall is roughly like three times faster than executing a breakpoint.

06:29.260 --> 06:37.340
So just a quick, very high level overview, how it looks like when the E-props gets executed,

06:37.340 --> 06:42.900
so when you decide you have a place in a binary that needs to execute the E-props, you

06:42.900 --> 06:50.060
store the instruction at the breakpoint, then the application gets executed, hits the breakpoint,

06:50.060 --> 06:57.100
goes to the kernel, run E-pf program, or any other E-props word that needs to do, then

06:57.180 --> 07:02.700
it executes the original instruction because the breakpoint overwritten instructions, so it needs

07:02.700 --> 07:10.940
to get executed, and it jumps back to the application to continue to continue the execution.

07:12.140 --> 07:17.980
Now, the next slide actually explain just the idea, the view space on this slide can

07:17.980 --> 07:26.780
not actually work in real work, but I mean it's simple enough to an ideal to explain

07:26.780 --> 07:27.780
the idea.

07:27.780 --> 07:34.780
So the idea is to replace the breakpoint with the C-scall, but you cannot do it like right

07:34.780 --> 07:40.140
away, because the C-scall is some preparation, you need to store some data to the register,

07:40.140 --> 07:46.460
to execute the proper C-scall in to save some registers, so the way you would like to do

07:46.620 --> 07:53.340
it, and we do it at some point, is that instead of breakpoint, we use the call instruction

07:53.340 --> 08:00.060
that jumps to the user space trampoline, and the trampoline itself executes the C-scall,

08:00.060 --> 08:06.060
the C-scall goes to the kernel, does whatever the breakpoint would do, jumps back to the trampoline,

08:06.060 --> 08:10.460
ideally execute original instructions, and return back to the execution.

08:11.340 --> 08:18.220
This very common use case for the airport cannot actually work, because we cannot just

08:18.220 --> 08:23.420
override those five bytes and get a wave at that. We are not in the control if you use

08:23.420 --> 08:28.700
this application, so you have actually no idea what's happening in those five bytes that you

08:28.700 --> 08:34.460
need for the call instruction, so any thread can be like any other path through the application

08:34.460 --> 08:39.900
and go through that and jump in the middle of that instruction, so let's go to work,

08:39.900 --> 08:46.460
fortunately there are some use cases that can work. One of them is URTPOP,

08:48.620 --> 08:55.020
so I will tell you all the details how URTPOP is configured, at the end when there is a

08:57.020 --> 09:03.340
URTPOP install on the function, there is a red instruction being called, and then there is a URTPOP

09:03.340 --> 09:07.660
installed, it actually doesn't go to the original point where the function was called,

09:07.660 --> 09:13.820
but it goes to the special trampoline that executes the breakpoint, and again breakpoint goes

09:13.820 --> 09:19.100
to kernel, run BPF program and jumps back, so this is how normal URTPOP works.

09:21.660 --> 09:26.780
Here we got really lucky, because we already have the trampoline, and we just had to

09:27.740 --> 09:34.460
replace the breakpoint with the Cisco, that area, the unfortunate enough was big enough to actually

09:35.260 --> 09:42.620
store all the instruction that we needed to actually execute the Cisco, so now the red actually goes

09:44.380 --> 09:49.420
to the same trampoline, but the trampoline contains the call to the Cisco, Cisco goes to kernel,

09:50.300 --> 09:55.100
actually in that's run on the slide, there's no execution of the original instruction

09:56.060 --> 10:03.420
because it goes just right back where the breakpoint thing will go, so here it could actually

10:04.060 --> 10:12.140
speed up on the probe with the idea, there's one more use case that seems that we can speed up,

10:13.020 --> 10:19.660
and it is the URTPOP URTPOP URTPOP is sort of like

10:20.460 --> 10:26.140
trace points in the kernel, so we have like the programmer decide there is the best place to put the probe

10:27.180 --> 10:34.700
and the like with the pair for any other tool, you have the use data probe that you can connect to

10:34.700 --> 10:40.780
and it installs the URTPOP, and you can execute your program on top of that, the way it works

10:40.780 --> 10:45.740
is that you normally have like some kind of macro probably the system type of macro is the most

10:45.740 --> 10:52.780
user one, and that just emit not instruction which is just one byte and that serves to actually

10:52.780 --> 10:59.660
write on the breakpointer and the URTPOP execution that I explained in the previous slide though

10:59.660 --> 11:08.140
happen, the idea here is to actually replace that one byte nope with the five byte nope,

11:08.220 --> 11:16.300
which is just big enough to carry to store the call instruction for the user space trampoline

11:17.340 --> 11:26.220
and with that we can just go to the trampoline, execute the C score and yeah there's no original

11:26.220 --> 11:35.260
instruction here either and just run the EBPOP program and jump back, so with this approach we actually

11:35.260 --> 11:41.340
know that those nope five, nope five instruction is there and nobody gonna jump in the middle,

11:41.340 --> 11:49.740
so that's where we can do that, of course the probe there always are,

11:53.580 --> 12:01.100
first probe is we actually update all the instructions, so it's kind of easy but not easy but

12:01.100 --> 12:08.700
easier to write just one byte, the breakpoint when the kernel goes installing the UPOP writing

12:08.700 --> 12:13.660
the breakpoint instruction it has to write just one byte with the breakpoint instruction,

12:13.660 --> 12:21.020
now we have to do we have to write five byte and it's kind of tricky to do it in atomic way,

12:21.020 --> 12:29.340
so we need to do some sort of procedure using the breakpoint and writing just like the

12:30.220 --> 12:37.660
half of the instruction and then write the rest of the instruction, so that's that's one problem,

12:38.460 --> 12:43.820
another problem is that as a compromise we started to use like five bytes nope, five bytes

12:43.820 --> 12:50.700
growing instruction which has like one byte of the operation byte and the rest is the offset,

12:50.700 --> 12:59.180
so we have four bytes for offset but it's a sign, so just half of that size and of course with that

12:59.180 --> 13:07.100
you cannot cover the whole 64 bit other space, so as a result of that we actually have to

13:08.540 --> 13:15.180
map the user space timeline closing up to the places that you actually want to do the

13:16.620 --> 13:23.900
breakpoint install the UPOP, so it's still one page, so we don't waste memory but we waste

13:23.900 --> 13:30.860
sort of well waste in 64 bits, it's not a huge waste but we waste the frames in that other space,

13:32.700 --> 13:40.220
another problem, big work compatibility, so the feature that we do should not affect like

13:41.660 --> 13:49.340
all the characters running from new applications and with using the not five instead of not

13:49.340 --> 13:57.980
instruction the not five is not like emulated in the older characters so we cannot just switch

13:57.980 --> 14:05.420
the macro to use the not five there needs to be, there needs to be some work on that so we don't

14:05.420 --> 14:14.300
slow the older characters and the latest fun that we had is the second, so when I said that we actually

14:14.300 --> 14:21.100
execute the Cisco we had to add new Cisco for that and for user space application that's kind of

14:21.100 --> 14:28.300
seeming less, user space application never really executes like bytes have this new Cisco but

14:28.300 --> 14:32.380
kernel does it for the application, so it's an application context, it looks like application

14:32.380 --> 14:40.060
executed it but it was installed by kernel but second doesn't know about it for second it's just

14:40.380 --> 14:49.660
another Cisco and of course there are plenty of configurations that that will just kill the

14:49.660 --> 14:57.420
application because of the unknown Cisco, so looks like second people will allow us to have like

14:57.420 --> 15:06.940
extra filter for for this Cisco inside of this second, so yeah that's that's current problem

15:07.500 --> 15:15.980
that will likely get solved soon and yeah with that that's it you have any questions

15:26.780 --> 15:32.540
sure I pick okay thanks for all the talk about the session you will mention in the beginning

15:32.540 --> 15:37.420
I think the helpers will mention our headphones, is there a reason that those are not there

15:37.420 --> 15:43.900
are here to get helpers? Ah so the question is if if while the k-funks are not helpers so helpers

15:43.900 --> 15:51.980
was from like you cannot up anymore helpers anymore like anything that you need to helper for

15:52.060 --> 15:55.260
now needs to be a k-funks, that's the reason

16:17.500 --> 16:19.980
okay the question is okay about the

16:19.980 --> 16:29.500
particular compatibility so there's a of the use dt-probes so there's a record for the use dt in

16:29.500 --> 16:37.020
the elf note right for for each use dt there's a there's a record in the elf note and

16:38.700 --> 16:44.460
I think as far as the note goes there's just the offset for the note so if there's one bite

16:45.100 --> 16:49.580
replaced with the fight by an instruction I don't think this actually matters but I haven't

16:54.460 --> 17:01.340
so for the issue that I mentioned we will actually have to maybe do some changes for the elf note

17:03.260 --> 17:10.300
but to be honest we didn't get there yet we were more busy with those problems that I described

17:10.300 --> 17:20.380
in the previous slide so are they? Is there any path forward optimizing entry for entry

17:20.380 --> 17:29.420
to close as well as retro? So the question is if there's a path forward to optimize the entry

17:29.420 --> 17:35.820
probes I don't think at the moment there's like any any thoughts on how this can be done

17:36.060 --> 17:44.460
all this might be eventually speed it up with the new Intel CPU which name I just forgot

17:45.340 --> 17:55.500
fret is just feature name of the CPU it's the feature so which claims like to speed up the breakpoint

17:55.500 --> 18:03.100
like the traps so that might eventually eventually happen but like to speed up just the entry point

18:03.180 --> 18:09.660
yeah there's this problem that I mentioned and I don't we don't see any work around for that

18:12.380 --> 18:18.860
what is that set of things we're talking about with both programs of the U.S. problem

18:22.540 --> 18:30.700
okay so the question is if the session will work in the in the go program

18:31.660 --> 18:40.060
with regards to the problem with the return U.P. where so so the goal manages the text by itself

18:40.780 --> 18:50.060
so that will work as long as normal U.P. works like that doesn't change the attachment at all like

18:50.620 --> 18:55.420
it just change like you have just this one program that's shared now between the entry and the exit

18:55.420 --> 19:03.740
but if you attach it to go program the manages its own stack you will be screwed again in the same way

19:25.660 --> 19:34.220
uh if you modify that or not the duration right is there are there in your slides you will actually

19:34.220 --> 19:39.500
allow this behavior even if you don't know if you don't modify the system of the creation

19:41.420 --> 19:46.060
I'm actually not sure I understand but I mean by the kernel configuration for the capers

19:56.380 --> 20:07.180
ah so that's like your kernel is missing so the question is that that's it please find me in the hallway

20:08.300 --> 20:11.420
thank you thank you