WEBVTT

00:00.000 --> 00:13.000
All right, well, welcome everybody, my name is Samuel, I'm a production engineer at Meta,

00:13.000 --> 00:19.000
and today I'll present a BFF trace, um, profiler that I wrote. And by the way, good ordering

00:19.000 --> 00:25.000
on the talks, because now I was not going to explain anything about BFF trace, so thank you,

00:25.000 --> 00:32.000
so yeah, just a little agenda for today. First, we'll frame the problem, we were facing,

00:32.000 --> 00:38.000
we'll walk through some implementation details, and look at results and

00:38.000 --> 00:45.000
share some learnings. So the team I work on at Meta, it's called Scuba, and it's

00:45.000 --> 00:51.000
basically a big log aggregation system. We have a real-time data ingestion pipeline,

00:51.000 --> 00:57.000
and we have a query-side component. I'm sure, probably many of you guys have used similar systems.

00:57.000 --> 01:05.000
You can think of elastic search, splong, I think it's Amazon CloudWatch. And visually,

01:05.000 --> 01:11.000
it looks something like this. So if you trace a request through the system, it goes through the proxy,

01:11.000 --> 01:16.000
and then it traverses the map, reduce architecture all the way to the leaves.

01:16.000 --> 01:23.000
And this is the part of the system will be interested in today. The leaf you can think of them as

01:23.000 --> 01:31.000
basically distributed database servers, and as with any good database servers, we have memory issues.

01:31.000 --> 01:39.000
So we have too many ooms, and the goal of this project is so we already have protection in place.

01:39.000 --> 01:46.000
In the Quarant Giant, we have a whole admission control layer. But really, even though we have protection,

01:46.000 --> 01:52.000
we're still ooming. And so the goal was to can we improve the oomubs of ability.

01:52.000 --> 02:00.000
And so we want to identify expensive use cases. And a use case, you can uniquely identify them with

02:00.000 --> 02:07.000
this triplet that I put here, basically table source SQL. So, and then eventually,

02:07.000 --> 02:14.000
if you're able to identify these expensive use cases as a first responder, when we're facing

02:14.000 --> 02:20.000
ways of ooms, you can go and you can block them potentially.

02:20.000 --> 02:26.000
So, let's try to see if we can do a proof of concept. So, the first version,

02:26.000 --> 02:32.000
it's not going to be the final result, but naively, we're going to, our goal is,

02:32.000 --> 02:39.000
we'll try to count these triplets, which I'll also refer to as aggregations.

02:39.000 --> 02:45.000
We want a data set that is highly correlated with memory pressure ooms.

02:45.000 --> 02:51.000
So we'll collect data only when we're facing memory pressure. We'll log only after oomm.

02:51.000 --> 02:57.000
And then the final criteria is we want to be memory life weight. And that's because the system is already

02:57.000 --> 03:05.000
running out of memory. So, our tracing should be not adding oil on the fire.

03:05.000 --> 03:10.000
So, when I was discussing this project with teammates, they were like, well, why do you want to use

03:10.000 --> 03:16.000
BPF trace for this? Why not just do it in user space? So, things out in our environment,

03:16.000 --> 03:23.000
we have this thing called oomd. And this thing is responsible for killing processes that use

03:23.000 --> 03:30.000
too much memory. So, it will act before the kernel oomkiller. And what it does, it sounds a sick kill.

03:30.000 --> 03:39.000
Sick kill, you cannot intercept it, cannot ignore it in user space. And so, this is why it's useful to use BPF trace.

03:39.000 --> 03:48.000
And what we do here, we have the BPF trace snippet. We intercept the trace point for the signals.

03:48.000 --> 03:59.000
And then we just check our condition. Was it sent to our scuba process? Was it signal line? And did it come from oomd?

03:59.000 --> 04:06.000
Yeah, but no oomd. It's open source. It's deployed on pet every machine as a system to service.

04:06.000 --> 04:11.000
And any other reason for its existence was to have better control over what gets killed.

04:11.000 --> 04:20.000
And mostly to avoid killing these golden Facebook processes that exist on machine that would render the

04:20.000 --> 04:28.000
hosts basically unusable. So, we move on with our goals. Let's try to collect these aggregations.

04:28.000 --> 04:35.000
So, the first thing we do is we go in our binary. And here we use the macro from folly.

04:35.000 --> 04:43.000
folly is a C++ open source library. And we use this macro. We add two probes.

04:43.000 --> 04:49.000
So, first the entry probe. And you see what we do is we add the query ID. And then our triplet.

04:49.000 --> 04:57.000
Table source is equal. And then we also add the return probe, which is going to run at the end of the function.

04:58.000 --> 05:09.000
And we add the query ID there. So, it's important to have a common piece of information between entry probe return probe so you can connect the dots.

05:09.000 --> 05:14.000
And that will be important later.

05:14.000 --> 05:23.000
So, now we've instrumented our binary. So, we can go in BPF trace and actually attach to this new usdt trace point.

05:23.000 --> 05:32.000
So, here this is the entry probe with query SQL. And we just we have this map. And we just count our triplets.

05:32.000 --> 05:41.000
And we also save the PID. Because if you're in this probe context, the PID will refer to our server.

05:41.000 --> 05:52.000
And we needed that to do the filtering. Okay, another thing we add. And this is just for completion, but sometimes only is not perfect.

05:52.000 --> 06:03.000
So, sometimes the kernel will actually kill our process. But we can detect this by tracing another trace point from the kernel.

06:03.000 --> 06:10.000
And we can just filter on the PID. So, this is another way to detect if we've been killed.

06:10.000 --> 06:21.000
And finally, yeah, our last criteria. Remember, we didn't want to use too much memory. So, there's few ways to do this. But the way I chose was to do backup duration over time.

06:21.000 --> 06:31.000
So, what we do is we collect aggregations. And we always write in the current bucket. And then when every now and then we rotate the buckets.

06:31.000 --> 06:39.000
And so, if it's elapsed, everything that was in current would put in previous. And we just trash everything from previous.

06:39.000 --> 06:44.000
And so, you can see we rotate things, which means in the end the memory we use.

06:44.000 --> 06:54.000
Our data retention period is twice the bucket duration. And in BPS trace, it looks like this.

06:54.000 --> 07:06.000
So, you can do things with the interval here every 10 seconds. We look how long has it been elapsed since the bucket started.

07:06.000 --> 07:13.000
Are we over the bucket duration? Is it time to rotate? If so, well, we have two four loops.

07:13.000 --> 07:20.000
We delete everything in previous. And we put everything in current in previous, just like we showed it in the graph.

07:20.000 --> 07:32.000
And then we reset the timer. So, that's not a perfect implementation, mostly because traffic pattern over time are very variable.

07:32.000 --> 07:41.000
So, maybe during one bucket duration we got a ton of queries, another one we have almost nothing. So, it's not the best way to do things.

07:41.000 --> 07:52.000
We'll be much better with the LAU Ashmap. And as Victor mentioned, there's active discussion and a thread for supporting this.

07:52.000 --> 08:04.000
And the other thing, you cannot assign a Hashmap to another Hashmap. So, I wanted to just do, oh, pre-ev equals current, but it turns out you cannot do this.

08:04.000 --> 08:17.000
And then the other issue was, you can clear a Hashmap, but it's only async supported. So, that created some data and consistencies.

08:17.000 --> 08:30.000
But, okay. So, when we have achieved everything we wanted, let's ship this in production. So, the way this works, at meta we have the service, at the center, scroll, scroll,

08:30.000 --> 08:41.000
blight, and scroll blight takes two things. It takes the system state and triggering rules. So, for us, the triggering rule is on memory pressure.

08:41.000 --> 08:52.000
So, what we do is, we test scroll blight. Okay. If you see memory pressure, start our BPS tray sidecar, and then it will do a sting attached to all the probes.

08:52.000 --> 09:06.000
And then if we oom, then we dump the data, and the data is sparse by a strobe blight, which sends it back to scuba. So, yes, we are monitoring scuba with scuba.

09:06.000 --> 09:21.000
And, okay, there are results. So, yeah. Unfortunately, I do have to censor the prob data, despite what you've heard about our censorship policy.

09:21.000 --> 09:32.000
But, but I still put the counts there. So, you see this, we have our triplets. We have the table source SQL, and we are able to order by the counts.

09:32.000 --> 09:44.000
And it works. It's a proof of concept. It's oom, Taylor data, but it turns out it's not really useful. And, and the reason for that is a lot of this thing.

09:44.000 --> 09:59.000
So, we have, on our platform, these type of, of use cases, which are basically help checks. And so, they're all going to be at the top of the list, which makes the data really noisy, and it's not really useful.

09:59.000 --> 10:08.000
So, okay. So, we let this run for a month or two, we get some discussion, collect feedback, and we come up with new ideas.

10:08.000 --> 10:15.000
So, let's try to one up our implementation. And what we're going to do is, well, first we'll do continuous lagging.

10:15.000 --> 10:22.000
So, in the previous version, we had this bucket thing. We do current preview, and we will trash the data.

10:22.000 --> 10:36.000
But, it's perfectly good data. Let's, we can still log it. And, and what we do though, because we want to retain the ability to, to say, was it oom on the data, like, after ooms.

10:36.000 --> 10:43.000
So, we just had a filter. We do this. Then we'll track one query ID per aggregation.

10:43.000 --> 10:50.000
That's mostly for us query ID. It's a gateway to many other data sets, get more information.

10:50.000 --> 10:57.000
And then, instead of doing counts, we'll try to collect memory and CPU cost.

10:58.000 --> 11:05.000
So, the way this looks is we go back in VPF trace, and we refactor the entry probe.

11:05.000 --> 11:12.000
And now, so, we need to trace all this new information. So, that means we need to add a bunch of maps.

11:12.000 --> 11:19.000
And we still want to be memory lightweight. So, we add this thing, the concept of an aggregation ID.

11:19.000 --> 11:24.000
And really, it's just because we only want to store the triplet once. So, we store it once.

11:24.000 --> 11:32.000
We create an ID, it's just a self increment and ID. And then, that's what we used to refer to the aggregation.

11:32.000 --> 11:40.000
Number two here, we have a query ID's map. It's just a map from aggregation ID to query.

11:40.000 --> 11:51.000
And finally, we add a temporary map, and this is just so we can connect the dots between entry probe and written probe.

11:51.000 --> 11:59.000
So, this is important in the next slide. And it's mostly so we can track memory and CPU for the aggregation.

11:59.000 --> 12:07.000
So, we use our temporary map, we retrieve the aggregation ID, and then we use the built-in from BPF trace,

12:07.000 --> 12:14.000
which computes rolling average. So, we do that for both memory and CPU.

12:14.000 --> 12:21.000
So, okay. Oh yeah, and finally, yes, sorry, this is a bit of a big slide.

12:21.000 --> 12:28.000
But the last thing is continuous logging. So, we refactor, this is the code where we were previously rotating buckets.

12:28.000 --> 12:36.000
So, instead of rotating buckets, now what we do is, we iterate through the aggregation, we get the count,

12:36.000 --> 12:45.000
we get the total cost in terms of memory CPU, which is the average built-in applied by the counts,

12:45.000 --> 12:57.000
and then we log it. But then we add our filter, our filter is important so we can detect was it after own data or not.

12:57.000 --> 13:04.000
And then, we get more results and more censorship, but we have two new columns.

13:04.000 --> 13:11.000
We have memory and CPU, and here you see it's ordered by memory, and it's arguably much better data,

13:11.000 --> 13:21.000
because it's not such a naive way to estimate the cost. And, but we still have a bunch of open questions,

13:21.000 --> 13:28.000
because actually, I ship this thing two days ago. So, I didn't really have time to look at the data.

13:28.000 --> 13:33.000
So, it's like the follow-ups. How to best use this data set.

13:33.000 --> 13:40.000
If you find like a top suspect, how do you confirm that it's actually a problematic use case?

13:40.000 --> 13:48.000
Some people are so ingest and just replay it and see if you crash a prod, but not sure that's the best idea.

13:48.000 --> 14:01.000
But yeah, so we have to play with this, find out how to best use the data set, and eventually maybe we can build some automation to do the policing efforts.

14:01.000 --> 14:14.000
But yeah, there was my first BPF trace programming experience, and it's really interesting to work with the technologies like busy coding with hash maps, I would say.

14:14.000 --> 14:23.000
And I just want to say a big thank you, also, to Daniel and Jordan. They really helped me answer it all my questions and do troubleshooting.

14:23.000 --> 14:30.000
So, thank you a lot. And yeah, I mean, it's probably the best crowd to preach for BPF trace.

14:30.000 --> 14:41.000
I'm sure you guys are all interested, but if you're a system engineer, I think you should really spend time learning this technology and add it to your toolbox.

14:41.000 --> 14:52.000
So yeah, thank you.

14:52.000 --> 15:01.000
We have time for questions.

15:01.000 --> 15:08.000
Cool question. Oh, wait.

15:08.000 --> 15:13.000
The rock for service?

15:13.000 --> 15:16.000
Yeah, yeah.

15:16.000 --> 15:20.000
Yeah.

15:20.000 --> 15:21.000
Yeah, exactly.

15:21.000 --> 15:26.000
It's just this one function. We added two probes to USDT.

15:26.000 --> 15:29.000
Rockford is just a legacy name for scuba.

15:29.000 --> 15:40.000
That's just the name of the binary.

15:40.000 --> 15:42.000
No, okay.