WEBVTT

00:00.000 --> 00:11.680
It's better if we're seated, if we're not, I'm going to get in trouble with the staff

00:11.680 --> 00:14.360
from Faustem.

00:14.360 --> 00:17.680
And there's plenty of Spencer's good.

00:17.680 --> 00:23.720
Our next speaker is going to tell you how we manage crashes in Firefox probably.

00:23.720 --> 00:24.720
Enjoy.

00:24.720 --> 00:33.280
Hello, I'm John Carlo, I'm a local, I have a little bit outside of Belgium, and I'm

00:33.280 --> 00:37.400
managed the OS integration team in Firefox.

00:37.400 --> 00:41.480
And today I'll be talking to you about your probably least favorite Firefox feature,

00:41.480 --> 00:43.760
and that's what happens if you have a crash.

00:43.760 --> 00:47.520
I'll skip through the agenda because we're a bit short on time.

00:47.520 --> 00:51.360
Straight to definitions.

00:51.360 --> 00:52.360
What's crash rate?

00:52.440 --> 00:54.640
It was 100 people in the room.

00:54.640 --> 00:58.000
Alex here in the first room has a Firefox crash five times.

00:58.000 --> 01:01.720
We say, OK, there's a 5% crash rate today.

01:01.720 --> 01:04.560
The other metric we use is crash incidents.

01:04.560 --> 01:08.800
Say, Alex and Silvester, the both have a Firefox crash three times.

01:08.800 --> 01:12.160
So we say, we have a crash incidents of 2% because that's

01:12.160 --> 01:14.120
tuner that had a crash.

01:14.120 --> 01:15.560
Which one of those is more important?

01:15.560 --> 01:16.840
It's a bit arguable.

01:16.840 --> 01:20.680
With for various reasons, so which will become clear during the talk.

01:20.680 --> 01:23.600
Most of my data here is going to be about crash incidents.

01:23.600 --> 01:26.720
And that's the number we track most closely.

01:26.720 --> 01:29.120
What can cause Firefox to crash?

01:29.120 --> 01:32.000
Well, the first thing we'll stop this one is just Firefox bugs, programming

01:32.000 --> 01:32.600
errors.

01:32.600 --> 01:35.160
That's probably what you thought this talk was going to be about,

01:35.160 --> 01:40.040
but plot list, it's largely not, it's about other causes.

01:40.040 --> 01:44.120
An important reason for Firefox crashes is running out of memory.

01:44.120 --> 01:46.920
The browser has to execute a JavaScript on the web page,

01:46.920 --> 01:49.760
and do whatever the web page says, so it's somewhat limited in its

01:49.760 --> 01:53.440
freedom, and this can cause it to run out of memory.

01:53.440 --> 01:55.520
There's various machine configurations, which

01:55.520 --> 01:59.560
can make this worse, like running on a 13-bit OS.

01:59.560 --> 02:02.200
And the last category we have are harder problems.

02:02.200 --> 02:05.960
And I use quotation marks because I see this very broadly.

02:05.960 --> 02:10.560
I'm not talking only about bad memory or overclocked machines,

02:10.560 --> 02:13.040
but also, for example, like Windows installation,

02:13.040 --> 02:15.600
that has had three different 20-virus products and a bunch of

02:15.600 --> 02:18.600
fruit cuts on malware on it, from my perspective,

02:18.600 --> 02:21.960
these are all hardware problems.

02:21.960 --> 02:24.920
So who is crashing and why?

02:24.920 --> 02:27.840
We wanted to dive a bit deeper to see how those three categories

02:27.840 --> 02:29.840
are divided among our users.

02:29.840 --> 02:32.400
And one of the most interesting ones I wanted to look at was,

02:32.400 --> 02:36.040
like, what percentage of our failures are hardware related?

02:36.040 --> 02:40.880
So one way we thought about this is we know the crash rate

02:40.880 --> 02:43.080
goes up as the machine's age.

02:43.120 --> 02:47.720
So as a bit of a question, can we estimate the age of a machine?

02:47.720 --> 02:50.720
I think probably the most reliable thing we can get access to

02:50.720 --> 02:53.720
is the bias date if it's successful,

02:53.720 --> 02:55.920
but we currently don't collect such telemetry,

02:55.920 --> 02:59.000
and I didn't have it historically, so it wasn't really useful.

02:59.000 --> 03:01.760
But what I do have is the exact CPU model.

03:01.760 --> 03:04.320
Now, if you look at the release gains of the CPUs,

03:04.320 --> 03:07.800
you see that Intel, for example, has a fairly good cadence,

03:07.800 --> 03:11.040
with, like, most CPUs only on sale for a two-year window,

03:11.040 --> 03:13.440
three-end-year slightly larger.

03:13.440 --> 03:15.600
And we say, OK, we just estimate that the machine

03:15.600 --> 03:17.440
was made in the middle of this window, which

03:17.440 --> 03:21.400
gives us about one year accuracy for the machine age.

03:21.400 --> 03:23.720
That's obviously a rough estimate that it's good enough

03:23.720 --> 03:27.000
for what I'm going to show you.

03:27.000 --> 03:29.520
Now, what we can do is we can take a certain set of hardware,

03:29.520 --> 03:32.800
say, when those release a machine with a gigabyte of memory,

03:32.800 --> 03:35.840
and pull the crash rate over time.

03:35.840 --> 03:37.720
And this gives us some idea of, like, how

03:37.720 --> 03:39.920
the crash rate evolves over time.

03:39.920 --> 03:42.600
And then we need a baseline, which is the amount of failures

03:42.600 --> 03:45.000
that are actually caused by software problems.

03:45.000 --> 03:47.000
This we can't really know.

03:47.000 --> 03:48.920
But again, I made an assumption that the machine

03:48.920 --> 03:52.000
from one to three years old, so not an entirely new one,

03:52.000 --> 03:54.640
never crashes due to a hard-for problems.

03:54.640 --> 03:56.640
So that's obviously going to be over estimating

03:56.640 --> 03:58.440
our own crash rate, but it's fine.

03:58.440 --> 04:01.600
Because what I really want to get is this graph.

04:01.600 --> 04:03.600
The black line that's still over all crash rate,

04:03.600 --> 04:05.480
you could sort of see the back of the curve,

04:05.480 --> 04:07.080
as it says, so the new machine to actually

04:07.080 --> 04:08.920
crash a little bit more, and then it starts

04:08.920 --> 04:12.240
going up as the machines get very old.

04:12.240 --> 04:14.920
Don't read anything into this graph that's not there,

04:14.920 --> 04:18.720
like you see suddenly in terms of user distribution.

04:18.720 --> 04:21.040
There's a few gaps here, but that's because of the CPU

04:21.040 --> 04:22.880
book I think I just explained.

04:22.880 --> 04:24.960
Most interesting part for me is this.

04:24.960 --> 04:29.000
So this blue line is like our estimate for hardware failures,

04:29.000 --> 04:30.280
out of memory problems.

04:30.280 --> 04:32.040
Also goes up as the machines get older,

04:32.040 --> 04:34.760
because older machines have less memory.

04:34.760 --> 04:39.080
But here, we are at 0.3% crash incidents,

04:39.080 --> 04:42.520
which is what we estimate our own software failure level to be.

04:42.520 --> 04:45.320
But the total crash travel, and around six years of age,

04:45.320 --> 04:48.320
also starts to go above 0.6.

04:48.320 --> 04:51.560
So what this means that if we got a crash from a machine

04:51.560 --> 04:55.360
that six years old, it is as likely to be a crash

04:55.360 --> 04:57.680
from a calls external to Firefox.

04:57.680 --> 04:59.320
And if you now look at the same graph,

04:59.320 --> 05:02.280
there's actually a pretty long tail here of our user base

05:02.280 --> 05:04.360
that are on such old hardware.

05:04.920 --> 05:07.000
And this confirms something we had been seeing

05:07.000 --> 05:10.240
for a while is that if a new crash arrives from the field

05:10.240 --> 05:12.520
and we put an engineer to look at it,

05:12.520 --> 05:15.400
sometimes it's been one or two days really digging into the crash

05:15.400 --> 05:18.480
and they come back and they say, I don't understand

05:18.480 --> 05:20.600
what's happening, this crash is impossible.

05:20.600 --> 05:23.160
And I would sort of know why that is starting to be more

05:23.160 --> 05:25.000
important to filter out those issues,

05:25.000 --> 05:29.040
because we cannot fix those feelings.

05:29.040 --> 05:30.480
How do we get to learn about crashes?

05:30.480 --> 05:33.720
We have both crash reports and telemetry.

05:33.720 --> 05:34.960
If the main process crashes

05:34.960 --> 05:38.160
like all of Firefox shuts down, which is a bit the worst case,

05:38.160 --> 05:40.960
you get this window, which is an external program,

05:40.960 --> 05:44.320
we recently wrote in Rust to have mother and translation

05:44.320 --> 05:49.320
support, IDPI, like this center, basically.

05:49.320 --> 05:52.800
If you get this dialogue, we see about 70% of our users

05:52.800 --> 05:56.800
replying positively, so submitting the crash report.

05:56.800 --> 05:59.800
If a tap crashes, but Firefox's a sales stage running,

05:59.800 --> 06:03.000
you see this in your tap, this report is done

06:03.000 --> 06:05.520
by Firefox itself, and here we see already

06:05.520 --> 06:07.920
there's quite a drop only about 30% of the people

06:07.920 --> 06:10.560
who get this submitted, are presumed the rest just closes

06:10.560 --> 06:12.960
in quickly and goes on working.

06:12.960 --> 06:17.240
And now in the final category, is a utility process crash?

06:17.240 --> 06:18.760
I think if you saw an exam or stop,

06:18.760 --> 06:20.840
you have some idea that there's a lot of utility

06:20.840 --> 06:24.320
process in Firefox, and these are invisible crashes.

06:24.320 --> 06:26.920
I'm going to see if I can make the invisible crashes

06:26.920 --> 06:28.360
visible.

06:28.360 --> 06:29.360
Yeah?

06:34.080 --> 06:35.600
So did you notice it?

06:35.600 --> 06:37.080
It's pretty subtle.

06:37.080 --> 06:38.920
So if I didn't tell you this, you might have just

06:38.920 --> 06:40.640
thought this was like an advert glitch, but this

06:40.640 --> 06:43.120
was actually Firefox, entire data decoding, stack,

06:43.120 --> 06:47.480
crashing, and immediately restarted.

06:47.480 --> 06:51.200
We see basically nobody submits these kind of crashes,

06:51.200 --> 06:58.320
and that's obvious, because I lost my clicker.

06:58.360 --> 07:00.560
No, you have to focus on the video.

07:00.560 --> 07:01.400
Ah, is that it?

07:04.960 --> 07:06.040
Yeah.

07:06.040 --> 07:08.960
If we would report such a crash, it would be rather confusing,

07:08.960 --> 07:10.560
because it might have been something that happened

07:10.560 --> 07:12.760
in the background, and suddenly you have this

07:12.760 --> 07:16.560
pop-up asking you to submit that.

07:16.560 --> 07:19.560
It might be more annoying than the crash itself,

07:19.560 --> 07:22.040
which just looked like a small network pitch.

07:22.040 --> 07:24.840
So like from a UX perspective, we should never try to surface

07:24.840 --> 07:27.480
those, which is why we don't get reports for them.

07:27.480 --> 07:30.320
So how do we get to 0.1%.

07:30.320 --> 07:33.440
Some people, and I love you all, have this very deeply

07:33.440 --> 07:35.520
hidden setting enabled, which automatically

07:35.520 --> 07:37.640
sends all crash reports.

07:37.640 --> 07:39.880
So we do get to know about some of these.

07:39.880 --> 07:42.880
But that's a bit of an issue, like 0.1% is not enough

07:42.880 --> 07:46.880
to decode a given example of a media decoding or GPU.

07:46.880 --> 07:49.560
So how do we deal with those?

07:49.560 --> 07:52.920
So a crash report is basically a capture of the process

07:52.920 --> 07:55.000
memory state.

07:55.000 --> 07:58.480
You can add your URL or any comments.

07:58.480 --> 08:00.280
But the memory of your process might

08:00.280 --> 08:03.000
have passwords or rather private information in it,

08:03.000 --> 08:06.400
so we can only collect those if you explicitly agree to it.

08:06.400 --> 08:07.880
But we do also have telemetry.

08:07.880 --> 08:10.520
And what we can do with telemetry is add to the telemetry

08:10.520 --> 08:13.680
thing, basically saying, OK, this Firefox has had a crash,

08:13.680 --> 08:16.760
and it was in this Firefox function with this call stack,

08:16.760 --> 08:18.520
which doesn't really tell anyone anything,

08:18.520 --> 08:21.280
because I hope most Firefox functions have more than one

08:21.280 --> 08:23.560
user using them.

08:23.560 --> 08:25.160
And it's not perfect for debugging,

08:25.160 --> 08:26.960
because it only tells you where we crashed.

08:26.960 --> 08:29.760
But often, it's already quite something to go on.

08:29.760 --> 08:32.960
It's better than nothing for sure.

08:32.960 --> 08:36.520
With a history, this is a graph of Firefox crash

08:36.520 --> 08:40.880
creates over crash incidents over the last two years.

08:40.880 --> 08:42.680
There's a very, very, very big spike.

08:42.680 --> 08:43.800
This is called FoxDuck.

08:43.800 --> 08:44.520
You can Google it.

08:44.520 --> 08:47.240
We did a post more among it.

08:47.240 --> 08:48.560
This is an interesting one.

08:48.560 --> 08:50.760
Somebody asked me about this.

08:50.760 --> 08:55.200
I think somewhere in 2022, in October or November,

08:55.200 --> 08:58.240
you had the biggest ever hit of solar gamma radiation

08:58.240 --> 09:01.120
on Earth, like 50 times bigger than ever before.

09:01.120 --> 09:04.040
And yes, can you see this in Firefox crashes?

09:04.040 --> 09:06.840
I was like, yeah, I can see it.

09:06.840 --> 09:08.920
And I looked into what the crashes were,

09:08.920 --> 09:11.520
and like a major Windows on the virus

09:11.520 --> 09:13.560
thunder released and update on the same day.

09:13.560 --> 09:16.440
So unfortunately, I have to conclude

09:16.440 --> 09:18.800
that galactic data rays have not been

09:18.800 --> 09:21.360
as on the virus software.

09:21.360 --> 09:22.640
And there was a little bit of a depth here.

09:22.640 --> 09:24.560
We shipped a new crash reporter.

09:24.560 --> 09:26.640
We didn't get it right first time, but we fixed it.

09:30.360 --> 09:32.360
OK, what are we working on right now?

09:32.360 --> 09:35.600
The new crash reporter GUI already talked about that.

09:35.600 --> 09:38.920
We can detect some instances of hardware issues.

09:38.920 --> 09:41.400
If a crash doesn't make sense, but flipping a single bit,

09:41.400 --> 09:43.920
would make it make sense, we tagged those crashes,

09:43.920 --> 09:46.440
because it's an indication it could be hardware.

09:46.440 --> 09:49.920
If we crash with an access violation read,

09:49.920 --> 09:52.680
but the instruction was a right, we also tagged that,

09:52.680 --> 09:55.840
because that's completely impossible.

09:55.840 --> 09:58.920
We changed our telemetry work on a new backend.

09:58.920 --> 10:01.320
Basically, this means we supported on Android as well.

10:01.320 --> 10:03.360
And there's a nice new dashboard if you want to look at it,

10:03.360 --> 10:05.400
which is public.

10:05.400 --> 10:09.400
And in the future, not only the GUI, but also the backend

10:09.400 --> 10:12.440
and the crash reporter, right now it runs in the same process

10:12.440 --> 10:15.320
that has just crashed, which can be rather interesting

10:15.320 --> 10:17.720
if that process has scribbled all over memory,

10:17.720 --> 10:20.280
and sometimes you don't manage to recover from this.

10:20.280 --> 10:22.760
So this is moving out of process.

10:22.760 --> 10:24.760
We're going to try this a bit of a gamble.

10:24.760 --> 10:27.880
If we, the moment we show the crash reporter dialogue,

10:27.880 --> 10:29.480
where we can do a quick memory test,

10:29.480 --> 10:32.440
and maybe detect some more machines that are seriously broken,

10:32.440 --> 10:35.560
so we don't waste our time analyzing those.

10:35.560 --> 10:41.800
This shoot land this week, next week, so we'll see if it works.

10:41.800 --> 10:44.920
And remote crash collection, I'll show this,

10:44.920 --> 10:46.920
because it's a bit more illustrative.

10:46.920 --> 10:50.280
So in case we see from telemetry that people are crashing,

10:50.280 --> 10:53.400
but nobody has actually sent in a crash report.

10:53.400 --> 10:55.640
We want to prompt like a small number of users,

10:55.640 --> 10:59.720
a bit more explicitly if they want to send in that crash report.

10:59.720 --> 11:02.040
And then hopefully we can investigate it better.

11:02.040 --> 11:06.200
I hope we can even explain this to like literally include the bug you're seeing,

11:06.200 --> 11:08.600
which I think would alleviate like the concerns

11:08.600 --> 11:12.120
that this is maybe a fishing site or something weird that's going on.

11:12.120 --> 11:13.480
And I hope if we do it this way,

11:13.480 --> 11:15.560
that also people will submit those crashes,

11:15.560 --> 11:18.280
and we can debug them.

11:18.280 --> 11:20.840
All right, that was all.

11:20.840 --> 11:22.360
Thank you for crashing, and.

11:22.360 --> 11:25.480
Thank you for your time.

11:25.480 --> 11:26.680
Thank you for your time.

11:26.680 --> 11:27.880
Thank you.

11:27.880 --> 11:29.080
Thank you.

11:29.080 --> 11:32.840
So I really think the key takeaway is please submit your crashes.

11:32.840 --> 11:33.160
Yeah.

11:33.160 --> 11:34.120
They don't have enough work.

11:34.120 --> 11:35.960
They won't more.

11:35.960 --> 11:37.960
Do we have questions?

11:38.040 --> 11:42.680
Great.

11:42.680 --> 11:46.920
From my perspective.

11:46.920 --> 11:50.840
Have you seen a substantial decreasing number of crash

11:50.840 --> 11:54.440
when you introduce trust into the modular overall?

11:54.440 --> 11:55.800
I would say those kind of things.

11:55.800 --> 11:57.880
I can actually be able to go back to this graph,

11:57.880 --> 11:59.880
because I've skipped over a lot of details here.

12:05.000 --> 12:07.640
So this looks kind of flat, but it's only because this spike

12:07.640 --> 12:09.720
was so ludicrous, they're huge.

12:09.720 --> 12:12.040
But you can see it's actually a very spiky signal.

12:12.040 --> 12:14.200
A lot of these are like third party vendors,

12:14.200 --> 12:16.440
or third party software that does broken updates

12:16.440 --> 12:17.960
and messes with Firefox.

12:17.960 --> 12:19.240
Here it starts going down.

12:19.240 --> 12:20.600
That's not so much because of trust.

12:20.600 --> 12:25.720
We could a lot of work on improving out of marriage situations.

12:25.720 --> 12:30.520
Here it went down again, because we moved Windows 7 to ESR,

12:30.520 --> 12:32.840
so the ESR is not in this data.

12:32.840 --> 12:34.360
And here it starts going up again,

12:34.360 --> 12:36.920
very slightly, because a lot of Linux trusts

12:36.920 --> 12:39.800
switch to A-land, and the waitant ecosystem is not as

12:39.800 --> 12:42.200
material as everything else.

12:42.200 --> 12:44.760
But from the rest, I mean, the rest would be so gradual

12:44.760 --> 12:48.120
that I wouldn't be able to visualize it here.

12:48.120 --> 12:50.200
Yes.

12:50.200 --> 12:53.320
We actually make some benefit of rest in our crash reporting

12:53.320 --> 12:56.680
tooling, because we have some updates, which

12:56.680 --> 12:59.080
catch every error condition in the crash reporting,

12:59.080 --> 13:02.040
and then surfaces to us, to close a few black holes.

13:02.040 --> 13:04.600
We have, and for that, Rust was incredibly useful,

13:04.600 --> 13:08.280
but it's very hard to skip on the error handling.

13:08.280 --> 13:09.480
A collection about privacy.

13:09.480 --> 13:11.160
Is it safe to send crash report?

13:11.160 --> 13:14.760
Is it safe to send crash report when Firefox crash it?

13:14.760 --> 13:19.320
When I was filing the bank formal government form?

13:19.320 --> 13:20.440
Very easy.

13:20.440 --> 13:22.520
Why we have this distinction between the crash reports

13:22.520 --> 13:23.880
and the crash telemetry?

13:23.880 --> 13:27.240
The reports do contain, like, a dump of your memory,

13:27.240 --> 13:29.880
so there could be confidential data in there.

13:29.880 --> 13:32.120
We didn't measure that's extremely restricted.

13:32.120 --> 13:33.960
I like you need to take an actual course

13:33.960 --> 13:36.280
before you can get access to that data.

13:36.280 --> 13:39.000
But if you send, if you opt into sending the crash report,

13:39.000 --> 13:40.520
you do send that data to us.

13:40.520 --> 13:44.040
So there's a bit of a decision for you, whether you're okay sharing that or not.

13:44.040 --> 13:48.520
Also, I can only confirm that we do apply the highest standards for that kind of data,

13:48.520 --> 13:50.760
and that's definitely not publicly accessible.

13:50.760 --> 13:54.760
Where's the telemetry, yes, because that's just like a stack of Firefox functions?

13:54.760 --> 13:57.560
Can you say how many people have access to the crash to two of the

13:57.880 --> 14:00.280
accordance, not so many, right?

14:00.280 --> 14:02.280
Not so many, you know.

14:02.280 --> 14:06.360
I would guess which is 50 people or something one hundred.

14:06.360 --> 14:11.000
And something that the process is complicated to get access to.

14:11.000 --> 14:13.960
Okay, other questions?

14:13.960 --> 14:14.960
No.

14:14.960 --> 14:16.680
Thank you very much.

14:16.680 --> 14:17.680
All right.

14:17.680 --> 14:19.680
Thank you.

14:19.680 --> 14:20.680
Thank you.

14:20.680 --> 14:21.680
What?

14:21.680 --> 14:24.040
Take a minute for a few more.

14:24.040 --> 14:26.040
So, thank you for joining the modular.

