WEBVTT

00:00.000 --> 00:16.560
All right, for those of you who may know me, and I know that there are a few in the audience

00:16.560 --> 00:20.960
who do you know that I am, in fact, not Alfonso.

00:20.960 --> 00:25.240
Unfortunately, Alfonso couldn't make it, so I'm jumping in for him.

00:25.240 --> 00:30.520
So today, we're going to be talking about random seats and state machines, and basically

00:30.520 --> 00:34.880
I'm going to show you how we employ in something called deterministic simulation testing

00:34.880 --> 00:37.120
at polar signals in rust.

00:37.120 --> 00:39.160
We've actually done this before and go as well.

00:39.160 --> 00:42.640
If you're interested in that, you can come and talk to me afterwards.

00:42.640 --> 00:48.120
But long story short, we ended up building a new database.

00:48.120 --> 00:52.640
And if you're interested in that, you can check out Tours Talk from yesterday in the database

00:52.640 --> 00:54.760
room where he talked about, you know.

00:54.760 --> 00:58.800
The decisions that went into that, the technologies, and so on.

00:58.800 --> 01:02.760
If you don't know at all what deterministic simulation testing is, that's okay.

01:02.760 --> 01:06.360
We'll go over everything.

01:06.360 --> 01:11.080
But before we dive into the details, I want to kind of set the setting why this is interesting,

01:11.080 --> 01:13.200
why this is useful.

01:13.200 --> 01:17.800
And for that, we do actually have Alfonso joining us.

01:17.800 --> 01:23.760
So basically, a couple of years ago, Alfonso worked at a different database company.

01:23.760 --> 01:27.600
We're not a database company, but we happen to build one.

01:27.600 --> 01:36.080
And there is a bug that he had been assigned to that was incredibly hard to reproduce.

01:36.080 --> 01:40.920
It only ever produced and very, very large machines over a very long period of time.

01:40.920 --> 01:47.920
So, you know, we probably all have been in this kind of situation in our careers that we had

01:47.920 --> 01:51.200
bug reports that were very, very hard to reproduce.

01:51.240 --> 01:56.080
So, that's ultimately kind of the setting of all of this.

01:56.080 --> 02:03.680
And Alfonso was trying to reproduce this with really expensive machines.

02:03.680 --> 02:10.200
And he thought, you know, I can just add a little bit of latency here and there and I'll

02:10.200 --> 02:11.800
tickle out the bug.

02:11.800 --> 02:15.840
And he thought he reproduced it, but it turns out that was actually not it.

02:15.840 --> 02:20.760
It was just a fluke, and it could not really reproduce it.

02:20.760 --> 02:25.200
And so, you know, time goes by, you can reproduce this bug.

02:25.200 --> 02:28.800
And, you know, other priorities come up.

02:28.800 --> 02:33.600
You can still not reproduce this, so eventually, you know, you just kind of kind of give up.

02:33.600 --> 02:36.760
And your organization says, nobody can reproduce this.

02:36.760 --> 02:40.480
We have other work here, so I'd rather have you work on this other thing.

02:40.480 --> 02:44.680
So, ultimately, you know, the way we can think of this is, there's this bug down here,

02:44.680 --> 02:49.880
the red dot, and there's this, you know, very high dimensional state space that can

02:49.880 --> 02:54.960
possibly reach this bug, or possibly not, right?

02:54.960 --> 03:03.240
And so, there are a gazillion combinations of things that may cost this bug, but ultimately

03:03.240 --> 03:10.120
what we want is one reproducible path that will always deterministically lead to this reproduction,

03:10.120 --> 03:11.120
right?

03:11.120 --> 03:15.480
That is the wish that we all have that we can actually reproduce these really difficult

03:15.480 --> 03:18.240
bugs reliably.

03:18.240 --> 03:23.200
And so, the true problem about all of this is randomness.

03:23.200 --> 03:24.600
The world is random, right?

03:24.600 --> 03:27.200
Like hardware fails randomly.

03:27.200 --> 03:30.440
Users behave in random ways.

03:30.440 --> 03:36.640
There's a bunch of things that is completely out of our control, but it is a reality that

03:36.640 --> 03:38.200
we have to work with.

03:38.200 --> 03:44.280
And so, this is the reason why these things are so freaking hard to reproduce, right?

03:44.280 --> 03:50.400
And then even if we do reproduce, sorry, if we do, if we do manage to find the fix for

03:50.400 --> 03:58.120
this bug, we still can definitively say that we have actually, in fact, this bug, right?

03:58.120 --> 04:03.040
If we can't actually reproduce it, how can we ever say for sure that we have, in fact,

04:03.040 --> 04:04.480
fixed it.

04:04.480 --> 04:07.320
And these bugs can be really expensive, right?

04:07.320 --> 04:12.680
We saw a funzer earlier, he spun up, you know, 16 of these really expensive machines,

04:12.680 --> 04:17.200
and the company happened to allow him to do this because it was a very high priority bug

04:17.200 --> 04:18.200
at the time.

04:18.200 --> 04:23.320
And so, companies and, you know, individuals end up spending a lot of money on bugs as

04:23.320 --> 04:26.480
well, potentially unnecessarily.

04:26.480 --> 04:31.800
And eventually, this can cause, you know, unhappy customers, unhappy users of your software,

04:31.800 --> 04:38.040
if you simply cannot, can never fix these bugs because you can reproduce them, and ultimately

04:38.040 --> 04:39.040
fix them.

04:39.600 --> 04:48.400
So, deterministic simulation testing is kind of the, one of the answers that we believe

04:48.400 --> 04:51.960
can help with these problems.

04:51.960 --> 04:57.280
And this was originally kind of popularized by FoundationDB, and the way that FoundationDB

04:57.280 --> 05:03.080
kind of went about this is they built this actor framework in C++, where they, and they built

05:03.160 --> 05:09.480
this framework specifically such that they built and accompanying simulator that they could

05:09.480 --> 05:14.760
simulate all of these failures and, and, and ran them way, it's, right?

05:14.760 --> 05:21.760
And so, that spark kind of, a number of companies to try and go up and do this themselves,

05:21.760 --> 05:27.000
Tiger Beatles, fairly popular for doing this, but there are a number of companies that maybe

05:27.000 --> 05:31.960
we haven't even heard of the, the storage team at Dropbox actually did this in the early 2010s

05:32.040 --> 05:33.360
already as well.

05:33.360 --> 05:39.760
A bunch of other companies have started started to do this, and a quick shout out to

05:39.760 --> 05:41.200
businesses as well.

05:41.200 --> 05:46.360
They basically, when one step further, they're the folks that originally did this with

05:46.360 --> 05:50.240
FoundationDB, and they've built a deterministic hypervisor so that you don't actually

05:50.240 --> 05:56.160
have to care about any of this in your application, and you can kind of try and explore

05:56.240 --> 06:01.800
this state space, the deterministically using their hypervisor and then also replay it

06:01.800 --> 06:02.800
deterministically.

06:02.800 --> 06:06.040
It's mind-blowing technology.

06:06.040 --> 06:13.400
Anyways, we're far too poor to pay for, for, for businesses, so we thought, you know, if

06:13.400 --> 06:20.840
we're already building something from scratch, we think we can build a simulator for ourselves.

06:20.840 --> 06:26.920
And so ultimately, there are four ingredients of randomness that we end up needing to control.

06:26.920 --> 06:32.800
The first one is event ordering, and ultimately what this means for us, and this is also

06:32.800 --> 06:38.960
one of the reasons why we ended up choosing Rust for our new database, is we need to be

06:38.960 --> 06:48.040
able to deterministically genuinely, deterministically execute an order of events in our system.

06:48.040 --> 06:54.840
And what that means is concretely on hardware is that for our simulator and our simulation

06:54.840 --> 06:59.320
test suite, they only to run on a single thread, because the moment that we go multi-threaded,

06:59.320 --> 07:03.120
we're already not deterministic anymore.

07:03.120 --> 07:07.240
And so, you know, in the very concretely what this means for our Rust application is that

07:07.240 --> 07:17.600
we use the like single-threaded Tokyo executor to make that happen.

07:17.600 --> 07:22.360
And then randomness, you know, very concretely, this can be random number generators,

07:22.360 --> 07:25.440
but also I'll go into this a little bit later.

07:25.440 --> 07:29.560
They're like implicit things that, you know, cause randomness as well.

07:29.560 --> 07:35.720
Time is a really difficult one, and this is one that we found was particularly difficult

07:35.720 --> 07:40.960
to control if it's not something that you think about from day one of your application,

07:40.960 --> 07:49.120
because, you know, if you end up calling, you know, time, time now somewhere in your program,

07:49.120 --> 07:50.560
you've kind of already lost, right?

07:50.560 --> 07:55.480
Like you need to be able to kind of see the time throughout the entire program.

07:55.480 --> 08:00.520
And then the very last thing, and this is ultimately how you end up really tickling out

08:00.520 --> 08:07.720
all the bugs of your code, you can randomly inject failures into in this simulator at any

08:07.720 --> 08:09.520
point in time.

08:09.520 --> 08:14.640
So once we can control all of these, we can actually build a test suite that we run

08:14.640 --> 08:19.520
against our system, we check some invariants about our system, and we can, you know, randomly

08:19.520 --> 08:21.960
control what happens.

08:21.960 --> 08:25.920
And so how do we structure our code to make that happen?

08:25.920 --> 08:31.320
So ultimately what we do is we have to relatively simple trade, the state machine, that

08:31.320 --> 08:36.080
implements the receive and the tick function.

08:36.080 --> 08:41.840
And ultimately, receive can be, you know, a response to any request that any other state

08:41.840 --> 08:49.800
machine may have produced, or a response to a request that a state machine has done.

08:49.800 --> 08:54.600
So, you know, let's say the state machine wants to do a request to some external system

08:54.600 --> 08:57.240
or something like that.

08:57.240 --> 09:05.480
So that way we can kind of control everything that happens external to the particular component.

09:05.480 --> 09:10.960
And tick, you know, either is there for periodic work generation.

09:10.960 --> 09:16.640
So, like, you know, some component does something on a schedule, but also, you can use it

09:16.640 --> 09:20.240
to kind of trigger timeouts and stuff like that.

09:20.240 --> 09:25.280
And the reason why we want this to be kind of separate is so that we can actually control

09:25.280 --> 09:27.200
this time component, right?

09:27.200 --> 09:29.680
And we need to always think about that these are state machines.

09:29.680 --> 09:33.680
So, like, every time that we call any of these functions, we mutate the state machine in

09:33.680 --> 09:38.760
some way, and output some messages to be consumed by others.

09:38.760 --> 09:47.040
And so that is actually something that Tyler Nealy of this led project, originally kind

09:47.040 --> 09:52.040
of came up with this trait for sled as well.

09:52.040 --> 09:57.280
And so we kind of just tried this out and happened to work quite well for us as well.

09:57.280 --> 10:00.720
So again, kind of taking a quick step back, why do we do this?

10:00.720 --> 10:07.920
So, we had two talks, earlier this day, today, in the rest room, we basically build a

10:07.920 --> 10:13.720
profiler and we built this database as a back end storage for this profiler.

10:13.720 --> 10:16.720
And then yesterday, we also had a talk in the database room, which happened to be in this

10:16.720 --> 10:21.960
room as well, that, you know, talks a little bit more about the architecture.

10:21.960 --> 10:25.120
But this talk is really just about how do we test this database, right?

10:25.120 --> 10:29.560
So, I just wanted to give you this in case you're interested in why do we do all of this.

10:29.560 --> 10:35.440
But on a very high level, the way that our system looks is that we have clients, clients

10:35.440 --> 10:40.280
push data, we have a component that we call the ingestor, that kind of buffers up data,

10:40.280 --> 10:43.280
and then ultimately flushes it to object storage.

10:43.280 --> 10:49.320
And then synchronously, there is a compaction component that regularly compacts data.

10:49.320 --> 10:53.760
Again, this isn't like the real, they're like a handful more components in the real system,

10:53.760 --> 10:58.480
but conceptually, this is how the system works, right?

10:58.480 --> 11:07.080
And so, the way that this end works in production and in the simulator is as follows.

11:07.080 --> 11:09.840
So, the simulator is essentially just a message bus.

11:09.840 --> 11:18.920
We saw that with our state machine interface, everything consumes and produces messages, right?

11:18.920 --> 11:26.080
And so, these are just messages that then go onto this, are cute into this message bus.

11:26.080 --> 11:30.000
And all of the components that we see here are just state machines.

11:30.000 --> 11:37.200
And in production, we have a separate driver for these state machines that actually take

11:37.200 --> 11:43.760
on a real interval, actually work with real networks, et cetera.

11:43.760 --> 11:50.680
And what's kind of nice about this is also that we can use a state machine to represent

11:50.680 --> 11:51.680
external components.

11:51.680 --> 11:57.760
So, obviously, we're not in writing the code for a GCS or S3 or something like that,

11:57.760 --> 12:05.040
but we can write a state machine that kind of acts as if we did, and we can control

12:05.040 --> 12:12.680
various failures and behaviors of those components, even if we don't fully control them.

12:12.680 --> 12:20.560
And so, like I said, work is produced on a tick and then transformed on receive.

12:20.560 --> 12:26.440
So, let's do this, I'll give you a real example of what this would look like in our testing

12:26.440 --> 12:27.440
harness.

12:27.440 --> 12:34.640
So, here we can see we've kind of initialized our queue with a bunch of ticks of our client.

12:34.640 --> 12:38.080
And again, our client is also just a state machine.

12:38.080 --> 12:45.880
And so, we tick and the client does its mutation of the state machine and ends up producing

12:45.880 --> 12:47.760
a new message onto the message queue.

12:47.760 --> 12:53.160
So, now the client says, I want the ingester to receive a write.

12:53.160 --> 12:59.920
And so, then the other components actually go about and receive this message, process it,

12:59.920 --> 13:00.920
and so on.

13:00.920 --> 13:10.080
And so, that's how, ultimately, we can produce a set of events in our system.

13:10.080 --> 13:14.680
And we can already see kind of, since we control this entire thing, I did this quite

13:14.720 --> 13:20.920
quickly, but like we can also, because we control all of this, we can reorder events randomly,

13:20.920 --> 13:22.440
we can drop events, right?

13:22.440 --> 13:28.040
Like we can do all sorts of interesting things that would be incredibly difficult to reproduce

13:28.040 --> 13:29.800
in a real environment, right?

13:29.800 --> 13:34.680
Potentially, potentially, we're testing even things that maybe are completely impossible, right?

13:34.680 --> 13:40.200
But maybe it's this one thing where something needs to be shut down for three years, and

13:40.200 --> 13:44.640
then come back online, and we can test all of those things.

13:45.400 --> 13:51.360
So, yeah, randomness, like I said, is something we need to control quite tightly, and something

13:51.360 --> 13:57.400
like this, and more generally, we need to write our entire system to be deterministic, right?

13:57.400 --> 14:03.520
So, this is something that's quite tricky to do, and so, obviously, there are kind

14:03.520 --> 14:09.960
of two kinds of randomness that would need to broadly care about quite deeply.

14:09.960 --> 14:14.120
One is, you know, I think the really obvious one, random number generators that are

14:14.120 --> 14:20.320
actually used by our software, that, you know, choose what next thing to execute or something

14:20.320 --> 14:21.320
like that.

14:21.320 --> 14:26.280
I'm obviously making something up, but, and the other thing is things that are implicit,

14:26.280 --> 14:31.720
so like hash map iterations, maybe different the next time that we run the same code,

14:31.720 --> 14:32.720
then it was the last time.

14:32.760 --> 14:41.160
So, we need to take a lot of care that, you know, every execution is actually deterministic,

14:41.160 --> 14:47.280
and every execution of the same input events will actually produce the same outputs.

14:47.280 --> 14:55.280
And so, one kind of neat trick that we found to ensure this is that we actually can test

14:55.280 --> 15:00.080
determinism, because what I've just said is, for the same inputs, we always must be getting

15:00.080 --> 15:06.960
the same outputs, and so, what we just do is we run the test harness with the same seed

15:06.960 --> 15:14.320
multiple times, and if that ends up generating a different set of events produced by those

15:14.320 --> 15:20.520
state machines again, then we've just proven that our system is in fact not deterministic,

15:20.520 --> 15:25.240
and then we can go and try to fix that determinism, right, so that we can make sure that

15:25.280 --> 15:30.720
when we do get a failure, it is actually deterministically reproducible.

15:30.720 --> 15:38.280
So, like I said, some of the, some things that we can do with deterministic simulation

15:38.280 --> 15:44.840
testing, because we control all of those four variables, including time, is we can say,

15:44.840 --> 15:50.280
you know, an ingester is supposed to tick, but it's actually only going to tick in six

15:50.280 --> 15:54.840
years, or I guess in five years, right, something that would be incredibly hard to do

15:54.840 --> 16:00.440
in real life, right, like how, who in their right mind would wait for actual five years of

16:00.440 --> 16:06.440
a production machine being online to test something, right, whereas for us it's just saying

16:06.440 --> 16:12.760
this tick has happened in five years in the future, and so this is really cool, because all

16:12.760 --> 16:22.120
the sudden we're no longer constrained by, like, real time, right, we're only constrained

16:22.120 --> 16:28.040
by how many CPU cycles can we actually run on a machine, right, and so what that means is we

16:28.040 --> 16:41.480
can simulate, you know, tens or hundreds of years of production sequences of events, every

16:41.480 --> 16:47.960
single day, right, and so this allows us to then find if there are any sequences that are, you

16:48.280 --> 16:57.160
potentially harmful for our system. So once again, let's look at an example of what could be

16:57.160 --> 17:01.720
interesting to simulate here, or what are the interesting things that are simulator might

17:01.720 --> 17:09.720
end up simulating, because, again, we originally randomly seed our simulator to try and explore

17:09.720 --> 17:16.360
random sequences of events, right, so one of the things that we could have is that, you know,

17:16.440 --> 17:24.680
we have on our message buzz, this message that our object storage is supposed to persist something,

17:24.680 --> 17:30.920
right, so one of the things that we can do is actually just say, this never happens, so ultimately

17:30.920 --> 17:36.440
what that means is the ingester should eventually retry, right, but another interesting that could

17:36.440 --> 17:45.880
happen is the right actually succeeds, but the acknowledgement is dropped, right, so all of these

17:46.200 --> 17:51.320
things, at any point in time, our simulator can decide, you know, now I'm not going to deliver

17:51.320 --> 17:55.880
this message, or now I'm going to restart this component, it's going to lose all of its state,

17:57.400 --> 18:02.440
and our system needs to be able to ultimately recover from this, and at all points in time,

18:02.440 --> 18:08.040
the invariance of our system might help must hold true, like snapshot isolation and what have

18:08.040 --> 18:18.200
you that we want to make sure are always true about our system, so I've talked a lot now about

18:18.200 --> 18:25.800
kind of the result or the ingredients that go into it, so what are some of the things that we found

18:25.800 --> 18:32.680
with this, so we've found some pretty devastating bugs using this methodology, so we found

18:33.000 --> 18:39.400
duplicate data being persisted in our system, we found cases of data loss, we found partial

18:39.400 --> 18:45.480
commits, we found random crashes of our system that shouldn't have happened, right, just about everything

18:45.480 --> 18:51.480
that you could possibly think of, and just so that you can actually visualize what this looks like

18:51.480 --> 18:58.200
in our CI, all that happens is that we generate a random seed at the start, we run our test

18:58.200 --> 19:05.560
suite with this random seed, it originally seeds our message queue with a number of actions,

19:05.560 --> 19:10.760
and then we just kind of let it go and do a thing, and like I said, we test a bunch of

19:12.920 --> 19:18.200
invariance about it the entire time, and if ever one of those variants don't hold true,

19:18.200 --> 19:24.920
we've found a bug, and one of the things that has been really interesting about this is that we

19:24.920 --> 19:31.000
actually have this true sequence of events, right, but it turns out this can be a lot of things,

19:31.000 --> 19:36.520
right, like we originally seeded with 64 events, but you know potentially hundreds of things can happen,

19:36.520 --> 19:43.240
and it can be quite difficult to understand what is the thing that has actually led to this bug,

19:43.240 --> 19:47.880
right, like yes we can now we're able to reproduce it, but still understanding what led to the

19:47.880 --> 19:53.640
bug is yet another thing, so one of the really cool things that we found to be really useful

19:54.200 --> 20:00.440
is that we have a sequence of events, right, so we can actually visualize this for you know

20:00.440 --> 20:08.360
our small tool humans to actually understand what has happened, so I'm just have a random example here,

20:08.360 --> 20:12.440
where we can see you know we have random sets of clients, random sets of ingestors,

20:13.160 --> 20:17.080
that all end up talking to each other, we have a random restart in the middle that happened

20:17.640 --> 20:24.200
and so yeah, this is just a random example of what you know a sequence like that may

20:24.200 --> 20:31.080
may look like, I'm going to have to go relatively quick, but like I said we've kind of tried this

20:31.080 --> 20:38.040
before and go, and with go what we had to do we kind of had to do you know unspeakable things

20:38.040 --> 20:45.960
to the runtime to make it deterministic, and that that did work, and it is kind of easier to do

20:46.040 --> 20:51.960
actually because once you make the runtime the deterministic you actually sort of get all the benefits

20:51.960 --> 21:00.360
out of that quite easily, however debugging with these state machines ends up being a little bit

21:00.360 --> 21:08.680
simpler, but one thing that definitely can be understated is changing your mental model about

21:08.760 --> 21:15.880
your system to always have to be within state machines is very different from the kind of you know

21:15.880 --> 21:23.080
suffering engineering that we've been exposed to before. I personally do think that this is a good

21:23.080 --> 21:28.120
thing because we always do think about the ultimate mutations in our system, but it can be quite

21:28.120 --> 21:34.680
difficult to kind of map what we want to do onto these state machines, and the simulator itself

21:34.680 --> 21:41.320
is actually really hard work, like it is software that needs as much care as any other test

21:41.320 --> 21:47.960
suite, any other software, so like if we start to neglect this it you know starts to start to show

21:47.960 --> 21:54.040
just like any other software. So yeah, the ST is not magic we need to put a lot of thought into it

21:54.840 --> 21:59.720
we thought about you know is there something better that we can do than just random seed that we start

22:00.200 --> 22:05.480
start with maybe we can start at some interesting state and then start to explore from there

22:05.480 --> 22:09.320
or we've previously determined this state is interesting and everything from there could be

22:09.320 --> 22:14.440
interesting to explore, right? There are a lot of things that can definitely be done. I said

22:14.440 --> 22:19.240
that I already said that we sometimes have to deal with a huge number of events and figure out

22:19.240 --> 22:26.280
what was the actual sequence, well we actually control the simulator, right? So we can try to shrink

22:27.000 --> 22:32.760
to the minimum amount of events that need to happen to actually reproduce the same

22:32.760 --> 22:39.800
same state, we call this the shrinking. So yeah with all of these ingredients and a lot of hard work

22:40.680 --> 22:48.520
you know we try to get rid of any unpredictable bugs and try to not have you know very critical

22:48.600 --> 22:52.920
bugs like this ever hit production through this testing. Thank you.

23:01.000 --> 23:03.720
Questions? We have a couple of questions.

23:03.720 --> 23:19.400
Thank you for the talk. I have a question when you have a specific seed. Can you speak up? I can't

23:19.400 --> 23:27.640
hear. If you have a seed that triggers a specific issue and you fix your state motions

23:27.720 --> 23:34.200
or can you ensure that the fix is correct because the seed will not match the new state

23:34.200 --> 23:42.200
machines to trigger the bug? Yeah great question. So basically you're saying how do we make sure

23:42.200 --> 23:49.800
that when we change the code that we're actually testing the same state space? So I mean we can

23:49.800 --> 23:54.520
test that by making sure that the log is the same, right? So if we actually do have the same

23:54.520 --> 24:03.080
the same sequence of events then we know that it is in fact the same thing. Obviously only

24:03.080 --> 24:11.000
leading up until the thing where we've proven that the invariant was violated, right? Does that make sense?

24:14.040 --> 24:21.800
I think you can have to talk outside because it's very interesting. Thank you very much.

24:24.520 --> 24:26.520
Thank you very much.

