WEBVTT

00:00.000 --> 00:12.000
In a previous life, I used to write enterprise firmware for HDDs and SSDs, and now it's my

00:12.000 --> 00:17.500
life mission to remove those from all systems I touch.

00:17.500 --> 00:21.420
This is my book, the Tori DocTax. Stick around, there'll be more later. This is my dog

00:21.420 --> 00:29.040
turn. So today, I'm going to talk about why our company is building a database. What

00:29.040 --> 00:34.200
wasn't working with the first database we built, how we decided to re-architect it,

00:34.200 --> 00:38.400
and then the open source projects that we built it upon.

00:38.400 --> 00:41.880
So first of all, why are we building a database? There's a lot of really good databases

00:41.880 --> 00:46.600
out there, why don't we use just one of those? Well, there's a lot of reasons. One of

00:46.600 --> 00:51.080
them being we didn't want to be tied to certain licenses that might change out from under

00:51.080 --> 00:57.360
us, or billing that might change, it became a business decision. But we also thought,

00:57.360 --> 01:02.440
because what we're doing is a little unique, not all of the off-the-shelf databases fit our

01:02.440 --> 01:10.280
needs. We're building a database for continuous profiling. So continuous profiling is my

01:10.280 --> 01:15.520
too close, sorry. Is the process of capturing stack traces off the CPU, storing those in

01:15.520 --> 01:19.800
a database, and then allowing users to query back that data over time? This allows them

01:19.800 --> 01:24.680
to answer questions like, why is my application slow? What is the most expensive function

01:24.680 --> 01:29.400
across my entire infrastructure, or why does this user have a worse experience on our system

01:29.400 --> 01:34.920
than any of the other users? So the goals of our database were we needed to be easy to

01:34.920 --> 01:39.840
operate, we're really small company, we don't have a huge team, so it needed to be simple.

01:39.840 --> 01:43.400
We needed to be cost-effective at scale, we are trying to build a business, so we needed

01:43.400 --> 01:48.520
it to scale, cost-wise with customers. And then we really wanted our storage and compute

01:48.520 --> 01:55.520
to be decoupled, that way reads, don't affect our right path, then vice versa. So this

01:55.520 --> 02:01.480
is our first attempt at a profiling database. It was a database for stacks. So this means

02:01.480 --> 02:05.920
is we check the stack trace on the CPU, we record all the functions we see on that

02:05.920 --> 02:12.200
stack trace, and we record the timestamp in the values that we see for that. So this gives

02:12.200 --> 02:17.400
us a really specific data shape. The data is repeated, we see the same functions called

02:17.400 --> 02:22.560
over and over, they're deeply nested, so function A calls function B, function B calls function

02:22.560 --> 02:28.280
C and D, and so on. The data is also immutable, once we've recorded a stack trace, it doesn't

02:28.280 --> 02:32.520
really make sense for user to ask to update it, because it's just a snapshot of what happened

02:32.520 --> 02:38.920
on your CPU. It's temporal, so we're just capturing the data at a specific time. It's

02:38.920 --> 02:44.920
also non-relational, one stack trace isn't necessarily for it to another. And importantly,

02:44.920 --> 02:50.120
it also has dynamic columns. So you might be more familiar with this as known as labels

02:50.120 --> 02:55.320
in the metrics or log world, where you can label your data sets, where it came from, what

02:55.320 --> 03:01.200
runtime it was using, what user ID is attached to the stack trace, things like that.

03:01.200 --> 03:06.440
And the last thing is not necessarily a data shape, but how the data comes in. This type

03:06.440 --> 03:11.320
of data is really right dominated. You might have a cluster of 1,000 nodes shipping profiling

03:11.400 --> 03:16.440
data to you, but you probably only query that data once a day twice a day. It's user

03:16.440 --> 03:23.560
driven, so it's just going to be naturally more right dominated. So this lens itself really

03:23.560 --> 03:29.240
well to call in there, and thank you for your Matt Sopel for pointing out Apache Aero. That's

03:29.240 --> 03:33.800
typically half the story. Apache Aero is really good for that transfer format, the in-memory

03:33.800 --> 03:39.880
format. But then there's its companion piece, the file storage for it, is the de facto standard

03:39.880 --> 03:46.120
being Apache Parquet. So this stores all the data in calendar format, and we get really good

03:46.120 --> 03:51.080
compression out of it, which is really important for our repeated nested immutable data.

03:53.160 --> 03:57.960
So this is where we came up with Frostyby. It's written and go. It's backed by Aero and Parquet.

03:59.400 --> 04:05.800
It was built on top of an LSM-ish tree. I say LSM-ish because, again, our database is immutable,

04:05.800 --> 04:10.840
so we don't actually need to deal with updates or deletes. So it basically just looks like we

04:10.840 --> 04:15.400
have a bunch of Aero records in memory as we receive them. Once we accumulate enough of those,

04:15.400 --> 04:20.040
we compact them into Parquet files. Once we accumulate it into those, we compact those Parquet files into

04:20.040 --> 04:26.760
one just big chunkier Parquet file and ship it off to object storage. So great. We had a database.

04:27.800 --> 04:32.840
We built one for the enterprise scale. We basically sharded out a bunch of Frostyby instances.

04:32.840 --> 04:37.240
Rights came in. They get split across a bunch of instances. Reads would just fan in from those

04:37.240 --> 04:43.400
instances, and then we used preemptible instances to keep our costs really low. So this is great.

04:43.400 --> 04:50.760
We had a database that was easy to manage. It scaled, and it was working. And then we had customers,

04:50.760 --> 04:58.440
and they ruined it. So our customers ended up growing bigger. They started using more and more of

04:58.440 --> 05:03.560
our database, which was good. But we got to a point where we were unable to write out our memory

05:03.560 --> 05:08.040
in time for preemption. So we had preemptible instances. We'd get notified it was going to be shut down.

05:08.040 --> 05:11.880
We tried to write it out, but we had so much data in memory that we couldn't get it there in time.

05:12.680 --> 05:16.840
We could have solved this by scaling out the number of instances further, but now suddenly our cloud

05:16.840 --> 05:22.200
bill just skyrocketed and we're not accomplishing our goals. So we reached into every database

05:22.280 --> 05:30.440
to open a tool kit, and we got it right ahead of log for durability. So now rights are written to the

05:30.440 --> 05:34.840
right-head log on a persistent disk, and then they can be acknowledged to the clients. It's great.

05:35.400 --> 05:39.000
We no longer need to write out our data when we're preempted. We can just shut down, come back

05:39.000 --> 05:44.440
up, replay the wall, carry on from where we left off. Everything's working again. Now we've coupled

05:44.440 --> 05:51.400
compute and storage a little bit, but it's not too bad. And those damn customers.

05:52.760 --> 05:56.840
So because we're replaying a wall, our customers noticed that when that was happening,

05:56.840 --> 06:01.240
their data was no longer available. We couldn't answer a query because we hadn't replayed it from

06:01.240 --> 06:07.960
the right-head log. All right. Well, fair enough, we have to go back and start replicating the rights.

06:08.840 --> 06:13.000
So now each of our frosty beinsons sees a copy of the right, technically we're

06:13.000 --> 06:18.040
triplicating the rights, but now when we're replaying the wall, customers can read their data because

06:18.040 --> 06:21.320
they'll just get it from a different node that isn't currently replaying. Great.

06:21.320 --> 06:27.720
Happy customers again. We have availability. We're not losing data. Okay, but we did just increase

06:27.720 --> 06:33.000
our cost a little bit because now we have three times the data footprint. All right. Well, it's working.

06:34.040 --> 06:40.760
Except for those damn customers. Now we have long replays caused because our customers,

06:40.760 --> 06:45.880
footprint grew so big on the nodes that the nodes just become really out of date after the replay.

06:45.880 --> 06:50.840
So if a replay took 10 minutes, that node wasn't receiving rights for 10 minutes. And so now it's

06:50.920 --> 06:57.160
10 minutes behind the other nodes and can't answer queries again. So yet again, we have to make changes

06:57.160 --> 07:03.480
to our system. And we had replay coordination. So now when a node comes up, it replays it all.

07:03.480 --> 07:07.720
It talks to the other nodes. It says, hey, how far behind I am, they say, these are the rights

07:07.720 --> 07:12.840
you're missing. You're this far behind. They send them over. Great. The node replays those rights

07:12.840 --> 07:16.840
from the other nodes. And now it's caught back up and is a fully functioning node in the system.

07:17.560 --> 07:22.920
All right. We're back to a working system. But if you haven't gotten the just to my talk,

07:22.920 --> 07:29.720
it's those damn customers again. So some of those customers get so big that it created really

07:29.720 --> 07:35.560
complex failure scenarios. So one of the things we saw was we'd have a node that would

07:35.560 --> 07:41.560
go down, come back up, replay its wall, request a bunch of data. It took so long to replay its wall.

07:41.560 --> 07:46.120
It got really far behind. The data requested was too much for the node to handle. So it

07:46.120 --> 07:51.960
umkilled. Kubernetes reschedules it. It comes back. Now it's further behind. And you can see where

07:51.960 --> 07:57.400
this goes because it requests more data, crashes, cycles. It's a vicious cycle. Often what would

07:57.400 --> 08:00.680
happen is that node would be doing that so long. Another node would go down and then it would just

08:00.680 --> 08:06.920
create this cascading failure or something there whole database was down. So one of my favorite

08:06.920 --> 08:12.120
systems quotes is every system is perfectly designed to get the results that it does. So we had

08:12.200 --> 08:17.480
perfectly designed our system to be challenging to operate, expensive at any scale, and we had

08:17.480 --> 08:24.360
hard dependencies between storage and compute. We're awesome. So clearly none of our architecture

08:24.360 --> 08:30.440
goals were being met. It was time for a new architecture. So this is where Apache data fusion comes in.

08:30.440 --> 08:35.560
This is the first thing we reached for. We learned when we were making frosty be that creating

08:35.560 --> 08:41.160
your own career engines really hard to get it right. It's really hard to get it fast. And we

08:41.160 --> 08:45.080
really didn't want to invest all of our time again in building that. So we reached for something

08:45.080 --> 08:50.280
open source. Data fusion is really fast. It's written in rust. It's calling it by default. It's

08:50.280 --> 08:54.760
very extensible and it works with mini table formats out of the box. So it didn't really lock us

08:54.760 --> 08:57.960
into what the rest of our database could look like, but it was going to give us really free

08:57.960 --> 09:01.880
performance. So we were really excited about that. So if we decided to move forward with data

09:01.880 --> 09:07.320
fusion and writing our new database in rust. The second thing we reached for on the right path

09:07.320 --> 09:13.240
was Delta Lake. So Delta Lake was donated by a Databricks, a open source community. It's an

09:13.240 --> 09:19.320
open table format, much like Apache iceberg. It provides acid transactions, really fast writes,

09:19.320 --> 09:23.320
quick reads through file pruning and schema evolution, which is again really important for our

09:23.320 --> 09:28.680
database because of the labels we see we don't know the scheme of beforehand. So great. So our

09:28.680 --> 09:34.280
database looks like this right now. We have clients writing arrow. The arrow gets written to parquet

09:34.280 --> 09:39.480
and is object storage and then committed to the Delta Log. The only problem with this is this is a

09:39.480 --> 09:43.960
rather expensive architecture. Remember, you might have thousands of nodes each of those represented

09:43.960 --> 09:49.640
a client all writing to the tables. This creates, you know, one part k file for write and then a

09:49.640 --> 09:54.760
commit for write. So huge write amp become really expensive. Most cloud providers bill you for every

09:54.760 --> 10:02.920
right you do. So kind of an expensive database operate. So we inject a buffer writer. So because

10:02.920 --> 10:09.240
it's an observability database, usually you don't need to see your data within sub-second intervals.

10:09.240 --> 10:14.200
So basically what we did is we no longer acknowledge or write to the client until it's buffered

10:14.200 --> 10:19.480
in this buffer writer and then is flush the object storage. Once it's fully committed to log,

10:19.480 --> 10:23.960
then we'll acknowledge the client. So usually it's however long it takes up to fill a pre-configured

10:23.960 --> 10:28.840
buffer size or one second however long it takes. So there's a one second delay in being able to read

10:28.840 --> 10:34.200
the data you write, but clients don't care that they can't see the last sub-second of data.

10:34.200 --> 10:39.560
So great. This is a lot less expensive to operate because now we have larger files and fewer

10:39.560 --> 10:47.400
writes to the system. The problem with this is if you notice we have lots of clients from different

10:47.400 --> 10:52.840
customers all being shoved into a single table. So now if we have a really big clients, it can

10:52.840 --> 10:57.080
affect the reads for other smaller clients because you have to first sift through that big

10:57.080 --> 11:02.600
client's data to get at the data you want. So now we have this weird paradigm where one customer

11:02.600 --> 11:09.080
can kind of ruin the query experience for other customers. So this is where we add a compactor

11:09.080 --> 11:15.640
and tenant tables. So it's just Delta Lake again, but now once a customer hits a pre-configured

11:15.640 --> 11:22.040
size in the primary table, we basically slurp all that data up right a new bigger part K file in

11:22.040 --> 11:26.920
a different Delta table that's just for that tenant and we keep really good performance. So now

11:26.920 --> 11:30.920
when a read happens, they just have to get their reads from the primary table which is kept

11:30.920 --> 11:37.800
rather small and then their reads from their tenant table. So great. This is working really well.

11:37.800 --> 11:42.680
It's really simple to operate. You'll notice we have very few moving pieces. We have no disks.

11:42.680 --> 11:47.080
Thank goodness. Everything's working great, but one thing we noticed and one of the cool things

11:47.080 --> 11:52.600
about building a profiling database is you get to profile your own database. So as we were doing this,

11:52.600 --> 11:58.520
we're learning, hey, we're spending an inordinate amount of time converting par k to arrow every

11:58.520 --> 12:03.640
time we need to move data or read data. There's a huge serialization cost. We also noticed it

12:03.640 --> 12:09.160
with Delta Lake that the normal commit log is all written in JSON and sure JSON serialization can

12:09.160 --> 12:15.960
be fast, but when you do it a ton, that ends up being a huge footprint for your CPU. So shout out

12:15.960 --> 12:23.160
to our friends at SpiralDB for their Open File Format called Vortex. So Vortex offers zero copy

12:23.160 --> 12:27.640
reads for Apache Arrow. So basically we no longer have to serialize the data when we write all our

12:27.640 --> 12:32.760
arrow records to files and we don't have to desialize it when we read it back into memory. So we actually

12:32.760 --> 12:39.000
replaced all the JSON files as well as all of our PAK files with Vortex. This gave us faster queries,

12:39.000 --> 12:43.640
your copy reads, which gave us lower memory overhead. And then one thing we're really excited about

12:43.640 --> 12:49.320
is it's an extensible file format so we can bring our own encoding schemes. We haven't done this yet,

12:49.320 --> 12:53.960
but there's a lot of room to grow there in terms of career performance for profiling data.

12:56.200 --> 13:01.720
So now we have an easy to operate database. It's really cost effective at scale. The storage

13:01.720 --> 13:08.040
and compute are completely decoupled. Overall from our FrostDB database to this one, this has got

13:08.040 --> 13:14.520
us a 50% reduction cost. We've increased, the decrees are clearly in say by over 200% and we no longer

13:14.520 --> 13:21.960
have data unavailability. So big shout out to all the open source projects that made this possible.

13:21.960 --> 13:26.440
We literally just built on top of all these things which is pretty cool to go from zero to a database

13:26.440 --> 13:34.680
in just a year's time with these awesome projects. Last dog text. All right. Any questions?

13:35.640 --> 14:01.400
Yeah, exactly. Yeah. Yeah. Yeah. So the question was, did we consider other table formats?

14:01.400 --> 14:08.520
So yeah, we looked at basically the big three, which is, you know, iceberg, Delta Lake,

14:08.520 --> 14:13.560
and that third weird one I can't remember the name of Kudi. Thank you. What a weird name.

14:13.560 --> 14:19.240
Okay. Yeah. So we looked at that and one of the things that really stood out for Delta Lake

14:19.240 --> 14:24.360
was the way it handled rights because our workload is so right dominated and we need to get really

14:24.360 --> 14:30.040
fast rights in there. And because we pay our cloud provider for every right we do to object storage,

14:30.040 --> 14:35.320
we wanted something that has the smallest amount of right amplification and that can do rights

14:35.320 --> 14:40.200
the fastest. So it's typical Delta Lake right only requires you to write a single commit file.

14:40.200 --> 14:45.880
So it's a single right potentially with the checkpoint. But it's a single right for most rights.

14:45.880 --> 14:51.240
So that was really advantageous for our type of database. Whereas Apache iceberg is usually like

14:51.240 --> 14:56.280
two or three different rights because you have to update manifests and then the commit version and stuff.

14:56.280 --> 15:08.760
So that was why we picked that, um, say that again. Yeah, I think it, I think it does make a difference

15:08.760 --> 15:13.720
because that's, sorry, yes. So the question was, even with the buffered rights in front of

15:13.720 --> 15:19.480
the database does it make sense between which open table format and we think it still makes a

15:19.480 --> 15:23.240
big difference because it's still one right versus three every time we flush that.

15:24.120 --> 15:28.840
Any other questions? All right, in the back, you got to be loud.

15:35.560 --> 15:41.000
Yeah, yeah. So there's a lot of cost factors in our old database. So one of them is we had to

15:41.000 --> 15:45.880
triplicate the data. So now we have all of these extra nodes that we have holding on to this data.

15:45.880 --> 15:50.440
We had to add disks to all of those nodes because we had a right-to-head log just cost money.

15:50.920 --> 15:51.880
So we had to do with that.

15:54.440 --> 15:58.520
Those are kind of our big primary factors was our compute and our disk storage. I think I'm probably

15:58.520 --> 16:01.960
missing some other things. I think our networking was also a cost because we were triplicating the

16:01.960 --> 16:07.640
data across regions. So our network bandwidth actually ended up being quite expensive for that as well.

16:10.840 --> 16:15.640
Yeah, so faster is kind of two really big components. One of them being that

16:16.360 --> 16:22.360
data fusion was better than our goal implementation of query engine.

16:23.000 --> 16:27.880
Maybe if we had continued to spend more effort, we could have gotten that but it was just faster.

16:27.880 --> 16:32.040
And then the other decision, this isn't necessarily strictly related to the database change,

16:32.040 --> 16:37.160
but it's the replacing of parquet with vortex, gave us huge career performance gains because we're

16:37.160 --> 16:40.760
no longer spending any time serializing to arrow. So those are kind of our biggest scans.

16:43.320 --> 16:44.200
All right, thanks a lot.

16:45.640 --> 16:47.720
Thank you.

