WEBVTT

00:00.000 --> 00:12.120
Okay, so we have our next talk, Elexa, is going to tell us about Rustin Clickhouse, so please

00:12.120 --> 00:21.440
give him a warm welcome.

00:21.440 --> 00:27.560
My presentation today is not about how to rewrite everything in Rust.

00:27.560 --> 00:35.120
It's more about how to not waste years in rewriting, but do something more sane.

00:35.120 --> 00:40.520
It's about our approach in Rust in Clickhouse.

00:40.520 --> 00:42.000
So what is Clickhouse?

00:42.000 --> 00:50.600
It's an open source project, it's pretty big, almost 2,000 contributors, 43,000 stars.

00:50.600 --> 00:57.160
It's a C++ code base, mostly C++, I would say 99%, maybe 95%.

00:58.160 --> 01:04.960
One and a half million lines of code, it exists since 2009, and today it's the most popular

01:04.960 --> 01:07.960
open source on our e-cultate base.

01:07.960 --> 01:17.560
And I can say C++ is not the nicest language, but in 2009 neither was Rust.

01:17.560 --> 01:25.480
So we started with C++, and it was quite a good choice, back then, and it's still

01:25.480 --> 01:31.600
quite popular in databases, data-based management systems, in graphics, applications,

01:31.600 --> 01:42.600
video, like games, computer-rated design, operating systems, drivers, scientific data analysis.

01:42.600 --> 01:50.960
So it still has its place, but the question is, if we start today, should we write Clickhouse

01:50.960 --> 01:53.040
in Rust, not in C++?

01:54.000 --> 01:58.480
Let's take a look.

01:58.480 --> 02:07.440
And the first question is C++, a plain, yes, it is.

02:07.440 --> 02:16.160
In big projects, people try to source this plain, a little, by adding even more plain.

02:16.160 --> 02:20.960
I think many, many ways to test the code base.

02:20.960 --> 02:28.500
So they have segmentation folds, they have data races, and they have to use all types of

02:28.500 --> 02:34.520
sanitizers, other sanitizer, 13-initizer memory, and undefined, however, sanitizers.

02:34.520 --> 02:41.960
They have to use Fizing, and we have to run all types of tests, including randomized

02:42.040 --> 02:47.720
tests, stress tests, performance tests, functional integration tests, compatibility tests,

02:47.720 --> 02:54.080
logic tests, jobs and tests for the correctness of distributed applications.

02:54.080 --> 03:01.640
So on, we run about 10 to 30 million tests in vacations each day.

03:01.640 --> 03:08.640
So it's quite a big pain, but, and you know what, we still have set folds in production.

03:09.640 --> 03:13.640
So there'd be different ways of Rust.

03:13.640 --> 03:19.640
There are quite a lot of reasons for using Rust today.

03:19.640 --> 03:26.640
The obvious one is memory and thread safety, but it is the most obvious.

03:26.640 --> 03:34.640
Another reason is that many modern libraries, many modern projects, are only exist in Rust.

03:34.640 --> 03:42.640
For example, libraries for data lakes, like for iceberg, for long, long time.

03:42.640 --> 03:45.640
The one the library was in Java.

03:45.640 --> 03:54.640
The same was for Delta Lake, but then, for Delta Lake, data bricks implemented Rust library.

03:54.640 --> 03:59.640
But a C++ library, a good C++ library, it still does not exist.

03:59.640 --> 04:07.640
If we don't use either Java or Rust, how can we use Delta Lake?

04:07.640 --> 04:11.640
And another reason is there is a lot of hype around Rust.

04:11.640 --> 04:17.640
I see this full audience, and I would say, yes, this is true.

04:17.640 --> 04:25.640
If this talk was about some boring C++ stuff, probably I will have just a few people in the audience.

04:25.640 --> 04:29.640
Maybe not here, but there are a few arguments about rewriting in Rust.

04:29.640 --> 04:34.640
In the main argument, if we do full rewriting, it will take years.

04:34.640 --> 04:40.640
Make your database, one and a half million lines of code, how to do full rewriting.

04:40.640 --> 04:45.640
We kind of just stop and allocate a year for doing this.

04:45.640 --> 04:48.640
And it will take not a year to take many years.

04:48.640 --> 04:58.640
Another reason is that using Rust is simple, but using C++ is slightly less.

04:58.640 --> 05:04.640
So, but if you use both C++ and Rust, you will get pain from both of the languages.

05:04.640 --> 05:12.640
So, maybe it's not a good choice to use two languages at the same time.

05:13.640 --> 05:18.640
And there is a lot of accumulated knowledge about C++, so why should we throw away this knowledge?

05:18.640 --> 05:19.640
Just to rewrite.

05:19.640 --> 05:23.640
There is too much drama about Rust.

05:23.640 --> 05:33.640
What if we write a database and someone will throw us just because we use unsafe in one file,

05:33.640 --> 05:38.640
and we use unwrap in another file, and people will hate us.

05:38.640 --> 05:41.640
But I want something boring.

05:41.640 --> 05:45.640
I want something sane.

05:45.640 --> 05:46.640
Okay.

05:46.640 --> 05:50.640
By the way, performance and efficiency is not deciding factor.

05:50.640 --> 05:54.640
You can write equally performant code, both in C++ and Rust.

05:54.640 --> 05:59.640
Sometimes it will be easier than Rust, because you can do quicker iterations.

05:59.640 --> 06:01.640
You can optimize quickly sometimes.

06:01.640 --> 06:07.640
You can do it faster in C++ just by avoiding complications with borrowers.

06:07.640 --> 06:10.640
Okay.

06:10.640 --> 06:16.640
So, the approach is not to do full rewrite, but to do iterative development.

06:16.640 --> 06:25.640
To find some library, some small feature that we don't really care about.

06:25.640 --> 06:32.640
And just to test, can we use a library in Rust integrated in C++,

06:32.640 --> 06:37.640
and use it as like a gateway for our Rust development.

06:37.640 --> 06:41.640
If it will succeed, we will add more libraries and more on more libraries,

06:41.640 --> 06:44.640
and maybe we will attract more engineers.

06:44.640 --> 06:49.640
Maybe we will get a relatively rewrite our code.

06:49.640 --> 06:55.640
We will see make build system and see make has some way for integration.

06:55.640 --> 07:02.640
In two thousand twenty-two, it was a library in the name of the corrosion.

07:02.640 --> 07:04.640
So, we selected the library.

07:04.640 --> 07:13.640
Then we found one student just because we did not want our full-time employees to lose sanity.

07:13.640 --> 07:18.640
If we asked them to drop C++ and write in Rust,

07:18.640 --> 07:23.640
and we found one function that we don't really need.

07:23.640 --> 07:27.640
There was a Rust library for Blake 3.

07:27.640 --> 07:35.640
Then we decided why don't add this function to SQL, and try what will happen.

07:35.640 --> 07:37.640
And actually, it succeeded.

07:37.640 --> 07:44.640
We integrated this library and even written an article how Blake 3 in Rust is faster.

07:44.640 --> 07:48.640
By the way, then we replaced it with an implementation from LWM.

07:48.640 --> 07:52.640
That is in C++, so it did not really matter, but.

07:52.640 --> 08:03.640
The point is, this was the third thing we integrated with Rust code.

08:03.640 --> 08:08.640
Here is a pull request from this guy.

08:08.640 --> 08:11.640
And what was the second?

08:11.640 --> 08:14.640
I hope that Rust is good for terminal applications.

08:14.640 --> 08:19.640
Sometimes I think that the only thing that people do in Rust is writing,

08:19.640 --> 08:23.640
and three writes in terminal applications.

08:23.640 --> 08:26.640
So, I decided we have a terminal application.

08:26.640 --> 08:27.640
Clickhouse client.

08:27.640 --> 08:31.640
Why don't improve it with Rust?

08:31.640 --> 08:37.640
There is a nice library for history search.

08:37.640 --> 08:42.640
And we decided to try.

08:42.640 --> 08:46.640
And it was also made by an external developer,

08:46.640 --> 08:57.640
works for Clickhouse, and he writes in C++.

08:57.640 --> 09:04.640
There were some problems like we integrated it, and our terminal application crashed.

09:04.640 --> 09:12.640
Because the library has a panic, and we had to patch this library just to avoid this trash.

09:13.640 --> 09:19.640
Potentially, the usability improved, so it was worth it.

09:19.640 --> 09:22.640
Actually, it was not so easy.

09:22.640 --> 09:27.640
So, you can see the history, we added the library,

09:27.640 --> 09:34.640
then we improved, and we might have built it, then we reverted the library due to a crash,

09:34.640 --> 09:39.640
then we changed the build system, and so on.

09:39.640 --> 09:41.640
But anyway, it worked.

09:41.640 --> 09:47.640
The next step was to entirely new language into Clickhouse.

09:47.640 --> 09:51.640
So, Clickhouse is a SQL database, you write queries in SQL.

09:51.640 --> 09:56.640
But there is an alternative database language, name it,

09:56.640 --> 10:00.640
PRQL, pipeline, and relational query language.

10:00.640 --> 10:04.640
You can see how it looks to the left, we have C++.

10:04.640 --> 10:08.640
Sorry, to the left, we have SQL with the clickhouse dialect

10:08.640 --> 10:12.640
to the right, we have PRQL.

10:12.640 --> 10:15.640
To the honest, I like SQL more.

10:15.640 --> 10:22.640
But anyway, PRQL, it was very fashionable, very happy.

10:22.640 --> 10:26.640
And I decided why don't add it just in case.

10:26.640 --> 10:31.640
Maybe people will prefer to write queries in PRQL.

10:31.640 --> 10:36.640
So, again, we found one student that did not mind,

10:36.640 --> 10:39.640
taking it as a coursework.

10:39.640 --> 10:43.640
And the point is, this is not a small library.

10:43.640 --> 10:48.640
It's a full-like transpider from PRQL to SQL.

10:48.640 --> 10:53.640
And maybe we will find many like details in the build system

10:53.640 --> 11:01.640
that we will have to just integrate integrated library.

11:01.640 --> 11:06.640
And what happened, actually, it wasn't integrated.

11:06.640 --> 11:11.640
No one used PRQL, but anyway.

11:11.640 --> 11:14.640
The next step was to make something practical.

11:14.640 --> 11:18.640
We already integrated three libraries.

11:18.640 --> 11:21.640
And we did not really need them.

11:21.640 --> 11:25.640
Now, something that we actually needed.

11:25.640 --> 11:31.640
A library to support Delta format for data lakes.

11:31.640 --> 11:33.640
What is data lake?

11:33.640 --> 11:39.640
It's why to represent a database in a different integrated form.

11:39.640 --> 11:45.640
Like data format is one thing, query language is another thing.

11:45.640 --> 11:49.640
And you can use different, sorry, query engine is another thing.

11:49.640 --> 11:53.640
And you can use different query engines like clickhouse, data fusion,

11:54.640 --> 11:57.640
data bricks on the same data format.

11:57.640 --> 12:01.640
And there are a few data lake formats,

12:01.640 --> 12:06.640
iceberg and Delta lake.

12:06.640 --> 12:13.640
And I say there were no implementations in C++ on the Java.

12:13.640 --> 12:16.640
And when the first library in Rath the period,

12:16.640 --> 12:18.640
it was published by data bricks.

12:18.640 --> 12:22.640
We decided let's try to integrate it.

12:22.640 --> 12:26.640
And we integrated this library.

12:26.640 --> 12:30.640
And now we have support for Delta lake.

12:30.640 --> 12:31.640
OK.

12:31.640 --> 12:38.640
But while all the integration we face just a bit of problems,

12:38.640 --> 12:41.640
let me go through these problems.

12:41.640 --> 12:43.640
All this wrong is Rath.

12:43.640 --> 12:45.640
It's a so nice programming language.

12:45.640 --> 12:48.640
Everyone loves it.

12:48.640 --> 12:52.640
Rath might be a perfect language.

12:52.640 --> 12:55.640
But the problem is when you integrate,

12:55.640 --> 13:02.640
clickhouse integrate Rath and C++ together.

13:02.640 --> 13:07.640
And the first problem is how to get reproducible builds.

13:07.640 --> 13:12.640
Things like making sure that all dependencies are fixed

13:12.640 --> 13:15.640
and all dependencies are in the source code.

13:15.640 --> 13:22.640
Everything is been applied to supply and chain attacks.

13:22.640 --> 13:26.640
How to avoid things like when the build system

13:26.640 --> 13:30.640
download something from the internet from and Rath the sources.

13:30.640 --> 13:34.640
And it is not easy to solve in C++,

13:34.640 --> 13:38.640
but we solve it at a long time ago.

13:38.640 --> 13:42.640
And in Rath the typically not a problem.

13:42.640 --> 13:50.640
But how to ensure that when you integrate it in Cmic,

13:50.640 --> 13:52.640
it does not download crates.

13:52.640 --> 13:54.640
It enters all the crates.

13:54.640 --> 13:56.640
It was not trivial.

14:02.640 --> 14:06.640
Another problem is when you combine two languages,

14:06.640 --> 14:08.640
you have to write wrappers.

14:08.640 --> 14:13.640
To call the code from Rath in Rath from C++.

14:13.640 --> 14:15.640
You have to figure out the interface,

14:15.640 --> 14:18.640
you have to figure out who allocates memory,

14:18.640 --> 14:20.640
who delegates memory.

14:20.640 --> 14:23.640
And when we try the first time,

14:23.640 --> 14:30.640
immediately our test system found that we did it wrong.

14:30.640 --> 14:32.640
There were crashes and so on.

14:32.640 --> 14:36.640
It was really ever wrong and not really safe.

14:38.640 --> 14:41.640
Fortunately, we already had a phyzen camp

14:41.640 --> 14:44.640
to continue some integration system, so it saved us.

14:48.640 --> 14:56.640
Another problem is how errors are handled in C++ and in Rath.

14:56.640 --> 15:00.640
And in C++ we use exceptions.

15:00.640 --> 15:02.640
Actually, I like exceptions.

15:02.640 --> 15:06.640
Maybe you prepare some, I don't write on tomatoes for me.

15:06.640 --> 15:08.640
I use exceptions.

15:12.640 --> 15:16.640
In Rath people typically don't use exceptions.

15:16.640 --> 15:21.640
You can get something close to exceptions and Rath,

15:21.640 --> 15:23.640
but it will be not easy.

15:26.640 --> 15:33.640
And sometimes instead of other handling people just use panic.

15:33.640 --> 15:39.640
And it is okay for applications that do something like batch processing

15:39.640 --> 15:45.640
for applications that invoke it like once, do the stuff,

15:45.640 --> 15:48.640
and went away.

15:48.640 --> 15:52.640
For several applications, it is quite controversial.

15:53.640 --> 15:58.640
You don't want some third party library to just terminate your server.

15:58.640 --> 16:01.640
So you have to fix all these libraries.

16:05.640 --> 16:10.640
And I would like to say that yes panic is memory safe,

16:10.640 --> 16:15.640
but it is in the same way memory safe as like a borders

16:15.640 --> 16:19.640
as to determine it in C++ or even null pointer reference.

16:19.640 --> 16:23.640
It is memory safe in the same way as panic.

16:23.640 --> 16:28.640
But typically panics are used to indicate some bugs,

16:28.640 --> 16:31.640
fail at the portions.

16:31.640 --> 16:36.640
And when you go seriously with fasting,

16:36.640 --> 16:39.640
you almost certainly will find some corner cases

16:39.640 --> 16:46.640
and will find some uncovered some bugs in Rath libraries.

16:46.640 --> 16:51.640
And in this way, the fact that we write code in C++

16:51.640 --> 16:56.640
and we have to pay for all these testing,

16:56.640 --> 17:02.640
it helps us with Rath as well.

17:02.640 --> 17:06.640
One example is in PRQL,

17:06.640 --> 17:11.640
so immediately we found that if you write a query something like X,

17:11.640 --> 17:14.640
or Y, it will crash.

17:14.640 --> 17:16.640
And we have to fix it.

17:16.640 --> 17:21.640
Not a big problem, but okay.

17:21.640 --> 17:25.640
Another thing is sanitizers.

17:25.640 --> 17:28.640
Maybe you want to say that you don't need sanitizers

17:28.640 --> 17:31.640
since in Rath because it is so safe,

17:31.640 --> 17:34.640
why do you need other sanitizers in Rath?

17:34.640 --> 17:36.640
Why do you need memory?

17:37.640 --> 17:41.640
But we built all our code with sanitizers,

17:41.640 --> 17:48.640
and we want all our builds to continue to be tested with sanitizers.

17:48.640 --> 17:51.640
So all the code must be sanitized.

17:51.640 --> 17:57.640
For memory, sanitizers, it's important that every code that writes

17:57.640 --> 18:02.640
or reads into memory is sanitized.

18:02.640 --> 18:09.640
And initially we had to switch to the nightly toolchain

18:09.640 --> 18:12.640
for us just to get memory sanitizer.

18:12.640 --> 18:17.640
For some reason, we still have some problems with it.

18:17.640 --> 18:22.640
Some Rath libraries are disabled with memory sanitizer,

18:22.640 --> 18:26.640
just because they don't provide some symbols that are required

18:26.640 --> 18:29.640
to compile them this way.

18:29.640 --> 18:37.640
But today it is mostly not a problem.

18:37.640 --> 18:43.640
What about cross compilation?

18:43.640 --> 18:47.640
And again, I can say that cross compilation in Rath

18:47.640 --> 18:50.640
is much better than in C++.

18:50.640 --> 18:54.640
The only problem that again,

18:54.640 --> 18:58.640
we paid a huge amount of effort to make it working with C++.

18:59.640 --> 19:05.640
We had to provide custom toolchames,

19:05.640 --> 19:12.640
custom headers from Lipsy for every system.

19:12.640 --> 19:17.640
But now we have to solve this problem again.

19:17.640 --> 19:24.640
And it was again not easy.

19:25.640 --> 19:29.640
What about dependencies?

19:29.640 --> 19:35.640
For example, we prefer to link everything statically.

19:35.640 --> 19:38.640
We start to link open as a cell.

19:38.640 --> 19:42.640
And we start to link the kernel Rath from Rath.

19:42.640 --> 19:46.640
And it depends on another library name it request.

19:46.640 --> 19:50.640
And request also depends on open as a cell.

19:50.640 --> 19:54.640
And for some reason, now we have two different versions

19:54.640 --> 19:56.640
of open as a cell in the binary.

19:56.640 --> 19:59.640
And actually not just in the binary.

19:59.640 --> 20:05.640
One was statically linked and another was dynamic linked at runtime.

20:05.640 --> 20:10.640
And it broke our hermetic builds.

20:10.640 --> 20:14.640
We found a configuration option just to switch request

20:14.640 --> 20:17.640
to use Rath TLS.

20:18.640 --> 20:22.640
But the problem was that Rath TLS was not tips compliant.

20:22.640 --> 20:24.640
And we had to switch it back to open as a cell.

20:24.640 --> 20:27.640
And I'm sure that it used the same open as a cell.

20:27.640 --> 20:30.640
That is tips compliant.

20:30.640 --> 20:31.640
Okay, problem solved.

20:31.640 --> 20:34.640
What is next?

20:34.640 --> 20:37.640
Composeability of the code.

20:37.640 --> 20:43.640
The question is how what are the conventions that we use for the libraries?

20:43.640 --> 20:45.640
How each library should allocate memory?

20:45.640 --> 20:48.640
How it should spawn threads?

20:48.640 --> 20:52.640
How should maintain connection pools or cashes?

20:52.640 --> 20:57.640
If it does, H.D.s per request, how it managed to retract.

20:57.640 --> 21:00.640
And if you make it one way in our C++ code,

21:00.640 --> 21:04.640
how to ensure that other libraries do it in the same way?

21:04.640 --> 21:07.640
And the answer we cannot ensure it.

21:07.640 --> 21:10.640
Either we just patch this libraries.

21:11.640 --> 21:17.640
Or we get away with different ways of managing these things.

21:21.640 --> 21:23.640
Small surprises.

21:23.640 --> 21:31.640
Like when we edit PRQL, we found that some symbols in the binary.

21:31.640 --> 21:35.640
Now take 50 kilobytes just for the name.

21:35.640 --> 21:39.640
And this is one of the names of these symbols.

21:39.640 --> 21:44.640
I see homesky repeated like 20 times.

21:44.640 --> 21:46.640
The legs are broken.

21:46.640 --> 21:50.640
It's some kind of monomorphization like C++ templates.

21:50.640 --> 21:52.640
I'm not expecting that.

21:52.640 --> 22:00.640
So I would say, no, this is not better than C++ templates.

22:00.640 --> 22:03.640
What about dependencies?

22:03.640 --> 22:06.640
Like software composition analysis.

22:06.640 --> 22:11.640
If we list our C++ libraries that we depend on.

22:11.640 --> 22:15.640
It will be like 20 to my B.30 libraries.

22:15.640 --> 22:26.640
If we list the REST libraries, there will be 156.

22:26.640 --> 22:31.640
Direct dependencies and 6072 interact dependencies.

22:31.640 --> 22:34.640
I would say it's not that bad.

22:34.640 --> 22:39.640
It's not that bad as it is in, say, Node.js.

22:39.640 --> 22:45.640
If we use Node.js, maybe NPM, maybe you will have thousands of dependencies.

22:45.640 --> 22:47.640
Now, 6072.

22:47.640 --> 22:56.640
But it is not as like as boring as in C++.

22:56.640 --> 22:58.640
Where you cannot just add the library.

22:58.640 --> 23:01.640
You have to integrate it into build system.

23:01.640 --> 23:08.640
For this reason, you can have too many libraries.

23:08.640 --> 23:11.640
OK, actually, all problems have been solved.

23:11.640 --> 23:14.640
And now we have just a bit of REST code in C++.

23:14.640 --> 23:17.640
We did not correct it in REST yet.

23:17.640 --> 23:19.640
Maybe there is a chance.

23:19.640 --> 23:25.640
It depends on the enthusiasm of our engineers.

23:25.640 --> 23:29.640
I don't see a lot of enthusiasm by the way.

23:29.640 --> 23:32.640
But there is still a chance.

23:32.640 --> 23:34.640
So, what are the takeaways?

23:34.640 --> 23:37.640
REST is actually a great language.

23:37.640 --> 23:43.640
And you can write in C++ and REST in the same project.

23:43.640 --> 23:49.640
And if you like REST, welcome to be a clickhouse contributor.

23:49.640 --> 23:51.640
Thank you.

23:59.640 --> 24:01.640
Thank you.

