WEBVTT

00:00.000 --> 00:13.000
It's a super fast filtering and aggregation operators to JSON pass.

00:13.000 --> 00:18.680
So that's very tempting to do, but it comes with three challenges, which I like to explain.

00:18.680 --> 00:23.320
Challenge one is when documents share no common pass.

00:23.320 --> 00:28.120
So let's look at the orange, the red, and the green box.

00:28.120 --> 00:31.320
These are three JSON documents, but they shall not come in pass, right?

00:31.320 --> 00:36.720
So we don't have two or three pass, we have seven pass in total.

00:36.720 --> 00:42.920
There's a general risk that the number of pass explodes and we'll get an avalanche of columns.

00:42.920 --> 00:49.720
Challenge two is when we actually have shared pass, but they do have different types.

00:49.720 --> 00:56.320
So if you look at the green box and the blue box, these two documents have the same structure.

00:56.320 --> 01:02.120
But the pass heads go, and one document is an integer, and the other document is a string.

01:02.120 --> 01:08.120
And that's a problem, because in relational databases, every column needs to be of exactly one type.

01:08.120 --> 01:13.520
And finally, challenge three is that the pass can have very different frequencies across JSON documents.

01:13.520 --> 01:19.320
So imagine a pass that occurs in just one document.

01:19.320 --> 01:24.720
Its column will be virtually empty except for that single one value, right?

01:24.720 --> 01:33.520
We call this pass column, and that's not really great for storage, because it contains literally your hot air.

01:33.520 --> 01:42.720
Okay, now to solve these three challenges, we designed two new data types in Clickhouse as building blocks.

01:42.720 --> 01:49.320
And eventually we added a dedicated JSON type on top of these two building blocks.

01:49.320 --> 01:57.920
Both building blocks can be used standalone and outside of a JSON context, and that would be a totally valid use case.

01:57.920 --> 02:02.120
First building block here is the variant data type.

02:02.120 --> 02:06.320
You can imagine variant as a union of multiple sub types.

02:06.320 --> 02:12.920
So basically, every value has one of the sub types, or it is null.

02:12.920 --> 02:14.320
How does it work?

02:14.320 --> 02:19.120
Let's start with the yellow example column on the left side.

02:19.120 --> 02:23.120
You can see it contains various values of different types.

02:23.120 --> 02:28.120
For example, integer strings, areas, there's also a null there.

02:28.120 --> 02:35.120
And the type of the column is variant of in 64 string and area of in 64.

02:35.120 --> 02:44.120
That's a little bit hard to read. It's in light gray on top of the yellow box in the middle.

02:44.120 --> 02:54.120
Okay, now the big black box on the right shows how their variant column is internally represented.

02:54.120 --> 02:58.120
And there are really two important data structures here.

02:58.120 --> 03:03.120
The first one is the one in light blue, and we call it discriminator column.

03:03.120 --> 03:07.120
It encodes for every value the value is type.

03:07.120 --> 03:13.120
So in the example of one in the first row of the discriminator column, means that the value is in integer.

03:14.120 --> 03:20.120
Two in the second row means it's a string, and zero in the third row means it's an area.

03:20.120 --> 03:26.120
Now you may ask, how do the sub types map to these type integers?

03:26.120 --> 03:29.120
And the mapping is actually implicit.

03:29.120 --> 03:35.120
So what happens is that we sort all the names to use a specified in the variant type.

03:35.120 --> 03:42.120
We sort them, and then the relative position, that's the type integer.

03:43.120 --> 03:55.120
The other structure here is an offset column, and that's the light gray box right next to the blue box.

03:55.120 --> 04:02.120
So this one represents for every value it's position in a sub type column.

04:02.120 --> 04:10.120
So if you look at the bottom right, you see three yellow boxes, and there's a box for every sub type of the variant,

04:10.120 --> 04:15.120
and every box stores the values of a sub type consecutively.

04:15.120 --> 04:21.120
So for example, the leftmost yellow box on the bottom right stores all the integers.

04:21.120 --> 04:24.120
And that's 42, 43, and 44.

04:24.120 --> 04:27.120
And it stores them one after another.

04:27.120 --> 04:33.120
Now the offset column is used to reconstruct the original positions of the sub type values.

04:33.120 --> 04:39.120
And it stores for every value in the original column, the position of the value in its sub type column.

04:39.120 --> 04:44.120
It's not a little complicated, but long story short, the variant data type in click house.

04:44.120 --> 04:48.120
Sort the second and the third challenge that we had on the previous slide.

04:48.120 --> 04:54.120
The second challenge, if you remember, was that JSON pass may contain different types.

04:54.120 --> 05:01.120
And the variant columns in the variant columns you can have values of different types in the same column.

05:01.120 --> 05:06.120
And finally the third challenge was that was about sparse columns.

05:06.120 --> 05:13.120
The variant column, the data is always stored in a dense fashion.

05:13.120 --> 05:21.120
Okay, now the second building block is the dynamic data type, and again it can be used standalone outside JSON context.

05:21.120 --> 05:26.120
The dynamic data type is exactly the same as the variant data type.

05:26.120 --> 05:33.120
The difference is that it does not require you to specify all the possible sub types upfront.

05:33.120 --> 05:39.120
The dynamic type can store whatever data type you throw at it at runtime.

05:39.120 --> 05:43.120
And if you look at the yellow box again, that's the original column.

05:43.120 --> 05:49.120
You see that it has now a data type dynamic, but like the sub types are gone.

05:49.120 --> 05:52.120
They are not specified explicitly.

05:52.120 --> 06:00.120
Because we no longer specify the possible sub types in the type itself, we still need to remember them somehow.

06:00.120 --> 06:04.120
And that is done by an additional file, which is called dynamic structure bin.

06:04.120 --> 06:07.120
And that's highlighted in red on the right.

06:07.120 --> 06:13.120
It contains the contained sub types, and it stores some additional statistics.

06:13.120 --> 06:19.120
So in this example, we store an area of N64, N64, and string a sub types.

06:19.120 --> 06:26.120
And besides that, as I mentioned, there's some more statistics which are needed to optimize the access later on.

06:27.120 --> 06:35.120
Now, remember there was this first challenge, right, where which was about exploding paths and column counts.

06:35.120 --> 06:37.120
And we will now address this.

06:37.120 --> 06:46.120
And the way that it's done is that we allow to specify an optional upper limit on the number of sub types in a dynamic column.

06:46.120 --> 06:54.120
If more types are, if more types than the limit are added, they are not rejected,

06:54.120 --> 07:02.120
but they are stored in a fallback column, which is going to be a little bit slower to access.

07:02.120 --> 07:09.120
So to be more precise, all values of the types which exceed the limit are stored together in a single column.

07:09.120 --> 07:17.120
And we use a very simple pair of data type and value to represent them.

07:18.120 --> 07:25.120
And this is how it looks like. Let's make an example in the yellow box on the right. You see that we used a dynamic data type.

07:25.120 --> 07:36.120
But this time, we also specify the parameter max types S3, which means that at most three types should be stored in dedicated columns.

07:36.120 --> 07:43.120
And the values of the force, the fifth, the sixth, and so on, types should go into a fallback column.

07:44.120 --> 07:53.120
Unfortunately, this column contains more than three types. So all the excess types are stored in the green box on the bottom right.

07:53.120 --> 07:58.120
Each value in the green box is a pair of data type and value.

07:58.120 --> 08:10.120
So in the example, the first value there is a pair of data and 2020-11 followed by a pair of pool and falls and so on.

08:10.120 --> 08:18.120
Now that I introduced the dynamic type and the variant type, I can finally present the JSON data type.

08:18.120 --> 08:26.120
Generally speaking, each pass across all the stored JSON documents go into a dynamic column.

08:26.120 --> 08:32.120
And this is how you formally declare the JSON type in Clickhouse for your data.

08:33.120 --> 08:42.120
There are five parameters in green, max dynamic pass. That specifies the maximum number of pass, which are stored in sub columns.

08:42.120 --> 08:47.120
In the interest of time, I will not explain this mechanism detail.

08:47.120 --> 08:56.120
And then there is in brown, max dynamic types, which specifies the number of types for a single pass, which are stored as sub columns.

08:56.120 --> 09:04.120
If the limit is exceeded, new types are stored together in a fallback column, and that's what you saw on the previous slide.

09:04.120 --> 09:14.120
And then in blue, there is some pass type name and that states which pass should never be stored or should not be stored in a dynamic column.

09:14.120 --> 09:17.120
And there is instead stored as regular columns.

09:18.120 --> 09:27.120
That's quite nice. That's, again, like a special case, and it gives you performance guarantees for certain paths that you want to access very quickly.

09:27.120 --> 09:30.120
Like that, that's without compression, more or less.

09:30.120 --> 09:41.120
And then in purple, skip and skip records, these refer to JSON paths, which should be completely ignored on other words, they will never be stored at all.

09:41.120 --> 09:45.120
Okay, now there's this intent example for the JSON data type.

09:45.120 --> 09:53.120
I will probably not explain it in detail, because the font is quite small, and you will throw things at me if I go too much into the data.

09:53.120 --> 10:10.120
The just of this figure is that there are six different JSON documents, and five pass, and we store every pass by default in a dynamic column, or in a dedicated column, if the user requested that via a parameter in the JSON type declaration.

10:11.120 --> 10:21.120
If you'd like to know more, you can see the nice bookpost about JSON support and kickhouse that we published recently, it's shown at the bottom of the slide.

10:21.120 --> 10:27.120
The last thing I like to show is a benchmark for JSON documents, which we did recently.

10:27.120 --> 10:39.120
This benchmark compares native JSON support in analytics databases, and it runs queries over a data set with one billion blue sky events modeled as JSON documents.

10:39.120 --> 10:50.120
It's total compressed size is 125 gigabyte. Everyone's free to grab the data from our public as free bucket and reproduce the measurements.

10:50.120 --> 11:06.120
So we compared the query performance of the new JSON type in clickhouse versus elastic search, ductDB, MongoDB and Postgres, and you can see clickhouse runs the queries by far as the fastest database ahead of all other systems.

11:07.120 --> 11:13.120
Okay, to sum up, clickhouse now speaks JSON, the implementation is easy to use, it's flexible, it's extremely fast.

11:13.120 --> 11:20.120
Chasing supports currently better and we plan to make it GA during 2025.

11:21.120 --> 11:30.120
And now really the last thing we're going to have a dinner, so a clickhouse dinner tonight with free drinks and free food.

11:30.120 --> 11:37.120
And you're invited to head over and we can talk about any topics about clickhouse database in general.

11:37.120 --> 11:41.120
I would be very happy to meet you there. Thank you.

11:51.120 --> 11:54.120
Any questions?

11:54.120 --> 11:56.120
Please.

11:56.120 --> 12:04.120
Yeah, so the question was about data compression and elastic search.

12:04.120 --> 12:23.120
To be honest, I don't know exactly like from the top of my mind.

12:23.120 --> 12:34.120
Like the exact steps are available in the GitHub repository, you can check it out and see what settings specifically used.

12:34.120 --> 12:36.120
Please.

12:53.120 --> 13:08.120
That's a good question.

13:08.120 --> 13:16.120
I guess I guess there is not as far as I know there are not too many benchmarks about JSON,

13:16.120 --> 13:24.120
it's important data basis for now and it's like still like an emerging thing.

13:24.120 --> 13:29.120
Yeah, but of course every benchmark has weaknesses.

13:29.120 --> 13:36.120
It's always informed by the launch of the benchmark in a balance, right?

13:36.120 --> 13:37.120
Okay, thank you.

13:37.120 --> 13:38.120
Still 30 of you, thank you.

13:38.120 --> 13:39.120
I'm on this.

13:53.120 --> 13:55.120
Yeah, I guess here.

13:59.120 --> 14:02.120
Thank you.

14:13.120 --> 14:15.120
Lovely. Thank you.

14:15.120 --> 14:17.120
Yeah.

14:17.120 --> 14:20.120
I could be.

14:20.120 --> 14:24.120
I could be.

14:24.120 --> 14:29.120
If any of you have a lot of columns of files, they've been at a lot of the properties.

14:29.120 --> 14:30.120
We've done a few things.

14:30.120 --> 14:31.120
Oh, yeah.

14:31.120 --> 14:32.120
Where are you?

14:32.120 --> 14:33.120
You need to pass them.

14:33.120 --> 14:35.120
They only have a single one.

14:35.120 --> 14:37.120
So it's possible to implement.

14:37.120 --> 14:38.120
Right?

14:38.120 --> 14:44.120
And with this dense solution, you mentioned that now is possible from the site.

14:44.120 --> 14:46.120
Would you just reach out and quiet?

14:46.120 --> 14:47.120
Yeah.

14:47.120 --> 14:49.120
You don't have to solve this at your Nicole.

14:49.120 --> 14:51.120
It's very interesting.

14:51.120 --> 14:53.120
And I didn't touch on that because I was telling you all the time.

14:53.120 --> 14:54.120
Okay.

14:54.120 --> 14:55.120
Okay.

14:55.120 --> 14:56.120
Okay.

14:56.120 --> 15:03.120
That was the first parameter in the chase in the declaration that I said,

15:03.120 --> 15:04.120
No, I marked it.

15:04.120 --> 15:05.120
Yeah, yes.

15:05.120 --> 15:07.120
Yes, another for that mechanism.

15:07.120 --> 15:08.120
Okay.

15:08.120 --> 15:14.120
Which, like, compresses, type of files which go beyond this limit.

15:14.120 --> 15:17.120
And they are stored consecutively in the same place.

15:17.120 --> 15:18.120
They're empty.

15:18.120 --> 15:21.120
But it's, but it's all described in the blog.

15:22.120 --> 15:26.120
They have to somehow, in the sense that if I have, I don't know,

15:26.120 --> 15:30.120
I don't know, one hundred parts, like a file and a column.

15:30.120 --> 15:33.120
And then a lot of parts, but then you need.

15:33.120 --> 15:35.120
Do you have any guarantees, though?

15:35.120 --> 15:37.120
At least they have spot you in the course storage.

15:37.120 --> 15:38.120
Yeah, yes.

15:38.120 --> 15:39.120
Yeah.

15:39.120 --> 15:43.120
So yes, it's, like, an upper limit on the number of files.

15:43.120 --> 15:45.120
And then an upper limit on the number of files.

15:45.120 --> 15:47.120
And it's the, but it's the same.

15:47.120 --> 15:48.120
Yeah.

15:48.120 --> 15:49.120
Yeah.

15:49.120 --> 15:50.120
Yeah.

15:50.120 --> 15:51.120
Yeah.

15:51.120 --> 15:52.120
Yeah.

15:52.120 --> 15:53.120
Okay.

15:53.120 --> 15:55.120
So I might have been like it, but I have a, yeah.

15:55.120 --> 15:56.120
Excuse me.

15:56.120 --> 15:57.120
Yes.

15:57.120 --> 15:58.120
Okay.

15:58.120 --> 15:59.120
Thank you.

15:59.120 --> 16:04.120
And this event after the post office is going to be just a form of thing.

16:04.120 --> 16:05.120
Very important.

16:05.120 --> 16:07.120
Gonna have to be here.

16:07.120 --> 16:08.120
Okay.

16:08.120 --> 16:09.120
Thank you.

16:09.120 --> 16:10.120
Yeah.

16:10.120 --> 16:11.120
And yes.

16:11.120 --> 16:14.120
I put that on the backstage room.

16:14.120 --> 16:15.120
Yeah.

16:15.120 --> 16:16.120
Yeah.

16:16.120 --> 16:17.120
Yeah.

16:17.120 --> 16:18.120
Yeah.

16:18.120 --> 16:19.120
Yeah.

16:19.120 --> 16:20.120
Yeah.

16:20.120 --> 16:21.120
Yeah.

16:21.120 --> 16:22.120
Yeah.

16:22.120 --> 16:23.120
Yeah.

16:23.120 --> 16:24.120
Yeah.

16:24.120 --> 16:25.120
Yeah.

16:25.120 --> 16:26.120
Yeah.

16:26.120 --> 16:27.120
Yeah.

16:27.120 --> 16:28.120
Yeah.

16:28.120 --> 16:29.120
Yeah.

16:29.120 --> 16:30.120
Yeah.

16:30.120 --> 16:31.120
Yeah.

16:31.120 --> 16:32.120
Yeah.

16:32.120 --> 16:33.120
Yeah.

16:33.120 --> 16:34.120
Yeah.

16:34.120 --> 16:35.120
Yeah.

16:35.120 --> 16:36.120
Yeah.

16:36.120 --> 16:37.120
Yeah.

16:37.120 --> 16:38.120
Yeah.

16:38.120 --> 16:39.120
Yeah.

16:39.120 --> 16:40.120
Yeah.

16:40.120 --> 16:41.120
Yeah.

16:41.120 --> 16:42.120
Yeah.

16:42.120 --> 16:43.120
Yeah.

16:43.120 --> 16:44.120
Yeah.

16:44.120 --> 16:46.120
He's.

16:46.120 --> 16:47.120
Yeah.

16:47.120 --> 16:49.120
He's a big team there too.

16:49.120 --> 16:50.120
No, yeah.

16:50.120 --> 16:51.120
Yeah.

16:51.120 --> 16:52.120
Yeah.

16:52.120 --> 17:06.120
We're going to have a fight.

17:06.120 --> 17:08.120
So we should be our team here.

17:08.120 --> 17:09.120
Wow.

17:09.120 --> 17:10.120
We know.

17:10.120 --> 17:11.120
We do.

17:11.120 --> 17:41.120
I don't know, I don't know, whatever is working here, I don't know, I don't know, whatever is working here, I don't know, I don't know, I don't know, whatever is working here, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't

17:41.120 --> 17:50.280
Well it's for audio, right?

17:50.280 --> 18:05.860
or if audio, right?

18:05.860 --> 18:15.860
Or if you've got audio, right? You can have this on if you want or you can use that.

18:15.860 --> 18:20.860
But this is only for the string. It doesn't work because you have to speak loud.

18:20.860 --> 18:25.860
If it's too loud, maybe the string will be...

18:25.860 --> 18:30.860
So if you think you're going to speak loud, put it on your shirt on your shirt.

18:30.860 --> 18:33.860
So then that will compensate for it.

18:33.860 --> 18:36.860
I can check the notes anyway.

18:36.860 --> 18:40.860
So when you start talking on, you'll make sure the levels are okay.

18:40.860 --> 18:42.860
Not sure if it's fine, but...

18:42.860 --> 18:43.860
Yes, confusing it.

18:43.860 --> 18:45.860
I always find the right.

18:48.860 --> 18:50.860
Is it perfect? Perfect.

18:50.860 --> 18:51.860
Yeah, I think I'm...

18:51.860 --> 18:53.860
Good, good to wear a shirt.

18:53.860 --> 18:54.860
Yeah, that's that's that.

18:54.860 --> 18:55.860
I mean, actually the hoodie is...

18:55.860 --> 18:56.860
This is...

18:56.860 --> 18:57.860
Yeah, that's that.

18:58.860 --> 18:59.860
Yeah.

18:59.860 --> 19:01.860
I just want to go to the market.

19:01.860 --> 19:03.860
If you're looking for something to do,

19:03.860 --> 19:07.860
there's something to do with the price.

19:07.860 --> 19:11.860
Yeah, that's what I do.

19:11.860 --> 19:13.860
I just want to go first.

19:13.860 --> 19:15.860
Yeah, that's what I do.

19:15.860 --> 19:17.860
I don't want to think about this at the end.

19:17.860 --> 19:19.860
Yeah, that's what I do.

19:19.860 --> 19:20.860
Oh, that's what I do.

19:20.860 --> 19:22.860
Oh, that's what I do.

19:22.860 --> 19:23.860
Yeah, that's what I do.

19:23.860 --> 19:24.860
Yeah, that's what I do.

19:24.860 --> 19:26.860
And it's a good set for me.

19:26.860 --> 19:27.860
I don't know what it kind of...

19:27.860 --> 19:29.860
I think country's more traditional...

19:29.860 --> 19:32.860
I think country's more traditional, I think...

19:32.860 --> 19:33.860
Yeah, yeah, yeah.

19:33.860 --> 19:34.860
I'm sure we can go.

19:34.860 --> 19:38.860
Yeah, that's what I do.

19:38.860 --> 19:40.860
Yeah, I'm sure that.

19:40.860 --> 19:41.860
Thank you for that.

19:41.860 --> 19:42.860
Yes, thank you for that.

19:42.860 --> 19:43.860
Thank you.

19:45.860 --> 19:48.860
I think, I don't know about the price.

19:48.860 --> 19:49.860
I think that's what I do.

19:49.860 --> 19:50.860
I think that's what I do.

19:50.860 --> 19:51.860
I want to ask.

19:51.860 --> 19:52.860
Thank you.

19:52.860 --> 19:53.860
It's all right.

19:53.860 --> 19:54.860
Thank you.

20:24.860 --> 20:30.420
And I'm not mistaken.

20:30.420 --> 20:52.960
Can I ask you what...

20:52.960 --> 21:22.960
I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't know, I don't

21:22.960 --> 21:26.400
Jerry glitter around the Jeremy will give you that, he will give you your five minutes at the beginning,

21:26.400 --> 21:32.560
You see me, you don't trouble because I come in right at the end so I'll be outside making balls

21:32.560 --> 21:35.840
The moment you appear that's your time. You should be literally wrap up

21:36.480 --> 21:40.160
Okay, okay, so if you see me because I will I will switch to

21:41.520 --> 21:46.320
That's okay, but so you'll fit him he'll give you the five minute mark if you go to late and you see me

21:46.320 --> 21:48.320
You know you've got a finish

21:49.200 --> 21:53.200
Okay, okay, yeah, yeah, because I have more content than that

21:53.200 --> 21:56.080
Oh, yeah, absolutely, but I will tell you I will tell you

21:56.080 --> 22:00.560
Yeah, I will start waving. Oh, sorry. Oh, you know what? We're going to start a fling the earlier

22:00.560 --> 22:03.680
Yeah, we can do that if you want to start five minutes early. We can do that as well

22:04.080 --> 22:06.080
Do you give you five minutes more

22:06.720 --> 22:08.720
Okay, yeah

22:09.600 --> 22:11.600
So just just to not

22:12.240 --> 22:16.480
So we'll start at 20 past so that's the six minutes time. We'll start at six minutes

22:17.360 --> 22:19.360
Okay, yeah, I think

22:20.000 --> 22:22.400
Yeah, so if there is no problem with a synchronization

22:22.400 --> 22:25.120
No, because it was the newscales like a lot of people

22:26.320 --> 22:31.280
People yeah people either I'm going to be here already or if I'm going to be female it's like because I'm not at all

22:31.280 --> 22:33.280
I'm going to be female it's like anyway

22:33.840 --> 22:36.880
So this window talk with no other the basis

22:41.520 --> 22:43.520
It's the kind of the time they still see you

22:47.120 --> 22:49.120
Thanks

22:49.120 --> 22:51.280
Senior

