WEBVTT

00:00.000 --> 00:08.000
We're actually receiving a lot of basic advice for us.

00:08.000 --> 00:13.000
We're going to go a lot of different things that a lot of intros might miss.

00:13.000 --> 00:17.000
We're going to make them soon and soon we can call them soon.

00:17.000 --> 00:19.000
For those of you who weren't talking about the talk,

00:19.000 --> 00:21.000
for any of my names you can find,

00:21.000 --> 00:25.000
I'm a developer advocate that's so nice that I exclusively focus on

00:25.000 --> 00:28.000
how to talk to experts, so I'm glad to see you today

00:28.000 --> 00:31.000
to share experts and methods.

00:31.000 --> 00:34.000
So the key things off, what is I've heard,

00:34.000 --> 00:36.000
why should you actually share?

00:36.000 --> 00:38.000
You probably heard by experts, right?

00:38.000 --> 00:40.000
It's the main way to do this.

00:40.000 --> 00:43.000
It is an open game.

00:43.000 --> 00:44.000
What about that?

00:44.000 --> 00:47.000
That was surface built to handle large scale.

00:47.000 --> 00:50.000
These are the effects in the data lake.

00:50.000 --> 00:55.000
When you think about cuttings, we're in the past.

00:55.000 --> 00:56.000
We've got data lake now.

00:56.000 --> 01:00.000
We've heard from the past we had our wonderful modelists,

01:00.000 --> 01:02.000
our data warehouses, right?

01:02.000 --> 01:05.000
The data lakes were kind of response to a lot of the things

01:05.000 --> 01:07.000
that weren't quite growing well.

01:07.000 --> 01:10.000
We've got data warehouses, mostly in the past,

01:10.000 --> 01:12.000
having that modelists.

01:12.000 --> 01:15.000
But there were some mistakes about data warehouses

01:15.000 --> 01:17.000
that we didn't like.

01:17.000 --> 01:21.000
What the expert needs is to turn the answers to the concerns

01:21.000 --> 01:24.000
of the data lake and concerns of data warehouses.

01:24.000 --> 01:28.000
And we're combining the best things of those into a simple context

01:28.000 --> 01:30.000
with data lake.

01:30.000 --> 01:33.000
What's so clever with our needs, right?

01:33.000 --> 01:36.000
And so what it is, what I forget is,

01:36.000 --> 01:39.000
it adds extra layer of metadata on top of the files

01:39.000 --> 01:42.000
because we already want to be storing in our data lake.

01:42.000 --> 01:46.000
And by doing so, it allows us to select that data into

01:46.000 --> 01:48.000
an organized and more structure paper.

01:48.000 --> 01:51.000
So effectively we're adding some more structure

01:51.000 --> 01:54.000
that people must set on top of the data lake.

01:54.000 --> 01:56.000
So we get some benefits on that.

01:56.000 --> 01:59.000
I mean, it's one system we're building on top of the lake.

01:59.000 --> 02:01.000
So we still get some of the benefits

02:01.000 --> 02:03.000
of if being a single person true,

02:03.000 --> 02:05.000
like we were just doing the data warehouses.

02:05.000 --> 02:09.000
We get the consequences and most cases of the storage.

02:09.000 --> 02:12.000
And then we get the flexibility to use whatever performance

02:12.000 --> 02:14.000
that we actually want for our data.

02:14.000 --> 02:16.000
But we get governance and initial structure,

02:16.000 --> 02:18.000
and obviously we're moving into those data.

02:18.000 --> 02:22.000
So it also means that you know what's a vendor

02:22.000 --> 02:24.000
or lock-in that you would effectively get a lock-in.

02:24.000 --> 02:27.000
The warehouses, you could take your ice per table,

02:27.000 --> 02:29.000
wherever you want, you would have a tool

02:29.000 --> 02:31.000
that you want better or better, right?

02:31.000 --> 02:33.000
Because that's actually like things up.

02:33.000 --> 02:35.000
So it's really about to go through it.

02:35.000 --> 02:38.000
So I learned isn't to be a really technology

02:38.000 --> 02:42.000
I was there trying to unlock the power of a data lake

02:42.000 --> 02:44.000
or you may have heard of the type of security

02:44.000 --> 02:46.000
or not the lake.

02:46.000 --> 02:47.000
But we're going to be focusing on experts

02:47.000 --> 02:49.000
in the talk of the tool and things that you can get

02:49.000 --> 02:51.000
in the benefits of ice search.

02:51.000 --> 02:53.000
And we're even more into these features

02:53.000 --> 02:55.000
as we go through more of this talk.

02:55.000 --> 02:57.000
But I want to be able to face a tool

02:57.000 --> 02:59.000
of what we expect to be able to do.

02:59.000 --> 03:00.000
Okay.

03:00.000 --> 03:03.000
So for that, we're going to focus on these kinds of experts.

03:03.000 --> 03:06.000
The very basic, thank you for the two people

03:06.000 --> 03:08.000
that sent in a little bit.

03:08.000 --> 03:10.000
So since we asked for some of these,

03:10.000 --> 03:14.000
the very basic thing you need to know about ice-creamable.

03:14.000 --> 03:16.000
And to those of you who work here this morning,

03:16.000 --> 03:18.000
you've got that crash course on the architecture

03:18.000 --> 03:20.000
where you take it a little closer to the sun

03:20.000 --> 03:23.000
and actually take a look at what actually

03:23.000 --> 03:25.000
makes ice-creamable.

03:25.000 --> 03:28.000
So I think that ice-creamable is the layer

03:28.000 --> 03:30.000
of metadata on top of the files we want

03:30.000 --> 03:32.000
to be storing in our data lake.

03:32.000 --> 03:33.000
Okay.

03:33.000 --> 03:35.000
So we have some data files.

03:35.000 --> 03:38.000
And what we're going to do is you're going to track

03:38.000 --> 03:39.000
with metadata.

03:39.000 --> 03:42.000
So we're going to work from right to left here.

03:42.000 --> 03:44.000
So first of all, we have our mammoth files.

03:44.000 --> 03:48.000
So mammoth files are going to point to the data files

03:48.000 --> 03:50.000
and also, as we read files,

03:50.000 --> 03:53.000
we're probably going to come to these more

03:53.000 --> 03:54.000
written rows here.

03:54.000 --> 03:58.000
And more importantly, mammoth files are going

03:58.000 --> 04:01.000
to have two types of specific about the data files

04:01.000 --> 04:02.000
that they're tracking.

04:02.000 --> 04:05.000
And the point thing to note here is that a mammoth file

04:05.000 --> 04:10.000
is only going to keep track of the subset of the data files.

04:10.000 --> 04:11.000
All right.

04:11.000 --> 04:13.000
So it's going to be a number of mammoth files

04:13.000 --> 04:17.000
for a data set to make sure that it keeps track of a lot of similar data files.

04:17.000 --> 04:20.000
And so by doing this, by having me just to take

04:20.000 --> 04:23.000
the arms off of these columns that were stored in our data files,

04:23.000 --> 04:26.000
you're able to layer on very, very plain.

04:26.000 --> 04:28.000
Effectively, it's going to be an off data that's not available.

04:28.000 --> 04:29.000
All right.

04:29.000 --> 04:32.000
So we can track with a maximum of the data that you keep out

04:32.000 --> 04:33.000
actually, okay.

04:33.000 --> 04:37.000
But we can hear a little more questions with our theory later on.

04:37.000 --> 04:39.000
Next up, we have a mammoth list.

04:39.000 --> 04:43.000
An example of a family list, a mammoth file.

04:43.000 --> 04:47.000
But it's a little more important that each mammoth list

04:47.000 --> 04:51.000
is going to list out a mammoth file that we want to use.

04:51.000 --> 04:54.000
I think I need to have a snapshot of the data

04:54.000 --> 04:56.000
I gave them from time to time.

04:56.000 --> 04:58.000
And then look at your device.

04:58.000 --> 04:59.000
What's your name?

04:59.000 --> 05:02.000
I think none of the microphones are working today.

05:02.000 --> 05:05.000
So this is not a surprise to me.

05:05.000 --> 05:07.000
We're turning it off.

05:07.000 --> 05:08.000
Okay.

05:08.000 --> 05:10.000
Okay.

05:10.000 --> 05:14.000
No idea what just happened, but hopefully it's good.

05:14.000 --> 05:15.000
All right.

05:15.000 --> 05:18.000
And then from there, we have metadata files.

05:18.000 --> 05:21.000
So the metadata files are going to track the snapshots,

05:21.000 --> 05:24.000
which then point to their associated manifest files.

05:24.000 --> 05:27.000
So we'll be keeping track of multiple snapshots of this table

05:27.000 --> 05:29.000
as it evolves over time.

05:29.000 --> 05:32.000
And then we're also going to keep track of some non-data information

05:32.000 --> 05:35.000
here, like the partitioning scheme that we're using,

05:35.000 --> 05:38.000
the actual schema of the data that we're storing as well.

05:38.000 --> 05:39.000
Okay.

05:39.000 --> 05:40.000
But that's not all.

05:40.000 --> 05:42.000
So that's the entire metadata layer.

05:42.000 --> 05:44.000
And how it interacts with the data layer.

05:44.000 --> 05:46.000
But there's one more key component that's needed to make this

05:46.000 --> 05:48.000
a little more database-esque.

05:48.000 --> 05:49.000
And that's the catalog.

05:49.000 --> 05:52.000
So the catalog here is going to keep track of the individual

05:52.000 --> 05:54.000
tables that we are storing.

05:54.000 --> 05:56.000
And specifically, they're going to maintain a reference

05:56.000 --> 06:00.000
to, for that given table, where is the latest metadata file

06:00.000 --> 06:02.000
that we need to be aware of?

06:02.000 --> 06:05.000
Where does it exist so that we can then access this later on?

06:05.000 --> 06:09.000
And the catalog is also going to coordinate updating that pointer

06:09.000 --> 06:13.000
to a new metadata file every time we want to evolve the table

06:13.000 --> 06:15.000
and add more data to it.

06:15.000 --> 06:16.000
Okay.

06:16.000 --> 06:19.000
So speaking of catalog, there's actually, it's pretty important.

06:19.000 --> 06:20.000
Okay.

06:20.000 --> 06:22.000
It might sound like an easy component to add on there.

06:22.000 --> 06:25.000
But it really is worth thinking more about iceberg catalog.

06:25.000 --> 06:27.000
This is a whole.

06:27.000 --> 06:31.000
So they're keeping track of, again, that most recent metadata pointer.

06:31.000 --> 06:34.000
But they serve a purpose beyond that.

06:34.000 --> 06:37.000
So at the most basic level, they're there to make sure that we can find it

06:37.000 --> 06:40.000
and interact with all the iceberg tables that we're storing.

06:40.000 --> 06:44.000
And more importantly, ensures the consistency with the multiple

06:44.000 --> 06:47.000
engines that might be trying to work with those iceberg tables

06:47.000 --> 06:48.000
at the same time.

06:48.000 --> 06:49.000
Okay.

06:49.000 --> 06:52.000
And they do that with atomic operations to actually update that metadata

06:52.000 --> 06:56.000
pointer every time we want to update the tables.

06:56.000 --> 06:59.000
So it's simple, but they're incredibly important to make sure that

06:59.000 --> 07:04.000
our tables are updating in the right way and functioning properly over time.

07:04.000 --> 07:07.000
And an important thing to note here is that the iceberg project

07:07.000 --> 07:10.000
doesn't actually ship with an implementation of a catalog.

07:10.000 --> 07:12.000
That might sound problematic.

07:12.000 --> 07:16.000
But they do provide an interface, thankfully.

07:16.000 --> 07:17.000
Okay.

07:17.000 --> 07:21.000
So along with several implementations of clients that can talk to catalogs

07:21.000 --> 07:23.000
like Hive, JDBC, and more.

07:23.000 --> 07:26.000
So that's what you'll see a number of options out there.

07:26.000 --> 07:29.000
Some of which are listed on the right of different catalog

07:29.000 --> 07:30.000
implementations.

07:30.000 --> 07:33.000
So as per the interface, the only real requirement

07:33.000 --> 07:36.000
if an iceberg catalog is that, as we already discussed,

07:36.000 --> 07:40.000
it provides a way for you to give back that current metadata file.

07:40.000 --> 07:42.000
So we know where that is.

07:42.000 --> 07:45.000
And then provide a means for us to interact with the tables,

07:45.000 --> 07:49.000
like listing them out, creating them, dropping them, and more.

07:49.000 --> 07:52.000
And I don't know if we're taking, are we taking questions in the middle

07:52.000 --> 07:53.000
or at the end?

07:53.000 --> 07:54.000
Does it matter?

07:54.000 --> 07:55.000
Up to me.

07:55.000 --> 07:57.000
What is your question, sir?

07:57.000 --> 07:58.000
We should slide it.

07:58.000 --> 08:00.000
We don't know if it's called like files,

08:00.000 --> 08:03.000
but if it could look, it's a sulfur component.

08:03.000 --> 08:04.000
Okay.

08:04.000 --> 08:05.000
Yes, that would be a service.

08:05.000 --> 08:06.000
Oh, I'm sorry.

08:06.000 --> 08:08.000
The question was on this slide here.

08:08.000 --> 08:11.000
All of these, the components on the right and the metadata

08:11.000 --> 08:13.000
and the data, those indeed are files.

08:13.000 --> 08:17.000
And the catalog is an application, the separate application.

08:17.000 --> 08:21.000
It would be okay to summarize the function of the catalog as being

08:21.000 --> 08:24.000
a kind of database for your metadata files.

08:24.000 --> 08:29.000
It is not storing any of these metadata files,

08:29.000 --> 08:30.000
effectively.

08:30.000 --> 08:32.000
It is providing pointers to those files.

08:32.000 --> 08:33.000
Yes.

08:33.000 --> 08:36.000
All right.

08:36.000 --> 08:38.000
So, back we were on catalogs.

08:38.000 --> 08:41.000
They are necessary part of iceberg as a whole,

08:41.000 --> 08:43.000
but iceberg doesn't ship with one.

08:43.000 --> 08:45.000
So that might sound a little problematic.

08:45.000 --> 08:48.000
So you might want to choose one of the other catalogs out there.

08:48.000 --> 08:50.000
And how do you choose?

08:50.000 --> 08:51.000
That is a great question.

08:51.000 --> 08:54.000
As with most problems in life and software,

08:54.000 --> 08:55.000
it depends.

08:55.000 --> 08:58.000
But I've conveniently put together a couple questions to ask yourself,

08:58.000 --> 09:02.000
so that you can help narrow down which catalog to actually choose.

09:02.000 --> 09:04.000
So the biggest consideration,

09:04.000 --> 09:06.000
the biggest question that you should ask yourself is,

09:06.000 --> 09:07.000
how lazy are you feeling?

09:07.000 --> 09:12.000
And by that, I mean, do you want to commit to running a catalog

09:12.000 --> 09:14.000
that requires additional software or external resource?

09:14.000 --> 09:17.000
Or if it does require external services,

09:17.000 --> 09:19.000
is there a managed version?

09:19.000 --> 09:22.000
So how involved do you want to be an actually operating this catalog

09:22.000 --> 09:24.000
in the long term?

09:24.000 --> 09:27.000
The next question is, speaking of other software,

09:27.000 --> 09:29.000
the next thing you should consider is,

09:29.000 --> 09:33.000
how are the catalogs integrating with other tooling that you want to be using?

09:33.000 --> 09:37.000
Other environments, compute engines that will get into shortly?

09:37.000 --> 09:43.000
Using certain catalogs can corner you into using certain cloud environments,

09:43.000 --> 09:45.000
for example, or compute engines.

09:45.000 --> 09:49.000
So you might want to consider that when you're choosing one.

09:49.000 --> 09:52.000
The next question to consider is,

09:52.000 --> 09:54.000
does it handle the basics?

09:54.000 --> 09:57.000
Hopefully your answer is yes.

09:57.000 --> 10:00.000
Does it do the thing that you want it to do?

10:00.000 --> 10:02.000
This might come as a surprise to some of you,

10:02.000 --> 10:05.000
but there's atomic transactions that we like about iceberg,

10:05.000 --> 10:08.000
aren't a given for each of the catalogs.

10:08.000 --> 10:11.000
Some catalogs don't handle multi-table statements very well,

10:11.000 --> 10:15.000
or multi-statement transactions.

10:15.000 --> 10:18.000
And finally, where are you running this?

10:18.000 --> 10:20.000
Meaning is this a toy use case?

10:20.000 --> 10:21.000
Is this a thing you're playing around with?

10:21.000 --> 10:23.000
Or do you want to actually ship this out to production?

10:23.000 --> 10:26.000
Some catalogs do to the aforementioned limitations

10:26.000 --> 10:29.000
aren't really recommended for production use,

10:29.000 --> 10:33.000
or they come with a big asterisk with them.

10:33.000 --> 10:36.000
So, down that we've seen at a high level,

10:36.000 --> 10:38.000
what the key components of iceberg are,

10:38.000 --> 10:40.000
our metadata, and our catalog.

10:40.000 --> 10:44.000
The next thing we could focus on is how to actually work with our data,

10:44.000 --> 10:46.000
and we're going to do that with some sort of query engine.

10:46.000 --> 10:49.000
So, conveniently, the Apache iceberg project

10:49.000 --> 10:53.000
maintains some implementation of open source compute engines

10:53.000 --> 10:56.000
with built-in support for Apache Flink, Apache Spark.

10:56.000 --> 10:58.000
Both are great options out of the box,

10:58.000 --> 11:00.000
if you just want to hit the ground running with iceberg.

11:00.000 --> 11:04.000
But what you choose would ultimately depend on what your text act currently looks like,

11:04.000 --> 11:07.000
what your familiarity with either of these technologies actually is.

11:08.000 --> 11:11.000
But as we'll see later, there's a lot of really great features,

11:11.000 --> 11:15.000
especially with Spark, with the iceberg actions that are built in there.

11:15.000 --> 11:19.000
Beyond that, the iceberg system ecosystem is very healthy.

11:19.000 --> 11:21.000
There's a number of other query engines,

11:21.000 --> 11:24.000
and tools available to you to actually interact with your iceberg tables.

11:24.000 --> 11:28.000
This is not an exhaustive list, and it's ever, ever growing.

11:28.000 --> 11:29.000
All right.

11:29.000 --> 11:34.000
So, that's sort of the basics of setting the stage for iceberg.

11:34.000 --> 11:36.000
But you might want to actually get started with it,

11:36.000 --> 11:39.000
and you do so by using queries.

11:39.000 --> 11:42.000
It's as simple as writing a query to interact with your iceberg tables.

11:42.000 --> 11:44.000
And for a motivating example,

11:44.000 --> 11:47.000
because we all love those, we're going to say that we're keeping track of data,

11:47.000 --> 11:49.000
related to Arctic cruise bookings.

11:49.000 --> 11:50.000
Isn't that fun?

11:50.000 --> 11:52.000
See, that's so fun.

11:52.000 --> 11:55.000
So, our create-table statement would look like this,

11:55.000 --> 11:57.000
and it's going to feel very familiar, right?

11:57.000 --> 11:58.000
It's a sequel.

11:58.000 --> 12:01.000
I note that we're going to be using these syntax for the Spark sequel,

12:01.000 --> 12:03.000
that ships with iceberg.

12:03.000 --> 12:05.000
So, we'll keep tabs on here some standard information

12:05.000 --> 12:07.000
for our Arctic cruise booking.

12:07.000 --> 12:10.000
The booking ID passenger, how much it actually cost,

12:10.000 --> 12:14.000
and the timestamp at which we're creating this booking.

12:14.000 --> 12:17.000
And an interesting thing to call out here is that

12:17.000 --> 12:20.000
the partitioning statement at the end of the statement here.

12:20.000 --> 12:23.000
You might recall from earlier that I mentioned hidden partitioning

12:23.000 --> 12:25.000
as a cool feature of iceberg.

12:25.000 --> 12:28.000
So, what this means is that iceberg can use a transformation

12:28.000 --> 12:32.000
on top of an existing column as the partitioning field.

12:32.000 --> 12:36.000
So, you'll see here we are extracting the hour of that booking time stamp,

12:36.000 --> 12:38.000
and we're going to partition the data based on that.

12:38.000 --> 12:41.000
And it's going to come in handy later when we talk a little bit more

12:41.000 --> 12:43.000
about selecting the data.

12:43.000 --> 12:47.000
So, if we have a table, we should then insert some information.

12:47.000 --> 12:50.000
And again, this is sort of a standard insert statement.

12:50.000 --> 12:52.000
And then finally, what we got to go up,

12:52.000 --> 12:56.000
want to go and select information from that iceberg table.

12:57.000 --> 12:59.000
We're going to make things a little interesting,

12:59.000 --> 13:02.000
and we're going to filter by a two week period.

13:02.000 --> 13:04.000
We want all of the results, all the rows from that two week period.

13:04.000 --> 13:07.000
But you notice here that we're filtering by days,

13:07.000 --> 13:09.000
on that booking timestamp.

13:09.000 --> 13:13.000
And so, the cool thing about partitioning is iceberg,

13:13.000 --> 13:16.000
is that because we are partitioning by a transformation

13:16.000 --> 13:19.000
on that booking timestamp, and we're partitioning by that,

13:19.000 --> 13:22.000
when we actually want to go and select that data later on,

13:22.000 --> 13:24.000
we can take advantage of that partitioning,

13:24.000 --> 13:27.000
and transfer information on top of an existing field.

13:27.000 --> 13:29.000
We're not creating a separate field,

13:29.000 --> 13:32.000
and the compute engines know to make that transformation

13:32.000 --> 13:36.000
when you're playing any filtering on top of that timestamp.

13:36.000 --> 13:39.000
So, you can take advantage of that.

13:39.000 --> 13:42.000
So, that was the very basic of iceberg.

13:42.000 --> 13:45.000
You can leave now if you want, but I hope you stay.

13:45.000 --> 13:48.000
Because I want to go a little bit beneath the surface,

13:48.000 --> 13:52.000
and scratch that it's, see how these things actually work,

13:52.000 --> 13:55.000
how these queries are operating when it comes to,

13:55.000 --> 13:58.000
under the hood, with some of the files that we saw earlier.

13:58.000 --> 14:01.000
So, when we look at our create table query.

14:01.000 --> 14:04.000
So, for this example, we're going to imagine that we have a fresh instance.

14:04.000 --> 14:08.000
We have no other files hanging out and no other data files.

14:08.000 --> 14:10.000
So, when we issue that create table statement,

14:10.000 --> 14:12.000
there's nothing here, right?

14:12.000 --> 14:13.000
We just have our running catalog instance,

14:13.000 --> 14:16.000
and we're running some engine of our choice.

14:16.000 --> 14:18.000
In this case, we say it's Spark SQL.

14:18.000 --> 14:21.000
So, when we issue the create table statement to the engine,

14:21.000 --> 14:24.000
we're going to first parse that query, obviously.

14:24.000 --> 14:27.000
And next, the engine is going to reach out to the catalog,

14:27.000 --> 14:30.000
to fetch the metadata for that table.

14:30.000 --> 14:32.000
That table doesn't exist, right?

14:32.000 --> 14:36.000
So, we're not going to actually get any files here, any pointer.

14:36.000 --> 14:40.000
And now, we can start to actually go through and create that first metadata file,

14:40.000 --> 14:42.000
related to that table.

14:42.000 --> 14:45.000
And remember, the metadata file keeps track of the partitioning,

14:45.000 --> 14:48.000
spec that we're actually using, the schema,

14:48.000 --> 14:52.000
that we're specifying based on those columns in that create table query.

14:52.000 --> 14:54.000
And we're going to store that information in the metadata file.

14:54.000 --> 14:58.000
We're also going to create a unique identifier for the table.

14:58.000 --> 15:01.000
And yes, so now we have a metadata file.

15:01.000 --> 15:04.000
We have no additional data files, right?

15:04.000 --> 15:06.000
We haven't actually added any engines to the table.

15:06.000 --> 15:10.000
So, we have no manifest file or manifest list to actually track those data files.

15:10.000 --> 15:13.000
We just have our metadata file here.

15:13.000 --> 15:17.000
And then, finally, the most important step is to update that pointer.

15:17.000 --> 15:19.000
So, we have created this table.

15:19.000 --> 15:22.000
And so, we want to reflect that in our catalog.

15:22.000 --> 15:25.000
So, for our bookings table, we're pointing to our metadata file.

15:25.000 --> 15:26.000
Cool.

15:26.000 --> 15:29.000
So, when we go ahead and add data for an insert,

15:29.000 --> 15:31.000
we're picking back up where we left off.

15:31.000 --> 15:36.000
We have this table, but already created reference in the catalog.

15:36.000 --> 15:41.000
So, we're going to submit the query to actually add the data to the table.

15:41.000 --> 15:43.000
First step, we reach out to the catalog.

15:43.000 --> 15:46.000
Find the latest, you know, where this file actually exists.

15:46.000 --> 15:49.000
The metadata file.

15:49.000 --> 15:51.000
And then, with the location of that file on hand,

15:51.000 --> 15:54.000
the engine can go ahead and read the metadata file.

15:54.000 --> 15:57.000
And I know you're thinking here is that we are adding data files, right?

15:57.000 --> 16:00.000
Why do we have to think about the metadata file that currently exists?

16:00.000 --> 16:03.000
Well, we want to be able to keep track of what this schema is,

16:03.000 --> 16:05.000
based on that create table statement,

16:05.000 --> 16:09.000
and make sure that if we're changing that, we can evolve it properly over time.

16:09.000 --> 16:13.000
And also see what that partitioning spec is for that existing table.

16:13.000 --> 16:17.000
So, we do want to make sure we read the metadata file to get that information.

16:17.000 --> 16:18.000
Okay?

16:18.000 --> 16:22.000
And then from there, we can go ahead and write our data files.

16:22.000 --> 16:25.000
So, this will be written as parquet by default,

16:25.000 --> 16:27.000
although you have flexibility to change that.

16:27.000 --> 16:32.000
And this writing will be done according to the hourly partition spec, right?

16:32.000 --> 16:38.000
So, if we were inserting a number of rows with across a number of time stamps,

16:38.000 --> 16:42.000
we would be breaking it up into those partitions and writing them to separate data files.

16:42.000 --> 16:48.000
And at this point, we could also define a local sort on top of the file

16:48.000 --> 16:51.000
that we wanted the data that we want to be writing as well.

16:51.000 --> 16:52.000
All right.

16:52.000 --> 16:56.000
From there, we are going to work backwards and create that manifest file,

16:56.000 --> 16:58.000
to track the new data file that we created.

16:58.000 --> 17:01.000
And at this point, again, if we were only adding one row here,

17:01.000 --> 17:03.000
but if we were adding many, many rows,

17:03.000 --> 17:07.000
we would also compute the statistics on top of the columns that we have inserted.

17:07.000 --> 17:10.000
So, we have the maximum and the bounds of all those columns,

17:10.000 --> 17:12.000
and we'd be storing that in the manifest file.

17:12.000 --> 17:15.000
And of course, working backwards again, we have our manifest list

17:15.000 --> 17:18.000
that then just points to this one manifest file,

17:18.000 --> 17:23.000
and we roll up those statistics into the manifest list as well.

17:23.000 --> 17:28.000
And they also keep track of more statistics on how many rows were added

17:28.000 --> 17:31.000
or removed during this operation as well.

17:31.000 --> 17:37.000
And then from there, we're not going to touch the metadata file that we already exist,

17:37.000 --> 17:40.000
but now we're going to create a new metadata file

17:40.000 --> 17:43.000
that has the same information as the partition spec,

17:43.000 --> 17:44.000
our schema as well.

17:44.000 --> 17:48.000
But here, we're going to keep track of this first snapshot,

17:48.000 --> 17:51.000
because we now have data that's actually stored in our table,

17:51.000 --> 17:55.000
and this snapshot refers to that manifest list.

17:55.000 --> 17:58.000
And then finally, most important step,

17:58.000 --> 18:01.000
we're going to automatically update that pointer in our catalog,

18:01.000 --> 18:06.000
so the booking's table now points to this specific metadata file.

18:07.000 --> 18:09.000
So, next up, our select.

18:09.000 --> 18:14.000
So, similar sort of process, we submit our query to the engine.

18:14.000 --> 18:18.000
We're going to reach out to the catalog to see where this metadata file exists.

18:18.000 --> 18:21.000
We know it is this one at the bottom with that snapshot,

18:21.000 --> 18:24.000
so we're going to go read that latest metadata file.

18:24.000 --> 18:26.000
And when we're reading the metadata file again,

18:26.000 --> 18:28.000
we're going to figure out the schema,

18:28.000 --> 18:32.000
so that we can prepare to actually send back data according to that schema on our engine.

18:32.000 --> 18:36.000
The partitioning ski, so we can start planning out the query,

18:36.000 --> 18:38.000
and actually accessing the proper data files,

18:38.000 --> 18:40.000
and then the latest snapshot, right,

18:40.000 --> 18:42.000
where are we actually reading from,

18:42.000 --> 18:45.000
or what version of this data that's available to we want to read from.

18:45.000 --> 18:47.000
So, knowing the latest snapshot,

18:47.000 --> 18:48.000
we only have one here,

18:48.000 --> 18:50.000
and it's pointing to this manifest list,

18:50.000 --> 18:53.000
so we can go ahead and read from this manifest list specifically.

18:53.000 --> 18:55.000
And so now from there,

18:55.000 --> 18:57.000
if we had more data in this table,

18:57.000 --> 19:01.000
we'd understand all of the manifest files that are belonging to this snapshot.

19:01.000 --> 19:06.000
And then, and also at that point,

19:06.000 --> 19:09.000
we can scan as part of the query planning,

19:09.000 --> 19:13.000
understand which data files we actually want to be accessing

19:13.000 --> 19:16.000
to receive the query for the proper filtering, right?

19:16.000 --> 19:19.000
If we're filtering here on that two week period,

19:19.000 --> 19:21.000
and if there were more data outside of that,

19:21.000 --> 19:25.000
we could prune off some of the data files, okay?

19:25.000 --> 19:28.000
And then we can finally access the relevant data files here,

19:28.000 --> 19:30.000
scan the data and select the correct rows,

19:30.000 --> 19:35.000
and send that back to the engine for their computation.

19:35.000 --> 19:39.000
And so, since we just saw it actually in action now,

19:39.000 --> 19:42.000
so this is where the hidden partitioning comes into play with that select query,

19:42.000 --> 19:43.000
right?

19:43.000 --> 19:44.000
We're filtering for that two week period,

19:44.000 --> 19:48.000
but we have actually partition the data based on our.

19:48.000 --> 19:50.000
So, just a little bit more on it here,

19:50.000 --> 19:54.000
is that the engine is aware of how the data is partitioned,

19:54.000 --> 19:56.000
and so it can conduct that transforms

19:56.000 --> 19:58.000
when we're actually doing that filtering,

19:58.000 --> 20:00.000
and then that is passed along.

20:00.000 --> 20:03.000
So, it's hidden partitioning in that it is hidden from the select

20:03.000 --> 20:07.000
that we don't actually have to be aware of that, right?

20:07.000 --> 20:09.000
So, I wanted to go a little bit deeper into him partitioning

20:09.000 --> 20:11.000
and what it actually means for the users,

20:11.000 --> 20:14.000
so if you think about how you might partition data

20:14.000 --> 20:17.000
in the past, a lot of the partitions fields

20:17.000 --> 20:19.000
are going to be derived from an existing field.

20:19.000 --> 20:22.000
So, if we wanted to do an hour partitioning,

20:22.000 --> 20:25.000
based on a timestamp, we would probably actually do that transform,

20:26.000 --> 20:28.000
and store it in a different column,

20:28.000 --> 20:32.000
then partition on that and have both of those columns kind of co-exist.

20:32.000 --> 20:36.000
But when it comes time for the person to actually select that data

20:36.000 --> 20:38.000
and make use of it, the analyst who doesn't care

20:38.000 --> 20:41.000
how you partition that entire, that table,

20:41.000 --> 20:44.000
they're not going to know that it would be way more efficient

20:44.000 --> 20:47.000
for their queries to use, you know,

20:47.000 --> 20:49.000
one of those fields over the other, right?

20:49.000 --> 20:51.000
And so what happens is that they ignore that,

20:51.000 --> 20:53.000
or don't, or not aware of it,

20:53.000 --> 20:55.000
and they're not getting any of the gains that they might get from

20:55.000 --> 20:57.000
knowing of the partitioning.

20:57.000 --> 20:59.000
So, icebergs, in partitioning,

20:59.000 --> 21:01.000
eliminates that issue,

21:01.000 --> 21:04.000
and also brings some additional functionality.

21:04.000 --> 21:08.000
And I want to note that iceberg can do this

21:08.000 --> 21:12.000
because the partitioning isn't tied to any physical structure

21:12.000 --> 21:14.000
of how the data is stored.

21:14.000 --> 21:17.000
In other words, it's not relying on that directory structure.

21:17.000 --> 21:19.000
So, what do we get about,

21:19.000 --> 21:22.000
get out of this partitioning, this isn't partitioning,

21:22.000 --> 21:23.000
there's a lot of things.

21:23.000 --> 21:26.000
The biggest thing is that we can partition the data

21:26.000 --> 21:29.000
based on a transformation of these existing fields.

21:29.000 --> 21:32.000
And again, they're going to be applied by the engines during

21:32.000 --> 21:33.000
clearing planning.

21:33.000 --> 21:35.000
And so what this means is that also,

21:35.000 --> 21:38.000
we're going to save space when we're actually storing the data

21:38.000 --> 21:40.000
because we don't have to have a separate column

21:40.000 --> 21:42.000
that stores this transform data.

21:42.000 --> 21:44.000
It's done on the fly.

21:44.000 --> 21:47.000
And so, we're reducing what we're actually storing in those.

21:47.000 --> 21:50.000
Another win is that we're protected from another class of bugs

21:51.000 --> 21:54.000
where the derived value isn't written properly or updated properly

21:54.000 --> 21:56.000
or stored properly.

21:56.000 --> 21:58.000
And finally, the business users don't have to play

21:58.000 --> 22:01.000
guessing games on what they actually have to,

22:01.000 --> 22:03.000
what fields they have to query in order to gain the benefits

22:03.000 --> 22:06.000
of that query printing query.

22:06.000 --> 22:07.000
But that's not all.

22:07.000 --> 22:10.000
We have another partitioning trick of our sleeves

22:10.000 --> 22:12.000
with partition evolution.

22:12.000 --> 22:14.000
So this means that if you want to change

22:14.000 --> 22:17.000
how you're actually partitioning your data over time,

22:17.000 --> 22:18.000
you can't.

22:18.000 --> 22:21.000
I'm just store the current partitioning spec.

22:21.000 --> 22:24.000
We're actually tracking all of the history of the partitioning

22:24.000 --> 22:27.000
specs within our metadata files.

22:27.000 --> 22:30.000
And so multiple partitioning specs

22:30.000 --> 22:33.000
can coexist within one table.

22:33.000 --> 22:35.000
So in the case of a multiple partitioning spec,

22:35.000 --> 22:38.000
we'll just have different the filters of the predicates

22:38.000 --> 22:40.000
will be applied differently based on how those files

22:40.000 --> 22:42.000
are actually partitions.

22:42.000 --> 22:43.000
And again, this is hidden from the user.

22:43.000 --> 22:45.000
You don't actually have to think about that

22:45.000 --> 22:47.000
when you're querying the data.

22:47.000 --> 22:49.000
So before we wrap up queries,

22:49.000 --> 22:52.000
another cool feature of iceberg is time travel capabilities.

22:52.000 --> 22:56.000
Meaning that we can query the table as of a certain time

22:56.000 --> 22:57.000
or snapshot.

22:57.000 --> 22:59.000
Because in those metadata files,

22:59.000 --> 23:03.000
as we actually evolve the table over time,

23:03.000 --> 23:05.000
we're keeping track of the old snapshots.

23:05.000 --> 23:08.000
So you can actually go back and see what that table

23:08.000 --> 23:10.000
looked like at a certain point.

23:10.000 --> 23:13.000
And so the first thing you have to do

23:13.000 --> 23:16.000
is actually see which snapshots are available.

23:16.000 --> 23:19.000
And you can see that by querying the system metadata table.

23:19.000 --> 23:22.000
And so you'll see a lot of information on the snapshots

23:22.000 --> 23:24.000
when they were last made current.

23:24.000 --> 23:27.000
And then from there, you can actually select

23:27.000 --> 23:31.000
either of these things when you actually want to query the data.

23:31.000 --> 23:35.000
So as of that timestamp or as of that specific snapshot,

23:35.000 --> 23:36.000
ID.

23:36.000 --> 23:38.000
So how this happens under the hood and here,

23:38.000 --> 23:41.000
we can assume that we have a couple more snapshots

23:41.000 --> 23:43.000
that are relevant to us.

23:43.000 --> 23:46.000
So if we are going to use the snapshot ID to query,

23:46.000 --> 23:48.000
again, that's going to be parsed by the engine.

23:48.000 --> 23:49.000
We read from the catalog.

23:49.000 --> 23:51.000
And we know that this is our latest metadata file.

23:51.000 --> 23:57.000
But suppose we're actually accessing our zero snapshot.

23:57.000 --> 23:58.000
Okay.

23:58.000 --> 24:01.000
So from there, we know that we are going instead of our S1

24:01.000 --> 24:04.000
manifestless, we go back to the other manifest list

24:04.000 --> 24:07.000
and read from there and collect the correct data files

24:07.000 --> 24:10.000
that are then sent back to the engine.

24:10.000 --> 24:13.000
And I promise we're rounding out queries now.

24:13.000 --> 24:14.000
Bonus one for you.

24:14.000 --> 24:17.000
So now that we know how selects work and how inserts work,

24:17.000 --> 24:19.000
we can have a merge and up-sert functionality.

24:19.000 --> 24:22.000
And the traditional example here is that we have a staging table

24:22.000 --> 24:24.000
for our bookings, our active cruise bookings.

24:24.000 --> 24:28.000
We want to periodically merge those into our bookings table.

24:28.000 --> 24:32.000
And so when you think about this sort of functionality,

24:32.000 --> 24:35.000
this query is effectively a few in one, right?

24:35.000 --> 24:38.000
We have our select to see what data we actually have

24:38.000 --> 24:42.000
in our bookings stage table, then it's an update for the relevant rows.

24:42.000 --> 24:46.000
And then it's an insert for the other ones, right?

24:46.000 --> 24:50.000
And so the select and insert are handled exactly as we discussed earlier.

24:50.000 --> 24:52.000
But the update is a little more interesting.

24:52.000 --> 24:56.000
For updates, we can either update things in place, more or less,

24:56.000 --> 24:59.000
or we can issue delete and then reinsert the data.

24:59.000 --> 25:03.000
So these are effectively your options in iceberg and which you choose

25:03.000 --> 25:07.000
will affect how efficiently your read and write operations

25:07.000 --> 25:09.000
will actually perform later on.

25:09.000 --> 25:11.000
So this is going to take us a little bit deeper

25:11.000 --> 25:15.000
and I suppose some additional considerations that you might need to think about.

25:15.000 --> 25:18.000
And so an important part of how your iceberg tables behave over time

25:18.000 --> 25:21.000
is handled with copy on write and merge on read.

25:21.000 --> 25:25.000
So at a high level, these define how your iceberg tables handle updating

25:25.000 --> 25:27.000
and how it deletes data.

25:27.000 --> 25:30.000
And so data files in iceberg, they're immutable.

25:30.000 --> 25:32.000
There's no surprise there.

25:32.000 --> 25:35.000
So when we need to make an update, we can't just access that file

25:35.000 --> 25:39.000
open it and add a new row or change an existing row.

25:39.000 --> 25:42.000
We instead need to make a new file.

25:42.000 --> 25:44.000
Some sort of new file.

25:44.000 --> 25:46.000
And we can handle that in multiple ways.

25:46.000 --> 25:48.000
So the first is with copy on write.

25:48.000 --> 25:49.000
It's what it sounds like.

25:49.000 --> 25:53.000
During the write process, if we're updating rows in a file or adding new rows in a file.

25:53.000 --> 25:57.000
During that write process, we are going to make those updates

25:57.000 --> 26:00.000
and copy the entire file over to a new one.

26:00.000 --> 26:02.000
Okay?

26:02.000 --> 26:08.000
If that sounds inefficient, that's because it could be.

26:08.000 --> 26:10.000
It could be. That's the question.

26:10.000 --> 26:18.000
We're going to copy everything and update only the rows that need to be updated.

26:18.000 --> 26:22.000
If we want to do some sort of update or delete information.

26:22.000 --> 26:25.000
So you might be rewriting a 500 megabyte file.

26:25.000 --> 26:29.000
Just update one row, which, well, that doesn't sound very efficient.

26:29.000 --> 26:33.000
But for, you know, for updates and delete, it can be efficient and efficient.

26:33.000 --> 26:36.000
But for reads, then that's really efficient, right?

26:36.000 --> 26:39.000
We just have to read that new file and we're done, okay?

26:39.000 --> 26:42.000
So it's copy on write is good if you're handling multiple,

26:42.000 --> 26:47.000
both deletes often and you want your files to be nice and neat for your readers later on.

26:47.000 --> 26:50.000
Okay? If you want to do more efficient sparse deletes,

26:50.000 --> 26:53.000
this is going to be handled with merge on read.

26:53.000 --> 26:56.000
So here, we're going to add delete files into the mix,

26:56.000 --> 26:59.000
which tells us which rows to ignore.

26:59.000 --> 27:03.000
And so with this mode, we have a two-step process for our updates.

27:03.000 --> 27:07.000
We first issue a delete in a separate delete file to say ignore that row,

27:07.000 --> 27:10.000
and then we create a new data file that says,

27:10.000 --> 27:13.000
here's the, we're inserting that new row, okay?

27:13.000 --> 27:15.000
With the, with the updates to it, okay?

27:15.000 --> 27:19.000
And then when it comes time to read, well, now it's a little more complicated,

27:19.000 --> 27:22.000
because we have three files to resolve against each other.

27:22.000 --> 27:26.000
One was the new update, the delete and then the existing file, okay?

27:26.000 --> 27:29.000
So it's faster for our rights, but then we have the trade-off

27:29.000 --> 27:33.000
is that we have to put in a little more effort for our read time, okay?

27:33.000 --> 27:37.000
And so the other thing that we have to think about here

27:37.000 --> 27:41.000
is that now we're inserting a lot of small files, right?

27:41.000 --> 27:46.000
And so file IO is the biggest killer of your query performance in iceberg.

27:46.000 --> 27:50.000
And so small files, we don't really want them around, okay?

27:50.000 --> 27:55.000
So I wanted to cover here just a little bit of the considerations

27:55.000 --> 27:57.000
around small files.

27:57.000 --> 27:59.000
We just saw that there's a potential downside of that,

27:59.000 --> 28:01.000
especially when we're considering deleting data.

28:01.000 --> 28:05.000
And so, but it's just a natural thing in iceberg,

28:05.000 --> 28:07.000
we're going to have small files at some point,

28:07.000 --> 28:09.000
but we want to be able to reduce them.

28:09.000 --> 28:13.000
It's not just a problem with more done read, okay?

28:13.000 --> 28:16.000
But generally, we want to reduce these files,

28:16.000 --> 28:19.000
and so the solution is here, has fewer small files.

28:19.000 --> 28:22.000
Great, it's simple, just do that.

28:22.000 --> 28:24.000
And so how do we actually do that?

28:24.000 --> 28:27.000
Well, you can use copy on right, generally,

28:27.000 --> 28:30.000
that's going to help reduce the number of small files you have floating around.

28:30.000 --> 28:34.000
Wonderful, but that's going to kill the performance of your rights, right?

28:34.000 --> 28:38.000
So to reduce the number of small data files that are going to be created from the get-go,

28:38.000 --> 28:41.000
we could also try to be smarter about our partitioning, right?

28:41.000 --> 28:43.000
Because if you have a lot of partitions,

28:43.000 --> 28:47.000
you could potentially be creating that many data files every time you insert

28:47.000 --> 28:50.000
do a bulk to insert on your iceberg tables.

28:50.000 --> 28:53.000
So you might want to be smarter about your partitions

28:53.000 --> 28:55.000
and reduce your partition size,

28:55.000 --> 28:58.000
but again, there's trade-offs there.

28:58.000 --> 28:59.000
Okay?

28:59.000 --> 29:04.000
So we can actually fix the number of small files that are floating around by doing compaction.

29:04.000 --> 29:05.000
Okay?

29:05.000 --> 29:09.000
And this is an important part of just regular maintenance on your iceberg tables.

29:09.000 --> 29:13.000
So the thing with compaction is that we're going to read in a bunch of these small data,

29:13.000 --> 29:17.000
these data files, and then we're going to write them to essentially larger files,

29:17.000 --> 29:19.000
again, just from a single partition.

29:19.000 --> 29:20.000
Okay?

29:20.000 --> 29:26.000
So this compaction mechanism ships with iceberg has a lot of configurations available to you

29:26.000 --> 29:30.000
for what the size of the output files should be, how the tasks are actually operating,

29:30.000 --> 29:32.000
what compaction strategy you use.

29:32.000 --> 29:35.000
So there's a lot of different levers that you can pull here,

29:35.000 --> 29:40.000
just to make the resulting iceberg tables just a little more efficient for your reads and writes.

29:40.000 --> 29:41.000
Okay?

29:42.000 --> 29:44.000
I need to say this generally something that you should be doing regularly,

29:44.000 --> 29:47.000
as part of your maintaining your iceberg tables.

29:47.000 --> 29:51.000
And then also we have a rewrite manifest files, sort of compaction,

29:51.000 --> 29:55.000
just to reduce the size of those manifest files as well.

29:55.000 --> 29:56.000
Okay?

29:56.000 --> 29:58.000
So I know I probably have like 10 seconds left,

29:58.000 --> 30:00.000
and so this is the end I got to get into it.

30:00.000 --> 30:03.000
So I have a QR code that links to my link tree,

30:03.000 --> 30:10.000
and then there I have a bunch of resources related to iceberg and just my blogging and where I am.

30:11.000 --> 30:16.000
But the most important thing on there that I have linked to my link tree is iceberg summit.

30:16.000 --> 30:19.000
Okay? If you want to learn more about iceberg, this is your chance.

30:19.000 --> 30:26.000
The event is on April 8th and 9th April 8th in San Francisco on April 9th.

30:26.000 --> 30:27.000
It's virtually.

30:27.000 --> 30:30.000
So the virtual conference is free to attend.

30:30.000 --> 30:33.000
So if you really want to broaden your knowledge about Apache iceberg,

30:33.000 --> 30:35.000
I would encourage you to register for that,

30:35.000 --> 30:39.000
and join us at the very least on the 9th if not in person on the 8th.

30:39.000 --> 30:45.000
And then the other thing to make note of is that the call for papers for iceberg summit is still open.

30:45.000 --> 30:50.000
So if your company is doing something interesting with iceberg and you want to talk about it,

30:50.000 --> 30:52.000
or if you know someone who should be talking about it,

30:52.000 --> 30:55.000
please consider submitting to the call for papers.

30:55.000 --> 30:58.000
It is open until February 9th.

30:58.000 --> 31:00.000
I mean it, no extensions.

31:00.000 --> 31:03.000
So yeah, so I hope to see some of you submit there.

31:03.000 --> 31:07.000
And at the very least, I hope to see many of you join us virtually on April 9th.

31:07.000 --> 31:10.000
So with that, I am done. I promise. Thank you so much.

31:10.000 --> 31:13.000
Thank you so much.

31:16.000 --> 31:19.000
The next session is going to start in 10 minutes.

31:19.000 --> 31:21.000
I'm going to have you out of your mind.

31:21.000 --> 31:26.000
We're going to be doing a session for accelerating questions.

