WEBVTT

00:00.000 --> 00:17.000
Thank you.

00:17.000 --> 00:18.000
Thank you.

00:18.000 --> 00:26.000
Good afternoon, thank you so much for joining us today.

00:26.000 --> 00:33.000
So we're going to talk about DPD score indeed, in terms of DPD metadata.

00:33.000 --> 00:37.000
Before we introduce ourselves, let's start with an observation.

00:37.000 --> 00:44.000
We've been measuring a metric of our Git repository for the past three years.

00:44.000 --> 00:48.000
As you can see, there's a sharp incline since two years ago.

00:48.000 --> 00:54.000
And what we've been measuring is the quantity of YAML code in this repository.

00:54.000 --> 01:01.000
Now as you might know, YAML kind of became a programming language for data engineers, for some reasons.

01:01.000 --> 01:09.000
And there's a strong correlation there with the introduction of DPD in that repository.

01:09.000 --> 01:15.000
Now to reach 150,000 lines of YAML code is quite a feat.

01:15.000 --> 01:23.000
And in this presentation, we're going to talk why this happened, how it can be problematic and how we serve that.

01:24.000 --> 01:28.000
I'm a metric and a if you didn't notice yet, I'm French.

01:28.000 --> 01:32.000
I live in Namsalam together with Yohan.

01:32.000 --> 01:39.000
We don't live together, but we bottle.

01:39.000 --> 01:40.000
All right.

01:40.000 --> 01:42.000
Those are coding assistance.

01:42.000 --> 01:45.000
And we worked, indeed, that technique.

01:45.000 --> 01:49.000
I don't know in supermarkets, based in Namsalam.

01:49.000 --> 01:56.000
So, in the spirit of free and open source software, I'm going to give you some takeaways for free already.

01:56.000 --> 02:02.000
We do encourage to use declarative model properties.

02:02.000 --> 02:09.000
We do want to enforce the consistency of those properties through automated lynching.

02:09.000 --> 02:18.000
And finally, we do want to be able to customize those lynching rules for LGBT metadata.

02:18.000 --> 02:22.000
We'll move the simplest user experience that we can build.

02:22.000 --> 02:25.000
Now, this might sound a bit abstract.

02:25.000 --> 02:32.000
But we will go over this outline first introducing DPD and the need for lynching DPD.

02:32.000 --> 02:36.000
Finally, our solution for that.

02:36.000 --> 02:38.000
Let's get started with DPD.

02:38.000 --> 02:40.000
Anyone familiar with that?

02:40.000 --> 02:42.000
Any user of DPD?

02:42.000 --> 02:44.000
All right. Nice, a few.

02:44.000 --> 02:48.000
For the rest of you, let me explain a bit.

02:48.000 --> 02:52.000
Let's build the data platform together.

02:52.000 --> 02:56.000
Starting with a place to store the data base data warehouse data lake.

02:56.000 --> 02:58.000
You name it.

02:58.000 --> 03:00.000
We have some data in there.

03:00.000 --> 03:02.000
We do want to make it usable for analytics.

03:02.000 --> 03:05.000
So, we are throwing SQL at it.

03:05.000 --> 03:08.000
Building new tables, transforming data.

03:08.000 --> 03:12.000
We're building some pizza analytics data warehouse.

03:12.000 --> 03:14.000
So, let's create one table.

03:14.000 --> 03:16.000
Let's create a second one, transforming more data.

03:16.000 --> 03:17.000
Making it usable.

03:17.000 --> 03:18.000
A third one.

03:18.000 --> 03:21.000
And what that starts to be pretty usable already.

03:21.000 --> 03:25.000
We can look at pizzas, pizzas, daily sales.

03:25.000 --> 03:30.000
Once we have it, we are adding some components to this data platform.

03:30.000 --> 03:34.000
Some notes book feature, for example, for machine learning experiments.

03:34.000 --> 03:37.000
Some dashboarding tool to build splits.

03:37.000 --> 03:40.000
Data catalog to find all the data in there.

03:41.000 --> 03:45.000
Some pipelines in search of an extraction.

03:45.000 --> 03:49.000
And the data app may be for some interactivity team.

03:49.000 --> 03:54.000
This is looking pretty good until issues appear.

03:54.000 --> 03:58.000
For example, data changes and we need to evolve the schema.

03:58.000 --> 04:02.000
Now we have a new column somewhere on the table.

04:02.000 --> 04:07.000
So, someone was writing some SQL query to add this column and back tracking the data.

04:08.000 --> 04:12.000
Oh, one day there was, for some reason, no data.

04:12.000 --> 04:14.000
So, someone inserted it manually.

04:14.000 --> 04:15.000
Sure.

04:15.000 --> 04:17.000
And another data was run.

04:17.000 --> 04:20.000
So, someone fixed the data.

04:20.000 --> 04:24.000
And this is problematic because it's hard to keep track of these changes.

04:24.000 --> 04:31.000
If they are done manually, they've done right in large data platforms run by many people.

04:31.000 --> 04:36.000
And this is only the beginning of some issues we can have.

04:36.000 --> 04:40.000
Or there are issues include, and there are not limited to orchestration.

04:40.000 --> 04:42.000
Half problem to solve.

04:42.000 --> 04:46.000
Out of 100 data, lost data manually in put, as we saw just before.

04:46.000 --> 04:49.000
Quality checks are the lack of quality checks.

04:49.000 --> 04:54.000
Personal data that should certainly be protected at all costs.

04:54.000 --> 04:57.000
If you nail all of this, then you have performance issues.

04:57.000 --> 05:00.000
Because it's a lot to deal with.

05:00.000 --> 05:03.000
And this is where the BTE checks in.

05:03.000 --> 05:06.000
So, the BTE does not return to solve all of this.

05:06.000 --> 05:10.000
However, it helps a lot to organize ourselves.

05:10.000 --> 05:12.000
The BTE means data build tool.

05:12.000 --> 05:19.000
It's a framework written in Python, which aims at giving data engineers and data users

05:19.000 --> 05:26.000
and some organizations into how they build and transform data in data warehouses.

05:26.000 --> 05:32.000
And in order to do this, the primary observation that the BTE does is to look at this.

05:32.000 --> 05:35.000
Transformation SQL query.

05:35.000 --> 05:39.000
That it is really made of two very different parts.

05:39.000 --> 05:42.000
The first part is some metadata.

05:42.000 --> 05:47.000
Basically, this creating a given table with its schema.

05:47.000 --> 05:49.000
List of columns and their types.

05:49.000 --> 05:55.000
And maybe some additional metadata, such as a primary key constraint.

05:55.000 --> 05:57.000
The second part is logic.

05:57.000 --> 06:00.000
Pure, decorative SQL transformation logic.

06:01.000 --> 06:07.000
And what the BTE proposes is to split these two parts into different structures.

06:07.000 --> 06:10.000
And SQL find that contains only the logic.

06:10.000 --> 06:12.000
So only a select statement.

06:12.000 --> 06:14.000
And a YAML file.

06:14.000 --> 06:19.000
There we go, YAML, which contains exactly the same information.

06:19.000 --> 06:25.000
Except organizing, organizing a more machine readable structure.

06:25.000 --> 06:29.000
And what the BTE will do with that information is to reconstruct the SQL query.

06:29.000 --> 06:32.000
The SQL query we saw before.

06:32.000 --> 06:40.000
By grouping over the types of metadata, we have in the YAML file.

06:40.000 --> 06:42.000
So far, so good.

06:42.000 --> 06:47.000
Nothing magic going on there, just some templating mechanism.

06:47.000 --> 06:52.000
And well, if the BTE was emitted to this, it wouldn't be very useful.

06:52.000 --> 06:59.000
However, once we have a YAML file to store metadata, we can add anything else we want in there.

06:59.000 --> 07:06.000
For example, some description of our columns and tables, in order to make them readable for humans.

07:06.000 --> 07:11.000
This description can then be forwarded into the data catalog, automatically.

07:11.000 --> 07:13.000
We can have some additional configuration.

07:13.000 --> 07:21.000
We can define who has access to this data and build an SQL template that will use this information in order to grant access to a given table.

07:21.000 --> 07:24.000
Maybe we can define an owner.

07:24.000 --> 07:26.000
We can define some adopting system.

07:26.000 --> 07:32.000
We could connect dashboarding system to this table and define heads.

07:32.000 --> 07:34.000
Think to this dashboard.

07:34.000 --> 07:37.000
And that dashboard, we automatically retrieve the data.

07:37.000 --> 07:40.000
Just because it was declared here.

07:40.000 --> 07:42.000
Columns can also be enhanced.

07:42.000 --> 07:44.000
We can add additional metadata.

07:44.000 --> 07:45.000
We can add tests.

07:45.000 --> 07:49.000
We can say that, hey, this is given column.

07:49.000 --> 07:54.000
It should contain unique values and so on and so on.

07:54.000 --> 07:59.000
What I'm illustrating here is that data models are first class citizens.

07:59.000 --> 08:08.000
In the sense that DBT allows us to program these models and enhance them with whatever automation we need.

08:08.000 --> 08:11.000
So this is DBT to us.

08:11.000 --> 08:17.000
It is a bit more, but this ID of programming models is pretty neat.

08:17.000 --> 08:23.000
By default, DBT comes bundled with bundled with features to build data.

08:23.000 --> 08:27.000
So generate the create table as with for the logic.

08:27.000 --> 08:33.000
To automatically generate the SQL to run tests as defined in the ML file.

08:33.000 --> 08:39.000
To generate a catalogue, a human readable catalogue to replicate the data schemas.

08:39.000 --> 08:44.000
And with this ID, we can also build our own automation on top of this metadata.

08:44.000 --> 08:46.000
Really for anything we want.

08:47.000 --> 08:54.000
Let's jump over now for the need for lynching this metadata.

08:54.000 --> 09:00.000
And let's look at these two models, pithas and pithas that we saw before.

09:00.000 --> 09:02.000
We are forgetting the SQL for now.

09:02.000 --> 09:05.000
Just looking at the metadata, the ML file.

09:05.000 --> 09:13.000
And to the untrained I, it's pretty hard to spot mistakes in those ML files.

09:13.000 --> 09:17.000
However, if I point them out, they can become a bit of just.

09:17.000 --> 09:21.000
For example, the one on the right is missing capitalization in its documentation.

09:21.000 --> 09:26.000
Of course, I'm bankshitting here, but it does not look good in the catalogue.

09:26.000 --> 09:29.000
This one is a bit more important. There's a data leak.

09:29.000 --> 09:33.000
We are granting red access to this data to all.

09:33.000 --> 09:36.000
And that's probably an issue.

09:36.000 --> 09:40.000
There's an invalid owner, null is no good.

09:41.000 --> 09:45.000
There's no dashboard link to it, so it's this data even useful.

09:45.000 --> 09:48.000
Oh, there's a sort of no primary key.

09:48.000 --> 09:53.000
We have the data modeling practice missing test.

09:53.000 --> 09:56.000
Zero quality, quality test on this data.

09:56.000 --> 10:01.000
So as you can see, we already have six spreadsheet little details.

10:01.000 --> 10:06.000
It only mistakes in this metadata out of 20 lines of the ML.

10:06.000 --> 10:11.000
Remember this 150,000 lines of the ML in one repository.

10:11.000 --> 10:15.000
That's the put on tool for lots of errors.

10:15.000 --> 10:21.000
And this is basically our problem statement.

10:21.000 --> 10:22.000
Quick summary.

10:22.000 --> 10:27.000
DBT offers us the ability to have first class citizen data models.

10:27.000 --> 10:30.000
And metadata management.

10:30.000 --> 10:34.000
However, at scale with thousands of new models,

10:35.000 --> 10:38.000
we need to enforce the consistency of the metadata.

10:38.000 --> 10:43.000
We need to run quality checks on this metadata for values,

10:43.000 --> 10:46.000
applications including data security,

10:46.000 --> 10:49.000
which is probably one of the most important ones.

10:49.000 --> 10:51.000
Now in order to tackle this issue,

10:51.000 --> 10:54.000
I will leave the mic to your hand.

10:54.000 --> 10:56.000
I'll buy one.

10:56.000 --> 10:58.000
Thank you.

10:58.000 --> 11:02.000
Yeah, that brings us to the main topic of this talk,

11:02.000 --> 11:04.000
which is DBT score.

11:04.000 --> 11:06.000
So I think much explaining it very well.

11:06.000 --> 11:08.000
We are vivid users of DBT,

11:08.000 --> 11:10.000
and we've been using it for quite some time now.

11:10.000 --> 11:14.000
And at scale, the metadata becomes a mess.

11:14.000 --> 11:17.000
You might have noticed this yourself already,

11:17.000 --> 11:20.000
but we wanted to solve this issue,

11:20.000 --> 11:24.000
and it doesn't work having to review each PR,

11:24.000 --> 11:28.000
like the yellow files with a magnifying scope.

11:28.000 --> 11:31.000
So we built a linter called DBT score.

11:31.000 --> 11:33.000
You can find us on GitHub,

11:33.000 --> 11:35.000
at picnic supermarket slash DBT score,

11:35.000 --> 11:37.000
so free to have a look.

11:37.000 --> 11:40.000
It's 100% written in Python,

11:40.000 --> 11:42.000
which is a great, I think.

11:42.000 --> 11:43.000
You might be asking why,

11:43.000 --> 11:45.000
because maybe you're a Java developer,

11:45.000 --> 11:46.000
and you don't care.

11:46.000 --> 11:49.000
But chances are you are running DBT,

11:49.000 --> 11:50.000
which is a Python environment,

11:50.000 --> 11:54.000
so this is also easily installable in your Python environment.

11:54.000 --> 11:57.000
The idea is that you install DBT score

11:57.000 --> 11:59.000
in your DBT project,

11:59.000 --> 12:02.000
and then you can really easily run it.

12:02.000 --> 12:03.000
So it looks like this.

12:03.000 --> 12:05.000
Pip install DBT score, it's that easy.

12:05.000 --> 12:08.000
You can also use a package manager or something,

12:08.000 --> 12:10.000
poetry, UV, and I don't mind.

12:10.000 --> 12:12.000
You can easily install it,

12:12.000 --> 12:15.000
and then it gives you a command line too.

12:15.000 --> 12:17.000
So you can run it from the command line,

12:17.000 --> 12:18.000
and it looks like this.

12:18.000 --> 12:21.000
So currently, two main commands, lint and list.

12:21.000 --> 12:24.000
Lint is, I think, obvious, that's the goal here,

12:24.000 --> 12:26.000
and you can also list all the rules

12:26.000 --> 12:29.000
that are in your DBT score project.

12:29.000 --> 12:32.000
I will go over that later.

12:32.000 --> 12:36.000
So let me demonstrate real quick what it does.

12:36.000 --> 12:40.000
Once you run DBT score lint in your DBT project,

12:40.000 --> 12:43.000
what it will do in the background is it will gather all the models

12:43.000 --> 12:45.000
and sources in your DBT project,

12:45.000 --> 12:47.000
and it will iterate over them.

12:47.000 --> 12:50.000
Then in your kind of DBT score project,

12:50.000 --> 12:54.000
so not DBT project, you have configured a set of rules,

12:54.000 --> 12:59.000
and these rules are applied to each model and or source.

12:59.000 --> 13:02.000
So it is a linter.

13:02.000 --> 13:04.000
What do you expect from a linter?

13:04.000 --> 13:05.000
Of course.

13:05.000 --> 13:08.000
But then after applying all those rules,

13:08.000 --> 13:10.000
there is a score calculated for each model.

13:10.000 --> 13:13.000
So in the case of pithereons,

13:13.000 --> 13:16.000
that was the ugly model that material was talking about,

13:16.000 --> 13:19.000
you can see that two rules have failed.

13:19.000 --> 13:23.000
So the first one is about columns not having a description.

13:23.000 --> 13:26.000
As you can see, the ID column doesn't have a description,

13:26.000 --> 13:28.000
and it is shown in the output.

13:28.000 --> 13:30.000
And the other one is that the model is an owner.

13:30.000 --> 13:33.000
It was no, I think, in the example.

13:33.000 --> 13:36.000
So that is basically how it works, a score is calculated.

13:36.000 --> 13:39.000
Based on the rules that have passed and failed,

13:39.000 --> 13:42.000
then based on the score, a badge,

13:42.000 --> 13:46.000
we call the emoji the silver medal a badge,

13:46.000 --> 13:49.000
is awarded so hopefully to kind of gamify

13:49.000 --> 13:54.000
getting your metadata a little bit in good state,

13:54.000 --> 13:58.000
because nobody loves fixing their metadata, of course.

13:58.000 --> 14:04.000
And in the end, your project is also scored by an average of models.

14:04.000 --> 14:08.000
By the way, below is kind of the result of the linter command.

14:08.000 --> 14:11.000
In this case, it is an error, because it is below the threshold

14:11.000 --> 14:13.000
that we set in our project.

14:13.000 --> 14:15.000
I will go over it later.

14:15.000 --> 14:18.000
If you look closely, you can see that the project score is a nine,

14:18.000 --> 14:20.000
while there is only one model with the eight,

14:20.000 --> 14:22.000
so that doesn't make any sense.

14:22.000 --> 14:24.000
But that means there is more happening in the background,

14:24.000 --> 14:27.000
but we don't want to overwhelm the user.

14:27.000 --> 14:30.000
So this is what it looks like with two models,

14:30.000 --> 14:33.000
and as you can see, there is way more rules being applied

14:33.000 --> 14:35.000
than was previously shown,

14:35.000 --> 14:38.000
because we don't want to bother everyone with the OK statements.

14:38.000 --> 14:40.000
They don't need fixing,

14:40.000 --> 14:43.000
but you can see what is happening in the background.

14:43.000 --> 14:46.000
The pizza's model scored a 10.0,

14:46.000 --> 14:49.000
and was awarded with golden metal.

14:49.000 --> 14:54.000
So this is like a holy grill for every DBT developer.

14:54.000 --> 15:01.000
So that's great, and it gives a good insight on what needs to be fixed.

15:01.000 --> 15:05.000
We think the DBT score should be highly configurable.

15:05.000 --> 15:09.000
Namely, every data platform is different.

15:09.000 --> 15:12.000
Everybody has different rules, ways of doing stuff.

15:12.000 --> 15:16.000
So it's kind of, with other linters,

15:16.000 --> 15:18.000
you can define a set of rules,

15:18.000 --> 15:20.000
and then apply it to every project.

15:20.000 --> 15:23.000
But for data projects, it's a little bit different in our opinion.

15:23.000 --> 15:27.000
So this tool was designed to kind of give the power to the user,

15:27.000 --> 15:30.000
and to set it up themselves.

15:30.000 --> 15:32.000
So it's highly configurable with a,

15:32.000 --> 15:34.000
well, we have some configuration in a pie project.

15:34.000 --> 15:38.000
So you can kind of tell DBT score where to get the rules from.

15:38.000 --> 15:41.000
You can also disable rules that are already packaged

15:41.000 --> 15:46.000
in the, in DBT score, but you can also import a package of rules,

15:46.000 --> 15:48.000
and then disable some of them.

15:48.000 --> 15:50.000
Disguise the limit.

15:50.000 --> 15:55.000
Then there's to fill under parameters or configurations,

15:55.000 --> 16:00.000
and that's basically to tell DBT score when it needs to fill with a non-zero exit code.

16:00.000 --> 16:04.000
So this is very useful when you want to run it in CI, for example,

16:04.000 --> 16:06.000
which is what we do.

16:06.000 --> 16:09.000
There's more the badges, so we are using metals,

16:09.000 --> 16:11.000
but you can also use a carrot and stick.

16:11.000 --> 16:15.000
Whatever you want, I don't judge.

16:15.000 --> 16:19.000
You can use anything, but you, yeah, there's three metals,

16:19.000 --> 16:24.000
and then the last one is kind of, well, working progress.

16:24.000 --> 16:29.000
Oh, that was weird. Okay, looks good.

16:29.000 --> 16:34.000
You can also configure rules, so each rule has a severity.

16:34.000 --> 16:37.000
It's just kind of to calculate the score.

16:37.000 --> 16:39.000
It's kind of the weight of the rule.

16:39.000 --> 16:40.000
So that can be configured.

16:40.000 --> 16:44.000
You can also configure some rules that have extra input parameters,

16:44.000 --> 16:45.000
if you have a rule.

16:45.000 --> 16:46.000
Like that.

16:46.000 --> 16:50.000
For example, let's say you want to have a rule,

16:50.000 --> 16:52.000
and it's used in multiple projects,

16:52.000 --> 16:57.000
then what can happen is maybe in Project A,

16:57.000 --> 17:03.000
you want to have a check for like a 300 max lines in the SQL code,

17:03.000 --> 17:06.000
but in Project C you allow it to have 500 or something.

17:06.000 --> 17:09.000
So it's more to kind of configure it more freely.

17:09.000 --> 17:12.000
Last but not least, you can also skip rules based on model metadata.

17:12.000 --> 17:17.000
So what we encountered a lot is that people set to us like,

17:17.000 --> 17:20.000
yeah, but I have all these rules, but they don't apply

17:20.000 --> 17:22.000
through the schema or something like this.

17:22.000 --> 17:25.000
So you can very easily configure your rules in a way

17:25.000 --> 17:28.000
that it can skip certain schemas or whatever models

17:28.000 --> 17:32.000
that start with a Z or disguise the limit.

17:33.000 --> 17:36.000
So let's go back to the example of my chair.

17:36.000 --> 17:39.000
The Peter Rias one doesn't look good,

17:39.000 --> 17:43.000
and mainly the goal is to help the creator of Peter Rias

17:43.000 --> 17:46.000
to fix his model easily.

17:46.000 --> 17:49.000
So let's try to create some new rules,

17:49.000 --> 17:53.000
because I think this is the most important part of DVD score.

17:53.000 --> 17:56.000
Creating rules is very simple,

17:56.000 --> 18:00.000
because we think that every project has a different set of rules.

18:00.000 --> 18:03.000
Probably depending on the company or whatever.

18:03.000 --> 18:07.000
So let's try to fix two of them by creating new rules.

18:07.000 --> 18:09.000
So this is a rule.

18:09.000 --> 18:16.000
It's just a method, that's it with a rule decorator,

18:16.000 --> 18:19.000
and that's basically all there is to it.

18:19.000 --> 18:22.000
So the rule decorator does all the magic,

18:22.000 --> 18:25.000
that you don't have to care about, it was implemented.

18:25.000 --> 18:29.000
It tells the empty score, this is a function that should be a rule

18:29.000 --> 18:34.000
and you need to apply it to all models and or sources.

18:34.000 --> 18:38.000
So the name of the function will be the name of the rule.

18:38.000 --> 18:41.000
The model needs to be the input parameter,

18:41.000 --> 18:45.000
because you make the assumptions on the model metadata,

18:45.000 --> 18:47.000
and it always returns a rule violation or not.

18:47.000 --> 18:52.000
So you're kind of, that's the restrictions of creating a rule.

18:52.000 --> 18:56.000
In this case, we are looking to create a rule to check if the description

18:56.000 --> 19:01.000
is not capitalized, the one, yeah, the description.

19:01.000 --> 19:05.000
So that's line number three, that should be a capital letter.

19:05.000 --> 19:08.000
So what we do is basically we check the description of the model.

19:08.000 --> 19:14.000
It's a property of model, and we check if the model dot description is title,

19:14.000 --> 19:17.000
which is basically, does it start with a capital letter.

19:17.000 --> 19:19.000
If so, then do nothing.

19:19.000 --> 19:22.000
If it doesn't start with a capital letter,

19:22.000 --> 19:26.000
we return a rule violation with a sensible error message.

19:26.000 --> 19:31.000
So this is basically what shows up to the user when linkeding.

19:31.000 --> 19:34.000
So that was easy, this is a little bit more complicated rule,

19:34.000 --> 19:37.000
but still, business logic is quite easy.

19:37.000 --> 19:41.000
As you can see, same concept, we still have the rule decorator,

19:41.000 --> 19:44.000
but this time we configure the severity.

19:44.000 --> 19:49.000
So this is a low severity rule, which means it's less important,

19:49.000 --> 19:52.000
which means it has less impact on the score,

19:52.000 --> 19:56.000
once a model fails this rule, if that makes sense.

19:56.000 --> 20:00.000
Once again, model as input parameter, rule violation,

20:00.000 --> 20:03.000
or none as a result.

20:03.000 --> 20:07.000
So what happens here is that we kind of want to gather all the columns

20:07.000 --> 20:10.000
that are missing a test.

20:10.000 --> 20:13.000
So what we do is we loop over the columns, then we check,

20:13.000 --> 20:16.000
does this column have a data test property.

20:16.000 --> 20:19.000
You are all very familiar with this property, of course,

20:19.000 --> 20:22.000
because you write a lot of tests in DBT.

20:22.000 --> 20:26.000
Then if the column does not have this data test property,

20:26.000 --> 20:30.000
you add it to the list of missing columns,

20:30.000 --> 20:34.000
and then in the end, you kind of append it into one string,

20:34.000 --> 20:37.000
so you can return to the user what columns were violating this rule.

20:37.000 --> 20:40.000
So the user can immediately see, like, ah, okay, yeah,

20:40.000 --> 20:43.000
those were the columns, and I need to fix that.

20:43.000 --> 20:45.000
So I guess creating rules is very simple,

20:45.000 --> 20:48.000
and the idea is that if you use DBT score,

20:48.000 --> 20:52.000
that you create a set of rules in your own project.

20:52.000 --> 20:57.000
So it looks like this after we run DBT score again.

20:57.000 --> 20:59.000
As you can see, there's now more rules,

20:59.000 --> 21:02.000
and Pizzeria scores even lower, because, well,

21:02.000 --> 21:04.000
the metadata wasn't fixed yet.

21:04.000 --> 21:09.000
So it only shows now that Pizzeria is not doing well.

21:09.000 --> 21:12.000
So creating a rule just makes you,

21:12.000 --> 21:15.000
makes it easy to find the problems,

21:15.000 --> 21:17.000
but now it's up to the user to fix it themselves.

21:17.000 --> 21:19.000
I'm not going to show you how to do that.

21:19.000 --> 21:22.000
I think you will be able to do yourself.

21:22.000 --> 21:25.000
But now Pizzeria is also scored as seven,

21:25.000 --> 21:27.000
because now it has more failing rules.

21:27.000 --> 21:32.000
It was rewarded with a bronze medal, which is not too cool.

21:32.000 --> 21:35.000
So yeah, I think the owner of Pizzeria has some work to do,

21:35.000 --> 21:38.000
although we don't know who it is, because, yeah,

21:38.000 --> 21:41.000
it's a model lesson on her.

21:41.000 --> 21:45.000
And in the end, the project also scored an 8.5.

21:45.000 --> 21:47.000
Yeah.

21:47.000 --> 21:50.000
Yeah, so this is basically how a user interacts with the tool,

21:50.000 --> 21:53.000
but we also think it should be automatable.

21:53.000 --> 21:57.000
So you want to integrate it with whatever system you have running,

21:57.000 --> 21:59.000
or running else.

21:59.000 --> 22:02.000
In our case, we like to use it in CI,

22:02.000 --> 22:07.000
because we actually run DBT score as part of the pool request bills and stuff,

22:07.000 --> 22:11.000
so we don't get shitty metadata in prop.

22:11.000 --> 22:14.000
So we use this to kind of run DBT score,

22:14.000 --> 22:16.000
then output it into a JSON file,

22:16.000 --> 22:18.000
and then we can parse this JSON file easily.

22:18.000 --> 22:24.000
We know how it looks like, and then we can just kind of check if all models are okay.

22:24.000 --> 22:28.000
You can think of more ways to integrate it.

22:28.000 --> 22:31.000
For example, using it in a data catalog,

22:31.000 --> 22:36.000
is anybody using the DBT's built-in data catalog?

22:36.000 --> 22:39.000
Yes, some people, okay. Yeah, we as well.

22:39.000 --> 22:41.000
And yeah, we put the badge in there.

22:41.000 --> 22:45.000
We think it's fun to have data practitioners build something,

22:45.000 --> 22:50.000
and a show like the score that they got in the data catalog.

22:50.000 --> 22:53.000
So that's really cool, I think.

22:53.000 --> 22:57.000
Last but not least, we have an amazing documentation website,

22:57.000 --> 22:59.000
so if you want to get started on this,

22:59.000 --> 23:03.000
I literally everything I talked, I said today,

23:03.000 --> 23:06.000
it's also here, but then it written format, and even more.

23:06.000 --> 23:09.000
So feel free to have a look.

23:09.000 --> 23:14.000
You can find links in this slide, if the black screen goes away.

23:14.000 --> 23:18.000
So DBT score, the picking up tech is the documentation website,

23:18.000 --> 23:20.000
and you can find us on GitHub.

23:20.000 --> 23:24.000
Last but not least, we have a group of authors already,

23:24.000 --> 23:26.000
and just so all of you know.

23:26.000 --> 23:29.000
We love working together with others on this project,

23:29.000 --> 23:32.000
so if you like Python, if you don't like Python,

23:32.000 --> 23:37.000
you can also join, but maybe less fun for all of us.

23:37.000 --> 23:41.000
Yeah, feel free to contribute with us.

23:41.000 --> 23:44.000
We like working together, open an issue,

23:44.000 --> 23:47.000
pull requests, whatever, we're happy to think along.

23:47.000 --> 23:49.000
Thanks for listening, everyone.

23:49.000 --> 23:51.000
Thank you.

23:56.000 --> 23:58.000
And I don't know if there's time for questions,

23:58.000 --> 24:00.000
but we'll have six minutes.

24:00.000 --> 24:02.000
So if any.

24:02.000 --> 24:05.000
You talked about links in yellow files.

24:05.000 --> 24:08.000
How do you handle links in the SQL code,

24:08.000 --> 24:11.000
as I think, in our picture,

24:11.000 --> 24:14.000
or the platform, we have to connect to the database,

24:14.000 --> 24:17.000
and then sometimes the code box out,

24:17.000 --> 24:20.000
and the pages from once,

24:20.000 --> 24:23.000
sit back to the files or another.

24:23.000 --> 24:25.000
Okay.

24:25.000 --> 24:27.000
Yeah, so the.

24:28.000 --> 24:30.000
Sorry.

24:30.000 --> 24:33.000
Repeat the question.

24:33.000 --> 24:34.000
Yeah.

24:34.000 --> 24:37.000
So how do we handle SQL Linting?

24:37.000 --> 24:41.000
Actually, DVD score is not meant for SQL Linting.

24:41.000 --> 24:44.000
We, however, use SQL.

24:44.000 --> 24:46.000
Fluff.

24:46.000 --> 24:49.000
As our SQL Inter of choice, but that's, yeah,

24:49.000 --> 24:50.000
kind of a different topic.

24:50.000 --> 24:54.000
We don't expect you to link SQL with this.

24:54.000 --> 24:56.000
But in the static analysis, it's complex.

24:56.000 --> 24:58.000
And SQL, especially with the many different flavors,

24:58.000 --> 25:02.000
you can find some of them are not even free software.

25:02.000 --> 25:04.000
It's still flake SQL, I'm no doubt.

25:04.000 --> 25:06.000
So it's, it's tricky.

25:06.000 --> 25:07.000
Yeah.

25:07.000 --> 25:08.000
Yes.

25:08.000 --> 25:09.000
Yeah.

25:09.000 --> 25:10.000
Thanks for the presentation.

25:10.000 --> 25:12.000
Any questions?

25:12.000 --> 25:14.000
I can select your piece.

25:14.000 --> 25:16.000
What was the question?

25:16.000 --> 25:17.000
Of course, you cannot ask that.

25:17.000 --> 25:19.000
But sometimes it's very straightforward.

25:19.000 --> 25:20.000
Yeah.

25:20.000 --> 25:22.000
I thought it was not kind of nice.

25:23.000 --> 25:24.000
Yeah.

25:24.000 --> 25:25.000
That's a good question.

25:25.000 --> 25:29.000
So the question is, if we plan to add the, well,

25:29.000 --> 25:32.000
TV score fix, the short answer is no.

25:32.000 --> 25:35.000
The long answer is that we use the manifesto JSON

25:35.000 --> 25:37.000
that is generated by dbt parse,

25:37.000 --> 25:41.000
and that it's very hard to kind of move back to YAMO files.

25:41.000 --> 25:43.000
So we kind of let dbt do the,

25:43.000 --> 25:46.000
yeah, compilation from YAMO to JSON or, yeah,

25:46.000 --> 25:48.000
JSON format.

25:48.000 --> 25:51.000
So we don't really know exactly anymore,

25:51.000 --> 25:52.000
but we don't really know exactly where the,

25:52.000 --> 25:55.000
for example, the description came from.

25:55.000 --> 25:57.000
So we're not using the real YAMO files.

25:57.000 --> 25:59.000
We're using the compiled.

25:59.000 --> 26:00.000
Stop.

26:00.000 --> 26:01.000
Yeah?

26:01.000 --> 26:02.000
Thank you.

26:02.000 --> 26:03.000
Hmm.

26:03.000 --> 26:08.000
Now, you can mention you have 150,000 likes of YAMO.

26:08.000 --> 26:11.000
So I'm just, I'm wondering,

26:11.000 --> 26:14.000
how much time this year it takes to let every single model

26:14.000 --> 26:16.000
with every single whole year.

26:16.000 --> 26:17.000
There you go.

26:17.000 --> 26:20.000
What takes the longest is for activity to pass the project.

26:20.000 --> 26:24.000
And for this particular 150,000 trends of YAMO,

26:24.000 --> 26:27.000
I think it's 30 seconds to 1 minute for dbt to pass.

26:27.000 --> 26:31.000
And then dbt score runs in seconds and on top of it.

26:31.000 --> 26:33.000
Yeah, dbt score is really fast.

26:33.000 --> 26:35.000
It's mostly the dbt being the button next there.

26:35.000 --> 26:39.000
We did not switch any performance problems with dbt score.

26:39.000 --> 26:40.000
Not yet.

26:40.000 --> 26:45.000
Any other questions?

26:45.000 --> 26:47.000
Go ahead.

26:47.000 --> 26:50.000
So this one is more of a given dbt score.

26:50.000 --> 26:53.000
Why do you need it in the example?

26:53.000 --> 26:55.000
I should try it out.

26:55.000 --> 26:57.000
It's one of the tools that you have.

26:57.000 --> 26:58.000
Yeah?

26:58.000 --> 27:01.000
I don't quite know what to do.

27:01.000 --> 27:02.000
That's for, oh.

27:02.000 --> 27:06.000
Why do we need a particular rule that every model must have

27:06.000 --> 27:09.000
an exemplary skewed in the other documentation?

27:09.000 --> 27:11.000
The answer is we went to enforce good documentation

27:11.000 --> 27:14.000
practices for the catalogs.

27:14.000 --> 27:16.000
By having an exemplary skewed,

27:16.000 --> 27:20.000
we did use data users, originally an entry point

27:20.000 --> 27:23.000
to even table and how we can start playing with this data.

27:23.000 --> 27:26.000
Yeah, so we expect this is the data catalog of dbt

27:26.000 --> 27:29.000
that we want all our models to have an example

27:29.000 --> 27:33.000
sequel to kind of get you started with the model

27:33.000 --> 27:35.000
that you're looking at.

27:35.000 --> 27:38.000
So actually, there in the description should have been an example

27:38.000 --> 27:40.000
sequel, at least that's our.

27:40.000 --> 27:41.000
That's a GIF ID.

27:41.000 --> 27:43.000
It shows that you can disable or only enable

27:43.000 --> 27:48.000
for the big models depending on your constraints.

27:48.000 --> 27:49.000
Thank you.

27:49.000 --> 27:50.000
Yes.

27:50.000 --> 27:51.000
Thank you.

28:00.000 --> 28:02.000
Thank you as well.

28:02.000 --> 28:03.000
Let's see.

28:03.000 --> 28:05.000
What's the follow-up question?

28:05.000 --> 28:07.000
I hope you get a follow-up question outside.

28:07.000 --> 28:08.000
Yeah.

28:08.000 --> 28:09.000
Maybe.

28:09.000 --> 28:10.000
Is yours?

28:10.000 --> 28:11.000
Yes.

28:11.000 --> 28:14.000
Thank you.

28:14.000 --> 28:16.000
All right.

28:16.000 --> 28:18.000
All right.

28:18.000 --> 28:21.000
We're going to the next.

28:21.000 --> 28:24.000
That's a cool five-time time.

28:24.000 --> 28:27.000
Okay.

28:27.000 --> 28:29.000
All right.

28:29.000 --> 28:30.000
Come on.

28:30.000 --> 28:31.000
All right.

28:31.000 --> 28:32.000
All right.

28:38.000 --> 28:40.000
It's alright.

28:40.000 --> 29:10.000
No, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no

29:10.000 --> 29:24.520
no, no no, no, no, no, no. I call

