WEBVTT

00:00.000 --> 00:22.000
Okay, my name is Joelson. I'm going to talk about a new open source data engineering framework called data prep pit.

00:23.000 --> 00:31.000
Okay, so this data prep pit was released by IBM that Apache 2.0 licensed last week.

00:31.000 --> 00:38.000
It was used internally by IBM to prepare their granite LOM family.

00:38.000 --> 00:42.000
So it was a data engineering tool that they used. They released it open source last week.

00:42.000 --> 00:46.000
There's three value props with this framework.

00:47.000 --> 00:53.000
Data engineering built on cube flow pipelines. So it's workflow based, which means you can define steps or flows made up of steps.

00:53.000 --> 00:58.000
You take the output of one step and you send it into another depending on what the output is.

00:58.000 --> 01:05.000
So it's a little easier to use than a little more flexible than just raw Python.

01:05.000 --> 01:15.000
The second value prop is scalable compute. So you can build and test your workflows locally and then easily migrate them up to much larger cloud clusters.

01:16.000 --> 01:20.000
And then the third value prop is community.

01:20.000 --> 01:30.000
We can collaborate since we're workflow based. We can collaborate and solve and complex data engineering problems facing GNI such as determining licensing and copyrights.

01:30.000 --> 01:36.000
Compliance GDPR identifying personal information hates speech and bias.

01:36.000 --> 01:39.000
So we're just shoveling all our data into our LOM.

01:39.000 --> 01:45.000
We can figure out and design pipelines and try to detect these things before they go into the LOM.

01:46.000 --> 01:49.000
The potential user base for this.

01:49.000 --> 01:58.000
GNI value creators. So if you don't want to get bogged down in data engineering, you're looking simply to maybe set up a rag with a bunch of documents in it.

01:58.000 --> 02:04.000
This would be a perfect tool to do this. There's many examples and tutorials for processing that kind of data.

02:04.000 --> 02:08.000
Maybe you're a professional data engineer and you are bogged down in data engineering.

02:08.000 --> 02:13.000
I develop workflows, Python locally, easily migrated up to spark ray, Qflow pipelines.

02:13.000 --> 02:23.000
There's a catalog existing transforms that operate on big data. So you're already can come out of the gates with a bunch of stuff you can already leverage without having to rewrite it yourself.

02:23.000 --> 02:27.000
And then the third potential user of this is the AI researchers.

02:27.000 --> 02:32.000
And if you want to collaborate with other researchers and solving some of these data problems, this is a good framework to do it.

02:32.000 --> 02:38.000
And I'll just quick shout out to the AI Alliance. The AI Alliance was started in December of 2023.

02:38.000 --> 02:49.000
It is about 100 members right now. It's a nonprofit organization that is promoting open-interested MLDI data engineering.

02:49.000 --> 02:51.000
So thank you.

02:51.000 --> 02:53.000
Thank you.