WEBVTT

00:00.000 --> 00:11.260
Who's going to be talking of the next segment around OVVC take an arrow to the net, not

00:11.260 --> 00:12.260
probably so far.

00:12.260 --> 00:14.260
So, I have a team up here.

00:14.260 --> 00:15.260
Thank you very much.

00:15.260 --> 00:16.260
I agree.

00:16.260 --> 00:17.260
Greetings everybody.

00:17.260 --> 00:18.260
Hi.

00:18.260 --> 00:20.760
And then one who's watching on the streaming.

00:20.760 --> 00:27.620
So, like you said, this is OVC takes an arrow to the net, going to be giving a talk,

00:27.620 --> 00:33.740
talking about utilizing the patchy arrow in context of database connectivity.

00:33.740 --> 00:38.080
Before we get going, just a little bit who I am and why you should care at all what I

00:38.080 --> 00:40.640
had to say.

00:40.640 --> 00:43.120
So I'm at that topo.

00:43.120 --> 00:51.000
I wrote the book on patchy arrow right here, actually the publisher sends me copies and

00:51.000 --> 00:52.000
ever go to conferences.

00:52.000 --> 00:56.920
So if you can find me throughout the conference, I have one copy of my backpack every

00:56.920 --> 01:01.920
day that I will give to someone and sign it whatever because I can.

01:01.920 --> 01:05.560
I'm also a member of the patchy arrow PMC.

01:05.560 --> 01:14.760
I primarily work a lot on the go implementation, but also contribute to PIRO, C++, ADBC itself,

01:14.760 --> 01:20.040
and I'm also a committee on iceberg and developing the go-lang information iceberg.

01:20.040 --> 01:26.840
So if you're not familiar with patchy arrow, quick primer, patchy arrow is an in-memory

01:26.840 --> 01:32.400
column-oriented data format, it's a spec.

01:32.400 --> 01:38.580
The idea is that you have this in-memory performance column-oriented data format that you

01:38.580 --> 01:45.180
can be identical on the wire as it is in memory, which enables you to transport data between

01:45.180 --> 01:52.280
different systems with no serialization, desirization overhead or little overhead.

01:52.280 --> 01:58.800
It also has implementations in a truly ridiculous number of languages.

01:58.800 --> 02:05.400
C++, go, Steesharp, Julia, Matt Lab, JavaScript, all have implementations of the arrow spec

02:05.400 --> 02:10.720
and many more languages that I didn't mention.

02:10.720 --> 02:16.040
If you're not familiar with the term column-narr or column-oriented, you're very simple to understand,

02:16.040 --> 02:22.240
typically when you're talking about representing the data of a struct of a table in memory.

02:22.240 --> 02:27.920
So the often way of doing that is row oriented, where the data for each individual column

02:27.920 --> 02:33.480
for a given row is collectively in order in continuous memory.

02:33.480 --> 02:38.400
In column-oriented way, you have the data for each column is contiguous.

02:38.400 --> 02:45.960
Now the reason for this is that it makes it much more efficient for friendly to vectorization

02:45.960 --> 02:48.800
and SIMD operations.

02:48.800 --> 02:53.360
If you're not familiar with vectorization SIMD, vectorization pretty obvious and the idea

02:53.360 --> 03:00.280
being that to do vectorized operations, the data has to already be contiguous in memory.

03:00.280 --> 03:08.640
So by having your general in memory representation of your data already be contiguous by column,

03:08.640 --> 03:14.280
you can perform vectorized operations with no transforming it or manipulating it to make

03:14.280 --> 03:15.920
it work.

03:15.920 --> 03:22.240
So if you want to do something, say, reading data, you want to read only one or two columns,

03:22.240 --> 03:23.480
not all of the columns.

03:23.480 --> 03:28.280
Well, because the data is contiguous as column-oriented, you don't have to read the extra

03:28.280 --> 03:29.280
columns.

03:29.280 --> 03:33.160
This is one of the reasons why Parquet is very efficient at the ILO level, because Parquet is

03:33.160 --> 03:37.440
column-oriented so you can read just the columns you need.

03:37.440 --> 03:45.600
If you are performing operations and doing aggregations or other computational situations,

03:45.600 --> 03:52.640
you get to use a lower memory footprint to make it work, because again, you only need

03:52.640 --> 03:54.880
the columns that you're operating on directly.

03:54.880 --> 03:59.440
You can reduce the ILO of what you have to read in and out and so on.

03:59.440 --> 04:09.040
So now that we've established what Arrow is and column-oriented and the whole idea is there,

04:09.040 --> 04:13.360
we're all on the same page, hopefully on Arrow, at least enough that you understand the

04:13.360 --> 04:15.040
Russian talk.

04:15.040 --> 04:20.960
So now let's focus on the actual bulk of the talk itself, which is the most common way that

04:20.960 --> 04:28.320
most applications and systems interact with databases nowadays is ODBC and JDBC, right?

04:28.320 --> 04:34.960
And you know, you add your client API and you have swapable drivers.

04:34.960 --> 04:40.320
So kind of looks a little bit like this, you know, you have your application, submits

04:40.320 --> 04:45.040
its query via JDBC or ODBC via the interface.

04:45.040 --> 04:51.280
That goes to the API, which you've loaded a driver for, a driver, then will translate

04:51.280 --> 04:58.000
that query into whatever form the database needs it to be for whatever specific protocol

04:58.000 --> 05:05.120
the database is using, the database then executes the query and result and the results are

05:05.200 --> 05:10.560
sent back in whatever database specific format their return does.

05:10.560 --> 05:16.880
The driver then translates that into the format that ODBC and JDBC use for their APIs

05:17.760 --> 05:22.320
and then the application can now iterate those results that came back from the driver translating it.

05:24.000 --> 05:30.320
Standard protocols, standard, very straightforward, most of you are probably familiar with all this

05:30.320 --> 05:31.200
and it makes sense.

05:31.280 --> 05:42.000
So there's pros and cons, JDBC and ODBC have been around for decades, they're not going anywhere,

05:42.000 --> 05:45.760
they are standards, they make sense and they're ubiquitous, everyone is using them.

05:47.760 --> 05:53.040
For almost any database system there's probably an ODBC or JDBC driver that exists.

05:54.080 --> 05:58.560
So you have nice and stable, the problem is that conversion cost.

05:58.720 --> 06:08.960
The majority of analytical systems nowadays are all column oriented, which means that we're going

06:08.960 --> 06:16.960
from a column oriented compute and then transposing all the data into rows for ODBC and JDBC

06:16.960 --> 06:24.400
because their row oriented APIs, yes ODBC does have a column oriented interface but it's

06:24.400 --> 06:29.760
not great and you have to pre know what the schema is beforehand, which isn't as good for

06:29.760 --> 06:36.400
single queries but stay with me here. The row oriented APIs. So we have our analytical data systems,

06:36.400 --> 06:43.200
our column oriented and we transpose all that data into rows to pass it along ODBC and JDBC

06:44.080 --> 06:48.000
and then half the time the application is going to transpose it back in the columns for your

06:48.080 --> 06:56.880
visualizations, your BI tools and other compute. That transpose to translate from columns to rows

06:56.880 --> 07:04.880
is expensive. It's a copy, you're copying all of the data and like I said, nearly all

07:06.000 --> 07:11.040
analytical data systems nowadays are column oriented, whether using snowflake, druid,

07:11.200 --> 07:17.680
treno, diqueery, ductDB, all of them are all column oriented internally.

07:19.120 --> 07:23.440
So if you want to avoid that cost, that conversion, you end up having to integrate with specific

07:23.440 --> 07:30.160
SDKs for each and every system. You lose the benefits of the abstraction that ODBC and JDBC are

07:30.240 --> 07:38.240
giving you. So you can see where I'm going with this given the nature of the talk, there's another way,

07:43.280 --> 07:53.280
ADBC, arrow database connectivity. Conceptually same idea as ODBC and JDBC. Single client API

07:54.320 --> 07:59.280
swap out the drivers for what individual data system you want to communicate with. The difference

07:59.360 --> 08:04.720
being that all the data communication through the system and through the API is built around the

08:04.720 --> 08:09.360
patchy arrow in this column oriented nature, which means that your column oriented data system

08:09.920 --> 08:15.680
can now just pass through the data and not have that conversion cost. But it also means that

08:15.680 --> 08:20.560
other systems that are not currently column oriented like Postgres or MySQL or whatever,

08:22.160 --> 08:28.400
you can have a driver set at the application level. Your application can only deal with

08:28.400 --> 08:37.360
arrow column oriented data. We have bindings for it in a host of languages, Java, Ruby,

08:37.360 --> 08:44.960
Rust, Python, Go, or have bindings and the drivers themselves are written in multiple different languages

08:44.960 --> 08:51.600
because at heart it's just a CAPI. So let's take a look at that workflow again. But this time

08:51.600 --> 09:00.560
let's think of it using ADBC. Once again, application submits a SQL query to whatever system

09:00.560 --> 09:09.280
using the ADBC API. You know, execute query. The driver will then use whatever protocol needs

09:09.280 --> 09:16.560
from the database to submit the query to the database itself. Database, execute it and ideally

09:16.640 --> 09:24.320
returns the data already as arrow data. If the database returns it as arrow data, then step 4 is

09:24.320 --> 09:32.800
just pass it through. There's no copy, no translation, no transformation. You take what the database

09:32.800 --> 09:39.680
gave you and just literally pass the pointers to the data across. If the database did not return

09:39.680 --> 09:44.640
it in arrow format, then you convert it to arrow format and pass it to the API and now you're

09:44.640 --> 09:50.160
still getting arrow, record batch, stream of data out the other end. And so everything from the

09:50.160 --> 09:58.720
application perspective is always arrow no matter what your underlying data sources. And like I said,

10:00.000 --> 10:08.240
the ADBC is essentially a header file. It's a CAPI which means that anything that can call a C function

10:09.200 --> 10:13.840
can use it and then we wrote the bindings to make it just easier for each individual language.

10:15.040 --> 10:22.000
It also means that anything that can expose a C interface can be used to write a driver,

10:22.000 --> 10:27.200
meaning whatever language you want. And because it's utilizing arrow, arrow has what's

10:27.200 --> 10:35.600
called the C data interface. It is a way for systems to pass data across a C boundary within the

10:35.600 --> 10:44.480
same process with zero copy. You're passing a little bit of metadata and just the raw pointers.

10:44.480 --> 10:51.680
Two of the data buffers across the C boundary. You know, super simple, straightforward,

10:51.680 --> 10:58.240
an arrow can support all the data types you like, you normally need and also supports complex

10:58.240 --> 11:02.560
data types, validity, nulls and the whole of nine yards, all the complexity you need.

11:04.080 --> 11:11.600
Now we have drivers that already are in the ADBC repo for snowflake, BigQuery, Postgres,

11:12.240 --> 11:18.320
SQLite, already all had drivers. db, implements the interface. You can load,

11:18.320 --> 11:28.240
lib.db.so, as an ADBC driver. And we're still building and working on more drivers.

11:28.240 --> 11:34.080
And it's, you know, semantic release, so major minor, versioning, the whole nine yards, you expect.

11:34.720 --> 11:41.440
And we maintain that's going to be ABI compatible across releases. So adopting the standard is something

11:41.520 --> 11:45.200
that you can be sure of. It's not going to get ripped off from underneath you. And remember,

11:45.200 --> 11:50.800
it's part of the Apache arrow ecosystem. It's a sub project of the arrow project itself, which means

11:50.800 --> 12:00.480
it's all Apache 2.0 license. So where does this fit in the general area of applications?

12:02.320 --> 12:10.160
ADBC is not looking to completely replace ODBC and JDBC in all use cases. It's not going.

12:10.160 --> 12:16.480
It doesn't want, we don't want to. It's not going to. If you're, if you have a use case that is transactional

12:16.480 --> 12:23.120
and truly row oriented or small amounts of data, you're probably better off just using ODBC

12:23.120 --> 12:29.120
and JDBC unless you're already using arrow elsewhere in the stack. Because if you're already

12:29.120 --> 12:36.080
using arrow elsewhere in the stack, then the conversion cost to and from arrow is going to be

12:36.160 --> 12:41.760
a benefit to using ADBC to avoid that. If you're not already using arrow elsewhere in the stack

12:41.760 --> 12:49.200
and you have a truly row oriented workflow, then you might not get much benefit. It's not going

12:49.200 --> 12:55.360
to be better or worse. There's not going to be much benefit to the work to convert. But if you're

12:55.360 --> 13:04.800
doing bulk analytics, if you're coming from a column oriented source, if you're doing large amounts

13:04.880 --> 13:14.720
of data transfer, that's where ADBC will shine. It's vendor neutral and it can use all the different

13:14.720 --> 13:21.600
protocols and because it works by plug-able drivers, it can be ultimately any protocol that's necessary.

13:24.880 --> 13:30.080
Now, one thing that some people do bring up is they've heard of arrow flight, arrow flight

13:30.080 --> 13:36.640
SQL and they wonder what is ADBC in that context. It's the difference between a protocol and

13:36.640 --> 13:45.920
API. ADBC is a client protocol. It's how the application will interface with the data system.

13:46.800 --> 13:54.560
Arrow flight SQL is a protocol. The network protocol that will be used by say the arrow flight

13:54.560 --> 14:01.600
SQL driver for ADBC to do the actual network traffic and transport to the data system. However,

14:01.600 --> 14:05.760
that means is that if you're building the database, you're building the data system, an API,

14:05.760 --> 14:11.760
whatever. If you expose an arrow flight SQL interface as your interface or as an

14:11.760 --> 14:18.320
available interface on your system, that means you get all the client sites more for free.

14:19.280 --> 14:25.600
There's already a ADBC arrow flight SQL driver. There's already arrow flight SQL clients

14:25.600 --> 14:32.320
in many languages and so on. So if you are exposing an arrow flight SQL interface,

14:33.280 --> 14:38.000
you don't have to do any extra work for your clients to be able to access your system,

14:38.000 --> 14:44.080
they can just use all the existing open source stuff. So for example, influx DB and Dremio

14:45.040 --> 14:53.200
and Apache Doris all expose arrow flight SQL interfaces. You can use ADBC arrow flight SQL right now

14:54.000 --> 15:00.800
and use that to connect to execute queries and get your data results from all those systems

15:01.920 --> 15:08.640
super fast, like significantly faster than leveraging ODBC in JVC. That said, there is arrow flight

15:08.720 --> 15:17.600
SQL drivers for JVC and currently an open source one for ODBC is being finished. So if you

15:17.600 --> 15:21.760
expose the arrow flight SQL interface, not only are you going to get the flight SQL clients

15:21.760 --> 15:28.160
in the ADBC driver for free, you also get anyone who wants to use ODBC in JVC, get it to access

15:28.160 --> 15:32.640
it for free too and you don't have to do the extra work. Instead, you can just help contribute

15:32.640 --> 15:43.200
to the projects themselves. So an example, I mentioned that DukDB implements the ADBC API.

15:46.480 --> 15:53.280
So DukDB while back when they first did that, they did a test. They just did a simple benchmark

15:54.000 --> 16:03.360
to insert million rows and then you select star million rows of a TPCH scale factor one.

16:05.280 --> 16:13.200
They use the ODBC driver to do it and then they use ADBC to do it. Now, what's really cool

16:13.200 --> 16:18.800
about this benchmark in this test is because DukDB is an improcessing on the system.

16:19.520 --> 16:24.800
There's no network overhead that or network in state instability that's interfering here.

16:25.520 --> 16:33.520
This is effectively showing what the cost is for that amount of data of that transposition

16:35.360 --> 16:43.120
because that's really the only difference here is the conversion from arrow data to ODBC versus

16:43.120 --> 16:52.720
passing some pointers through. So some code examples. Nice straightforward simple way of

16:52.720 --> 16:59.840
how you can use ADBC right away. Python, we have ADBC drivers that are exposed to the

16:59.840 --> 17:07.680
epicellin libraries. We utilize DBAPI to win a interface and then add it and we add a fetch arrow

17:07.680 --> 17:13.600
table or fetch record batch functions to it so that you can retrieve arrow data directly from it.

17:13.600 --> 17:18.880
But you can also use any of the other DBAPI methods and it will convert them as necessary

17:18.880 --> 17:23.920
that's what you need. We already have, you know, flight SQL, snowflake, postgres SQLite or all

17:23.920 --> 17:31.120
right there. Nice simple connection interface. There's also the lower level driver we expose there.

17:31.120 --> 17:37.280
With the driver manager, even if we don't already have a packaged ADBC driver for Python,

17:37.840 --> 17:44.960
you can use the driver manager to load any arbitrary ADBC compatible shared object that you want,

17:44.960 --> 17:52.080
which is what DupDB did. If you pip install DupDB, you can then import ADBC driver manager

17:52.880 --> 18:03.440
DupDB, DupDBPI and it'll work. Because the driver manager is a simple wrapper that allows you to

18:03.440 --> 18:10.240
load the arbitrary shared objects, which means that you can share these drivers across any languages.

18:10.240 --> 18:17.120
In fact, many of the drivers in Go because it really is easy to compile Go to a static

18:17.120 --> 18:24.000
we linked shared object that has almost no dependencies other than lib C. It makes it really

18:24.000 --> 18:29.280
really great for distribution and passing around. So like the snowflake, big query flight SQL drivers

18:29.360 --> 18:35.600
are all written in Go and then expose a C interface and are loaded and packaged with Python,

18:35.600 --> 18:41.840
R, Ruby, and so on. And then we have the bindings for all the individual languages.

18:42.880 --> 18:47.840
With the Go with the Go library, if you don't want to use, you don't need arrow directly,

18:48.880 --> 18:56.560
we have a wrapper for the standard database SQL package that wraps around ADBC and returns,

18:56.640 --> 18:59.680
and you can use it the same way as you would, you know, the scan library.

19:02.000 --> 19:07.520
You can use it in R, nice simple and straightforward. In enable systems that have an ADBC driver

19:08.240 --> 19:14.800
to for free get utilized in other languages that we have the bindings for without having to make

19:14.800 --> 19:23.600
specific ASTKs for the individual languages. And then conceptually there's a lot of overlap

19:23.680 --> 19:28.880
with the systems, you know, whether you're using ADBC connections and flight SQL, so it's

19:28.880 --> 19:34.080
easier reason about. If you're used to reasoning about things at that level, the concepts are

19:34.080 --> 19:43.600
pretty much the same. So, what would an ideal scenario? Well, ADBC all supports bulk ingestion.

19:44.400 --> 19:51.920
If you have an arrow source coming in, you can ferry it into ADBC to do inserts,

19:52.880 --> 20:00.480
which also means that you can then take your ADBC driver for snowflake, get a stream of arrow

20:00.480 --> 20:09.680
record batches, feed that stream to ADBC ingest for ductDB, and release stream data from snowflake

20:09.680 --> 20:15.360
into a ductDB instance, and you didn't have to materialize everything locally. It's streaming

20:15.360 --> 20:19.680
it, you have a little memory footprint because you're producing an arrow stream, consuming an

20:19.680 --> 20:26.480
arrow stream on both ends. And that will work for anything that has ADBC drivers. It also supports

20:26.480 --> 20:33.280
partitioned results sets. Flight SQL allows you to do partitioning of results sets so that you

20:33.280 --> 20:41.200
can have separate endpoints on how you fetch your streams of data. ADBC exposes an API to allow for that.

20:41.200 --> 20:47.920
So, if you're a system, your database system puts data on different nodes so that you can

20:47.920 --> 20:54.240
parallelize the fetching, you can expose that through ADBC's API. It's there, it's available.

20:55.280 --> 20:59.920
You know, we all support transactional, you know, always standard database transaction,

20:59.920 --> 21:06.880
workflows, and even supports using substrate plans if you're utilizing things at that level,

21:06.880 --> 21:12.480
you can send substrate plans to the system and send it to ADBC if you're driver supports that

21:12.560 --> 21:21.680
if your data system supports executing a substrate plan. So, that's pretty much the talk there,

21:21.680 --> 21:26.960
that's ADBC for you. If you want more information, you want to learn more about Apache

21:26.960 --> 21:32.640
arrow or ADBC, or any of those things, you can go to the websites, you know, the arrow dot

21:32.640 --> 21:39.280
Apache dot org, and arrow Apache dot org slash ADBC. The documentation is all there, the bindings,

21:39.360 --> 21:43.920
the information, the GitHub repose, everything you need to get started and to learn more.

21:46.080 --> 21:51.520
And then, you know, not so shameless plug of, you know, I recently had the second edition of my

21:51.520 --> 21:58.240
arrow book came out, which includes an entire new chapter that is on ADBC. Because when the first

21:58.240 --> 22:04.640
edition was written, ADBC had just begun as a project. It was just starting. So, it's only

22:04.960 --> 22:11.840
up one paragraph. When I went to do the new edition of the book, I expanded that into a full chapter

22:11.840 --> 22:20.800
with code examples and insights and usage using examples and C++ go and Python. So, lots of

22:20.800 --> 22:25.600
information and not just about ADBC, but also everything in the arrow ecosystem in general.

22:26.560 --> 22:30.560
And, yep, that is hit.

22:37.280 --> 22:39.280
And there's also lots of big sticker.