WEBVTT

00:00.000 --> 00:09.000
All right, I'm Vlad, and I'm going to be talking about binary dependencies.

00:09.000 --> 00:10.000
Just doing it.

00:10.000 --> 00:11.000
Just doing it.

00:11.000 --> 00:12.000
All right.

00:12.000 --> 00:18.000
So we want to be able to identify the dependency graphs of all packages generally ideally, but specifically

00:18.000 --> 00:19.000
Keystone packages.

00:19.000 --> 00:21.000
What are Keystone packages?

00:21.000 --> 00:27.000
Nadia and her book in her report has this great definition where it comes from sort of this

00:28.000 --> 00:29.000
Barrelage definition.

00:29.000 --> 00:34.000
And the Keystone package we want to say is a package that has a widespread effect on its ecosystem

00:34.000 --> 00:38.000
or something that is disproportionately useful or dependent upon by other things or whatever.

00:38.000 --> 00:42.000
So especially the most important packages we really want to know there's dependency graphs.

00:42.000 --> 00:45.000
There's two reasons for this that I'm going to talk about.

00:45.000 --> 00:47.000
The first reason is open source financial sustainability.

00:47.000 --> 00:52.000
So obviously open source maintenance don't get paid enough or at all,

00:52.000 --> 00:54.000
burnout is a big issue when you have burnout.

00:54.000 --> 00:57.000
Packages don't get maintained, you know the deal.

00:57.000 --> 00:59.000
And open source supply chain security.

00:59.000 --> 01:04.000
So if we don't know any particular packages dependency graph, we can't know what security issues

01:04.000 --> 01:06.000
it might be vulnerable to.

01:06.000 --> 01:10.000
So the first thing, open source is not financially sustainable.

01:10.000 --> 01:12.000
Companies charge money for their products.

01:12.000 --> 01:14.000
They're products incorporate open source software.

01:14.000 --> 01:17.000
The money goes to the companies that doesn't flow to the open source product.

01:17.000 --> 01:18.000
The open source packages.

01:18.000 --> 01:20.000
And so maintainers can sustain them.

01:20.000 --> 01:21.000
They're packages.

01:21.000 --> 01:23.000
This is a problem.

01:23.000 --> 01:25.000
Technically, this is according to the rules.

01:25.000 --> 01:31.000
Yes, open source licenses allow this, but also this can still be bad even if the rules say that it's fine.

01:31.000 --> 01:33.000
Both things can be true.

01:33.000 --> 01:36.000
I'm not going to go into a lot of detail about that.

01:36.000 --> 01:37.000
The heat is doing the great talk tomorrow.

01:37.000 --> 01:40.000
The community track about burnout and open source.

01:40.000 --> 01:42.000
It's really powerful points about that.

01:42.000 --> 01:44.000
And then you can also reach out what occurs.

01:44.000 --> 01:47.000
Block both the open source sustainability crisis.

01:47.000 --> 01:50.000
So here's an example from my work.

01:50.000 --> 01:51.000
On the open source pledge.

01:51.000 --> 01:55.000
This is an initiative where we try to get companies to pay the maintainers who's worked

01:55.000 --> 01:56.000
They depend on.

01:56.000 --> 02:00.000
We ask companies to pay $2,000 per full-time equivalent developer to the

02:00.000 --> 02:03.000
employee per year to any open source maintainers really.

02:03.000 --> 02:07.000
But particularly to the maintainers of the software.

02:07.000 --> 02:10.000
They depend on the money goes directly to maintenance.

02:10.000 --> 02:11.000
We don't handle any funds.

02:11.000 --> 02:15.000
We had our one-year anniversary last year.

02:15.000 --> 02:18.000
And so far, our members have raised $6.1 million for maintainers.

02:18.000 --> 02:21.000
We're very happy about it.

02:21.000 --> 02:22.000
But.

02:22.000 --> 02:23.000
Thank you.

02:23.000 --> 02:27.000
Thank you.

02:27.000 --> 02:29.000
But where should the money go?

02:29.000 --> 02:34.000
So obviously companies don't understand 100% of their dependency tree.

02:34.000 --> 02:38.000
And so don't know necessarily what the best way is to allocate all of that money

02:38.000 --> 02:39.000
that they're paying.

02:39.000 --> 02:43.000
And like I said, we struggle to identify, especially the most important packages that

02:43.000 --> 02:45.000
keep our global infrastructure running.

02:45.000 --> 02:46.000
So that's one problem.

02:46.000 --> 02:50.000
Another problem is open source supply chain security.

02:50.000 --> 02:53.000
So let's say you have a thing and it depends on fully.

02:53.000 --> 02:55.000
It depends on some other live and some other thing.

02:55.000 --> 02:58.000
And there's a security problem in that last one.

02:58.000 --> 03:03.000
You might want to know because that means that there might be a security problem in your project.

03:03.000 --> 03:08.000
But if you don't know about that package, you can't do anything about it.

03:08.000 --> 03:11.000
So that's bad and we want to solve that.

03:12.000 --> 03:16.000
Yeah, so we want to be able to identify packages dependency graphs.

03:16.000 --> 03:19.000
Now you'd think look in the manifest, right?

03:19.000 --> 03:23.000
Because it says in package.json what things things depend on.

03:23.000 --> 03:26.000
The problem is some things are not in the manifest.

03:26.000 --> 03:32.000
The dreaded phantom dependencies, which are sort of described in this really cool post.

03:32.000 --> 03:38.000
Yeah, so if the dependency is not in your manifest, you're not going to know you depend on it.

03:38.000 --> 03:42.000
And then you can't fund the people you depend on if you wanted to do that.

03:42.000 --> 03:46.000
And also you can't spot security issues from the stuff that you depend on.

03:46.000 --> 03:47.000
So that's a problem.

03:47.000 --> 03:49.000
Now I'm going to talk about these different kinds of phantom dependencies.

03:49.000 --> 03:52.000
I'm going to talk about one kind, which is binary dependency.

03:52.000 --> 03:56.000
So this is when you have something and then it calls into this dynamic library,

03:56.000 --> 03:58.000
which is usually written in C.

03:58.000 --> 04:03.000
And this kind of dependency is usually not recorded.

04:03.000 --> 04:07.000
So for example, you might have some Python code and your Python code depends on numpy.

04:07.000 --> 04:09.000
But numpy depends on open blasts.

04:09.000 --> 04:11.000
But it doesn't say that anywhere.

04:11.000 --> 04:12.000
So we don't know that.

04:12.000 --> 04:13.000
So that's bad.

04:13.000 --> 04:15.000
It's not in Pyproject.tomals.

04:15.000 --> 04:22.000
So you can't support the developers of open blasts or any of the other sort of binary dependencies that you rely on.

04:22.000 --> 04:25.000
And you can't track security issues in those packages.

04:25.000 --> 04:26.000
So that's not good.

04:26.000 --> 04:28.000
Now how binary dependencies work?

04:28.000 --> 04:29.000
I wrote a post.

04:29.000 --> 04:30.000
It's on Vladau website.

04:30.000 --> 04:31.000
You can read it.

04:31.000 --> 04:32.000
I'm not going to get into it.

04:32.000 --> 04:34.000
But I will get into this specific thing.

04:34.000 --> 04:40.000
So how you usually call into C code in a lot of different ecosystem.

04:40.000 --> 04:43.000
So this is true for Python and Ruby and JavaScript.

04:43.000 --> 04:44.000
The details are a bit different.

04:44.000 --> 04:50.000
But basically, you want to be able to have some kind of code that you write that calls into the C library.

04:50.000 --> 04:56.000
But you also want to be able to manage type conversions in between the C types and you know,

04:56.000 --> 05:01.000
your Python data structures in a way that is, you know, maximally flexible.

05:01.000 --> 05:05.000
And so what a lot of ecosystems do is you have this thing called an extension module,

05:05.000 --> 05:07.000
which is a bunch of C codes.

05:07.000 --> 05:08.000
You write a little bit.

05:08.000 --> 05:10.000
You write a little bit of C code.

05:10.000 --> 05:15.000
And then that C code includes some headers as like Python.htr or whatever.

05:15.000 --> 05:18.000
And those headers tell it about the types in Python.

05:18.000 --> 05:22.000
And so that means that in your extension module,

05:22.000 --> 05:27.000
you can convert things, you know, back and forth whenever is best for you.

05:27.000 --> 05:29.000
And you know, it's a performance decision.

05:29.000 --> 05:37.000
But anyway, the point is your Python code and your extension module are going to communicate in whatever way is defined by your ecosystem.

05:37.000 --> 05:42.000
But then your extension module itself gets compiled to dynamic library.

05:42.000 --> 05:44.000
This is the important sort of point, right?

05:44.000 --> 05:48.000
And your dynamic library is going to get linked dynamically to the C thing you're calling.

05:48.000 --> 05:51.000
And so that means that your compiled extension module,

05:51.000 --> 05:56.000
if we can get a hold of that, is going to say in it what libraries it needs, right?

05:56.000 --> 06:00.000
It's cool because then we can figure out what binary dependencies you depend on.

06:00.000 --> 06:05.000
So, Python wheels, right, include vendor binary.

06:05.000 --> 06:09.000
So, you know, if you depend on open blast is going to include the dynamic library,

06:09.000 --> 06:12.000
compiled version of open blast in your wheel.

06:12.000 --> 06:13.000
So that's good.

06:13.000 --> 06:14.000
And we can detect that.

06:14.000 --> 06:15.000
So let's investigate.

06:15.000 --> 06:19.000
I looked at the 15,000 most downloaded Python packages.

06:19.000 --> 06:23.000
I could only download 13,074 of them for various reasons.

06:23.000 --> 06:29.000
And of those wheels, 1531 wheels contain dynamic libraries.

06:29.000 --> 06:31.000
So that's already interesting.

06:31.000 --> 06:35.000
And I found a total of 12,137 SO files.

06:35.000 --> 06:40.000
Now, those SO files, some of them are bundled binary dependencies.

06:40.000 --> 06:41.000
So stuff like open blasts.

06:41.000 --> 06:45.000
But some of them are just those extension modules that we wrote, you know,

06:45.000 --> 06:47.000
that we just looked at.

06:47.000 --> 06:51.000
So, you know, it's either the extension modules or the phantom dependencies.

06:51.000 --> 06:53.000
So we want to find the phantom dependencies.

06:53.000 --> 06:55.000
Okay, how do we find it?

06:55.000 --> 06:59.000
In your extension module, that's on Linux, at least it's an L file.

06:59.000 --> 07:01.000
Your L file has a section, the dynamic section.

07:01.000 --> 07:03.000
And in there, it has a bunch of information.

07:03.000 --> 07:05.000
But some of that information has a tag.

07:05.000 --> 07:06.000
And the tag is detected.

07:06.000 --> 07:10.000
So it describes a library that your extension module needs.

07:10.000 --> 07:15.000
And in the name, it's going to tell you the name of that dynamic library.

07:15.000 --> 07:20.000
And it's that name that the runtime linker uses to find that thing on the computer.

07:20.000 --> 07:23.000
And so in theory, the name is sufficient for us to find it.

07:23.000 --> 07:24.000
Right?

07:24.000 --> 07:26.000
So we can look inside the dynamic.

07:26.000 --> 07:29.000
We can look inside those dynamic libraries and find those names.

07:29.000 --> 07:36.000
And so I decided to figure out what the most common dynamic libraries are in our 13,000 odd wheels.

07:36.000 --> 07:38.000
Some of them are obvious.

07:38.000 --> 07:42.000
I think, you know, lib m whatever.

07:42.000 --> 07:44.000
So that's fine.

07:44.000 --> 07:47.000
But some of them, I don't, I mean, maybe you all know all of these things.

07:47.000 --> 07:49.000
In which case, that's super cool.

07:49.000 --> 07:50.000
I don't.

07:50.000 --> 07:53.000
I mean, I know like a few of them, but also wasn't expecting them to be this common.

07:53.000 --> 07:55.000
Other ones, I just don't know what they are at all.

07:55.000 --> 07:59.000
So anyway, I don't sort of have a concrete breakdown.

07:59.000 --> 08:03.000
But it's interesting because there's a bunch of stuff that's really dependent upon that we don't know.

08:03.000 --> 08:05.000
I don't know what it is.

08:05.000 --> 08:10.000
And so I wanted to come up with a number of how common binary dependencies are in the Python ecosystem.

08:10.000 --> 08:12.000
Generally, it's going to be a very inaccurate number.

08:12.000 --> 08:13.000
But let's give it a try.

08:13.000 --> 08:15.000
So we have 13,000 odd packages.

08:15.000 --> 08:19.000
We have 1,531 wheels that contain the SO files.

08:19.000 --> 08:24.000
We have 1,300 free other wheels that depend on those first wheels, right?

08:24.000 --> 08:27.000
So that gives us 2,834 wheels.

08:27.000 --> 08:32.000
Now, when we generalize this, we're going to keep in mind that this is obviously a small sample,

08:32.000 --> 08:34.000
even though it is the most popular packages.

08:34.000 --> 08:37.000
And also, this is only one way of doing binary dependencies, right?

08:37.000 --> 08:39.000
There's other ways.

08:39.000 --> 08:42.000
But, you know, taking that into account if we were to average that out.

08:42.000 --> 08:46.000
We can say that around 20% of Python packages have phantom binary dependencies.

08:46.000 --> 08:51.000
Which means that for those packages, which is a lot of packages, we don't have the full picture of the dependencies,

08:51.000 --> 08:54.000
especially since frequently.

08:54.000 --> 08:59.000
You know, you're not going to have a binary dependency for something that's like a left-pad type situation, I hope.

08:59.000 --> 09:02.000
So it's going to be a thing that you really rely on, right?

09:02.000 --> 09:03.000
So we want to know about those things.

09:03.000 --> 09:05.000
And this affects pretty much everyone.

09:05.000 --> 09:07.000
It affects users of the open source software.

09:07.000 --> 09:10.000
And it also harms maintainers because the maintainers can't get supported.

09:10.000 --> 09:13.000
Yeah, so I'd like to continue working on this.

09:13.000 --> 09:14.000
This was just for a Python.

09:14.000 --> 09:16.000
We can do this in more detail for Python.

09:16.000 --> 09:18.000
We can do it for all of these other ecosystems.

09:18.000 --> 09:22.000
We could look at the symbols inside the dynamic libraries.

09:22.000 --> 09:26.000
In case we have some trouble identifying what library a file belongs to.

09:26.000 --> 09:31.000
We can look at the symbols and build a database of, you know, which symbols are provided by what package.

09:31.000 --> 09:37.000
I would like to publish some kind of tool that makes it easy to explore these binary dependencies for any given package.

09:37.000 --> 09:40.000
And integrating into services like ecosystems would be great.

09:40.000 --> 09:44.000
I'd hope to get some funding to work on this if you know of anyone that might be willing to fund this.

09:44.000 --> 09:45.000
Let me know.

09:45.000 --> 09:46.000
And that's it.

09:46.000 --> 09:47.000
Thank you very much.

09:47.000 --> 09:49.000
There's more details on my website.

09:57.000 --> 09:58.000
All right, let's do some questions.

09:58.000 --> 09:59.000
I'll fund you.

09:59.000 --> 10:00.000
Amazing.

10:00.000 --> 10:01.000
Easy.

10:01.000 --> 10:02.000
Thank you very much.

10:02.000 --> 10:05.000
It's not a question, but I welcome it.

10:05.000 --> 10:07.000
I'm kind of a question.

10:07.000 --> 10:08.000
Sorry.

10:08.000 --> 10:10.000
The answer is yes to your question.

10:10.000 --> 10:11.000
Let's do a different question.

10:11.000 --> 10:13.000
We have a question over there.

10:13.000 --> 10:20.000
Do you think the future culture should just debate these kind of binary dependence?

10:20.000 --> 10:24.000
I don't.

10:24.000 --> 10:25.000
Yes.

10:25.000 --> 10:30.000
So do I think that package manager or sort of package managers or package managers?

10:30.000 --> 10:31.000
Yeah.

10:31.000 --> 10:32.000
Yeah.

10:32.000 --> 10:35.000
Registry should not allow these kinds of binary dependencies.

10:35.000 --> 10:36.000
No.

10:36.000 --> 10:37.000
I don't think so.

10:37.000 --> 10:42.000
I think if anything, we need to get better handling them because there's, you know, a lot of people

10:42.000 --> 10:43.000
use them for very good reason.

10:43.000 --> 10:47.000
If anything, we don't have the infrastructure right now to do that in a reliable way.

10:47.000 --> 10:48.000
Yeah.

10:48.000 --> 10:49.000
Just one word.

10:49.000 --> 10:51.000
There's a button on this thing.

10:51.000 --> 10:53.000
We'll close it.

10:53.000 --> 10:54.000
Yeah.

10:54.000 --> 10:55.000
Two.

10:55.000 --> 10:56.000
Four.

10:56.000 --> 10:57.000
Four.

10:57.000 --> 10:58.000
Girls.

10:58.000 --> 10:59.000
And we have the offer right there.

10:59.000 --> 11:01.000
Raise your hand.

11:01.000 --> 11:02.000
Hello.

11:02.000 --> 11:03.000
The truck.

11:03.000 --> 11:04.000
Bineries.

11:04.000 --> 11:06.000
That's our vendor.

11:06.000 --> 11:07.000
There's another thing.

11:07.000 --> 11:09.000
Not my son, please.

11:09.000 --> 11:10.000
Raise your hand.

11:10.000 --> 11:12.000
So main things of package.

11:13.000 --> 11:14.000
Test truck.

11:14.000 --> 11:16.000
S-S-T-R-E.

11:16.000 --> 11:18.000
Which instruments.

11:18.000 --> 11:21.000
G-C-C and L-D-A-B-E-L.

11:21.000 --> 11:25.000
To inject all of the dependencies in an L-F-section.

11:25.000 --> 11:27.000
So there are people who believe that you said there's two.

11:27.000 --> 11:28.000
So there's pebb.

11:28.000 --> 11:29.000
Haven't you haven't happened then?

11:29.000 --> 11:34.000
You know that the dependencies because they were bound up at the up time.

11:34.000 --> 11:35.000
So.

11:35.000 --> 11:36.000
We isn't as bound.

11:36.000 --> 11:37.000
For the time we spread.

11:37.000 --> 11:38.000
Yes.

11:38.000 --> 11:39.000
Yes.

11:39.000 --> 11:40.000
So just to repeat this.

11:41.000 --> 11:42.000
Yeah.

11:42.000 --> 11:43.000
You said there's pebb.

11:43.000 --> 11:44.000
Seven.

11:44.000 --> 11:45.000
Seven.

11:45.000 --> 11:46.000
Twenty-five.

11:46.000 --> 11:47.000
Okay.

11:47.000 --> 11:48.000
So that's I will definitely look into that.

11:48.000 --> 11:49.000
Hello.

11:49.000 --> 11:50.000
And then the other thing was called.

11:50.000 --> 11:52.000
S-T-R-E.

11:52.000 --> 11:53.000
S-T-R-E.

11:53.000 --> 11:54.000
Okay.

11:54.000 --> 11:55.000
Cool.

11:55.000 --> 11:56.000
So I'll look at those things.

11:56.000 --> 11:57.000
Thank you very much.

11:57.000 --> 11:59.000
You can go to getup.com slash sunny.

11:59.000 --> 12:00.000
Yeah.

12:00.000 --> 12:01.000
Yes.

12:01.000 --> 12:02.000
Amazing.

12:02.000 --> 12:03.000
Thank you.

12:03.000 --> 12:04.000
Any.

12:04.000 --> 12:05.000
Raise your hand.

12:05.000 --> 12:06.000
I may sound.

12:06.000 --> 12:08.000
You can go to this guy there.

12:08.000 --> 12:09.000
Amazing.

12:10.000 --> 12:11.000
We have one question over there.

12:11.000 --> 12:12.000
If I still have time for one question.

12:12.000 --> 12:13.000
Yes.

12:13.000 --> 12:14.000
I'll get a curse.

12:14.000 --> 12:15.000
It's not so much a question.

12:15.000 --> 12:19.000
It occurs to me that if you said this is called phrase or run.

12:19.000 --> 12:20.000
You will discover every day.

12:20.000 --> 12:21.000
Yeah.

12:21.000 --> 12:22.000
Yeah.

12:22.000 --> 12:23.000
I've built a blue.

12:23.000 --> 12:24.000
Exactly.

12:24.000 --> 12:25.000
For that.

12:25.000 --> 12:26.000
He's a trace.

12:26.000 --> 12:27.000
It's painful.

12:27.000 --> 12:28.000
Because.

12:28.000 --> 12:29.000
Yeah.

12:29.000 --> 12:30.000
Interesting.

12:30.000 --> 12:31.000
And doesn't know.

12:31.000 --> 12:33.000
Because you don't find exactly this.

12:33.000 --> 12:34.000
Okay.

12:34.000 --> 12:35.000
Then you don't know.

12:35.000 --> 12:36.000
All right.

12:36.000 --> 12:37.000
All right.

12:37.000 --> 12:38.000
Thank you very much.

12:39.000 --> 12:40.000
Good job.

12:40.000 --> 12:41.000
Thank you.

12:41.000 --> 12:42.000
Thank you.

12:42.000 --> 12:43.000
Good job.

12:43.000 --> 12:44.000
Good job.

