WEBVTT

00:00.000 --> 00:18.000
Okay, so we are in time so we can welcome Philippe and on his dog and please if you still see some space next to each other, try to move to the middle so more people can come in.

00:18.000 --> 00:26.000
So, welcome, I'm going to talk today about a favorite topic which is AI and L.M.s and GNI.

00:26.000 --> 00:32.000
And excuse me, speak up.

00:32.000 --> 00:34.000
Okay.

00:34.000 --> 00:36.000
Okay, now it's better.

00:36.000 --> 00:38.000
You hear me all right?

00:38.000 --> 00:41.000
In the back, it doesn't saturate better like that.

00:41.000 --> 00:49.000
So, I'm going to talk about our favorite topic which is AI and GNI that we all love to hate.

00:49.000 --> 00:57.000
Me, in particular, actually, I'm not so much of a ludic but I have a no AI badge in the back of my laptop.

00:57.000 --> 01:01.000
And not of this was generated with AI by the way.

01:01.000 --> 01:05.000
So, Philippe and I'm the lead of an open source project.

01:05.000 --> 01:08.000
It's actually now a small foundation called abort code.

01:08.000 --> 01:13.000
And everything we do is free an open source code data and standard.

01:13.000 --> 01:16.000
To help people use more open source.

01:16.000 --> 01:20.000
So, figure out what the code comes from, what the license or the security issues.

01:20.000 --> 01:23.000
And I'm sure we can all benefit from that.

01:23.000 --> 01:26.000
I double the beat in standards.

01:26.000 --> 01:30.000
I'm a co-founder of something called SPDX for S-bounds.

01:30.000 --> 01:38.000
And I'm also a co-controller to cyclone DX because standard you need more of them so you can pick which one to support.

01:38.000 --> 01:48.000
And I'm also behind a small standard called package URL which happened to be since mid December, a commas standard.

01:48.000 --> 01:55.000
And on the way to ISO, it's a very stupid string to identify package in S-bomb and SCA tools and VomT database.

01:55.000 --> 02:01.000
You will hear about that if you don't, who has ever heard about Perl or package URL in the room?

02:01.000 --> 02:04.000
Good, so it's pretty cool.

02:04.000 --> 02:12.000
You will have to surface with that whether you like it or not if you're doing somehow application security at some level.

02:12.000 --> 02:19.000
So, we're building tools, data, standards, everything is free as open source.

02:19.000 --> 02:21.000
That's important.

02:21.000 --> 02:24.000
The origin that we have, of course, AI is a thing the world.

02:24.000 --> 02:31.000
That's a serious issue and how can we make sure we reuse AI responsibly?

02:31.000 --> 02:34.000
And there's the two core issue that I will bring up.

02:34.000 --> 02:38.000
One is an issue of license.

02:38.000 --> 02:44.000
Because if you think about open source, open source is defined by license.

02:45.000 --> 02:48.000
Code that's generated by a bot.

02:48.000 --> 02:53.000
The tree is not out yet, but is eventually non-corporatable.

02:53.000 --> 02:54.000
So, there's no open source anymore.

02:54.000 --> 03:02.000
There's also the related problem which is LLMs are massively trained on an open source code.

03:02.000 --> 03:04.000
The code we write.

03:04.000 --> 03:10.000
And it can really regurgitate, speak back this code very efficiently because they memorize it.

03:10.000 --> 03:14.000
Which brings not only copyright issue, but also security issue.

03:14.000 --> 03:17.000
You know, bag in, garbage in, garbage out.

03:17.000 --> 03:31.000
So, you have a wonderful way to eventually propagate security bags from the training code to the code that's being generated.

03:31.000 --> 03:36.000
So, who has not heard about AI and LLM?

03:37.000 --> 03:41.000
So, somebody was living the cave for the last three years.

03:41.000 --> 03:43.000
Wonderful.

03:43.000 --> 03:44.000
That's great.

03:44.000 --> 03:46.000
That's a rare thing.

03:46.000 --> 03:48.000
We need to put you on a pedestal.

03:48.000 --> 03:51.000
Because you will save humanity.

03:51.000 --> 03:52.000
Eventually, you're the last one of us.

03:52.000 --> 03:55.000
That's not so far to the thing.

03:55.000 --> 03:57.000
Okay.

03:57.000 --> 04:02.000
We're talking about the risk of the AI generated code.

04:03.000 --> 04:08.000
And also another topic which is eventually AI is used as a feature in code.

04:08.000 --> 04:14.000
It's not so much not only generated code, but also using an LLM,

04:14.000 --> 04:18.000
a chat feature or whatever, especially in the world of cyber security.

04:18.000 --> 04:23.000
It has really, really funky outcomes.

04:23.000 --> 04:30.000
And eventually, having the ability to identify when code the AI generated is really important.

04:30.000 --> 04:34.000
Whatever I'm going to show you today, and I'll do a short demonstration in July,

04:34.000 --> 04:37.000
I'm not able to detect the AI generated code.

04:37.000 --> 04:42.000
There's no EM dash in code to spot the AI generated code.

04:42.000 --> 04:47.000
What I can instead do is detect strikingly,

04:47.000 --> 04:53.000
which is strikingly similar to your open source code that was used to feed the beast.

04:53.000 --> 04:58.000
And that's what we're focusing on at least part of what I'm talking about.

04:59.000 --> 05:04.000
Now, you can at least do funny stuff and even useful stuff in the AI.

05:04.000 --> 05:06.000
One of them is to generate points.

05:06.000 --> 05:16.000
That's actually AI generated, which would describe how you would do AI generated code detection.

05:16.000 --> 05:20.000
I'll make sure I put the link to the slides in my talk.

05:20.000 --> 05:23.000
I didn't do it yet, but you have the full downloadable,

05:23.000 --> 05:26.000
a library of his file and PDF.

05:26.000 --> 05:28.000
I make sure it's on first them.

05:28.000 --> 05:32.000
So there's another problem is there's a lot of open-washing in the space.

05:32.000 --> 05:34.000
You know, open AI is not open.

05:34.000 --> 05:39.000
And most of open space AI is not open source itself.

05:39.000 --> 05:46.000
There is also a lot of open source, which is essentially central,

05:46.000 --> 05:50.000
which opens any kind of AI will not be happening,

05:50.000 --> 05:53.000
then so flow by torture some of them.

05:53.000 --> 06:02.000
The good thing at least for now is that the code that's the AI generated very often doesn't compile yet.

06:02.000 --> 06:05.000
It's a good thing you send that it's an easy way to spot.

06:05.000 --> 06:10.000
It doesn't compile, doesn't pass the test, it's a good chance to see AI generated.

06:10.000 --> 06:15.000
We have, as I said, not only a problem of open-washing,

06:16.000 --> 06:21.000
but also a problem which is a lot of innovation.

06:21.000 --> 06:25.000
And I don't like this kind of innovation about licensing.

06:25.000 --> 06:31.000
We have a lot of new, what I call, open source, open-washing licenses,

06:31.000 --> 06:34.000
which look like open source, but are not.

06:34.000 --> 06:39.000
And that's so straight, not so much on secret, but on open source at large,

06:39.000 --> 06:45.000
because it's very easy to get full if you're not really savvy about spotting these.

06:45.000 --> 06:50.000
And sometimes they take an MIT, a BSD, or an Apache license, and in sort of few things.

06:50.000 --> 06:55.000
We're seeing at least one or two new license a week,

06:55.000 --> 06:57.000
which are funky, non-open source license.

06:57.000 --> 07:02.000
They all come from AI projects or AI-related projects.

07:02.000 --> 07:09.000
He's in point, for instance, if you're doubling with a source available,

07:09.000 --> 07:15.000
or downloadable model from MITA, Lama 4,

07:15.000 --> 07:18.000
good luck for you, you're basing your up, you cannot touch.

07:18.000 --> 07:23.000
The license, please, says, you cannot choose this model if you're basing your up.

07:23.000 --> 07:27.000
That's wonderful. What can go wrong with that?

07:27.000 --> 07:34.000
So, I think that we have to treat AI, and accept sometimes to wonderful,

07:34.000 --> 07:38.000
positive booster, sometimes it even compiles and run.

07:38.000 --> 07:40.000
It's rare, but that can help.

07:40.000 --> 07:45.000
But again, we need to ensure we understand what we're dealing with.

07:45.000 --> 07:50.000
If you imagine taking all the code from the new project on the GPL,

07:50.000 --> 07:54.000
and create a small language model of that,

07:54.000 --> 08:01.000
I'm not a lawyer, but I cannot imagine a way where the output of generating code from that model

08:01.000 --> 08:03.000
would not be also under the GPL.

08:03.000 --> 08:08.000
Just derive in a weird, funky way with weight and math behind it,

08:08.000 --> 08:12.000
but still eventually directly derived from that.

08:12.000 --> 08:15.000
There's also interesting that, in some case,

08:15.000 --> 08:20.000
and I think it's probably going too far, you have corporations which are

08:20.000 --> 08:24.000
probiting AI, generated code, or probiting AI use.

08:24.000 --> 08:30.000
You also have stupid corporations, and maybe some of you have been subjected to these abuse,

08:30.000 --> 08:33.000
where you have managers that ask you every other day,

08:33.000 --> 08:38.000
have been using more, have you been using more AI?

08:38.000 --> 08:42.000
Which AI have you been using? Why are you not using more AI?

08:42.000 --> 08:46.000
Which I think is absolutely terrible as a metric for performance.

08:46.000 --> 08:51.000
I hope, how many of you are subjected to this kind of abuse in your corporation or business?

08:51.000 --> 08:55.000
Oh, man, more than people that know Pearl, that's terrible.

08:55.000 --> 09:00.000
Oh, we're leaving really in weird world.

09:00.000 --> 09:04.000
So, the AI generates its search project.

09:04.000 --> 09:09.000
The test base is a small between code training data,

09:09.000 --> 09:15.000
where we've indexed about 260,000 open source project source code.

09:16.000 --> 09:24.000
And what we did is to ask Chagipiti to generate code similar to a package URL.

09:24.000 --> 09:28.000
So, essentially generate code similar to this package.

09:28.000 --> 09:31.000
It's probably not a problem that you would use in real life,

09:31.000 --> 09:38.000
but it's a possible point, and we saved each file in a code file.

09:38.000 --> 09:46.000
And then we used or tool to rebuild for that to scan this file, this 100 files,

09:46.000 --> 09:49.000
on the most common JavaScript project,

09:49.000 --> 09:58.000
and basically run the code matching between index of 200,000 and this 100 generated files.

09:58.000 --> 10:02.000
What we found is it's not scientific,

10:02.000 --> 10:06.000
at least 20% of the case, strikingly similar code.

10:06.000 --> 10:10.000
So, that's how the set-up looks like.

10:10.000 --> 10:13.000
A lot of matching read, that everything's open source,

10:13.000 --> 10:17.000
the whole point is you essentially collect check sums for your index,

10:17.000 --> 10:18.000
and you match them back.

10:18.000 --> 10:21.000
Except the check sums are not really check sums that would be too brittle.

10:21.000 --> 10:23.000
We wouldn't find anything.

10:23.000 --> 10:25.000
In particular, when you generate code,

10:25.000 --> 10:30.000
we's depending on the parameter called the model temperature.

10:30.000 --> 10:33.000
You will have essentially the same control flow,

10:33.000 --> 10:36.000
completely different name for the variables, the functions, and else.

10:36.000 --> 10:39.000
So, you need to adjust to this kind of things.

10:39.000 --> 10:45.000
And initial approach, where to say, let's use AI to detect

10:45.000 --> 10:51.000
if there are similarities with a code that exists in the AI generated code.

10:51.000 --> 10:56.000
It worked, actually, okay, except it was extremely expensive.

10:56.000 --> 11:02.000
I just spent all my free token budget with all the AI company there.

11:02.000 --> 11:09.000
The focus of this project is to say, let's find strikingly similar code fragments

11:09.000 --> 11:11.000
that make them from another project.

11:11.000 --> 11:15.000
The problem is it doesn't work well with traditional techniques.

11:15.000 --> 11:18.000
If you think about just an inverted index,

11:18.000 --> 11:22.000
there's too much content to match details to get in.

11:22.000 --> 11:25.000
You get too much noise very quickly.

11:25.000 --> 11:31.000
We didn't try also the approach, but we research it extensively,

11:31.000 --> 11:35.000
which is used in a search engine called Bing from Microsoft.

11:35.000 --> 11:40.000
It's a project called BitcoinL, which is an alternative to inverted index indices,

11:40.000 --> 11:45.000
which is interesting, and it probably something you want to consider in the future.

11:45.000 --> 11:48.000
And you have a few companies in the space,

11:48.000 --> 11:52.000
which are doing what I call traditional code fragment matching companies.

11:52.000 --> 11:54.000
Commercial companies like BlackDuck,

11:55.000 --> 11:59.000
semi commercial company like SKNOSS,

11:59.000 --> 12:03.000
a proprietary company like Fossa or FOSID.

12:03.000 --> 12:06.000
They all use the exact same algorithm,

12:06.000 --> 12:13.000
which was devised by a guy in Berkeley in the late 1990s, early 2000s.

12:13.000 --> 12:16.000
And that's really work well,

12:16.000 --> 12:22.000
because it's not about to detect the wide variations we have when we use it.

12:22.000 --> 12:27.000
You can point twice the same model with the same point.

12:27.000 --> 12:32.000
You will get eventually slightly different results, so that's the problem.

12:32.000 --> 12:38.000
So the approach is going to a first we break the code into chunks.

12:38.000 --> 12:42.000
That means literally we parse the code in tokens,

12:42.000 --> 12:46.000
and we detect boundaries using a content defined algorithm

12:46.000 --> 12:50.000
to have chunks which are roughly of the same size.

12:50.000 --> 12:54.000
And then we compute what's called FOSID hash.

12:54.000 --> 12:58.000
Some of you may be aware of things like SSDEEP,

12:58.000 --> 13:02.000
which is a tool to find approximate matching files,

13:02.000 --> 13:04.000
where you use in security.

13:04.000 --> 13:08.000
We're not using SSDEEP, but the principles are similar,

13:08.000 --> 13:13.000
meaning you abstract the code fragment to bit string,

13:13.000 --> 13:16.000
and you have way to match approximately validating distance,

13:16.000 --> 13:20.000
in having distance if two code fragments are the same.

13:20.000 --> 13:24.000
The thing in our case is what's interesting is that the precision

13:24.000 --> 13:27.000
of the matching can be tuned at correct time,

13:27.000 --> 13:32.000
depending on how large the bit string you want to be.

13:32.000 --> 13:37.000
The best way to understand how this works is this.

13:37.000 --> 13:39.000
Imagine two cats, pictures.

13:39.000 --> 13:43.000
A brown and a gray that have a slightly different tail different eyes.

13:43.000 --> 13:50.000
If you resize the image down to say 32 bits by 32 bits,

13:50.000 --> 13:53.000
they look exactly the same.

13:53.000 --> 13:58.000
That's essentially what we're doing here using as our approach,

13:58.000 --> 14:01.000
but it's sizing down cat pictures,

14:01.000 --> 14:05.000
which is always, of course, a security favorite.

14:05.000 --> 14:11.000
So the initial project plan was to go through

14:11.000 --> 14:15.000
each of these steps, I won't forget that.

14:15.000 --> 14:18.000
But again, this didn't work well.

14:18.000 --> 14:21.000
We were not able to detect a lot of the code,

14:21.000 --> 14:24.000
because of these variations in the literature.

14:24.000 --> 14:26.000
That's one problem.

14:26.000 --> 14:29.000
And you know, good stuff, die hard.

14:29.000 --> 14:32.000
We found an algorithm devised by a guy,

14:32.000 --> 14:36.000
which happened to be the author of a venerable version

14:36.000 --> 14:39.000
control system called CVS,

14:39.000 --> 14:43.000
which creates, which was based on RCS,

14:43.000 --> 14:45.000
which then led to subversion.

14:45.000 --> 14:49.000
They eventually took it with a few segues in Bazaar

14:49.000 --> 14:52.000
and other version control systems.

14:52.000 --> 14:55.000
And I think the guy, I don't know if he's dead or not,

14:55.000 --> 14:58.000
but he's retired.

14:58.000 --> 15:02.000
At the minimum he's retired, working in Dutch university.

15:02.000 --> 15:06.000
And he wrote a piece of code to actually

15:06.000 --> 15:09.000
transform a stream of code tokens

15:09.000 --> 15:12.000
in something that's generic and makes sense.

15:12.000 --> 15:15.000
It's simple, a views, it's time tested,

15:15.000 --> 15:19.000
it's like what almost 40 years.

15:19.000 --> 15:23.000
And we call that code stemming, because that looks cool.

15:23.000 --> 15:26.000
But the sense is to the same way you stem

15:26.000 --> 15:31.000
language when you index it for information retrieval,

15:31.000 --> 15:34.000
where two words which share the same start

15:34.000 --> 15:37.000
will be just abstracted to that start.

15:37.000 --> 15:41.000
Here we're doing the same with code.

15:41.000 --> 15:45.000
We have another problem which I won't dive into too much,

15:45.000 --> 15:48.000
which is how do you distribute eventually the fingerprints,

15:48.000 --> 15:53.000
widely, so we don't create that this massive centralized database,

15:53.000 --> 15:55.000
which is locking mechanism.

15:55.000 --> 15:58.000
We have some single federated code to help with that

15:58.000 --> 16:00.000
that we're progressively deploying,

16:00.000 --> 16:04.000
which means that eventually you have access to this

16:04.000 --> 16:08.000
to run on prem without having any of us in the picture.

16:08.000 --> 16:12.000
It's important to liberate the data.

16:12.000 --> 16:16.000
So, current status, we have something which works.

16:16.000 --> 16:19.000
It's based on boring technology, I love boring.

16:19.000 --> 16:22.000
It's proven there's nothing funky.

16:22.000 --> 16:26.000
We use old code, we resist when we have

16:26.000 --> 16:30.000
a request of newbies that says, hey, you know why don't use these new tools,

16:30.000 --> 16:31.000
these new things?

16:31.000 --> 16:35.000
No, we use boring, working things.

16:35.000 --> 16:41.000
And we make sure we could also extract low-level libraries

16:41.000 --> 16:45.000
that could be reused for the purpose at the same time.

16:45.000 --> 16:52.000
So, before going there, I'm going to go and do the mandatory live demo,

16:52.000 --> 16:57.000
which for sure will not work because we're alive.

16:57.000 --> 17:00.000
And if I can find...

17:04.000 --> 17:07.000
Shiza.

17:07.000 --> 17:10.000
I see there's a few Germans in the room, right?

17:10.000 --> 17:14.000
Normally, you're supposed to use French to swear.

17:14.000 --> 17:19.000
But here, I'm using Germans because that's less recognizable.

17:19.000 --> 17:21.000
Okay.

17:21.000 --> 17:24.000
I'm just going to go there, we'll find it when we add the other,

17:24.000 --> 17:28.000
that's a test instance.

17:28.000 --> 17:31.000
And I'm going to search for test project.

17:31.000 --> 17:34.000
I'm sure we're going to have one.

17:34.000 --> 17:37.000
So, this is the tool we use at the phone tank called Scanko.

17:37.000 --> 17:41.000
You can just look for it, download and run.

17:41.000 --> 17:48.000
It can scanko for origin, like AI generated code,

17:48.000 --> 17:50.000
but also a lot of things.

17:50.000 --> 17:53.000
Scank containers do wrong off the hookups, lot of other stuff.

17:53.000 --> 17:58.000
But match to match code, there we go.

17:58.000 --> 18:03.000
And this looks like a good design project.

18:03.000 --> 18:09.000
So, what we did here is scank for some code that was the AI generated.

18:10.000 --> 18:13.000
If I recall correctly, the point was,

18:13.000 --> 18:17.000
generate some code similar to this JavaScript library that does

18:17.000 --> 18:20.000
basics before encoding.

18:20.000 --> 18:23.000
We run these three pipelines in sequence,

18:23.000 --> 18:25.000
what's interesting to see the results here.

18:25.000 --> 18:28.000
There was one package that was detected,

18:28.000 --> 18:35.000
which is eventually the project that we asked to generate code for.

18:35.000 --> 18:38.000
So, it sounds a bit as a totology,

18:38.000 --> 18:42.000
but interesting thing, asked for a prompt and

18:42.000 --> 18:48.000
LLM to generate code about a certain package you'll get it out.

18:48.000 --> 18:53.000
And if we dive a bit into this,

18:53.000 --> 18:56.000
we see three matched resources,

18:56.000 --> 18:59.000
and if we do dive into one of them,

18:59.000 --> 19:02.000
and see a code you're here,

19:02.000 --> 19:05.000
we can see the match fragments

19:05.000 --> 19:09.000
that are essentially the same as our scene upstream.

19:09.000 --> 19:12.000
And you could dive into the details.

19:12.000 --> 19:15.000
You would be looking, you see some sections,

19:15.000 --> 19:17.000
which are not highlighted.

19:17.000 --> 19:19.000
It's just a side effect of the algorithm.

19:19.000 --> 19:20.000
But you look at this code, you says.

19:20.000 --> 19:25.000
Yes, there's no question that this has been obviously derived

19:25.000 --> 19:27.000
from this upstream project.

19:27.000 --> 19:30.000
And, and except for a few non-match regions,

19:30.000 --> 19:32.000
such as the same code, and again,

19:32.000 --> 19:35.000
this is literally exactly verbatim the code

19:35.000 --> 19:38.000
with a few magnification.

19:38.000 --> 19:42.000
So, that's the proof that we're not boasting there.

19:42.000 --> 19:45.000
So, you don't have to trust me.

19:45.000 --> 19:51.000
Next up, the key things also is being able to detect

19:51.000 --> 19:56.000
the case where you have code that's used as a feature.

19:56.000 --> 19:59.000
So, that's what we're working on next.

19:59.000 --> 20:05.000
Together, we're helping people that build LLM using code.

20:05.000 --> 20:09.000
In particular, there's a project at a hanging face,

20:09.000 --> 20:11.000
which they call big code,

20:11.000 --> 20:14.000
and they're building a dataset called the stack.

20:14.000 --> 20:17.000
They're eventually helping them to run

20:17.000 --> 20:19.000
or tools can code at scale,

20:19.000 --> 20:23.000
to ensure that the provenance and license of the code they index

20:23.000 --> 20:25.000
is actually known,

20:25.000 --> 20:29.000
which is a good thing because at least you can trace

20:29.000 --> 20:33.000
when you have potentially generated code from their models,

20:33.000 --> 20:35.000
where this came from accurately.

20:35.000 --> 20:41.000
The next frontier is really to treat models

20:41.000 --> 20:44.000
as software components.

20:44.000 --> 20:49.000
Meaning, there's some companies that probably don't want

20:49.000 --> 20:51.000
to use Chinese models.

20:51.000 --> 20:53.000
It's the case in Germany,

20:53.000 --> 20:57.000
where I think it's been prohibited by the German government,

20:57.000 --> 20:58.000
in some case.

20:58.000 --> 21:00.000
I think DeepSeek has been prohibited.

21:00.000 --> 21:02.000
It's the case in some US corporation

21:02.000 --> 21:05.000
doing business with the US federal government.

21:05.000 --> 21:07.000
I don't care about the reason.

21:07.000 --> 21:10.000
I think frankly, it's overblown and bullshit,

21:10.000 --> 21:13.000
but in any case it's interesting

21:13.000 --> 21:14.000
from a technical point of view to say,

21:14.000 --> 21:20.000
can we detect when a model is based on DeepSeek?

21:20.000 --> 21:22.000
And I would be in fine tune, for instance, on DeepSeek?

21:23.000 --> 21:28.000
It turns out from the early things we've been looking at

21:28.000 --> 21:34.000
that if you treat the sequence of weight in a model

21:34.000 --> 21:37.000
as a subject of fingerprinting,

21:37.000 --> 21:40.000
we can actually find strikingly similar similarities

21:40.000 --> 21:44.000
between the model and its fine tune versions.

21:44.000 --> 21:46.000
In many cases,

21:46.000 --> 21:48.000
so when you have quantization,

21:48.000 --> 21:50.000
it's harder and it doesn't work.

21:50.000 --> 21:52.000
But when you don't quantize,

21:52.000 --> 21:56.000
you have essentially a few number of the weights

21:56.000 --> 22:00.000
which are being updated at each generation of fine tuning,

22:00.000 --> 22:01.000
and the rest is truly the same,

22:01.000 --> 22:04.000
and we find these very efficiently.

22:04.000 --> 22:07.000
The last point is to detect

22:07.000 --> 22:12.000
when you have AI,

22:12.000 --> 22:15.000
used using APIs and libraries.

22:15.000 --> 22:16.000
So it's easier,

22:16.000 --> 22:18.000
we already have the code to detect

22:18.000 --> 22:21.000
code similarities to the tech libraries.

22:21.000 --> 22:23.000
It's going to be just about tagging them,

22:23.000 --> 22:26.000
so you have LangChain,

22:26.000 --> 22:27.000
well known library in Python,

22:27.000 --> 22:29.000
for instance, you want to make sure that you know that you

22:29.000 --> 22:30.000
are using LangChain,

22:30.000 --> 22:32.000
which means it's like using AI,

22:32.000 --> 22:34.000
featuring your code.

22:34.000 --> 22:36.000
And API importance,

22:36.000 --> 22:37.000
something that's like yes,

22:37.000 --> 22:41.000
simple as just doing a grip on URL.

22:41.000 --> 22:43.000
And that's it.

22:43.000 --> 22:45.000
This was founded in part by

22:45.000 --> 22:47.000
the EU program called NGI search,

22:47.000 --> 22:49.000
so if you're basing the EU in part

22:49.000 --> 22:51.000
sums to your taxes,

22:51.000 --> 22:53.000
thank you very much.

22:53.000 --> 22:54.000
The code is yours,

22:54.000 --> 22:57.000
it's not ours, it's for use to use.

22:57.000 --> 23:01.000
And if you have questions,

23:01.000 --> 23:04.000
I'm taking some questions there.

23:04.000 --> 23:06.000
Go ahead.

23:06.000 --> 23:13.000
Thank you very much.

23:13.000 --> 23:16.000
Thank you very much.

23:16.000 --> 23:20.000
Yes, I have a question about

23:20.000 --> 23:22.000
transform somehow.

23:22.000 --> 23:24.000
Do you use the actual representation,

23:24.000 --> 23:26.000
the actual representation of source code

23:26.000 --> 23:28.000
or transforming to intermediate representation,

23:28.000 --> 23:31.000
like kind of abstract syntax tree

23:31.000 --> 23:33.000
and when analyzed with tree

23:33.000 --> 23:36.000
to find matches and flow of a code

23:36.000 --> 23:39.000
and not exact words and constructions use?

23:39.000 --> 23:42.000
So the question is,

23:42.000 --> 23:46.000
do we use some kind of intermediate representation of the code

23:46.000 --> 23:48.000
when we're processing?

23:48.000 --> 23:51.000
So the answer is yes and no.

23:51.000 --> 23:54.000
We transform the code,

23:54.000 --> 23:55.000
we parse it,

23:55.000 --> 23:58.000
we is a library from GitHub called Trissier.

23:58.000 --> 24:01.000
And we basically have streams of tokens

24:01.000 --> 24:05.000
that we then generalize with this algorithm from GitHub.

24:05.000 --> 24:09.000
So it's really more of a syntactic base approach.

24:09.000 --> 24:11.000
We don't deal with abstract syntax trees,

24:11.000 --> 24:13.000
control flow and all that.

24:13.000 --> 24:15.000
It's extensive to be done.

24:15.000 --> 24:16.000
But it works very well.

24:16.000 --> 24:18.000
The problem is at scale,

24:18.000 --> 24:22.000
doing anything that deals with abstract syntax tree.

24:22.000 --> 24:24.000
There's a guy here working also with a

24:24.000 --> 24:25.000
bot code project,

24:26.000 --> 24:31.000
which does incredibly sophisticated

24:31.000 --> 24:35.000
static analysis to find actually reachable,

24:35.000 --> 24:36.000
vulnerable code.

24:36.000 --> 24:39.000
That's very expensive in terms of compute.

24:39.000 --> 24:42.000
Here we're trying really to have massive matching

24:42.000 --> 24:45.000
and we need to be able to turn out this very fast.

24:45.000 --> 24:47.000
So we're doing some compromise,

24:47.000 --> 24:49.000
but in practice it works pretty fine.

24:49.000 --> 24:51.000
Another last question.

24:51.000 --> 24:54.000
Okay, well, thank you very much.

24:54.000 --> 24:57.000
Thank you.

