WEBVTT

00:00.000 --> 00:04.580
Get there, I'm going to go.

00:04.580 --> 00:06.580
Good.

00:06.580 --> 00:08.580
Now it works.

00:34.580 --> 00:36.580
Good.

01:04.580 --> 01:06.580
Good.

01:35.580 --> 01:36.580
Good.

01:36.580 --> 01:37.580
Now we'll move.

01:39.580 --> 01:41.580
Again?

01:41.580 --> 01:43.580
Now.

01:43.580 --> 01:45.580
Oh, that is too loud.

01:45.580 --> 01:48.580
Is it better now?

01:48.580 --> 01:51.580
Thank you very much.

01:51.580 --> 01:54.580
So.

01:54.580 --> 02:02.580
We were coming from that idea that we have a bunch of LVMI options.

02:02.580 --> 02:08.980
we have a bunch of LVMI files that we feed into org and org would figure out which of them

02:08.980 --> 02:14.420
we need to run the code for the symbol that we were asking for.

02:14.420 --> 02:19.460
And it would take these materialization units and compile them to native objects with the

02:19.460 --> 02:26.660
LVM target for the LVM backend for the target that we need.

02:26.660 --> 02:33.460
And then it links them in memory and loads them for execution so that eventually we get back

02:33.460 --> 02:39.380
the address of the symbol that we did ask for.

02:39.380 --> 02:41.380
And how does that work in practice?

02:41.380 --> 02:46.500
If we have a hello world example like this with the main function and the print statement,

02:46.500 --> 02:54.180
we can compile LVM IR for that with clang in this case and feed it into LLI, the LVM

02:54.180 --> 02:56.180
interpreter.

02:56.180 --> 03:02.740
This is the program that uses org kit under the hood and if we run that like this it looks

03:02.740 --> 03:07.700
right it seems to work as long as we look at small examples.

03:07.700 --> 03:12.580
So if we take something bigger like these are two for example it's a simple compression

03:12.580 --> 03:17.220
program that is often used for benchmarks.

03:17.220 --> 03:23.220
This consists of like 13 C files and one make file and the make file describes the build process

03:23.220 --> 03:27.220
for that around 200 lines of code.

03:27.220 --> 03:34.820
If we want to like usually we would just run make and get an executable and run that executable.

03:34.820 --> 03:42.820
But if we want to get the IR code for the C files and run this through LLI we need to do something

03:42.820 --> 03:43.820
here right.

03:43.820 --> 03:51.660
We could say well let's tell all the C compile commands to embed LVM instead of binary.

03:52.620 --> 03:55.260
And that works pretty well for the first seven targets.

03:56.140 --> 04:02.860
But then we want to link a static archive that's a bit of a problem because we did not

04:02.860 --> 04:11.180
emit the object files and there's nothing to link and even if we tried LVM IR has no counterpart

04:11.260 --> 04:18.700
for linking archives or anything like that so there's not so much we can do except well we already

04:18.700 --> 04:25.980
have almost all the bit code files we need so we could just compile the driver code manually

04:26.780 --> 04:32.780
and now we have all the codes for the main program and we can run this in LLI.

04:32.780 --> 04:37.100
It gets a bit more complicated but still works if we executed like this.

04:37.260 --> 04:44.300
Great that could be a good approach right or is it really so nice?

04:45.420 --> 04:51.260
Well there's a few issues here let's start with one that is maybe not so obvious.

04:53.260 --> 04:59.260
One of the problems is that we load all code that start up and why is that?

05:00.220 --> 05:07.420
Well before we enter main we have to get all the symbols somehow because we need to figure out where

05:07.420 --> 05:14.620
is main and where's the code for it and also we have to allocate a memory for global variables

05:15.260 --> 05:21.580
because main will think oh it must be accessible right and we also have to initialize all these

05:21.580 --> 05:26.940
global variables because we come from a world where everything used to be static and compile the

05:27.020 --> 05:35.900
head of time. There is a way to get the first two from the NLTO summaries that we can get from

05:35.900 --> 05:43.900
the compiler and I showed that in a talk long ago on the LLI and deaf meeting but the third one

05:43.900 --> 05:49.420
is really a problem and it won't go away because initialization can call arbitrary code and

05:49.420 --> 05:56.620
are executable so we can just generate that ahead of time so the status quo really is we load all

05:56.620 --> 06:02.540
code and start up. We'll be coming back to this in a second let's go first to the previous slide again

06:02.540 --> 06:09.820
and see what was maybe another problem here well of course the MLVM was the big problem right

06:10.380 --> 06:16.620
we can't just do that for real world projects we have static archives we have dynamic libraries

06:16.700 --> 06:24.700
we want to pass linker flags all of this would not work together with emitting only LLLM code so

06:25.420 --> 06:32.940
all in all the observation is not necessary from this only but also from here we need this build

06:32.940 --> 06:38.460
process and it's quite complicated because our platforms are a bit messy and the build process is

06:38.460 --> 06:46.060
what holds everything together when we look at other projects and how they build their code for

06:46.060 --> 06:53.820
example Julia or come in this implementation class that you selected they are built on the dynamic

06:53.820 --> 06:59.900
execution model and they bring everything they have everything included that needs to be done to build

06:59.900 --> 07:08.700
the code they have first class trip support but cc++ rusts with they don't have that and most of all

07:08.700 --> 07:16.460
we are missing build system integration so what could we do well the idea is that we could build

07:16.460 --> 07:24.700
a LLM plugin to somehow connect these two worlds because plugins work very well with LLM based

07:24.700 --> 07:34.540
compilers and they're easy to be used with existing build systems the project is on GitHub and

07:34.540 --> 07:40.940
I code is here on the slide and the idea is this we hook in the compiler pipeline very early

07:41.660 --> 07:49.020
early as in before we do a lot of work with the code and we cut out all the code and put it somewhere

07:49.020 --> 07:56.780
like on disk maybe and then we inject the grid loader instead of it and otherwise we just compile

07:56.780 --> 08:06.460
everything as regular into a kind of shallow binary after that if we do that for our bc2 example

08:07.340 --> 08:12.780
it looks a bit like that we pass the plugin in the cflex here and this will obviously

08:12.780 --> 08:18.140
close no problem as long as we use the compiler like clang because it will not break our build

08:18.300 --> 08:27.180
and this works not only will make but we see make cargo if you want to have a few samples there's

08:28.220 --> 08:40.380
CI GitHub actions with one example for each project and that should be yes everything here

08:40.540 --> 08:50.780
so back to the idea that was the first part what about the next steps we cut out code and

08:50.780 --> 08:57.260
inject the grid loader how would they look well let's say we have a function like this bc2

08:57.260 --> 09:03.340
compressed block looks like that it's a lot bigger but they didn't fit on the slide

09:04.060 --> 09:11.900
and now we would replace that function body with a grid loader so that would maybe look like

09:11.900 --> 09:18.380
something like this we have three blocks an entry a materialize and a call block in the entry we say

09:18.380 --> 09:24.220
oh did we already materialize the code for this function if so we go to call and just invoke it

09:24.940 --> 09:31.420
otherwise we go to the materialize block and call into the runtime and instead of

09:32.140 --> 09:37.820
requesting a function by name we request a number here and build an ID system around that

09:37.820 --> 09:46.780
that's basically like thin LTO summaries have ID's global unique IDs that identify functions

09:46.780 --> 09:52.780
instead of names because that's a lot easier and we can just store all this information

09:53.580 --> 10:00.380
that we need to identify which file it is in and what code we need to load into the study binary

10:02.380 --> 10:08.380
this code doesn't live in thin error we also have to inject a few more information

10:08.380 --> 10:15.420
because for example the bit code file that we store the code in and we append a global static

10:15.420 --> 10:21.580
initializer that will register this file with the runtime so that it knows where it is

10:22.300 --> 10:33.740
for regarding the last point what do I mean with a shallow binary how would that look

10:35.660 --> 10:43.020
if we look at the call graph of our bc2 example it looks about like this we have a main function

10:43.020 --> 10:49.500
over here and all the other things are functions that are reachable that's a regular executable

10:49.580 --> 10:58.060
I added boxes are functions and arrows are calls and now imagine we replace all function bodies

10:58.060 --> 11:05.100
with a dead loader well then we also remove all the calls right and we look like that

11:06.860 --> 11:13.900
almost because we also run that code elimination of course and then it would run then it would look like this

11:14.860 --> 11:23.820
um what remains are the exported entry point of our translation units like that's like the

11:23.820 --> 11:33.580
public functions in in our c files and this is great for the iD system because it's a lot fewer

11:33.580 --> 11:40.300
entries than before and it's also great for a bi compatibility because our translation units from

11:40.380 --> 11:47.260
the outside stay as they are they don't differ from regular compile compilation units and we can mix

11:47.260 --> 11:53.900
and match these with regular ones so if something doesn't work yet we can just say okay let's not

11:53.980 --> 11:59.180
audit it this object file we just keep the original one for in in this case

12:06.060 --> 12:12.700
exactly and all the remaining functions are basically really just symbols with a little

12:12.700 --> 12:18.700
stop attached that would load the actual code and then that is something we see in the binary size

12:19.420 --> 12:25.820
the original binary is 190 kilobytes on the left side and the audit kit compiled one is 43

12:26.860 --> 12:34.220
only and it's important to note that this is something like that the size reduction is really

12:34.220 --> 12:40.060
only in the text section or maybe exception data or something like that but not in a data section

12:40.060 --> 12:48.220
because this is another idea and the concept that we just leave all the data all the global variables

12:48.300 --> 12:59.500
in there because that means that we can solve our status quo we don't have to build these things

12:59.500 --> 13:04.060
on startup anymore we can implement real laziness because all the things that we need

13:04.860 --> 13:09.020
upfront at startup are already in the exit and the static executable

13:09.420 --> 13:21.020
on the project there's a benchmark mode or some scripts that run benchmarks it's about

13:22.380 --> 13:27.820
outputs things like that it shows binary sizes it shows like average compile times

13:28.380 --> 13:35.900
it doesn't get much faster yet because LVM compile times are not such a big part when you compile

13:35.980 --> 13:43.820
your C++ and run times are a bit slower of course because we have to materialize this code

13:43.820 --> 13:51.980
on the fly but it's not that big of a difference if we look at the run times there is a proof of

13:51.980 --> 13:57.580
concept release this is really like super early stage there don't expect anything to work

13:59.100 --> 14:03.900
but there's two types of run times one static out of process run time where

14:04.860 --> 14:11.020
you can link a static archive that would implement the run time functions materialize on

14:11.020 --> 14:18.300
register and that would talk to a demon auto to D process to do all the grid compilation because

14:18.300 --> 14:23.020
this is pretty heavy like the run time is pretty big it brings a whole LVM compiler

14:24.220 --> 14:30.860
and the dynamic in process run time is just a shared library but that makes issues sometimes

14:30.860 --> 14:36.300
like for example rust developers don't want to link clips of C++ in their code and things like that

14:36.300 --> 14:40.460
there's also issues when we compile LVM binaries with that because it confuses symbols

14:41.820 --> 14:47.020
and last but not least there is a fast-blocking auditory with that I would say

14:48.140 --> 14:54.300
that's kind of a novel approach for running orchid at scale because you can in principle build

14:54.300 --> 15:00.300
projects of any size with that and not just small examples at least I don't know any other

15:00.300 --> 15:07.500
project that does that so far yes we reduce binary size and compile time ideally

15:09.260 --> 15:14.860
one idea is that incremental builds might be able to avoid relinks because if you only change

15:14.860 --> 15:19.020
the implementation of a function then the actual object file will not change and the

15:19.020 --> 15:23.660
linker doesn't need to do anything actually we just load a different bit code file from disk

15:25.020 --> 15:29.340
and we can mix a match objects so we don't have to do that for the whole program we could say

15:30.300 --> 15:33.340
there's something that doesn't work okay then let's don't do it not do it here

15:35.580 --> 15:42.380
and the real core part is free to leave global variables in the study executable

15:43.100 --> 15:52.460
so that we can support real laziness here's the project and if you have questions then we have five minutes

15:52.700 --> 15:59.180
yes

16:00.460 --> 16:04.460
you come to me say that you reduce the binary size because you have all the because that

16:04.460 --> 16:10.460
nice on your file system and you just keep into the application so I don't know if this is really

16:10.460 --> 16:17.740
something for production actually but the point here is more like that if you reduce the binary

16:17.900 --> 16:23.500
size then you also reduce the link size and if you link huge binary is like chromium or something

16:23.500 --> 16:30.700
then link times really huge and if you link only every tiny bit of it then like like the

16:30.700 --> 16:36.060
edit compile tests cycle gets much smaller because if you only link a few megabytes instead of a

16:36.060 --> 16:38.700
few gigabytes might be benefit

16:38.780 --> 16:45.420
if you're responding it to the runtime then many starts it starts to get compiled by the runtime

16:45.420 --> 16:52.620
exactly the idea is that the static executable has none of the actual code it's just like a frame

16:53.180 --> 16:59.100
that is executable and knows where all the code is that it needs and then when it starts it will

16:59.100 --> 17:05.580
first of all initialize global variables and we'll call all these constructors and for that it

17:05.660 --> 17:13.180
will start loading code and then it will enter main and go the regular code path

17:15.020 --> 17:25.580
just continue this seems like across platform compilation as well so for example if I compile

17:26.540 --> 17:35.980
and basically I have an output the edit VMIR and it is we are able to use it for any application

17:35.980 --> 17:43.260
because it's equivalent only to compile it I know LVMIR is not platform-agnostic it's very specific for

17:43.260 --> 17:51.580
platform yeah yeah the question right the question was if this would be a cross-platform solution

17:51.580 --> 18:00.220
but it is not because the IR is very specific to platform yes if I had correctly you're now running

18:00.220 --> 18:07.180
the like global constructor functions delayed or an amazing instead of it started

18:08.060 --> 18:13.660
yes the question is the global constructor constructors need to be run yes that's true that

18:13.660 --> 18:20.220
doesn't work at compile time and yes this is just like every other code that runs

18:21.580 --> 18:28.380
does it still run before main or executing yes yes global constructors need to be run before main

18:28.380 --> 18:34.700
executes because main expects everything to be initialized but in this concept global constructors

18:34.700 --> 18:39.260
are not different than any other function calls it's just calls into the runtime and it will

18:39.260 --> 18:47.180
find the code and materialize it and run it is you've initially mentioned Julia

18:47.180 --> 18:53.100
other languages that have written plus jits support is that the pair of comparison because

18:53.100 --> 18:58.220
like the language has first class of jits support typically does tracing or some sort of like

18:58.220 --> 19:04.380
runtime optimizations can also like invalidate code so I think like it's it's a different

19:05.740 --> 19:12.220
idea than like what you're following here is the correct assumption yes the question is if this is

19:12.940 --> 19:18.460
if it's fair or or reasonable to compare dynamic languages with study relief wild languages and

19:18.460 --> 19:26.460
yes it's not it was more meant as a comparison in what other built systems are there for for

19:26.460 --> 19:41.260
languages that use objects so far yes yes yes the question was what is the overhead for the dynamic

19:41.260 --> 19:49.100
loading and it depends it depends if you have endless small functions for example or if you have

19:49.100 --> 19:53.820
reasonable size functions for example and there's no threshold here implemented it could all be

19:53.820 --> 20:00.220
done it's just not it's just an experiment so far but for like this is a real code bc2 is a real code

20:00.220 --> 20:06.620
and here the runtime difference is not very big but there's also this is also a trick actually

20:06.620 --> 20:12.540
it's it's bigger because this is using the tpde back and for all the others that is like very fast

20:13.260 --> 20:17.660
but it's any other time just it's good for baseline jitter

20:27.180 --> 20:33.500
can you say that again how does lb and auto jit affect runtime performance of larger projects

20:36.620 --> 20:44.700
probably not so much probably like this as I said that the difference is if the code is in many

20:44.700 --> 20:52.460
big for many big functions or only these small functions everywhere thank you there's everything

20:52.460 --> 20:55.180
you can ask me more questions in a whole way thank you really