WEBVTT

00:00.000 --> 00:16.720
Hello everyone, I'm Nick, thank you for being here.

00:16.720 --> 00:24.440
So I'm going to talk about a very metal programming on RIS5 and a few cool stuff, so

00:24.440 --> 00:25.960
how they'll start it for us.

00:25.960 --> 00:31.960
I work at 4th and we are involved with lots of European projects and we bring up RIS5

00:31.960 --> 00:32.960
prototypes.

00:32.960 --> 00:38.880
So we get a lot of experimental stuff and we have a sub-seize, we have a bit of frames,

00:38.880 --> 00:45.280
we have a lot of different components and we do integration and we do also bring up

00:45.280 --> 00:51.120
so our hardware team gets all the different parts, puts them together and the software

00:51.120 --> 00:53.320
teams needs to make this thing booked.

00:53.320 --> 01:00.640
So usually when you win the hardware team when we get an IP, we expect it to be verified,

01:00.640 --> 01:06.440
you have past all the compliance suits and we expect a RIS5 core to follow the spec.

01:06.440 --> 01:10.280
This doesn't always happen and even when it does, there are still bugs over there.

01:10.280 --> 01:14.240
There are a few horror stories I can share with you, like okay, compressed instructions,

01:14.240 --> 01:21.320
kind of not-don't work because of the alignment, the MFU is not really working, the branch

01:21.320 --> 01:29.600
predictor, we have seen crazy things and point is that even that you need a lot of testing

01:29.600 --> 01:34.840
and the standard test, I mean the simple test that are out there for just compliance and

01:34.840 --> 01:38.760
not going to cover all of this, so you need to develop your own tests.

01:38.760 --> 01:45.760
Both for testing the IPs, like the hearts, the RIS5 hearts and then you need to slowly expand

01:45.760 --> 01:50.040
your test cover because it's not just the RIS5 hearts, you also have an intercontroller

01:50.080 --> 01:55.120
there, you have like, I may have an IMMU, you may have other other stuff on the system that also

01:55.120 --> 02:01.640
need validation and so for example, when you go out you may, we have a case for example,

02:01.640 --> 02:12.320
that things that are messed up in the bus, like we had to, we used case where the NVMe was

02:12.320 --> 02:16.800
looking to the outside, like another case where our DRAM was messed up and we didn't have memory

02:16.800 --> 02:22.760
because the DRAM traces were wrong. In case testing is really important and as soon as

02:22.760 --> 02:29.160
we cut the bugs, the better, because when you miss something and it goes to the next stage,

02:29.160 --> 02:36.000
it is harder to fix and once you go tape out it's basically game over, like it becomes

02:36.000 --> 02:41.840
super expensive to fix the bugs when you move on, so the earlier you cut them the better.

02:41.840 --> 02:48.840
So what we started was we would be develop our own test for a platform level test, mostly

02:48.840 --> 02:56.160
like a litmus test, for example, for the memory subsystem to verify casco currency with

02:56.160 --> 03:05.600
other devices or other tests like interrupts at the time in the timers and stuff and all

03:05.600 --> 03:11.240
this simple test became a bit of a mess. So we said, maybe we should have, we should create

03:11.240 --> 03:19.480
a simple framework so that we can write bare metal tests for our stuff instead of starting

03:19.480 --> 03:31.480
from scratch. So that's how the bare metal framework started and we wanted to make it a

03:31.480 --> 03:39.160
bit more flexible so that we could grow the test coverage even more so that we could have

03:39.160 --> 03:43.880
write more like being able to write simple test but also expand it to write more complicated

03:43.880 --> 03:53.520
tests. And let's then we figured out that you know what, basically now we have some

03:53.520 --> 04:00.320
we have a framework and kind of an SDK to run bare metal applications and let's use it

04:00.320 --> 04:05.280
for other stuff as well. Let's use it for bugs for profiling or benchmarks, because this

04:05.280 --> 04:09.000
is cool stuff, like when you're running benchmarks on bare metal, you don't have any

04:09.000 --> 04:12.960
noise from the operating system, you can get very accurate measurements and you can

04:12.960 --> 04:21.240
profile your hardware very efficiently. And then when we cleaned it up and it become

04:21.240 --> 04:26.880
readable and presentable, it all works so that this was a good tool for education, because

04:26.880 --> 04:31.720
we are in research center, we work closely with our university and it seemed like a good

04:31.720 --> 04:40.720
way to demonstrate to students like how does like, because when they see the first time

04:40.720 --> 04:46.560
an operating system class, they end up taking out Linux or some very complicated OS and

04:46.560 --> 04:52.320
they get lost in all this mess, like how is the timing subsystem works, how do we get

04:52.320 --> 04:56.920
how the scheduler gets interrupts, how do I have the privilege mode, how do I switch from

04:56.920 --> 05:02.880
one to another, how do I system colors, they don't see that, so we figured out that if

05:02.880 --> 05:06.720
you have a good start-of-arms or something very simple and clean, it's also a good

05:06.720 --> 05:11.680
use, a good tool for education. So we started this, we joined the RIS 5 Foundation

05:11.680 --> 05:17.880
like 2018 and we were already involved with such kind of projects and by 2019, we kind

05:17.880 --> 05:26.320
of this thing started to become a thing and it evolved and it's now in a stage where

05:26.320 --> 05:36.520
it's okay, I guess, to release, so it supports the standard RIS 5 extensions, it supports

05:36.520 --> 05:43.680
AIA, so it has support for the previous generation of Indira controllers, like the clean

05:43.680 --> 05:50.240
the click and then the click was split into, you have the M timer and the device for sending

05:50.240 --> 05:55.680
IPIs, this is the hack-lint and it also has the click an app link and EPSIC, so you have

05:55.680 --> 06:02.640
all, you can support from all the RIS 5 systems that had only clean their click to more

06:02.640 --> 06:08.160
RIS and ones that have EPSIC, an app link and also support them in size, it has SMP support,

06:08.160 --> 06:12.400
we had this because we wanted to play with the memory subsystems also, we had to support

06:12.400 --> 06:19.520
multiple hearts so that we were able to do, you know, a concurrency test between them and

06:19.520 --> 06:25.280
complex memory layout, so I'm going to come to that later and I also wanted to make this

06:25.280 --> 06:30.800
simple for hardware people to play a piece of code it runs and this is the job of this step,

06:30.800 --> 06:38.000
step it's to prepare for the next boot stages, fetch your first days boot loader, your firmware,

06:38.000 --> 06:44.560
jump there and then the next boot stage will fetch the kernel and move forward, so this thing

06:44.560 --> 06:49.760
needs to fetch these images from somewhere, in a normal system you get this from a flash,

06:49.760 --> 06:58.640
in our case you may not have a flash, when you have an FPGA there you usually have an RJ45,

06:58.640 --> 07:05.280
you have connectors, you can plug to a network to a switch, but if you want flash you actually

07:05.280 --> 07:10.720
need to get a flash chip and put it in the FPGA you have it like a separate module, so it's more

07:10.720 --> 07:16.320
frequently to be able to have network there than being able to put a flash also, if you have network

07:16.560 --> 07:24.160
you also have the whole stack, but this is about the net boot thing I'm going to talk later,

07:25.360 --> 07:30.400
the thing is that when you write a boot room you may have the memory layout, you don't, you may

07:30.400 --> 07:35.760
not have DRAM yet okay and even if you have DRAM, your DRAM address space may be very far away from

07:35.760 --> 07:42.160
wrong or your RAM, your restaurant for example or your scratch partner whatever may be very far

07:42.240 --> 07:47.120
in terms of physical addressing may be very far from your code, now when you're in a regular

07:47.120 --> 07:51.760
application the operating system loads you and the memory layout is virtual memory, so everything

07:51.760 --> 07:59.440
is next to each other you don't have this issue, but when you're restricted by the memory layout

08:01.040 --> 08:06.400
if your text if your code is very far away from your data the linker will not be able to resolve

08:06.400 --> 08:11.840
the symbols because we have two ways basically to resolve symbols, one is PC relative, so basically

08:11.920 --> 08:16.400
you have the program counter and you can see you can find any symbol that's before or after

08:16.400 --> 08:23.680
two gigabytes from your program counter, so if your RAM region is like four gigabytes away from

08:23.680 --> 08:29.520
text you won't be able to put any symbols in there, then you have GP relative, so GP relative

08:29.520 --> 08:36.480
basically you have an absolute address somewhere, you place you have a GP registered a global

08:36.480 --> 08:42.000
pointer, you put it there and now you can resolve symbols that are plus or minus two kilobytes

08:42.000 --> 08:47.760
are think from the global pointer, so you have you can have four kilobytes of symbols any

08:47.760 --> 08:52.640
anyway and anywhere in your physical memory, but you're like you're strictly the only four

08:52.640 --> 09:02.320
kilobytes, so there is a large memory model that tries to solve this, but how how do this

09:02.640 --> 09:07.680
you solve this basically, instead of having the linker resolve all those symbols that are very far away

09:07.680 --> 09:13.200
you grab the addresses, you put them as values somewhere, let's say in rodata or in text,

09:13.200 --> 09:19.200
you store them and when you want to find them you load the address from rodata that's close to your

09:19.200 --> 09:27.200
text, so basically instead of resolving those addresses as a symbol, as a from the link table

09:27.280 --> 09:32.160
you grab them as values from the you store them, you store their addresses the banner,

09:32.160 --> 09:37.360
so you resolve them at runtime instead of resolving them on the link time, that if you follow the

09:37.360 --> 09:42.960
larger memory model which is not ratified yet, it creates, it adds addresses for every symbol in there

09:42.960 --> 09:49.440
I mean from last time I checked, so it adds their functions like symbols, so it grows a lot

09:49.440 --> 09:56.240
and a butrum needs to be very small because we are constrained in size and it also last time I checked,

09:56.240 --> 10:00.320
it added those symbols in the text segment and not in rodata which to me is a problem because

10:00.320 --> 10:06.800
I want to be able to have clear separation between code and code and data, so basically what I did was

10:09.360 --> 10:18.000
do it manually and just make sure that I don't exceed those four kilobytes, so I have the data

10:18.000 --> 10:23.920
and BSS and I can put them everywhere anywhere in physical memory, but it has to be

10:23.920 --> 10:31.040
there's on 4K which is doable, so you have to introduce a malloc there to if you want for example

10:31.040 --> 10:35.680
when you do networking or other stuff and you need large buffers instead of having them as global

10:35.680 --> 10:40.960
variables in which case they land up in that and BSS and you want that's flat space, you use malloc

10:41.520 --> 10:51.360
and you have them allocated on runtime which is world wars without a problem, so that's one of the

10:51.440 --> 10:57.520
problems that we had to solve, the other is flexibility, okay, prototypes are still working progress,

10:58.400 --> 11:04.720
so even if you're doing an FPGA design you still need to put this in the bitstream and it will take

11:04.720 --> 11:09.040
area there, so it cannot be too much, you have like to be like a few kilobytes,

11:10.080 --> 11:15.360
the restriction I got from my hardware team was like 30 kilobytes, you have this higher limit,

11:15.520 --> 11:21.520
anything you do, you cannot exceed 32 kilobytes, so that's the pain, how about the joy,

11:21.520 --> 11:28.880
the joy is that you have full control, okay, bear metal, nothing else, like no no is, you do everything

11:28.880 --> 11:34.160
from scratch, you don't need to deal with weird abstraction layers, I mean if you work in

11:34.160 --> 11:38.240
learning of scanner, you have to implement like you need to do right a simple driver there and you

11:38.320 --> 11:51.840
need to attach to all these interfaces and stuff, no, this is dead simple, sorry, yep, so it's it's

11:51.840 --> 11:59.200
cleaner, right, you have full control and it's also cleaner and another cool thing is that because

12:00.000 --> 12:04.400
respite specs define an ecosystem and they are consistent with each other,

12:06.400 --> 12:14.560
it's the respite spec is simple to follow, I mean we don't have like a completely complicated

12:14.560 --> 12:20.320
stuff, it check out the interrupt controllers, check out the the trap handling, okay, it's not that

12:20.320 --> 12:24.880
too, it's not, I mean you can implement all of that, you can support the whole risk five set of

12:24.960 --> 12:34.560
specifications without too much complicated code, so it's the follow and for example, if you go to,

12:34.560 --> 12:39.680
if you have to integrate this to a more generic operating system, because of all these abstractions,

12:39.680 --> 12:45.680
you miss this simplicity, okay, because this needs now to, you have a very simple interrupt controller,

12:45.680 --> 12:52.160
but this now, this driver needs to be part of something that can also handle some really complicated

12:52.160 --> 12:58.480
stuff, so you're missing this simplicity, in this case, because things are simple, you can keep them

12:58.480 --> 13:05.120
simple, it's also a great opportunity to have fun to stress your skills and experiment,

13:05.120 --> 13:10.800
because you have like, access it, full control and full visibility and you don't have any noise

13:10.800 --> 13:16.320
from from the rest of, from like, other applications running there on the application system,

13:16.320 --> 13:20.480
so you want you to profile in, for example, you can get very accurate measurements when you're

13:20.480 --> 13:28.640
doing multiprocessing, you can explore very, very interesting things, and again, in case you work with

13:28.640 --> 13:35.760
students, really, just working in bare metal, just just exploring, this is a great teaching tool,

13:36.640 --> 13:44.160
so, I use case for that, I'm pretty, pretty mature, that in an FPGA, you may have network,

13:44.160 --> 13:49.280
but you don't usually have flash, because you flash is an external device, you need to actually

13:49.360 --> 13:54.320
attach flash there, because, you know, to be persistent, the beach trim is not persistent,

13:54.320 --> 13:58.880
but you can have a network card in there, and with your network card, you can connect to some persistent

13:58.880 --> 14:04.080
storage, and actually, if you connect to the network, you can get everything, you can fetch your

14:04.080 --> 14:08.720
both images, you can fetch your kernel, then you can mount your with a fetch over NFS, you can

14:08.720 --> 14:14.160
put a whole Linux, you can also have internal access, you don't need storage, right, with network

14:14.240 --> 14:19.760
thing, you have pretty much everything, and actually, networking is usually faster than flash,

14:19.760 --> 14:25.120
so, if you just go out, you have like a, you can get a little bit of SFP connector there,

14:25.120 --> 14:29.840
you can fetch your both images like, like this, if you go through SPI to fetch them,

14:29.840 --> 14:36.880
much lower, even with a logic in there, so, it's much more flexible, and I think if you're

14:36.880 --> 14:41.120
developing stuff, if you're playing with embedded boards and SBC, you've probably ended up in

14:41.120 --> 14:45.600
this situation where you have to plug and unplug as the cards all the time, it's been,

14:45.600 --> 14:49.520
or you have to wait for the problem to refresh them, and it becomes a mess in this situation,

14:49.520 --> 14:54.400
you just, or compile your stuff, you get a booted, and it will come and fetch it,

14:54.960 --> 15:02.800
you don't need to reprogram any flash or wait for it, so, I started working on this bootload,

15:02.800 --> 15:07.440
on this bootroom that has network capabilities, so that it when instead of, because we didn't

15:07.520 --> 15:14.720
have any storage and it would be useful, so, I had a support for Immaculate, because it was a

15:14.720 --> 15:19.040
simplest Ethernet interface, we could put in there to play with, it's only up to 100 megabits,

15:19.040 --> 15:24.880
but it was something to start with, I added support for our custom, one gigabit,

15:26.000 --> 15:34.160
Nick, which is based on an accident, from Zylings, and I, later on, I added Virtio,

15:34.160 --> 15:39.760
so that I can run all this thing in chemo and support Virtio net. So, what this thing does,

15:40.640 --> 15:46.560
it's a different block size, when you talk into DHCP to the TFTP, the window size allows you to instead

15:46.560 --> 15:51.920
of acknowledging every packet, you acknowledge a group of packets, and the T-size is an extension

15:51.920 --> 15:57.280
that you ask the TFTP server to tell you the size of the file, this is optional, the block number

15:57.280 --> 16:01.840
under, I have heard that right there, because there is a gap in the TFTP spec, they have this

16:01.840 --> 16:07.360
to bite for a thing for a counter for the block number, and they didn't really define how

16:07.360 --> 16:14.000
this will wrap around anyway. So, this is about the networking stuff, this is what gets you

16:14.000 --> 16:20.240
the image, and now when you get the image, you have to parse it, the reason is that when you want to

16:20.240 --> 16:24.400
put a system, you probably want to put Linux, for example, you need, let's say, open its BI,

16:24.400 --> 16:28.240
but you also need the device tree, you probably need more than one image, right, you don't,

16:28.240 --> 16:33.120
you cannot think, you can compile things in, but then it's not flexible, because you need to

16:33.120 --> 16:37.760
rebuild the whole thing. So, I created the image container format that can have many

16:37.760 --> 16:43.280
partitions, and I have an image parser there, and it has also compression, so it's much

16:43.280 --> 16:48.880
can be compressed, which makes sense, because the whole, if you add open its BI and the kernel

16:48.880 --> 16:53.840
amount, you might just be like 17, 20 megabytes, just tells it for, which is a very simple

16:53.920 --> 17:00.320
decompressor, it's like 90 lines of code or something, you can get down to 5 megabytes,

17:01.600 --> 17:07.360
and I have some integrity checks there, and I'm also preparing this to support secure boot,

17:08.880 --> 17:14.320
and he mentions, I'll go, what, I'm still having this limitation of 32k, right? So, here,

17:14.320 --> 17:23.360
I have a breakdown of how long it's of this part, where I'm right now. So, I am with everything

17:23.360 --> 17:30.000
in there, so this thing connects to the network, it has the DHCP, the FTP, it can figure out the

17:30.000 --> 17:39.040
file name to the DHCP, it will get they much, you know, unpack it, unzip it, verify it, and all of that,

17:39.040 --> 17:43.760
and it's still less than 32k bytes, right? Now, it's 32k bytes with debugging information

17:43.760 --> 17:49.280
and colors and everything. So, if I remove that, and I add a OS and deal with some of the

17:49.280 --> 17:57.120
compiler mess, because when you have OS and LTO, beautiful things happen, it can get like 20 to 20

17:57.120 --> 18:04.800
something kilobytes, and I have space there, I have like a 10, 8 to 10 kilobytes, and the goal is to

18:04.800 --> 18:09.760
execute boot with that remaining space, and of course, I don't have enough space to actually

18:09.760 --> 18:16.400
write the crypto in there, but I have enough space to write a parser, to write a proxy to

18:16.400 --> 18:20.640
Kaliptrek, Kaliptrek is a root of Tras mode to the open source mode, it has a mailbox,

18:20.640 --> 18:26.320
and one of the mailbox commands, it is to verify signature, so I just get the, I can just

18:26.320 --> 18:31.360
hash, they much give the hash and the signature and the public key to Kaliptrek, and Kaliptrek

18:31.360 --> 18:36.480
will do the verification for me, and I believe I can still fit that in there, and I also have

18:36.480 --> 18:41.600
some space left to also support flash, so that I can get things for flash if you don't want to

18:41.600 --> 18:45.840
fetch your images for network, and I will use the same image parser, so I will still be able

18:45.840 --> 18:54.320
to do secure boot with, with that, so goal is to do, to finish all that, so I'll secure boot

18:54.320 --> 19:01.840
in there using Kaliptrek, and still keep it less than 32k, and as a back end, when we do not

19:01.840 --> 19:07.920
have Kaliptrek, this is for my fun, I'll try to add the crypto there to do verification myself,

19:07.920 --> 19:16.720
and keep it under 64k, but I'm not there yet, so that's it, feel free to grab this, have fun,

19:17.600 --> 19:23.040
any bugs please report them, and I hope this is also useful to you and others,

19:23.040 --> 19:29.520
if you're playing with this kind of stuff, I believe it would be useful, I have a demo of you

19:29.520 --> 19:47.360
one, we have a few time left, yeah, right, how do I do that, that's good, better, kind of,

19:59.760 --> 20:11.440
if you have an implement that's here, so that's because I support multiple extensions,

20:11.440 --> 20:14.960
some of the hearts may not have all of them, you get illegal instructions, I have to

20:14.960 --> 20:22.240
skip them, because it's normal, this is the malloc, allocating stuff, and the HP starts,

20:22.240 --> 20:30.800
it gets an IP, I get the server address from the DHCP, I ask for the, I got the boot

20:30.800 --> 20:36.000
too much from DHCP, so DHCP gave me that file name, if I try to open it, it fails, I'll try

20:36.000 --> 20:45.120
another myself, and negotiate the block size, blah, blah, receive it, and jumps, so, so I receive five

20:45.120 --> 20:49.280
megabytes, the whole thing is like 20 something megabytes, it's five because it's compressed,

20:49.840 --> 20:56.880
and this is a kernel, an open BIM, that also contains kernel as a payload, and kernel contains

20:56.880 --> 21:02.720
an inner trauma phase as a payload, so that's why you see these books here, and that's it,

21:04.720 --> 21:06.720
so, any questions?

21:19.920 --> 21:28.000
Why not use FIT as a much format, because if you see the parser, the FIT parser, I again,

21:28.000 --> 21:39.680
30 kilobytes, I have to feed everything in there, my much parser is less than 300 lines of code,

21:39.680 --> 21:46.080
and it has CRC32, on every header, it has separators, I support multiple units, multiple

21:46.080 --> 21:54.320
partitions, I have, I mean, it's better, and it also has, like, it is ready for secure boot,

21:54.320 --> 22:01.280
so I also have, you know, flags for crypto of functions, like sizes and stuff, so, I did, I mean,

22:01.280 --> 22:05.920
much better, and it's actually the header format is like 8 bytes, every header is 8 bytes,

22:05.920 --> 22:10.080
everything is 8 bytes aligned, so that's it, I can simplify the parser, not a problem there,

22:10.080 --> 22:15.200
it just fails that the compressor is configured so that it works per byte, so that I can align things,

22:15.280 --> 22:20.000
anyway, I've done some hacks in there to simplify the code and it's really, it worked,

22:22.000 --> 22:26.240
so yeah, no FIT custom stuff, but I have a Python script that will generate

22:26.240 --> 22:36.880
very much for you, CloudRode it for me, yep, I couldn't find anything that was that small,

22:38.480 --> 22:43.520
and after all, I used to do networking stuff, ages ago, so I enjoy writing network stuff,

22:45.280 --> 22:49.360
it's very simple, I mean, it just type P plus GdP plus R, it's like, again,

22:51.360 --> 22:57.600
430 lines of code, it's, and like you can get it in less, I have some, it's because you can skip

22:57.600 --> 23:03.680
CR6, and you won't, you can like, I, I, all of this, it's because I am, I'm a bit paranoid with

23:03.680 --> 23:08.960
being strict with a spec, like the DHCP, it could be like a hundred, a hundred, a lot of

23:09.040 --> 23:15.600
less if I just ignored validation in the client side, I'm a bit paranoid, so I, you could save space,

23:18.880 --> 23:27.600
not TCP, you don't need TCP, TFTP and DHCP are just UDP, so, anything else?

23:27.680 --> 23:44.160
Okay, well, thank you, this is, this is the projects we are working on, and you are all

23:44.160 --> 23:49.360
welcome to the EU summit, so I, this, I'm working my program committee hard.

