WEBVTT

00:00.000 --> 00:13.080
Hello, everyone. I'm Matthias. I'm in this presentation. I'm going to talk about some

00:13.080 --> 00:20.360
bugs that I found in the betaian implementations that they were at the end relation of

00:20.360 --> 00:25.960
the specifications. So I was starting to research these sort of bugs and trying to find

00:25.960 --> 00:32.320
out some automatic methods to try to find the simulations of the specification. So I'm going

00:32.320 --> 00:38.440
to talk about this method, but this is working progress. So don't speak any, I mean,

00:38.440 --> 00:44.640
I'll speak more questions than answers I will say to you. I present this also in LPC last

00:44.640 --> 00:52.160
year. So this is an upgrade presentation on this. So first, I would like to introduce myself.

00:52.160 --> 00:59.160
I'm a software engineer, I'd like to enjoy working on operating systems and

00:59.160 --> 01:04.880
utilization. I'm also trying to figure out how I can use format to prove the way that we

01:04.880 --> 01:11.920
code. I also participate in the Mitaio comment manual is reviewing patches to the spec.

01:11.920 --> 01:18.120
And as a reviewer, as I coach her too. So I had all the other tasks too. This is my

01:18.120 --> 01:24.640
need to have repository. You want to check part of my work. So I think you are more or less

01:24.640 --> 01:30.440
familiar with the built IOS specification. So I will just briefly introduce this, just for

01:30.440 --> 01:35.760
those that they don't know it. Well, the built IOS specifications contains the specifications

01:35.760 --> 01:43.240
of drivers and devices that are mostly used in built utilization. So in general, the devices

01:43.240 --> 01:50.520
are implemented in the built machine monitor. And they are meant to emulate a physical device.

01:50.520 --> 01:56.640
In other hand, the drivers are implemented in the operating systems. One, the operating

01:56.640 --> 02:03.160
system is deployed as built machines. So what we have now is different implementation of

02:03.160 --> 02:07.920
the drivers, like depending on the operating system. And then, for example, Linux, Windows,

02:08.000 --> 02:12.680
and one, and then we have different implementation of the devices, depending on the built

02:12.680 --> 02:18.400
machine monitors. So for example, for example, if I crack your in-game, and ask me a man,

02:18.400 --> 02:27.720
leave care on each of them, implement their own version of the devices. In particular, in

02:27.720 --> 02:34.360
this presentation, I'm going to focus to the RASBMM project. So I will give a brief introduction

02:34.360 --> 02:43.480
about how these devices, built IOS devices are implemented in the RASBMM. So these are what

02:43.480 --> 02:50.360
they call, we call, because user devices and implementation is mainly split into two components.

02:50.360 --> 02:56.360
Let's say, we have the front end that is implemented in KIMU, and we have the backend that

02:56.360 --> 03:02.800
is implemented in another process. Like, for example, in this case in the RASBMM project,

03:02.800 --> 03:10.800
under the BIHOS user, yes, right. And these two components talk each other using the BIHOS

03:10.800 --> 03:17.840
user protocol, which is which relies on the unique socket domain. So just to have a more

03:17.840 --> 03:26.080
concrete idea about this, particularly implementation. So for example, when we have a driver,

03:26.080 --> 03:30.880
and then the driver is going to trigger, they get feature command, for example, this is going

03:30.960 --> 03:37.200
to the front end, and it's going to be translated to a message. Like, for example, the BIHOS

03:37.200 --> 03:44.640
get feature command, and then it's going to be get by the backend. And this is using what I

03:44.640 --> 03:54.000
highlight in red is using unit socket. And this is the controlled path, let's say, in red.

03:54.000 --> 03:58.320
That is not only used for the message, but also the front end, as some point, is going to

03:58.320 --> 04:05.280
share a file descriptor that then is used by the backend, which is now in blue, to map

04:05.280 --> 04:10.880
some version of the guess memory, and then access to the build cues, for example. So we have

04:10.880 --> 04:17.120
two paths, the control paths, and the data paths. And just to have an idea about the devices I

04:17.120 --> 04:26.720
are going to talk later in the presentation. So coming back to the specification, as I said before,

04:26.720 --> 04:31.600
we have different implementation with different technologies, but all have to conform to the

04:31.600 --> 04:38.720
BIHOS specification. Not conforming to the specification, can or cannot, or may or may not

04:38.720 --> 04:47.920
lead to a bug, right? So the idea of the presentation is to talk about two particular paths that

04:49.760 --> 04:54.240
the first one we found in the last year, when we implemented some BIHOS sound device,

04:55.200 --> 05:00.080
and the other one is a bug that was found from our colleague, but it's also interesting to

05:00.080 --> 05:06.960
talk, I think. So you have the discussions in the manual list there. I'm going to give you

05:06.960 --> 05:13.680
a brief description of what happened. The first one is regarding the live dial sound driver.

05:15.520 --> 05:19.360
I would try to explain the picture. We have three components here, we have the driver,

05:20.000 --> 05:26.160
and in the other side we have the device. For example, a Raspberry MM device,

05:27.280 --> 05:31.440
and what I did in the middle was the transmission queue. So for example, it's the driver

05:31.440 --> 05:38.000
want to play back some sound in the HOS. It's going to send data to the device and the device

05:38.000 --> 05:45.040
went to play that, right? So the way that is done for the built-in of sound device is we have

05:45.040 --> 05:53.760
a transmission queue, which the driver put data and then the device get it. So if we see how it is done,

05:54.640 --> 06:00.880
how was done at that moment before we fix the problem, is that we have this function, the built-in

06:00.880 --> 06:07.040
or send PCN message send, that is going to put buffers in the other arena of the transmission queue.

06:08.640 --> 06:12.960
Then the device at some point is going to consume that data, so it's going to play something

06:12.960 --> 06:18.080
in the HOS, like music or something like this, and then it's going to put the buffer in the

06:18.080 --> 06:28.560
user arena once has been consumed. What we are in this particular driver was that when the device

06:28.560 --> 06:34.560
has consumed the buffer, this will trigger an interruption, and then the driver is going to

06:34.560 --> 06:41.440
immediately move the buffer from the user arena to the other arena. This is done by the built-in

06:41.440 --> 06:46.240
IO sound PCN message complete function. This is how the driver was working at that time.

06:49.040 --> 06:55.440
If we see in more detail how these buffers are made, well the driver creates a M vibration

06:55.440 --> 07:02.240
that it shares with the user. This is split in smaller buffers, and then it used these buffers

07:02.800 --> 07:11.200
to fill the script arena, and then it used the other arena to expose those buffers to the

07:11.200 --> 07:18.880
device that is going to be consumed. In this picture in particular, we can imagine that the device

07:18.880 --> 07:25.920
at this moment is consuming the buffer number zero, just to illustrate. And one thing that I

07:25.920 --> 07:32.160
forgot to mention was that when the use of application wants to know where it is, where it can

07:32.160 --> 07:38.320
bright some data is going to trigger the built-in IO sound PCN pointer, IO CTLs to get exactly

07:38.320 --> 07:47.440
where to bright the next chunk, it is going to be reproduced. But the problem that we found was that

07:48.000 --> 07:56.480
in this picture, now it is like the continuation of the previous picture, I mentioned that before,

07:56.480 --> 08:02.800
we have the device consumed the buffer number zero, and then in the next steps, it is consumed

08:02.800 --> 08:07.680
the buffer number one, we can think like this, but then at some point the user is going to

08:07.680 --> 08:13.520
query this position, and it is going to start to fill again the buffer number zero to put some

08:13.520 --> 08:19.280
data. But the problem was that at this point, the buffer was already in the other arena,

08:19.920 --> 08:28.960
and means that it was already exposed to the device. What we know from the specification is that

08:28.960 --> 08:36.080
in the moment that the buffer is exposed to the device, the device could take it at any moment,

08:36.080 --> 08:43.520
and this is what happened before. When we were building the device, we were reading the data immediately

08:43.520 --> 08:49.040
when the data was exposed, and what happened is that if you read in that moment, you're going to

08:49.040 --> 08:55.680
get all data because at that point the user didn't read them right, yet this value has been already

08:55.760 --> 09:01.600
exposed, so this is a violation of this, but we found that this is a violation of the specification,

09:01.600 --> 09:12.320
and we fixed it actually. So if we see the traces to try to identify what was going on,

09:12.320 --> 09:17.680
this is a made up trace, so you're not going to find these sort of traces, I did in myself.

09:17.680 --> 09:23.600
So what you see here is that we see the driver put in buffers in the other arena,

09:24.800 --> 09:30.560
and then at some point the user query some position, and it starts to fill the buffer number one,

09:30.560 --> 09:38.720
for example, and then at some point the device consumed. What I try to illustrate in this trace is that

09:38.720 --> 09:45.360
this behavior, this sequence is not valid. In the sense that once the buffer, like the number one,

09:45.440 --> 09:52.880
is exposed to the device, you cannot write it, so this is wrong. I mean, what made me think is that

09:52.880 --> 09:59.680
if I could identify the sequence that is delayed in the specification, I could early know

09:59.680 --> 10:09.840
that something wrong was going on. I don't have water on me. So the second bug I want to talk about

10:09.920 --> 10:15.440
is about the because user crate, in which there is also a relation of the specification.

10:18.640 --> 10:25.520
Well, in this case, what we find out was that when the built-in test driver was initiated,

10:25.520 --> 10:31.760
for example, after the driver okay, common, it's going to end you request to the to the built

10:31.760 --> 10:36.560
queue, right? It's just immediately after the driver okay, common. The driver okay,

10:36.560 --> 10:41.280
common is part of the initialization of the device. So it's happened at the end, and after that,

10:41.280 --> 10:46.480
the specification tells us that you're going to start filling requests, just at that moment.

10:48.480 --> 10:52.880
This is the driver okay, common is then translated to the because user said,

10:52.880 --> 10:56.640
veering enable, for example, and I will mean I will talk about that later.

10:58.560 --> 11:04.480
So this man says in particular, tell the device, well, we got the driver okay,

11:04.480 --> 11:10.320
enable the built queue. So from now, you can't set requests, it's saying something like this.

11:10.320 --> 11:17.280
But what happened was that this man such was asynchronous. So the guess was not getting any,

11:18.880 --> 11:27.280
I mean, the guess was not waiting to finish, sorry, the device was not waiting to respond this one,

11:27.280 --> 11:31.680
and the guess was just moving to the to start into an queue request.

11:32.640 --> 11:39.760
So the situation that we have was that, for example, since this man such was asynchronous,

11:40.560 --> 11:45.760
we got the notification that someday that was in the build queue before this one.

11:46.320 --> 11:52.960
And then the device say, well, we did an an aven, the veering, we cannot accept the, the,

11:56.480 --> 12:00.160
the request and then the driver hand waiting for that.

12:01.920 --> 12:08.720
And the fix was just to add that moment, the fix was just to fill out that the man such was asynchronous

12:08.720 --> 12:14.320
and just making synchronous. So we stop here, the driver to move. We got some,

12:15.760 --> 12:20.240
we know that the driver okay, common was successful and then we start to thank you.

12:20.880 --> 12:22.880
And that was the fix at the end. Thank you.

12:23.040 --> 12:34.080
So again, this is a violation of the specification.

12:36.480 --> 12:42.080
And if we see again the traces, we can, we can say that this is an invalid trace,

12:42.080 --> 12:47.520
because we have first the request to the, let's imagine that this is a request to the build queue,

12:47.600 --> 12:54.400
and this is the end, the driver, okay, common. So this is an invalid trace,

12:55.120 --> 13:01.200
because this is happened before all this. And this is a valid trace, because the request can arrive only

13:01.200 --> 13:06.960
if after those common has been successful. So this is the, I mean, it shows to fill out what

13:06.960 --> 13:17.200
was going on, could be handled to have the valid trace. So yeah, well, and since I have

13:18.160 --> 13:22.880
observating that there were some traces that there were, not conforming the spec,

13:22.880 --> 13:29.440
I started to play with two tools to try to figure out from this implementations.

13:30.160 --> 13:35.360
When we have implement, I mean, when we come on something in implementation and this

13:35.360 --> 13:41.040
building the spec. So I started to play with Kenny, and then another tool time, just tracing

13:41.120 --> 13:49.040
formal, which Kenny is a tool that he has been made by Amazon, and it is a commonly used

13:49.040 --> 13:54.160
in firecracker. So what I started to do in the last six months or less, is try to see what they

13:54.160 --> 13:59.600
have been doing. So they have an implementation of your tire. I'm moving that to the Rust BMM

13:59.600 --> 14:10.160
crate, and so I'm going to talk about this this tool now, and I have to know that I still

14:10.160 --> 14:16.560
not super familiar with the tools I'm still learning, and yeah, so I have also many questions.

14:16.560 --> 14:25.040
The second tool which is tracing formal is to instrument trace, instrument traces in the code,

14:25.040 --> 14:31.520
but not only instrument in the traces, but also tailing when this trace has to happen.

14:31.520 --> 14:36.480
So we don't now, we don't have only trace points. I'm going to show that later, but since I

14:36.560 --> 14:43.520
have to minute I'm talking more, but not only the trace points, but also how this trace points

14:43.520 --> 14:50.480
have to happen. So first I would like to talk about Kenny, and then I would talk about trace point,

14:50.480 --> 14:59.440
about tracing format. So Kenny, what we specify together with the implementation is a proof.

15:00.400 --> 15:06.960
In this particular, this is an example in which I try to bribe a proof of the

15:06.960 --> 15:13.040
notification suppression mechanism that is already in the specification. So I take the specification

15:13.040 --> 15:20.320
and try to bribe this proof, and then what I try to test is our implementation of beer cues.

15:21.520 --> 15:28.720
Something what I highlighted in the code, this is a draft anyway, but I highlighted in the code,

15:28.720 --> 15:36.320
that is different to a unit testing. What you do is you use the Kenny any type to say,

15:36.320 --> 15:44.480
I will feed my proof with any potential value of my beer cue. So the idea is that you check

15:44.480 --> 15:51.280
for all the possible instance of a beer cue, let's say. And then if they prove pass it,

15:51.280 --> 15:56.720
you are sure that no value exists, that will violate the specification.

15:57.680 --> 16:06.640
And if you want to see more about this, I create a PR to the BMB tile project in which I add in this

16:06.640 --> 16:13.280
proof to the code. So you want to see the discussion there, feel free. It's still a draft,

16:13.280 --> 16:21.840
because I miss in some stuff, but yeah, it's ongoing, ongoing work. Something interested

16:21.840 --> 16:27.920
I found was, well, maybe it is not interested, but it was for me, at least, if you see the

16:27.920 --> 16:34.320
implementation of the need notification method, we will find something like this in the code.

16:34.320 --> 16:40.880
And at some point, you may be tested, attempted to try to simplify the expression to something

16:40.880 --> 16:49.440
like this, for example, which is wrong. By using Kenny, you can quickly spot this problem,

16:49.440 --> 16:57.360
and then Kenny is going to tell you exactly what is the unit test that is going to fail for that case.

16:57.360 --> 17:01.920
So it's quite interesting. It takes some time sometimes, but it's interesting too.

17:02.400 --> 17:11.280
And the second tool that I want to mention, which I tried to use also for chat conformity,

17:11.280 --> 17:16.000
is the tracing format, which is completely different, and I something that I made myself rely on

17:16.000 --> 17:23.680
the tracing instrumentation that the drafts have. So for example, remember that I was talking about

17:23.680 --> 17:31.200
this setbearing enabled at hand and FM back in the Beehos crate. Well, if I want to spot when

17:31.200 --> 17:38.080
this relation happened, I can first, for example, set some event in the definition of the functions.

17:38.080 --> 17:45.280
So you use event here. And for example, I say, well, here is the setbearing enabled. I set up

17:45.360 --> 17:55.120
event type, and here I change the name for sure. I can also not only set up this sort of

17:55.120 --> 18:00.160
FM, but I can't do it in the, I didn't need that slide, but you can also use instruments here.

18:00.880 --> 18:05.440
And the event is going to be triggered, each time the function is called it. So you don't need to

18:05.440 --> 18:12.000
put this manually, you can do it systematically in each definition of the function, for example.

18:12.640 --> 18:22.080
And so once you set up this, then you have to tell the tracer, when, when, which sequence is

18:22.080 --> 18:29.760
going to trigger a warning, right? So what you do is, you do something like this. Well, here is maybe

18:29.760 --> 18:37.440
because I'd be tricky, let's say. You could use, for example, stay machine and say, well,

18:37.520 --> 18:43.760
the set, some sequence, some equivalency of these two events are going to trigger a warning,

18:43.760 --> 18:51.520
or something like this. I use another formalist for that. And what I, the thing I want to say here

18:51.520 --> 18:59.840
is that, together with the instantiation of the, of the tracer, you fill it with, you fill it with

18:59.840 --> 19:03.760
the relationship between the events. So here, for example, you have,

19:04.720 --> 19:11.520
with this, yes, the alternate vector, which is going here when you instantiate the tracer.

19:11.520 --> 19:18.880
So you, you have the tracer, and you have also together for a specification that tells you

19:18.880 --> 19:25.360
when this event can happen. This relationship between the two events is something that comes

19:25.360 --> 19:29.600
from the specification, the specification is telling us when that could happen.

19:29.760 --> 19:39.600
And then what happened is when this efficacy acute, if, for example, the handle event happened

19:39.600 --> 19:44.240
before they said the ringing able, this is going to trigger the warning in the equation.

19:44.240 --> 19:48.880
So maybe at that point we can think, maybe this is something wrong.

19:49.360 --> 20:00.160
I think that's all from my side. I don't know if you have any questions.

20:01.680 --> 20:07.360
This is a still an experimental word, and I'm working on the PR too. So yeah,

20:09.040 --> 20:14.080
it's not really clear for me yet, it's the tracer informal, it's going to be useful.

20:14.320 --> 20:20.080
I mean, I find interesting to play with, but then if it is useful, the other thing, but yeah,

20:20.080 --> 20:21.760
I don't know if you have any questions.

20:30.960 --> 20:37.200
Yes, I mean, the question was if I have another example in which I use the tracing format,

20:37.200 --> 20:39.520
which seems useful, no, I don't.

20:44.560 --> 20:48.960
I mean, I'm having playing with something similar, but the tracing the kernel, I don't know if you

20:48.960 --> 20:54.960
heard about that, it is called the room time observator or room time verification something

20:54.960 --> 20:59.200
that they already are in activity in the kernel, and people that are doing well at the same

20:59.200 --> 21:05.120
with the same machine, but in the kernel. Well, maybe you know Daniel have been working on that,

21:05.120 --> 21:14.000
but yeah, you can check that also, I think they are examples there.

21:25.840 --> 21:32.480
The question was if I already found any bug in the specification itself, that was a question, right?

21:32.560 --> 21:37.680
Well, I am reviewing the spec, for example, I have a re-invirtiocan device.

21:39.280 --> 21:45.920
I don't know if it was a bug, but what you find is that what the outer want to say is not what

21:45.920 --> 21:52.960
the people implement, for example. So, I mean, the fact that it is in English, and it's not always

21:52.960 --> 21:59.760
a precise language made that there is no bug, but it's not even a precise tool. So, you know,

22:00.560 --> 22:06.480
but no, I can, I can not remember now a precise bug now, but this is something related that will

22:06.480 --> 22:14.960
sell. Yeah, thank you.

