WEBVTT

00:00.000 --> 00:27.640
We are going to go to a couple of bugs we found while working on this

00:27.640 --> 00:38.120
5, how we went through looking into the bugs, how we fix them.

00:38.120 --> 00:43.720
Some of the techniques we use aren't specifically risked 5, but we thought that would be relevant

00:43.720 --> 00:49.080
to the topic, so this is a bit about me, so I'm a senior engineer and further work code

00:49.080 --> 00:59.680
to think since 2012, I work on mostly Linux, QMU type things, so nice of a context,

00:59.680 --> 01:06.440
so there's an introduction, this is going to be a tale of two bugs and I sort of threw

01:06.440 --> 01:13.240
a bonus one in, which is similar to one of the other ones, while I said in the description

01:13.320 --> 01:18.920
we were going to do is out the others, we are going to use the bugger for one and I promise

01:18.920 --> 01:27.400
no life denies because we know how all those go.

01:27.400 --> 01:34.200
We also learn, hopefully, a bit about the risk of our architecture, so the first one, I think

01:34.200 --> 01:42.200
this is chronologically the first one we found, is a kernel user access bug, so we'll start with

01:42.200 --> 01:50.120
the bug report, so if you know that there is a testing system called XYZ box that goes through

01:50.120 --> 01:56.280
a number of different architectures, it's an automated system, which looks for bugs, so we've

01:56.280 --> 02:05.480
like things like stress testing, and we got this bug report on the mailing list about a user space

02:05.480 --> 02:12.520
process causing a kernel option, obviously user space process shouldn't be causing kernel

02:12.520 --> 02:19.640
option, so this is a kernel stress test, I have a little bit of the first part of the bug

02:19.640 --> 02:27.720
report of the ups there, and I'll show some more about this, so the kernel actually provides

02:27.720 --> 02:32.680
quite a lot of information about what happens when things go wrong, this is not the old days

02:32.680 --> 02:40.200
where you'd be lucky to get the registers and then possibly some weird guru meditation fault

02:40.200 --> 02:45.720
and you really didn't know what was going on, at least have now some good inbuilt the next

02:48.120 --> 02:55.160
debugging, so this is output, so it starts with at the top, well it's telling you what happened,

02:55.160 --> 03:00.600
so, and this is a very specific error report, so it's not like it tried to access the memory

03:00.600 --> 03:06.120
that didn't exist, this is the kernel trying to get access to user memory in properly,

03:07.240 --> 03:13.560
we get the information about what CPU it was on, which process was running,

03:14.760 --> 03:20.120
we get some information about what hardware, so this information is better easily passed around to

03:20.120 --> 03:27.560
people to say, if you're looking into what was going on here, we get an information about what the build

03:27.560 --> 03:34.280
was, and then after that we can see information about where in the kernel the code was running,

03:34.280 --> 03:42.040
so the EPC is basically where the process of the process of core was executing, so we know it's

03:42.040 --> 03:47.000
in schedule tail, there's a return address that tells you where it came from, so you can look at

03:47.000 --> 03:53.000
how it was called, and then we get registered dump which I've sort of curtailed for space reasons,

03:53.000 --> 03:57.720
and we've got some very nice projectors here, but even that I think people will be strained

03:57.720 --> 04:05.320
to feed anything lower, and we have information about the process of core status, so that tells

04:05.320 --> 04:14.040
you a bit about what the hardware was reporting, such as status bits, the actual bad address,

04:15.160 --> 04:20.840
and a cause info, so you can go and buy it into the hardware, I say manual and go,

04:20.840 --> 04:27.720
this is actually what it meant, if you would like to, you get a call trace, assuming that information

04:27.720 --> 04:35.160
wasn't manual, because occasionally bugs, if you call into bad code, not always, you would

04:35.160 --> 04:43.960
get that info, so you get quite a lot of information, so actually can we go from this information

04:44.040 --> 04:52.280
to actually go through and work out what happened, so the basics of this, well this is a very

04:52.280 --> 04:59.160
specific type of bug, so if people know written device drivers, the curve is not meant to go

04:59.160 --> 05:05.400
randomly wondering through the user memory, this is obviously usually a sign of some problem happening,

05:05.400 --> 05:12.760
so viscify like some other architectures has protection for that, so the kernel has specific

05:13.480 --> 05:20.280
copying functions to access user memory, there's checks in to check that that's properly used,

05:21.880 --> 05:27.480
so for example the kernel has a copy to use a user get user, those are where the kernel should be

05:27.480 --> 05:33.480
getting its memory, so one thing we can think about for this bug is that somebody used a bad

05:33.480 --> 05:41.320
pointer somewhere, so for example just to go into this feature a bit more each CPU call has a

05:42.040 --> 05:49.800
status control status register with a flag called SUM in it, SUM stands for permission to

05:49.800 --> 05:56.760
provide a user memory, when this flag is set, obviously it allows it, if it isn't set then you get

05:56.760 --> 06:03.560
a specific fault, so we can also say that this fault isn't looking at a bad memory page,

06:03.560 --> 06:08.520
because that memory page must exist to be able to notice a user page, so it's not a

06:08.520 --> 06:17.560
suddenly passed a pad pointer in and the kernel can see it at all, so we have the basics, we know

06:17.560 --> 06:26.280
that how this bug is happening, the report also gives us the kernel getref and config used to build

06:27.240 --> 06:33.720
it, so we can go away, build the same binary, so we can either look into the binary code,

06:34.920 --> 06:41.320
make a test system around and rerun it, because one thing we haven't really known here is

06:41.320 --> 06:46.360
whether this is actually reproducible on real hardware, could it be a bug in QM, you it's self,

06:46.360 --> 06:54.360
I mean we haven't really said looked into whether that could be the case, however we were going

06:54.360 --> 07:03.000
to do a reproduction case on say I think it's going to be a Visc5, one of the sci-fi freedom

07:03.000 --> 07:08.760
machines, but I don't think we ever actually got around to it, because when we started looking into

07:08.760 --> 07:16.440
the code, we decided to look into the code, we can read the assembly output, it's not always

07:16.440 --> 07:23.400
possible to reproduce exactly the same binary, but we can fairly hopefully get something fairly close,

07:23.400 --> 07:35.240
so we being of old school type of debuggers, think that you know what, that's not even try and

07:35.240 --> 07:41.640
on this case the bug was so intimate, which is another thing that we can say is information into

07:41.640 --> 07:47.320
our bug hunting, is that we're not really going to get a lot out of attaching a debug at this,

07:47.320 --> 07:54.120
because it's happening every few hours, we're going to be looking for a lot of data, so we're

07:54.120 --> 08:03.320
going to go firstly, we can disassemble the code because we obviously love our Visc5 assembly stuff

08:04.040 --> 08:12.440
and we can go and look at the code and say actually what's happened here, so we can take the bug

08:12.440 --> 08:18.840
with port and go back to the code, so we're going to use the new option, we can dump the code out,

08:18.840 --> 08:26.840
look at roughly where it was, so we've used the output from the upst to go and look at the

08:26.840 --> 08:32.600
schedule tail function here, obviously there's a lot of information we move just to make this clear,

08:33.240 --> 08:41.800
so I've highlighted in the bold area the UDAC access code, just as no the K config allows you to add

08:41.800 --> 08:48.040
a lot more debug information into the objects, so if I default I don't think it turns that on,

08:48.040 --> 08:54.360
but it allows you to do things like cross reference, these instructions to roughly where they were

08:54.360 --> 09:03.720
produced in the source file, so in here we have a schedule tail is very simple function and the only

09:03.720 --> 09:10.600
place it actually looks to go into the user is if this current pointer is set, so at this point of

09:10.600 --> 09:18.440
set it's doing something when it schedules and on the code here we can show that that's the area

09:18.440 --> 09:28.600
of code roughly relating to the put user, so we've read through that and go that's fairly simple,

09:31.640 --> 09:37.560
so actually what is this code doing, so I'll go through a little bit of actually what the code is doing,

09:37.560 --> 09:42.280
so we can actually work out whether the code is right or not and actually then

09:42.840 --> 09:51.640
okay is where we crashed, where we think it is, so first thing we load the s2 register with

09:51.640 --> 09:57.480
the sum bit which we've got very, very, very far from the iSeries correct, I mean if this wasn't correct

09:58.680 --> 10:03.880
really we wouldn't have a working system at all, well so going back if the putman get user code

10:03.880 --> 10:08.520
was not correct in the first place it would be very unlikely we ever get a working system,

10:08.920 --> 10:16.600
so much of the kernel relies on this that you'd be lucky to even get to a in-it process,

10:16.600 --> 10:25.000
so we'll go through a little bit about this segment, so as to the sum flag the CSRS instruction

10:25.000 --> 10:31.400
we're just going to set into the status register to ensure this bit is set, so we know fairly

10:31.400 --> 10:40.040
much this bit should have been set when the code happens and this at store word is where it crashed

10:40.040 --> 10:46.440
now, obviously that correlates with it's storing something, we know that it was storing a value,

10:46.440 --> 10:53.320
doesn't really matter what the value was because s1 is the point to where it's going to store and we

10:53.320 --> 10:59.560
can always we can verify that as well with the crash saying that's what we look like and then at the end

10:59.640 --> 11:05.240
it's clearing it's clearing that flag again, wax fairly simple it should have worked,

11:06.280 --> 11:10.920
if s1 was wrong we should have did it got on a different page fault, the page fault we're getting

11:10.920 --> 11:17.160
is an access problem, so but that shouldn't have happened because that bit should have been set,

11:18.120 --> 11:29.000
but what's the bit in the middle? Now I didn't put the actual code mapping in here but there is

11:29.000 --> 11:36.040
definitely something else going on here, this is not a simple set a bit, store a value, clear a bit

11:36.040 --> 11:43.000
that really should have happened for the put user, so what we got here, well actually what seems

11:43.000 --> 11:52.120
to be happening is that the compiler has showed our input, our put user code, the call to it actually

11:52.120 --> 12:01.000
get the task process ID, so it's the call to task pit number and s, which you think be fairly

12:01.000 --> 12:07.160
simple, right? So that shouldn't mean a problem, but there's clearly this is the only opportunity

12:07.160 --> 12:15.160
we have unless our call has magically gone wrong that we could add a bug, so we can read through that code

12:15.160 --> 12:20.680
and I'm not going to go through that code, but it turns out it actually does quite a bit and we already

12:20.680 --> 12:27.400
discounted the bug for being the set childhood, and we have only the task pit of v in our

12:27.400 --> 12:33.720
comment call, which calls into the kernel's pit code to go what is the process ID of this

12:33.720 --> 12:41.000
task pointer because the kernel has these task pointers that say this is each representation of a thread,

12:42.200 --> 12:48.760
so it needs the actual number to give to user space, but there's no sign in there that there should

12:48.760 --> 12:57.560
have been a change to this flag, in fact there's no, no call to that, so this way it reflects

12:57.560 --> 13:06.360
so what we could do next, well the fairly much easiest thing, we'll be just just wrap this call

13:06.360 --> 13:16.600
in something that says on exit to this call is the SUM flag still set and you know what occasionally

13:16.680 --> 13:27.560
it's not, so what's going on here, well we did two things firstly, really the only thing that

13:27.560 --> 13:33.000
we can probably assume it's happening here is that it's a task it might get scheduled out,

13:33.720 --> 13:41.320
scheduling, it doesn't happen on, so with the preemptible kernel and various things, scheduling

13:42.280 --> 13:50.760
places, so we can make a very good bet that it's task which, but task which you shouldn't

13:50.760 --> 13:58.520
un-set that sort of stuff really, so actually the first thing which was really easy to do,

13:58.520 --> 14:05.640
this was about 40 lines of code, was we started another kernel thread which went, I'm going to just

14:05.640 --> 14:13.960
sit here, sleep a little bit, check for some flag being set because it shouldn't have been and

14:13.960 --> 14:21.800
every so often you get this little thread would inherit some flag from some way, so we've gone through

14:21.800 --> 14:31.560
we've written a test case or two, we've worked out this, so we're fairly sure now that the problem

14:32.280 --> 14:41.080
is with task switch, so the risk 5 switching from one thread to another somehow has completely failed

14:41.080 --> 14:49.000
to consider that this flag ever needed to be saved, now we're fairly small kernels, you don't have

14:49.000 --> 14:58.680
a lot of code for these segments of say doing a process look up, but things like the S-wise

14:58.680 --> 15:07.000
edbot use kernels that run with a lot of checking involved, it turns out the more things you add

15:07.000 --> 15:14.440
into the say the RCU etc, parts of the kernel, the bigger the code becomes and therefore

15:15.800 --> 15:21.240
the highly likely that it will get scheduled, so this was something that hadn't been seen

15:21.240 --> 15:30.760
hadn't been looked at, so how actually would we go about to fix this because really it's not

15:30.760 --> 15:35.800
going to be very interesting if we just leave this bug around, firstly this is stopping tests running

15:35.800 --> 15:42.440
because obviously if your test is getting with an ups that's not very good test and it then fails,

15:43.400 --> 15:51.320
so first idea, well thus switch code is available, just save the sum flag into the task info,

15:52.600 --> 15:59.480
that was actually I think the best way of sorting this out, turns out everything that's arguing

15:59.480 --> 16:05.400
about this sort of code and it got sort of sideline and I think it still hasn't been fixed,

16:06.040 --> 16:10.360
this could actually be a problem down the line but no be has run into it yet,

16:10.920 --> 16:19.960
there are places in the kernel which use quite wide access to users, so they say explicitly

16:19.960 --> 16:25.800
I'm going to do a lot of movement of memory to use a space or from enable it, do things

16:25.800 --> 16:33.880
disable it explicitly, whereas in these macros, put user is really just enable user access,

16:34.520 --> 16:40.920
put the thing which is probably going to be a word, byte or whatever, disable the access,

16:42.280 --> 16:51.320
so the quickest fix here is just to reorder the macro, so this is the patch that actually got

16:51.320 --> 16:59.960
accepted for this, it's fragment of it, so the first thing you do is move explicitly say a temporary

16:59.960 --> 17:10.280
code vowel, it could be value but it's just as good as the x being the argument, really people,

17:10.920 --> 17:17.800
anyway that's forced it to be evaluated before and then we use that value in the actual

17:18.600 --> 17:23.640
putting a user and we can verify that by going to look at the code again, I'm going to show you

17:23.640 --> 17:31.720
the code because it reordered it, it moved that outside the area of checking, this actually

17:31.720 --> 17:38.360
generally fix the problem pretty much because we now only have a very small window and occasionally

17:38.360 --> 17:46.680
these code get called with the preemption disabled so it can't switch anyway, it's also pretty good

17:46.680 --> 17:52.280
for everywhere else because we could have also fixed a number of other issues where code code

17:52.360 --> 17:57.320
could have been called without the user space access being called because who knows where else this

17:57.320 --> 18:04.680
was, I was too lazy to go and find out any of the places but obviously if we find one there may

18:04.680 --> 18:11.560
or may not be others, this is sort of these are sort of bugs that happen because people don't

18:11.560 --> 18:17.720
think exactly how things get ordered and the compiler still may well be within its rights to be

18:17.880 --> 18:24.840
ordered this, I'm not a compiler expert and having had attempts or to looking into things like

18:24.840 --> 18:32.440
GCC that hurts my head and I would rather other people sort that out for me thank you very much

18:35.240 --> 18:42.200
so a quite little bonus side investigation I did and this is a project we're doing at the moment

18:43.080 --> 18:49.320
because we're extremely strange in weird people we've been trying running risk five big

18:49.320 --> 18:54.280
end-ion I submitted that thought but wasn't accepted so I'm gonna get it in here

18:56.280 --> 19:02.040
so we actually have a project on our website I think we're doing a blog post about it

19:02.680 --> 19:08.120
we have pretty much a working system running risk five big end-ion but as part of that we went

19:08.120 --> 19:13.960
for a few issues we weren't talking too much more about it but the in it code was failing for

19:13.960 --> 19:20.200
certain things were failing we've stopped frames going weird now we could have gone for this

19:20.200 --> 19:26.200
idea bugger again it wasn't too early in the system but we can again go through the code we know

19:26.200 --> 19:31.800
that the thing is failing it takes a bit more time on this one so I haven't included the code

19:31.960 --> 19:39.480
like two or three pages so what was happening is because risk five runs instructions always

19:39.480 --> 19:45.160
little end-ion because who cares what instruction end-ion is you have you're only really

19:45.160 --> 19:53.720
carrying out data it turns out people like hand-assembling instructions because who doesn't so apparently

19:53.720 --> 20:00.120
the pause instruction went by swapped by data it turns out to be a valid stack of

20:00.440 --> 20:08.920
movement wow that was not fun anyway that was a complete matter side quest on things

20:10.680 --> 20:16.440
as you can see the sometimes you say you can go do things with or without a debugger

20:17.400 --> 20:24.120
we will talk a little bit about tracing but again know that's sort of things that are really useful

20:24.120 --> 20:37.720
for long-term problems so our second one a user in legal instruction so I'm told FDSDA is actually

20:37.720 --> 20:45.160
meant to be called free desktop SDK we put in some effort on free desktop SDK help things like

20:45.160 --> 20:53.160
build we have some runners build fails with sigil now this is repeatable so

20:54.760 --> 21:00.440
this is a case where actually you know what let's go with debugger gdb is actually sort of something

21:00.440 --> 21:06.760
especially when you can get a favor of people problem so I've seen the log got the log

21:06.760 --> 21:14.120
last bits of log from the bug report you say bin bash well why bin bash is a dying but illegal

21:14.120 --> 21:22.600
instruction gets we run help build of stops repeatable on most architectures however some of

21:22.600 --> 21:29.960
the runners are not showing this problem but we can repeat it we know we can repeat it and we've

21:29.960 --> 21:40.040
repeated it on our own hardware so we run that job under gdb backtrace is at least useful in this

21:40.040 --> 21:48.280
case we actually get a backtrace we know what the problem is sigil okay so it's in something called

21:48.280 --> 21:58.360
ABSL which is live absile we can dump the really intensely unuseful function name thank you for

21:58.360 --> 22:07.080
C++ for mangling the hell out of that but it's actually a speed cycle that's an instruction

22:07.960 --> 22:15.080
that gets into the register A0 the number of cycles the core has run since I think it was reset

22:17.080 --> 22:24.440
obviously if it can disassemble the instruction and it's known to have been good

22:26.120 --> 22:33.320
why are we getting the illegal instruction from it well this is where my colleague would let me

22:33.320 --> 22:39.400
ping that for you this is why I don't try and talk to him very much but your favourite search engine is

22:39.400 --> 22:44.840
actually one of your friends here because actually this is an ownership searching for it

22:46.120 --> 22:54.280
actually finds a really good bug report as well so the problem here is that read cycle

22:54.840 --> 23:00.920
like on a few other architectures where information like this is a security problem well

23:01.560 --> 23:08.920
since it's 6.6 this has been disabled now security reasons I won't go into all the

23:08.920 --> 23:15.960
sort of security problem leaks that happened from this sort of thing but it's been disabled we can't

23:15.960 --> 23:27.320
use it now why live ABSL is using it to read the time is another matter the number of comments in

23:27.320 --> 23:34.840
the thread about the bug report may show a few slightly less favorable feelings about this

23:35.960 --> 23:42.040
but we have get time of day and we have plot get time they're fairly good these have been officially

23:42.040 --> 23:51.240
available since probably the time of Linux 4 via the kernels quick fiscal interface so

23:52.200 --> 23:58.280
why have the baby settled does this is another question anyway new in the baby settled stop this

23:58.280 --> 24:05.320
because obviously they've realised this was fairly silly thing to do I mean I'm all for efficient coding

24:05.320 --> 24:17.320
but that isn't in great example of it so this is a really easy thing to fix there's two things

24:17.320 --> 24:24.600
we can do here and since F3 desktop SDK we can control the build because it all declaratively built

24:25.160 --> 24:31.240
just replace the tank thing use the new version if we couldn't have done this we can use the

24:31.240 --> 24:38.840
SPI PMU driver that has assist us control which was not intuitive so I didn't actually have to work

24:38.840 --> 24:44.520
to be unable to use a space because if you don't like security and you can control your machines who cares

24:45.320 --> 24:50.840
right so we fixed that good everything runs again until it crashes

24:54.760 --> 25:00.520
right so further discussions on this problem go what we must have another bug

25:02.360 --> 25:06.040
suddenly posted this is what my runner showed in the demessage

25:06.920 --> 25:14.840
okay so that's Warhex so as humans I don't like Warhex very much

25:14.840 --> 25:22.280
and this it basically is the program data so the program instruction data showing hex the current

25:22.280 --> 25:29.560
program counter it was there I can't be asked to run the two-hour job to find out what the hell is

25:29.560 --> 25:37.320
going on you can turn this into some nice little text and because I don't like decoding code

25:37.320 --> 25:44.120
from hex and we think we have better things to do just feed that into canoes so we can use

25:44.120 --> 25:56.440
use the same upstum before and it's there again so I want my back with have it it was the same bug

25:57.160 --> 26:04.520
I don't think I've got any pain for it but no it's the same bug in the same library

26:06.360 --> 26:13.320
because GRPC includes it's own copy of liberally yourself she's really guys

26:16.040 --> 26:21.640
anyway the fix there was to go and change the build again to tell GRPC to use the user

26:21.640 --> 26:30.040
the commonly local copy so after a day of shouting at people we actually got the builds going

26:30.040 --> 26:37.320
again which is great so we had some builds obviously the actual course of the issue was upgrading from

26:37.320 --> 26:43.480
the runners from 6.6 at some point some of the runners may not have ever been updated properly

26:43.480 --> 26:48.920
I failed to check what exactly all of them were maybe some of our architectures don't

26:49.000 --> 26:53.080
obviously one of the disabled mental read cycle not sure

26:54.600 --> 27:02.680
however because we are absolutely insane people I went you know what this should be too

27:02.680 --> 27:15.720
difficult to fix the hard way because we're in slightly weird we can actually add a check into the

27:15.800 --> 27:22.200
illegal instruction handler so the kernel the low level code when it gets the exception calls into

27:22.200 --> 27:29.880
some C code that obviously in visc5 kernel is called due trap in illegal it gets past some

27:29.880 --> 27:37.240
registers about the state of the code and obviously this is code either goes through and goes

27:37.960 --> 27:44.680
what you did was incredibly silly and prints your instruction dump and oops or

27:45.640 --> 27:51.160
delivers a signal or whatever to your process and but we know what we could actually check

27:51.160 --> 28:00.280
whether that bug was read time and emulate it only issue nothing natural is being done this before

28:00.680 --> 28:05.720
so we'll do a little bit of a quick run through what I would do it also seem to be a bit of a hack

28:05.800 --> 28:12.840
so I did actually put this on the Linux viscv kernel list and people were either shocked or

28:12.840 --> 28:21.960
went what an earth if you'd done that for that is probably a reasonable response so if you don't

28:21.960 --> 28:29.320
know how instructions are specified risk life like most of us has some sort of way of packing

28:29.960 --> 28:39.480
bits out say these are bits certain bits do this so each subset of bits say you have some sort of

28:39.480 --> 28:46.840
operation code and then what that operation is doing for a data obviously people generally don't

28:46.840 --> 28:54.840
do much for this it's usually compilers for you and so just as a done we we're looking at the

28:54.920 --> 29:02.520
system group which is the said ICR extension which gives you a whole pile of ways of getting

29:02.520 --> 29:09.000
up control the status registers which reads cycle is one of them as you can do this CSR

29:09.000 --> 29:19.320
read gets your register so if anybody's interested this is what I did roughly there's a bit

29:19.320 --> 29:24.360
more because there are things that we were missing but it goes well is this instruction that

29:24.360 --> 29:30.200
was fetched a system instruction if so we can then go and decode some more information from it

29:31.000 --> 29:36.200
so there's helpers in header files because obviously some people need to look at instructions

29:36.200 --> 29:47.400
for other things like if you're trapped so you can go through extract we can say is this our CSR

29:47.400 --> 29:55.640
read and set if so we can look for access to the cycle register and because the instruction

29:55.640 --> 30:03.560
the instruction traps a design to allow you to restart if it had to fix something up we can just

30:03.560 --> 30:12.760
basically say into the registers that we were passed from our task set Rd from the CSR time obviously

30:12.760 --> 30:18.360
we don't want to use we use CSR cycle because whilst we could get that from the kernel that

30:18.360 --> 30:22.520
would be just leaking the instruction like leaking information we didn't want to lead

30:23.080 --> 30:30.840
CSR time is obviously apparently a better way so we can just move the EPC register on to where

30:31.560 --> 30:40.360
it was return so just say this we ended up introducing this a bit of a support maybe someone

30:40.680 --> 30:46.600
will get merged there was no actual set register there was a get there was no a way of extracting

30:46.600 --> 30:53.160
the one of the subcoats and that and obviously we had to do it a bit of messing about with instruction

30:56.280 --> 31:01.880
so but it worked we did like some test code for this as well there was obviously you would like to test

31:02.200 --> 31:09.800
so we do a whole pile of test code the result was as I said we sent that to the list

31:10.600 --> 31:18.120
nobody liked it now that wasn't entirely unsurprising it was more of a proof that it could be done

31:19.320 --> 31:25.480
I was bored and it would be you maybe useful for somebody in the future who might actually need to

31:25.560 --> 31:31.560
do this I bet to fix up some legacy code maybe something in the future need sorting out

31:34.040 --> 31:39.720
anyway it's going to be left out there people could have looked at if they like it is entirely

31:40.440 --> 31:45.000
to the thing it was about I think with the test code there's a couple of days of work

31:45.880 --> 32:01.480
so the conclusions right so did say about that so in really in conclusion some of this

32:02.280 --> 32:08.680
great five minutes yeah padding this out of the extra bug we got that like put

32:09.560 --> 32:15.640
so conclusions this may not be as easy as just using a debugger but for cases where

32:15.640 --> 32:21.640
you have some really difficult bugs it can help as a different way it's very old school this is

32:21.640 --> 32:29.320
how I would do it back in the ages when say the old acorn stuff let me go

32:31.960 --> 32:36.680
but it's really useful to look at sometimes it's just useful to look at the code and I'm trying

32:36.680 --> 32:42.200
to understand what it's doing in this case in the case of the user access bug it was very clear

32:42.200 --> 32:50.040
that the code looked fine it was just that it was ordering wrong and the kernel also had a lot of

32:50.040 --> 32:58.520
useful logging systems nowadays for instance if your bug isn't causing a crash but is definitely

32:58.520 --> 33:04.280
repeatable there's the there's a lot of kernel logging you can use it to efficiently export data

33:04.360 --> 33:10.280
from the kernel and docket elsewhere for instance I had a problem with a take rate

33:10.280 --> 33:17.000
DMA controller now every few hours you'd get a blip in the audio no you could quite understand

33:17.000 --> 33:23.960
why so really just add a couple of trace points the DMA driver where it was looking at things like

33:23.960 --> 33:30.520
moving on to buffers reporting position I think things out in the DMA driver that were probably

33:30.520 --> 33:36.200
where it was able to be export that data via the trace and just export it off the system you

33:36.200 --> 33:41.400
could look at it later and analyze it so running it overnight just showed that there was a bug

33:41.400 --> 33:48.040
in the positional parameters when there was an occasional buffer swap and parameters were not

33:48.040 --> 33:54.120
often updated in the way we expected so there's basically just a check needed and you could then

33:54.120 --> 33:59.560
do that so let's say this is probably old school but hopefully was interesting to people

34:01.320 --> 34:07.880
I haven't got ran to releasing the slides I didn't get time to get these formatted out

34:09.000 --> 34:13.640
I've designed it decided to change a lot of our formatting or last minute so

34:14.680 --> 34:20.680
yeah hopefully we'll be putting them up I don't know if you know be somewhere we can put a url for that

34:24.440 --> 34:29.640
anyway hard question I'll make a blog post about this the same but we have a couple of slides

34:29.640 --> 34:39.000
of references these are like the issues some kernel code testing blog posts we had a number of

34:39.000 --> 34:48.920
amazing discussions it's always nice to have people weigh in on stuff and then go well you know

34:48.920 --> 34:58.840
what actually you might have did I start talking about this so thank you and hopefully

34:59.880 --> 35:03.160
it might have been at least something interesting and may have given you an idea about

35:04.120 --> 35:10.760
alternative ways obviously there are many ways to look at bugs and sometimes just talking to your

35:10.760 --> 35:17.640
rubber duck might help and if you could explain it to a rubber duck you probably got to the point

35:17.720 --> 35:25.080
where you actually understand it so there's a few minutes left don't know if anybody has questions

35:27.240 --> 35:34.920
idea I meant thank you really enjoy your presentation thank you so much can you describe

35:35.640 --> 35:42.760
how your kernel testing works so the question about how the kernel testing works so this is not

35:42.760 --> 35:49.800
our kernel testing this is the the s boys edbot project has I'm not sure if it's still running

35:49.800 --> 35:56.440
at the moment but that has information I think online somewhere google will probably give you an

35:56.440 --> 36:02.680
idea it's found a number of issues if you search through the kernel logs there's a number of things

36:02.680 --> 36:08.280
it's reported and people have gone it's testing around a wide range of kernels and architectures so

36:09.080 --> 36:14.680
hopefully that answers your bug your question that's not bug

36:16.840 --> 36:26.120
any further so great thank you for attending thank you for I should say thank you to co-think

36:26.200 --> 36:38.040
so oh there's one more quick one yeah what motivated you to do big mian risk five

36:38.040 --> 36:45.160
uh what motivated us to do risk and big nian risk five well I was just about to say that my company

36:45.800 --> 36:52.200
thank you for my co-think to sponsor me to be here so I haven't had to pay for my own transport

36:53.160 --> 36:58.440
what motivated us to risk so I did an amort of big nian about 10 years ago

36:59.160 --> 37:06.760
because obviously we're consulting group we do things for people suddenly when that will cost you

37:06.760 --> 37:12.680
five million dollars we'll have it done next year and I went um I'll do you for do you that for about

37:12.680 --> 37:19.640
half a million and we've already two months now and obviously they went we'll take you we'll do it

37:19.640 --> 37:25.880
great so we did that we got actually quite a bit done got the customer system going and then they

37:25.880 --> 37:34.520
went hmm now not good do that but next thing is that this five haven't announced that big nian

37:34.520 --> 37:41.880
data is that it's going to be available in new course um I needed something to do I need to

37:41.880 --> 37:48.440
talk about because things to talk about don't come along too often and more importantly

37:49.080 --> 37:54.920
no you'd done it before so we went and did it and went you know what this is going to be just a

37:54.920 --> 38:00.840
quick project we'll see if it can be done at the worst the company might get some dog posts and

38:00.840 --> 38:06.600
some talks out of it and at the best we can try and go to people like Ryzen go

38:09.640 --> 38:15.560
my wallet is looking for a bit to finish the moment would you like to buy some consultancy services

38:16.520 --> 38:24.760
but yeah yeah it was more of a let's try just try doing this it will be we'll spend about

38:24.760 --> 38:31.960
four or five weeks of engineering time on it and um it turns out it's actually quite doable

38:32.520 --> 38:38.920
however risk anybody actually takes on who knows but we'll we'll try and get some talks

38:39.880 --> 38:45.640
so that's done so the one if we did the big engine and I think time is about up

38:47.880 --> 38:49.880
great thank you

