WEBVTT

00:00.000 --> 00:10.760
So, yeah, I want to start a small talk about the Linux kernel, because actually in this

00:10.760 --> 00:17.600
Devroom, it's quite rare that we talk about the Linux kernel at all.

00:17.600 --> 00:21.960
So first of all, I'm not a kernel developer, I'm not even a software developer in C, I'm

00:21.960 --> 00:26.240
just a Czadmin of an HPC cluster here in the university, so it's a perspective I'm going

00:26.240 --> 00:27.240
to give you.

00:27.720 --> 00:34.680
And as I said, we put a lot of care in this the HPC community in user space, like we compile

00:34.680 --> 00:40.520
the software of our researchers with a lot of care so that he uses all the features of the hardware

00:40.520 --> 00:48.360
underneath, all the best instruction set of the CPU, we also use their two things with modern

00:48.360 --> 00:53.800
compilers, modern libraries so that we can squeeze all the performance out of the software of the

00:53.800 --> 00:58.360
researcher, but on the other side we have the kernel and we don't even look at it, we just

00:58.360 --> 01:04.440
install usually the kernel that comes from your Linux distribution of choice, and there's

01:04.440 --> 01:09.880
a reason for that, and it's because you want stability and compatibility in the kernel space,

01:09.880 --> 01:15.640
the Linux kernel that is shipped by the Linux distributions has a lot of features that we like,

01:15.640 --> 01:20.600
so the people in the Linux distribution, the Linux distribution and they put a lot of effort

01:20.600 --> 01:28.680
to make sure that the kernel works, is stable, is secure, they fix vulnerabilities, you will get

01:28.680 --> 01:36.200
updates, and you have a kind of standardized environment where for instance vendors can provide you

01:36.200 --> 01:43.880
with kernel modules, they will work, and there's a good understanding of what you can expect from your

01:43.960 --> 01:52.280
system, but we would also like to see what happens if you can improve the performance of your

01:52.280 --> 02:00.360
cluster by optimizing the kernel, because the kernel is shipped in a very stable fashion,

02:01.160 --> 02:05.720
it's very portable as well, but that means that the instruction set that the uses is not optimized

02:05.720 --> 02:11.640
for the hardware of your system, so it uses a generic instruction set for X86 CPUs so that it goes

02:11.720 --> 02:18.360
to null 8, X86 CPUs, but if your CPU has other features that are specific to that model,

02:18.360 --> 02:25.160
they will not be used by the kernel, and we want to do that also without losing all the stability

02:25.160 --> 02:30.360
and compatibility I talk about, so we don't want to break any users of what it is already running

02:30.360 --> 02:34.920
in the system, we don't want to change any of the features of the kernel itself, we want that

02:34.920 --> 02:40.760
all the drivers that we have from vendors still work, so basically what we want is a drop in replacement

02:40.760 --> 02:47.400
of the kernel, that magically can be make the system go faster, so no patching of the source code

02:48.040 --> 02:53.960
and no changing the config of the kernel at all, and also use the same package manager of your

02:53.960 --> 02:58.680
Linux distribution, in our case that's RPM, because we use our Red Hat based Linux distribution,

02:59.560 --> 03:05.800
so we'll use RPM built to actually play with the kernel, so what we did is we device some

03:05.800 --> 03:14.920
benchmarks, typical HPC benchmarks, and we used easy build and reframe, and those are the

03:14.920 --> 03:21.160
the benchmarks that we systematically run on our cluster, those we have data historical data,

03:21.160 --> 03:27.640
and we know very well, and I also pick some benchmarks from the phonics test suite, which is very

03:27.640 --> 03:34.920
commonly used by the press, and it has a suite of tests that is targeting HPC tools as well,

03:35.560 --> 03:41.880
so what we, what I did is run, it's very simple, run some the benchmark with the

03:41.880 --> 03:50.360
digital Linux kernel, also recompiled the kernel from the distribution by myself without making

03:50.360 --> 03:55.800
any changes, and then start playing with configurations and optimizations of the kernel,

03:55.800 --> 04:02.120
and say what happens with all these benchmarks. The system also is a production system in our

04:02.120 --> 04:08.840
cluster, so it's a Skyleg Note, and it has ABX 512, so that's something that we want to actually

04:08.840 --> 04:17.560
be test if it's useful or not, and we are currently using a Rockley in Obsator 10,

04:17.560 --> 04:22.680
which is not the newest, I know, but that's what we have on production, and my goal is to not create

04:23.960 --> 04:29.400
exotic system to play, at the end of the day what I want is to actually apply

04:29.880 --> 04:35.080
any improvements that I might get into the cluster that we have today, so I'm going to try to

04:35.080 --> 04:41.800
test as much as possible, close to the system that we have in production. So what kind of

04:41.800 --> 04:46.600
optimizations are we talking about then? Basically, what we can do is recompile the kernel with the

04:46.600 --> 04:51.720
same source code of the Linux distribution, with the same configuration, and then you can play with

04:51.720 --> 04:56.920
optimizations at the compiler level, so the compiler, as I said, it can generate assembly with an

04:56.920 --> 05:00.920
instruction set that is more or less adapted to the features of your CPU, like for instance,

05:00.920 --> 05:08.760
using SSC instructions or AVX for the transition, we can also enable optimizations of the

05:08.760 --> 05:12.520
assembly code, because the compiler can actually do a lot of magic with your code, so for instance,

05:12.520 --> 05:18.680
it can inline functions, you might have multiple functions, but if the compiler

05:19.640 --> 05:26.680
evaluates that those are more efficiently run and bet it one into the other, what it will do

05:26.680 --> 05:33.240
is just copy paste the code of the function into the caller function, so that there's no calling

05:33.240 --> 05:41.480
between functions that might be lying very far away, and this is what we tested first, so here you

05:42.120 --> 05:51.800
have the results for the different CPU instruction set, in blue is AVX 2, in red is AVX 512,

05:51.800 --> 05:58.760
and what you see here is a percentage, the performance gain, and for those tests that the performance

05:58.760 --> 06:06.040
gain is again in time, so that the less is the best, I inverted that, so positive is good,

06:06.120 --> 06:12.120
negative is bad for all the tests, and as you can see here is messy, so in some cases you gain,

06:12.120 --> 06:17.800
so for instance we have FFTW, so the first Fourier transform with floats and SSC,

06:17.800 --> 06:21.880
that actually benefits from 512 with max sense, because it has floats and SSC,

06:23.400 --> 06:28.360
but on the other hand, the same test with stock settings, it doesn't gain that much, and other

06:28.360 --> 06:35.800
applications total you are affected negatively, but by this change, the overall result is

06:35.880 --> 06:41.560
negligible, so we get in on average 0 to something percent of improvement,

06:43.480 --> 06:48.600
and what is also interesting is that all the tests that you see here, the first ones are those

06:48.600 --> 06:55.000
that we run with easy built and reframe, and those are software, modules that are very well optimized,

06:55.000 --> 07:00.840
and very well adapted to the hardware already of our cluster, and those see even less change

07:00.920 --> 07:08.600
than those from the Forenix test suite, so it seems kind of not really interesting,

07:08.600 --> 07:16.200
we also compare using GCC and LLVM to see if there's any change, fortunately it's even less,

07:16.200 --> 07:22.280
which is expected, there shouldn't be much change between GCC and LLVM with default settings,

07:22.520 --> 07:30.600
not making any other changes, and then we compare O3 between O2 versus O3,

07:31.560 --> 07:36.360
so by default the kernel is compiled with O2, and then we try to O3, with the enables,

07:36.360 --> 07:41.080
that kind of magic I talk about, the compiler inlining functions and on rolling loops and all the

07:41.080 --> 07:49.960
kind of stuff, and it's kind of the same story, it's messy, there's no general trend, you might

07:49.960 --> 07:54.280
improve some applications, but others you might have a negative impact in them, so what we also

07:54.280 --> 07:59.480
one is that we don't deploy something, then one of the user's, one of the research groups,

08:00.280 --> 08:07.960
and finds out the day after that their jobs are running half speed than before, so and on average

08:07.960 --> 08:14.840
the improvement is totally negligible, so there's not much to be said here, but then there's the other

08:14.840 --> 08:20.280
as a SQL level is what is called the profile guided optimizations that you can do in the kernel,

08:21.000 --> 08:28.760
and those works in a different way, so you can have your kernel compiled with some instrumentation,

08:30.040 --> 08:37.400
which is GCCov, which is a coverage tool that will analyze and generate a profile

08:38.040 --> 08:43.880
of which parts of the kernel are most more used or less used while your software is running,

08:43.960 --> 08:48.600
so you can compile your code with this instrumentation, you reboot your system into this special

08:48.600 --> 08:55.160
kernel with the instrumentation, you run whatever workflow you want to run, it will generate this data,

08:55.160 --> 08:59.960
and then you can recompile again a second time your kernel, using this data this profile,

08:59.960 --> 09:03.560
and it will annotate the source code with the probabilities of the different branches,

09:03.960 --> 09:08.920
and it can also do estimates of the values of expression and then compile the code

09:08.920 --> 09:14.840
to the assembly based on that information, so it generates a binary that is more optimized for

09:14.840 --> 09:23.320
that specific workflow, so we test that with NumPy, and we saw on in red you have the kernel that

09:23.320 --> 09:30.040
it's optimized for a test of NumPy, and you can see that NumPy here is performed 3% better than

09:30.040 --> 09:34.520
the rest, and if this is the best value that we ever got in these all these tests that I've run

09:34.520 --> 09:41.320
with all 3, 0, 2, and all that stuff, and what we also did is to run the same with profiling

09:41.320 --> 09:46.360
for all the tests, so we took all these 5 tests that is here, and profile of them, and you can

09:46.360 --> 09:51.560
you can see the NumPy then performs worse, but what is nice also in that in both cases you get

09:51.560 --> 09:57.800
on average more or less the same improvement, and this is just 1% 3% in the case of NumPy which is

09:57.880 --> 10:03.560
very small, but in this case it's the profile guide optimization that I did is the basic one,

10:03.560 --> 10:08.120
just branch probabilities, there are newer ones that are coming in the kernel 6 to 13,

10:08.120 --> 10:17.160
which was released just on January, and there's also both developed by meta, which will be

10:17.160 --> 10:26.520
available now, but they are very, very narrow, so these needs more work, so for the conclusions,

10:27.240 --> 10:33.480
this is a little bit of improvement, and also it might be very complex to deploy, because as I said

10:33.480 --> 10:41.000
you need to target some specific workflow, and in a rich HPC environments that may not be even

10:41.000 --> 10:45.000
applicable, because if you have users doing very different things it's going to be complicated to

10:45.080 --> 10:51.960
actually have a surge of demisitions deployed on your poster, so let's do it

10:57.960 --> 11:01.160
all right for one question, before we start moving, let's do the question first,

11:01.160 --> 11:13.400
can you go up slide 9, so there you have FFTW flow does SSC, what do you mean SSC,

11:13.400 --> 11:19.720
so is that SSC then you run it with AVX5 flow, what does that mean, well SSC is because

11:19.720 --> 11:25.880
these codes will use SSC instructions, so we'll use this SIMD, the single instruction,

11:25.880 --> 11:33.640
multiple data instructions, those are not, not enabled by default in the district kernel,

11:33.640 --> 11:38.840
because that's generic, compacted with the generic flag, so there's no SSC in there, with AVX12,

11:38.840 --> 11:45.240
well I actually use native, native is AVX512 plus SSC4, plus some other stuff, because this is an

11:45.240 --> 11:51.720
skylic CPU, so it has all these, these instructions are supported by the hardware of the CPU,

11:51.800 --> 11:58.040
so now this code can actually use hardware paths, transistors in the CPU that were not used

11:58.040 --> 12:02.920
by the Linux kernel version, so very used as C by default, and then you use the

12:02.920 --> 12:08.680
S512 and the others, you don't use a C at all, the other code that the stock code doesn't

12:08.680 --> 12:13.800
use in Seattle, even if you enable it, because the source code of this test does not have those

12:13.800 --> 12:19.960
instructions in it, all right, thank you Alex, my pleasure,

12:21.720 --> 12:31.720
you're all good, you're all good, you're all good, you're all good, you're all good, you're all good,

12:31.720 --> 12:37.720
you're all good, you're all good, you're all good, you're all good, you're all good, you're all good,

