WEBVTT

00:00.000 --> 00:20.000
Thank you very much for your patience, first of all I've already asked who's heard of MSF,

00:20.000 --> 00:26.680
but the people who haven't heard of MSF, this is us, international independent medical humanitarian

00:26.680 --> 00:34.200
organisation. The independence is really important, that is what enabled us to speak out when

00:34.200 --> 00:39.360
the witness, for example, violations of humanitarian law, war crimes, and that kind of stuff

00:39.360 --> 00:44.600
and I'll talk a bit more about that in a minute. I'm going to talk very briefly about who we are,

00:44.600 --> 00:49.080
what we do, where we use next to S, why and how, and then if I move really quickly there'll be

00:49.080 --> 00:58.600
time for questions. I'm Ian, by the way, this is Shaw, and Ranzis is over here. So, we're quite

00:58.600 --> 01:05.320
proud of what we do, we work all over the world, and almost all of our funding comes from individual

01:05.320 --> 01:11.480
donors, and that's what means we can be independent. So we don't have to worry about government's

01:11.480 --> 01:16.440
cutting off our funding because they don't like us denouncing their migration policy or enabling

01:16.440 --> 01:20.280
or committing war crimes, for example, which you might have read about in the news recently,

01:20.280 --> 01:27.080
gets more and more important. So, Shaw and myself, we work for MSF, our payroll is up there

01:28.200 --> 01:33.640
in the general administration section. So, yeah, we do a lot of stuff. You might be wondering,

01:33.640 --> 01:38.280
okay, find vaccinations, but what does NixOS have to do with that? The answer is that you'll find

01:38.280 --> 01:43.720
NixOS in our systems deployed in the Clared, and also deployed in the healthcare facilities

01:44.680 --> 01:51.240
all over the world, where you work. So, for example, in an ITFC, that's an intensive therapeutic

01:51.240 --> 01:56.840
feeding center, which is used for feeding people who are either very small children or who are

01:56.840 --> 02:03.000
too malnourished, and I'll show you some photos of them in a minute. This is a picture of two

02:03.000 --> 02:08.920
ambulances in IT. The reason I'm showing you these is because there's a story behind them. In Christmas

02:08.920 --> 02:15.960
2023, we were forced to stop activity in IT because our patients were being killed in transit

02:15.960 --> 02:21.880
between hospitals. They'd be dragged out of the ambulance and shot. So, while we couldn't guarantee

02:21.880 --> 02:27.320
the safety of our patients or our staff, we had to stop work. While we stopped work, and while

02:27.320 --> 02:33.560
our hospitals were empty, we went in there with two of our guys, so Carl and Daniel, our X-ray and

02:33.560 --> 02:38.200
network technician, we upgraded the network, and then we installed virtual machine running NixOS,

02:38.200 --> 02:44.520
and on top of that, an X-ray image archiving system. And when hospital came back, it came back

02:44.520 --> 02:53.080
better. The surgeons were more able to look at the X-rays of these kind of operations. MSF is a

02:53.080 --> 03:02.680
big organization. Think about Fortune 500 company size, but we're about the fortune. Sure, and I work

03:02.680 --> 03:09.400
for a part of MSF called the Operational Center Brussels, which in total is about 15% of the

03:09.400 --> 03:17.960
global stuff. And IT inside the HQ is 505 people. Five people work on NixOS. So what I'm trying to

03:17.960 --> 03:26.040
convey here is that we do not have a choice. We have to work at scale. Right. Next slide. This is

03:26.040 --> 03:31.720
some of the cases where we work for a centralized data systems. I'm going to explain to you one of

03:31.880 --> 03:37.560
the systems now. So this is my degree in intensive circuit heating sensor. You can see the solar panels

03:37.560 --> 03:43.480
on the roofs. Those are new. The deployments are solar panels across all our hospitals,

03:43.480 --> 03:48.920
has been enabled by data driven decisions, which are enabled by a platform called Things Board,

03:48.920 --> 03:54.600
which is an Apache licensed MQTT dashboard. That's deployed on top of NixOS, and we

03:54.600 --> 03:59.800
centralized power consumption information from all over the world on that platform. And for

03:59.800 --> 04:06.280
example, the solar panels here have cut diesel consumption generators by more than half. There's no grid.

04:06.280 --> 04:15.000
It's diesel, solar, or nothing. Next. So inside that hospital, you'll see kit like this.

04:16.680 --> 04:22.520
That's a ruggedized filter. It's kind of mini rack set up with the UPS and this kind of stuff.

04:23.080 --> 04:27.560
They're very, very rugged. I only know about one which broke, and the reason it broke is because

04:27.560 --> 04:34.280
a snake pulled inside the power supply. The snake died. The power supply died. Very bad.

04:35.160 --> 04:43.560
And then on top of that, we also deploy smaller pieces of kits. So this is a nook. It's a small industrial

04:43.560 --> 04:50.360
panel of Intel PC. Very powerful for their size, and they fit in a backpack. So you can ship

04:50.360 --> 04:55.800
them very easily to wherever you have to go. And we use them, for example, when you want to isolate

04:55.800 --> 04:59.480
patient data, and you can also do things like using for active passive replication.

05:02.120 --> 05:07.800
This is the example of this inside the hospital. Here, for example, you would see

05:10.040 --> 05:16.120
patient nurses in health care workers watching around with tablets. Inside the therapeutic

05:16.120 --> 05:21.000
feeding centers, it's really important to keep track of who you've fed and what and how.

05:21.800 --> 05:27.160
The tablets themselves are linked to an application for DHS2, which is running on a pair of nooks.

05:29.000 --> 05:36.120
And that's, that's inside that. This is two of our favorite pictures in IT, MSF, on the left

05:36.120 --> 05:42.440
is Alex. He helped design the field network kit. On the right here is John. He's working on a vaccination

05:42.440 --> 05:48.600
campaign in RDC against measles, where a quarter of million people every year are still infected

05:48.600 --> 05:55.640
and more than 5,000 people a year still by. This is a platform. So we have the platform and then

05:55.640 --> 06:00.520
we have the applications. You should recognize most of these logos, especially the first one.

06:01.640 --> 06:07.640
But you might be wondering what it's Ansible doing here. And the answer is that we started using

06:07.640 --> 06:14.440
NixOS before a subsnix became available. So we use Ansible to keep our secrets encrypted.

06:15.400 --> 06:23.320
So those are the platform components. The applications that we deploy. So DHS2 is a public health

06:23.320 --> 06:29.720
management information system. You can use it to track patient data. You can use it also to do things

06:29.720 --> 06:35.720
like record information about academics and this kind of thing. Or I think it is a piece of software

06:35.720 --> 06:39.720
from managing X-ray imagery and this kind of thing. It's also open source to relative to an

06:39.720 --> 06:46.440
Belgian. And then BAMNI is an electronic medical record system. It's used to run hospitals. And

06:46.440 --> 06:52.120
with that I'm going to introduce Charles and experts in BAMNI. He's deployed medical record systems

06:52.120 --> 06:57.000
throughout Banger Dash before he worked for MSF. And then when he came to MSF, he started deploying

06:57.000 --> 07:03.480
DHS2 all over the world without NixOS, which is what makes him ideal to explain just how much time

07:03.480 --> 07:13.480
it's open over to you. Any time you do the mic once, it's okay.

07:13.480 --> 07:43.400
Okay, thank you. Good morning, great. So, why do we use NixOS? So MSF is a big idea to do this.

07:43.480 --> 07:52.920
So, we have to manage complex IT operations. And this complex IT operation will need a system that

07:52.920 --> 08:01.160
is robust resilient. And we wanted a system that can evolve quickly by itself with a minimum

08:01.160 --> 08:10.520
maintenance. And also we wanted a system that the same system we can run both for our field

08:10.520 --> 08:17.320
operation management and headquarter operation management. So, there we select NixOS.

08:18.360 --> 08:25.080
As it's a support for declarative configuration management, this will all know. And in

08:25.080 --> 08:33.560
first lecture as a code. So, then I won't say how NixOS help us to overcome our many

08:33.560 --> 08:39.400
operations of challenges. So, as we started this setup, we initially

08:39.400 --> 08:46.680
developed what we wanted to get our platform, like as everyone wants. So, we wanted our

08:46.680 --> 08:55.000
configuration to be centrally managed. It is secured and declaratively we wanted to manage.

08:56.120 --> 09:02.040
Then the changes of our configuration we wanted to be automatically tested

09:02.040 --> 09:11.480
then on deployed automatically. Then, it will be a security pass update. Then,

09:12.840 --> 09:18.520
operation everything we wanted to manage with a very minimal effort with a very minimal maintenance

09:18.520 --> 09:26.520
effort. Even access control to the server we wanted to manage automatically and declaratively.

09:27.080 --> 09:36.680
And as our servers are distributed at just different places, we wanted a easy access to our

09:36.680 --> 09:45.640
server with a minimal dependency on the network and we wanted to deploy containerized application

09:45.640 --> 09:52.520
to the NixOS server. So, we will see how easy NixOS we have received majority of

09:52.520 --> 09:59.640
on design goals that is set initially. So, let us go between history about our easy NixOS.

09:59.640 --> 10:08.200
So, since 2018, we started to fast deploy our custom NixOS platform to manage a fleet of

10:08.200 --> 10:15.000
Linux servers and since then we started to write our machine definition in a Nix code

10:15.880 --> 10:21.800
along with the application configuration, application 6 is together and save it in a

10:22.680 --> 10:30.040
GitHub repository. So, we can make all this possible at NixOS support security

10:30.040 --> 10:38.120
configuration management. And then the center repository we use that is connected with all

10:38.120 --> 10:47.320
the server we use in different areas. So, the servers run by the schedule and the pool the

10:47.320 --> 10:56.680
Nix code and building side itself and get the update of the configurations. So, here is a

10:56.680 --> 11:03.240
code snippet. How we define our servers in Nix code? So, everything about the server,

11:03.240 --> 11:10.040
server time zones, other settings even even the disk partition, boot mode and the services we

11:10.040 --> 11:21.800
wanted to run inside the server all are defined declaratively. So, here is a example here is a

11:23.160 --> 11:30.760
here is a configuration code like for example, to all those servers we wanted to

11:30.760 --> 11:39.880
disperse those content as a configuration and this part actually defined like in the name of

11:39.880 --> 11:45.960
this file this content will disperse and plus all those servers. So, this is how we define our

11:45.960 --> 11:54.680
configuration in NML. Civil I will define our secrets in NML file but that is a encrypted

11:54.760 --> 12:03.720
with a civil fault. So, before moving a posit further just I wanted to show you this picture.

12:03.720 --> 12:11.320
This is where basically our server as a few server ran. So, it is a picture of jump to the

12:11.320 --> 12:17.880
refusing comes in cost with the Bangladesh. So, it is a very remote area where sometimes to run a

12:17.960 --> 12:25.400
server very essential needs right power and internet supply is very difficult. So, we need to

12:25.400 --> 12:34.040
run those servers under those extreme constant and we will see how using NixOS we overcome those

12:34.040 --> 12:42.680
challenges. So, as we have everything in a code. So, this gives us a significant advantages right.

12:43.080 --> 12:57.480
So, we can take the advantages of advantage of for example, every changes to be tracked then when

13:01.320 --> 13:08.200
the deployment or we can take the advantage of gate automation for deployment for testing

13:09.000 --> 13:19.480
and the gate ops operation we can run over it. So, now I am going to talk about how we regularly

13:20.360 --> 13:31.880
upgrade and gate security pass update for our servers. So, as our servers are connected to our central

13:31.960 --> 13:41.080
configuration repository it pulls the updates from that central repository and rebuild itself and

13:41.080 --> 13:50.200
an upgrade itself and we also use in our server NixOS, in our platform we use NixFlex and every week

13:50.200 --> 13:59.160
we do the flight lock bump with help us to pass the security update from the upstream and when

13:59.240 --> 14:05.960
the NixFlex version upgraded twice in a year we also do the platform upgrade and twice in a year.

14:06.600 --> 14:13.960
But, we run our upgrade addition in a three steps or three wave frequency faster,

14:13.960 --> 14:18.280
middle wave and final wave. So, in the first wave we actually run the

14:19.240 --> 14:26.280
operation in our relays and dead machines, middle wave in UAT and test servers and low SLF production

14:26.280 --> 14:33.640
servers and the final wave we run in our mission critical application at the server hosting mission

14:33.640 --> 14:42.600
critical applications. So, this process gives us a lot of advantages why? Because we minimally

14:42.680 --> 14:51.800
wanted to disrupt our operation in the field and if application inherits any force we wanted

14:51.800 --> 15:00.520
that to be surfaced in a faster and not hit impact the final wave servers for critical applications

15:00.520 --> 15:09.960
running server. So, this is from why the way it started our faster the

15:10.920 --> 15:20.360
notes we just next to my desk and the final wave ended of executing the field servers for

15:20.360 --> 15:25.720
basically our mission critical applications run. As you see the doctor is taking

15:25.720 --> 15:33.560
preparation of a surgery and we wanted him to be minimally disturbed. So, that is our ultimate goal

15:33.560 --> 15:39.320
of every time whatever we want. We do not want to show the technical excellence, we want to

15:39.400 --> 15:50.120
vary minimally disturbed our field operation. So, how do we test our next code? Similarly,

15:50.120 --> 15:56.520
the count to two. So, as our everything in a code. So, every change is we do in our next

15:56.520 --> 16:03.480
source configuration. Then we run the build test validation test integration test

16:03.560 --> 16:09.640
recently we started to use the VM based test and we wanted to run our test as similar as

16:09.640 --> 16:15.480
a production environment. The goal is ultimately the same. We wanted to get no

16:15.480 --> 16:23.160
surprise after deployment. We also manage our code using a staging and main runs.

16:24.520 --> 16:31.080
Particularly, the major critical changes in an next code we progress through these

16:31.080 --> 16:38.840
staging runs and the regular operational changes like changes in a configuration which is less

16:38.840 --> 16:47.160
impactful. We progress through the main runs. So, we follow this geared work forward to decouple

16:47.160 --> 16:56.600
the development from the operation. Because for example, when I am doing some critical changes

16:56.680 --> 17:02.360
or while my other colleagues he needs to deploy some small changes of the configuration,

17:02.360 --> 17:06.360
I do not want to block him. Because my changes would be more

17:06.360 --> 17:11.080
in fact, who is a man here, I need to be tested, who is gone through the gestaging

17:11.080 --> 17:20.760
runs and the normal changes go through the main runs. So, how do we manage the remote access

17:20.840 --> 17:26.840
to our servers? We manage remote access to our servers using relay. Because even this simple task

17:26.840 --> 17:32.040
is not simple for us, because the servers are distributed across many places, not in a single

17:32.040 --> 17:37.400
network, in a different network. Some are inside a field network, some are inside a

17:37.400 --> 17:45.320
scenario, we use some are for example, in a hybrid cloud, some are in a cloud. So, we will

17:45.400 --> 17:52.760
relay, we overcome the challenges and we are becoming less dependent on network, we can ensure

17:52.760 --> 18:03.400
access to our servers. So, how do we, this is how we declaratively manage a user access control?

18:05.160 --> 18:11.720
So, in a simple file we define the users, user roles and the roles access to the servers.

18:11.800 --> 18:17.240
So, who is a transform to next code? Next code, that is transform that is the

18:17.240 --> 18:27.080
son file into next option. Then when it is basically get builds, we get the users inside our

18:27.080 --> 18:33.640
server. So, which means we are also declaratively, declaratively can manage our access control.

18:33.640 --> 18:43.640
This gives us couple of advantages, first of all. So, we can prevent the configuration

18:43.640 --> 18:53.240
grid and then we can for example, another applications like I am or they can also use the same

18:53.240 --> 18:58.920
definition, which is a different declaratively. This is how we define our containerized application

18:59.240 --> 19:06.600
next source server. So, we declare application in a Docker container. We intentionally chose

19:06.600 --> 19:13.080
this so that, for example, I do not have any excellence with a next or next code. I do not know

19:13.080 --> 19:18.680
nothing and about it, but it is still I can deploy my application inside out inside those next

19:19.560 --> 19:25.560
source as the deployment process is automated and yeah.

19:28.840 --> 19:35.080
So, this is how we define our deployment service as you can see in here, just I need to say

19:35.080 --> 19:42.440
which application I wanted to deploy. So, I just write the repo name and then the browse name

19:42.520 --> 19:50.840
and that is and then then the machine target machine, why I wanted to enable this option. So,

19:50.840 --> 19:57.960
that is it. So, then the automated deployment options, just check out the emails and then

19:57.960 --> 20:08.760
deploy it to the correct target machine. Yeah. And then after deployment, we will get ready

20:08.760 --> 20:19.240
our application to be used. How do you manage our installation to a new machine? We basically use

20:19.240 --> 20:24.920
next source anywhere and this code. We compose this to into an installation script as long as

20:24.920 --> 20:34.360
server is resubbel by SSAs. We can install next source to there. We encrypt our data partition

20:34.360 --> 20:42.040
is looks to this usually do because our sensitive patient file are inside of it. So, now I am

20:42.040 --> 20:51.240
handing over to Ramsis or basically the main architect behind the Ramsis. Behind this system and

20:51.240 --> 20:54.520
here we will talk more about our next improvement plan.

20:54.520 --> 20:56.520
Yeah, go ahead.

21:01.320 --> 21:06.520
Yeah, maybe just very quick because we are kind of running out of time. So, there is a couple

21:06.520 --> 21:12.360
of improvements that we see in the system. So, as we mentioned, we started in 2018. So, the whole

21:12.360 --> 21:18.120
ecosystem was a lot smaller back then. Things like shops, nicks and such didn't exist. Nicks

21:18.120 --> 21:22.680
was anywhere that didn't exist. So, it is a bit weird. Maybe that we are using Ansible Fold for

21:22.760 --> 21:27.800
secrets, but that is because we kind of had to roll our own system at the time. So, at some

21:27.800 --> 21:35.080
day, we should improve. That is also linked with the whole encryption key thing because there was

21:35.080 --> 21:41.400
yeah. It is a bit weird that we use the same keys for like secrets and SSAs. There is a couple

21:41.400 --> 21:46.360
of things that we should probably migrate to like the system D in it already. We tried it once.

21:46.440 --> 21:52.120
It didn't work. We should probably try it again. We would really like to use

21:52.120 --> 21:57.240
verified boot and measured boot to secure our servers in the field, which we did a

21:57.240 --> 22:01.960
minute so far. I think there has been kind of a bit of advancements in the ecosystem so far,

22:01.960 --> 22:09.800
but it is still not the most straightforward thing to do. We have been writing on VMTest

22:10.680 --> 22:17.000
which are very helpful to avoid issues. And then finally, one thing that we still do is we

22:17.000 --> 22:22.680
evil and build everything on the actual machines, which has a couple of advantages for us,

22:23.880 --> 22:25.720
but eventually we probably want to move away from that.

22:29.080 --> 22:33.480
Okay, I'm going to skip very quickly past our next slides because I think we are completely out of time.

22:33.480 --> 22:39.160
So, I just want to say, first of all, we have had some challenges, but some of the books have

22:39.160 --> 22:44.760
been helpful. I want to say, first of all, Nick, so I see an amazing technology and it's a good

22:44.760 --> 22:50.520
community. You all deserve a big round of applause for the products that you've managed to create here.

22:50.520 --> 22:55.000
It's really a really good technology. If you have to choose one thing and I'm going to duck here,

22:55.000 --> 23:02.120
it's stable place. Yes, that's, yeah, and I'd also like to give us a big thanks to Ramsey since

23:02.200 --> 23:07.400
Nundite for their help. Ramsey's was the original architect of the system. We're

23:07.400 --> 23:12.440
operationally independent there, but he's been great for helping us with our evolution. And that's it.

