WEBVTT

00:00.000 --> 00:11.000
For the next session, we're coming up to the end of the day now. I'm delighted to have

00:11.000 --> 00:17.400
Ben Talis and Balza Ribero, fantastic name, Shaps. Who are you talking about enhancing

00:17.400 --> 00:25.400
air, global analytics, data engineering, ML and Canadian. So take it away guys. Thank you.

00:26.400 --> 00:32.400
Before we start, I just want to make sure that you can hear me at the back and on the stream.

00:32.400 --> 00:36.400
Okay, can you hear me at the back if I speak this loudly?

00:36.400 --> 00:38.400
Okay, so we'll speak loud.

00:38.400 --> 00:46.400
Okay, thanks very much. So yeah, just to explain myself, I'm Ben, the spouters are we're both SREs

00:46.400 --> 00:51.400
at the Wikimedia Foundation and we're in the data platform engineering team.

00:51.400 --> 00:58.400
And yeah, so this presentation is an implementation story about air flow really.

00:58.400 --> 01:04.400
And it should hopefully be interesting because of the fact that it's the Wikimedia Foundation,

01:04.400 --> 01:10.400
it means that all of the code that's there, all the tags, all the puppet manifest,

01:10.400 --> 01:16.400
everything else, all the tickets around implementation, they're all open. So it should be an interesting sort of reference case

01:16.400 --> 01:19.400
and see where we're up to at the moment.

01:20.400 --> 01:26.400
Yeah, so seven to see, Wikimedia Foundation, I don't know how familiar you all are with it,

01:26.400 --> 01:32.400
but there are some interesting aspects of working at Wikimedia Foundation.

01:32.400 --> 01:37.400
So firstly, we're all on prem, we're all on premists in physical data centers.

01:37.400 --> 01:45.400
So no cloud services, no AWS, no commercial stuff, we're all fully open source.

01:45.400 --> 01:54.400
And we don't allow for policy reasons any running of container images that are made by third party sources.

01:54.400 --> 01:59.400
So everything has to be on dead end, everything has to be built on a gold standard image that we build ourselves.

01:59.400 --> 02:04.400
So that just means that when you're implementing software like this or systems like this,

02:04.400 --> 02:13.400
there are additional steps because we have to take all the sources from all the code from sources and build them ourselves.

02:14.400 --> 02:20.400
Okay, this is the sort of raise on debtor of the data platform.

02:20.400 --> 02:22.400
So there's sort of a mission statement there.

02:22.400 --> 02:31.400
Effectively, we provide a data platform which has various storage and transformation and serving layers.

02:31.400 --> 02:40.400
And we have various teams that we support, so from data engineering and analytics.

02:40.400 --> 02:45.400
Things to do with the experimentation platform, we're building at the moment, the search platform.

02:45.400 --> 02:51.400
So that's everything to do with searching on MediaWiki projects and also query services.

02:51.400 --> 02:58.400
So all of the data query services, this kind of thing that they're all falling under the data platform.

02:58.400 --> 03:03.400
Okay, this is the best sort of reference we have at the moment for what the whole data platform looks like.

03:03.400 --> 03:08.400
I'm not going to go through it in any detail, but we've got like the event producer and things.

03:08.400 --> 03:12.400
So everything like the web request logs that go into it.

03:12.400 --> 03:24.400
Then we have various sort of storage streaming data lake kind of areas here and then various serving systems serving layers out here.

03:24.400 --> 03:27.400
And you find a nice,

03:28.400 --> 03:31.400
super set dashboard here.

03:31.400 --> 03:35.400
Please do that's linked up to Presto and do it and various things.

03:35.400 --> 03:37.400
But we're talking at the moment here.

03:37.400 --> 03:43.400
So we're talking about batch job scheduling for workloads that are generally on the Hadoop cluster.

03:43.400 --> 03:45.400
Let's move on.

03:45.400 --> 03:51.400
So we've been using the Airflow for quite a long time since it was an incubator project.

03:51.400 --> 04:01.400
And it's built up, we've built it up from a sort of MVP stage to where it has been beginning of 2024.

04:01.400 --> 04:03.400
And we use it for all sorts of things.

04:03.400 --> 04:10.400
So we've actually moved our Hadoop workloads from UZ into Airflow.

04:10.400 --> 04:18.400
And by, whenever I was in 2023 maybe we've finished moving all of the UZ workloads into Airflow.

04:18.400 --> 04:28.400
And these are our sort of primary use cases for researchers, for such product team analytics and so on.

04:28.400 --> 04:37.400
Okay, and yeah, if we, this is a sort of screenshot of what a typical data pipeline in Airflow looks like.

04:37.400 --> 04:42.400
And most of the time what we're doing is we're using sensors to say, you know,

04:42.400 --> 04:46.400
when data arrives in HGFS or in various locations.

04:46.400 --> 04:54.400
And then when it's there, we then run spark jobs, which you spark on yarn on the Hadoop cluster.

04:54.400 --> 05:03.400
So Airflow workers themselves were never intended to do the bulk lifting at this time for analytics and for most of the other things.

05:03.400 --> 05:07.400
Airflow workers launch yarn jobs on the Hadoop cluster.

05:07.400 --> 05:14.400
So that's where we are. And Airflow is effectively orchestrating these spark jobs on the cluster.

05:14.400 --> 05:21.400
This was the infrastructure as we came to it in sort of beginning of 2024.

05:21.400 --> 05:28.400
So each of these instances for teams or for functions, we had an Airflow instance.

05:28.400 --> 05:35.400
And they were running on either a bare metal host or individual virtual machines with a local web server, local scheduler.

05:35.400 --> 05:40.400
And all of the Airflow jobs running as sub-processes of the Airflow scheduler here.

05:40.400 --> 05:50.400
And they were connecting to a shared Postgres metadata server with a replica but knows sort of automatic high availability,

05:50.400 --> 05:52.400
full-over mechanism.

05:52.400 --> 05:56.400
Wikimedia in general is a MariaDB house.

05:56.400 --> 06:04.400
So most of the effort that we have around high availability and, you know, database administration is in the MariaDB space.

06:05.400 --> 06:09.400
Airflow doesn't work with MariaDB, so we've had to migrate to Postgres.

06:09.400 --> 06:13.400
And we didn't have, you know, the huge amount of Postgres experience.

06:13.400 --> 06:20.400
So we had this model at the moment and it had a number of shortcomings in terms of scalability.

06:20.400 --> 06:22.400
I think that's one we've got to do.

06:22.400 --> 06:25.400
Yeah, I think this is pretty much what we have covered it.

06:25.400 --> 06:31.400
We're saying that it's all Kerbross authenticated, Hadoop, yarn, hive and spark.

06:32.400 --> 06:36.400
Yeah, that's pretty much it. Also we had the local executor and local task logging.

06:36.400 --> 06:45.400
So as the number of tasks in Airflow ramped up, the number of logs that we had on the individual machines was getting to be in the millions.

06:45.400 --> 06:50.400
Okay, so this is then talking about some of the problems that we had with this infrastructure.

06:50.400 --> 06:57.400
And it was really around about around what happened when we had to do maintenance of the metadata database,

06:58.400 --> 07:01.400
rebooting, restarting that service.

07:01.400 --> 07:05.400
We didn't have any high availability in Postgres.

07:05.400 --> 07:12.400
That cause is downtime for the scheduler, which then causes problems for the jobs that are running as sub processes.

07:12.400 --> 07:17.400
Sometimes they were would be okay, but sometimes they just sort of disappear a little bit.

07:17.400 --> 07:18.400
So not great.

07:18.400 --> 07:24.400
And yeah, we had a high number of connections to Postgres as well coming from there.

07:24.400 --> 07:30.400
Actually as we got into dynamic bags that were doing more and more database querying.

07:30.400 --> 07:32.400
And next one piece. Yep.

07:32.400 --> 07:40.400
So yeah, we then took much scaling. How we go about scaling our Airflow instances and our Airflow things effectively.

07:40.400 --> 07:45.400
Because we were linked to we were limited by a single machine on each instance.

07:45.400 --> 07:53.400
It got pretty difficult to scale. We couldn't use the local executor in a way that was easy to scale.

07:53.400 --> 07:58.400
Okay. Also, there were a few things around about the user experience that were pretty poor.

07:58.400 --> 08:04.400
Engineers would have to use SSH tunnels in order to gain access to the Airflow web server.

08:04.400 --> 08:12.400
So that's not very nice as a user experience to tell people you got to have a terminal window within SSH connection.

08:12.400 --> 08:14.400
So you can log into your Airflow.

08:14.400 --> 08:20.400
And also, if there were people who were not necessarily technical, but management wanted to see how things were going.

08:20.400 --> 08:28.400
They would still have to have an SSH user shell posics account with all the various overhead that had to do.

08:28.400 --> 08:34.400
And then because we had that, there was no authorization level.

08:34.400 --> 08:40.400
We just bypassed all of the Airflow authorization of our back levels and this kind of thing.

08:40.400 --> 08:43.400
Because everyone was already authenticated with SSH.

08:43.400 --> 08:48.400
So these are some of the things that were not ideal about our current setup.

08:49.400 --> 08:52.400
So we came up with a plan to make things better.

08:52.400 --> 08:57.400
We've got the ticket links here, get to the things.

08:57.400 --> 09:01.400
It's all open so you can all look at it and how we manage the project.

09:01.400 --> 09:06.400
We wanted to make sure that we could eliminate all the single points of failure.

09:06.400 --> 09:09.400
Make sure that the Airflow services are resilient.

09:09.400 --> 09:16.400
And a key thing really is that we, as the SRE, although we're changing the deployment model,

09:16.400 --> 09:21.400
we didn't want to have to force the DAG authors to change the way that they were doing their DAG.

09:21.400 --> 09:25.400
So we didn't want to have to have a big bang migration for users.

09:25.400 --> 09:33.400
It was still going to be supporting spark jobs on Yarn.

09:33.400 --> 09:39.400
Okay, so in terms of the infrastructure now, the project infrastructure,

09:39.400 --> 09:44.400
we have a Kubernetes cluster for the data platform engineering team.

09:44.400 --> 09:48.400
So we built this over the last couple of years.

09:48.400 --> 09:52.400
It's likely used at the moment so it's good deployment target for the compute.

09:52.400 --> 09:56.400
So we've also got a couple of these machines here, these gates workers.

09:56.400 --> 09:59.400
This one and this one have got GPUs in.

09:59.400 --> 10:03.400
So all the generation and newer generation GPUs.

10:03.400 --> 10:06.400
This is going to be useful for the research and ML team.

10:06.400 --> 10:09.400
So we're going to be onboarding to do model training work.

10:09.400 --> 10:12.400
And you go to the next one, please, yep.

10:12.400 --> 10:20.400
We've also invested in safe storage in the DAG platform team alongside the Hadoop cluster.

10:20.400 --> 10:30.400
So this is something that we really wanted to be able to use for this project to help us achieve this availability targets and things.

10:30.400 --> 10:33.400
So we have in the safe cluster, five nodes.

10:34.400 --> 10:41.400
We have a large tier with backed by disks and cache and then we have a smaller high performance SSD tier.

10:41.400 --> 10:47.400
So this cluster again was up and running ready for use in this airflow project.

10:47.400 --> 10:58.400
Okay, so the architecture that we have come up with is that we wanted to move from the local executor to the Kubernetes executor for pods.

10:58.400 --> 11:09.400
So I don't know if you are familiar with airflow, but every time a task runs with the Kubernetes executor it will run as a Kubernetes pod, run to completion.

11:09.400 --> 11:14.400
So it will no longer be running as sub-processes of the scheduler.

11:14.400 --> 11:22.400
We also wanted to make sure that airflow stores its logs remotely rather than local to the scheduler works running previously.

11:22.400 --> 11:30.400
And key part of it is that we wanted high availability postgres built in or in some way.

11:30.400 --> 11:41.400
So on the postgres side we decided to take the decision to move postgres into Kubernetes, backed by the safe cluster.

11:41.400 --> 11:46.400
And we have a Kubernetes operator for postgres clusters.

11:46.400 --> 11:56.400
That enables automatic leader, failover, sorting out replication and postgres poolers and this kind of thing.

11:56.400 --> 12:01.400
And we wanted to make sure that we have got competitive postgres backups at the same time.

12:01.400 --> 12:12.400
And the safe cluster we have got the blocked device interface, the SFFFS, POSIX file system interface and the S3 interface.

12:12.400 --> 12:22.400
I don't know if everyone knows with SFF as well, but effectively we are using the three main interfaces of the safe cluster in this project.

12:22.400 --> 12:28.400
Okay, yep. This is then the architect he do want to take over from here.

12:28.400 --> 12:33.400
So this is the visual representation of what Ben was just talking about.

12:33.400 --> 12:36.400
The blue boxes here are the air flow components.

12:36.400 --> 12:41.400
We are like we mentioned before. We're using Kerberos, so we need some way to renew our tokens regularly.

12:41.400 --> 12:44.400
So this is the Kerberos subcommend that's running.

12:44.400 --> 12:50.400
We have the webserver, obviously the scheduler that itself manages the pods running on Kubernetes.

12:50.400 --> 13:00.400
And the way that we've designed the PG cluster is that it automatically connects to PG balancer and not to post risk itself.

13:00.400 --> 13:05.400
We actually set up some natural policies preventing anyone from connecting to POSIX girl.

13:05.400 --> 13:15.400
We enforce traffic to go through PG balancer to avoid overwhelming PG itself with connections, especially when we run in the thousands of map tasks.

13:15.400 --> 13:20.400
So we can get pretty here. And like Ben said, all of that is backed by SFF.

13:20.400 --> 13:29.400
We had to expose different interfaces. So S3 is used to store the logs and the walls and the base backups for the POSIX girl.

13:29.400 --> 13:34.400
We also needed a block device, which actually stores the actual PG data.

13:34.400 --> 13:41.400
And we also needed some distributed file system in which we store the DAX themselves that we mount on the scheduler.

13:41.400 --> 13:49.400
And we also store the Kerberos tokens that we regularly renew and we store on every single airflow component that needs to talk to Kerberos.

13:49.400 --> 13:55.400
So the first thing that we kind of set out to do, and I say the first thing we actually worked on this in parallel, but it makes for a better story,

13:55.400 --> 14:02.400
is that we integrated SFF with Kubernetes. So if you're not familiar with Kubernetes, the way that we do it,

14:02.400 --> 14:07.400
that you integrate storage with it, is through what is called the CSI, a container storage interface.

14:07.400 --> 14:15.400
And so this is the man in the middle, so we speak that we'll see that you are requesting some volume to be created in your storage mechanism,

14:15.400 --> 14:20.400
the storage software, and we'll then actually provide the volume for you.

14:20.400 --> 14:27.400
And so for us, we actually had to install two CSIs, one for the block device, the RPD and one for SFFFS.

14:27.400 --> 14:37.400
And we also expose an any cast endpoint for S3, so basically it allows any clients who wish to talk to S3,

14:37.400 --> 14:45.400
to talk to the SFF server that is if available in the same network racks, so that we minimize the number of network hops,

14:45.400 --> 14:52.400
and we maximize bandwidth. And so what you're seeing here in the bottom screen chat is basically just the persistent volume claims,

14:52.400 --> 14:58.400
and so you can see the airflow dags that is stored on SFFFS, the curb loss volume on SFFFS as well,

14:58.400 --> 15:06.400
and then the PG data on our block device, and you can see in the names that all of these data are actually backed by the SSD pool that we talked about,

15:06.400 --> 15:11.400
because these are actually latency critical.

15:11.400 --> 15:17.400
We investigated multiple operators to run in, to run cross-scrisco in Kubernetes.

15:18.400 --> 15:25.400
I have a background of running data stores in Kubernetes, thanks so much, and so I know that this is actually pretty hairy, and you can be pretty dangerous.

15:25.400 --> 15:30.400
And so we have the stick at, if you want to have a look at that, we investigated nothing five of them,

15:30.400 --> 15:38.400
and we settled on one called cloud native PG. It had basically the features that we wanted, it had support for pooling, automatic backup,

15:38.400 --> 15:46.400
and it failed over data imports, and also exposed all of the PG metrics to premieres.

15:46.400 --> 15:53.400
Some of the operators we investigated had proprietary metrics interface, and so that just couldn't fly.

15:53.400 --> 15:59.400
And so the way that we designed the cluster, we have two PG servers and three PG bansers, the right ahead logs,

15:59.400 --> 16:05.400
and the base backup, so we exported to the S3 interface of SFF, and the backups.

16:05.400 --> 16:12.400
If you think about it, it's kind of dangerous, because the PG data is stored in SFF, and then we export it back up in SFF as well.

16:12.400 --> 16:19.400
So we export trade them outside of SFF every night as well, so that we don't store every single X in this Mbaskar.

16:19.400 --> 16:29.400
As you can see on that little screenshot, there we have a, we can use QCTL to inquire about the state of the cluster, because it's a custom resource that it's integrated.

16:29.400 --> 16:34.400
I don't know exactly why they can name it cluster, because it's not super explicit, but for now.

16:35.400 --> 16:41.400
When it comes to security and airbag, what we've done is we wanted to have a public UI.

16:41.400 --> 16:48.400
We didn't want anyone to have to go through a stage tunneling, and we wanted that UI to be authenticated.

16:48.400 --> 16:57.400
Using OIDC, and basically what that meant is that when you logged in, your LDAP membership would be mapped to roles and thus to permissions.

16:57.400 --> 17:03.400
So for example, event on myself, belonging to the SRE group, so we're admin on every single Airflow instance.

17:03.400 --> 17:10.400
But Gabriela here is a member of the search team, and so he is getting some search related permissions, but doesn't have permissions on other influence instance.

17:10.400 --> 17:15.400
So we have some flexibility that wasn't really possible for us before.

17:15.400 --> 17:24.400
And we also wanted to have the CURBeras, the API CURBeras authenticated, which turned out to be a pain in the book.

17:24.400 --> 17:28.400
And the migration plan was simple, and the reality wasn't.

17:28.400 --> 17:32.400
First, we would migrate the web server.

17:32.400 --> 17:44.400
The reason for the web server migration would be that we actually provide value to our customers, so to speak, by having a public UI, a public URL, or back in permissions.

17:44.400 --> 17:58.400
Then we also created an Airflow instance for ourselves, and we would test every single kind of operator we could find in our Airflow DAGs repository, until we were confident that we could accommodate every DAG that we had in our codebase.

17:58.400 --> 18:05.400
So we thought, then we would migrate for one instance, it's database to CURBeras, sorry.

18:05.400 --> 18:11.400
Then we would migrate the scheduler, and then everything would explode, and then we would have to fix that years and so we're all improved.

18:11.400 --> 18:17.400
And so we're telling this story when we applied for a fast end, we hoped that it would be done.

18:17.400 --> 18:21.400
We are mid-biagration, and it's still a bit of an issue right now.

18:21.400 --> 18:23.400
But we'll get there.

18:23.400 --> 18:27.400
Some things did help us back, did hold us back.

18:27.400 --> 18:37.400
We found that out of the box, the CURBeras API authentication was broken, and it took a little bit of fidgetting to fix it.

18:37.400 --> 18:47.400
And the fix has been released in that pull request, and fixed it in that pull request and release in the fat provider, one that five to one.

18:47.400 --> 18:59.400
It turns out the fun part of it is that I found some finger, I basically found some traces from the next building of mine, from a previous company, so I was able to reach out and we kind of talked about that.

18:59.400 --> 19:06.400
But then, by default, even if you had a valid CURBeras token, the API would just respond for one.

19:06.400 --> 19:11.400
I also found out that the automatically generated API client doesn't support CURBeras authentication.

19:11.400 --> 19:16.400
So if you use CURBeras authentication, you're kind of on your own to use the client to talk to the API.

19:16.400 --> 19:20.400
So we had to patch it in an ugly and sweet possible.

19:20.400 --> 19:27.400
And the issue at working at Dibellium F is that my ugly hacks are probably, you can check them out.

19:27.400 --> 19:35.400
And I think it will take a little bit of talking about it upstream, and so we can release that as a proper batch because right now this is shameful.

19:36.400 --> 19:40.400
And like I said, migrating an instance with a lot of dice, was tricky.

19:40.400 --> 19:46.400
Our biggest instance has about 150, and some of them have many, many, many, many tasks.

19:46.400 --> 19:55.400
And so we found out that some of them actually needed to be rewritten just a little bit to account for the specific environments in which they were running.

19:55.400 --> 20:02.400
And the one thing that was the trickiest to get right was a networking.

20:02.400 --> 20:06.400
Because who is familiar with CURBeras networking?

20:06.400 --> 20:09.400
You were lucky, you were lucky, yes.

20:09.400 --> 20:12.400
It's not a great place to be, and it's complicated.

20:12.400 --> 20:20.400
And we found out a lot of discrepancies between the previous environments in which we were running and the new one.

20:20.400 --> 20:27.400
So we also run a service mesh that allows you to talk to local hosts and certain ports.

20:27.400 --> 20:32.400
And we redirected to a service within the internal set of services that we keep media.

20:32.400 --> 20:37.400
And integrating with that in a way that worked, took us some time as well.

20:37.400 --> 20:40.400
So all of these little things were paper cuts.

20:40.400 --> 20:45.400
And I think we're getting to the bottom of them, but you know, well you never know.

20:45.400 --> 20:47.400
What do we get out of it today?

20:47.400 --> 20:49.400
I think we've fixed most of the issues.

20:49.400 --> 20:53.400
If not all of the issues that we mentioned, we have a better reliability story.

20:53.400 --> 20:58.400
And rolling a restart set for we can rolling a restart crew in the cover that is workers.

20:58.400 --> 21:03.400
And the air flow tasks will not be impacted because they run in pods.

21:03.400 --> 21:07.400
And the worst thing that could happen is that they get rescheduled.

21:07.400 --> 21:11.400
We can restart the schedulers without killing the pods.

21:11.400 --> 21:16.400
We can, you know, we can do a lot of things that weren't possible before.

21:16.400 --> 21:21.400
The Postgres QL cluster can be rolling restarted.

21:21.400 --> 21:25.400
We have automatic fell over when it used to be a manual process.

21:25.400 --> 21:29.400
Every air flow instance has its own PG cluster.

21:29.400 --> 21:35.400
So there's no more noisy neighbor effect that we actually experienced mid migration at some point.

21:35.400 --> 21:40.400
And we couldn't migrate because one air flow instance was hammering the PG cluster.

21:40.400 --> 21:42.400
And so that was a little bit of a fun story.

21:42.400 --> 21:44.400
We have an improved security story.

21:44.400 --> 21:48.400
We have permissions in our back and logging.

21:48.400 --> 21:52.400
And not everyone needs to be admin anymore, which is great.

21:52.400 --> 21:56.400
The API isn't free for all anymore.

21:56.400 --> 21:58.400
It's for us authenticated.

21:58.400 --> 22:01.400
The scalability story is actually just easier because of Kubernetes.

22:01.400 --> 22:06.400
We can scale with the number of Kubernetes workers that we have.

22:06.400 --> 22:10.400
We are not constrained by the size of one machine.

22:10.400 --> 22:15.400
So some, that's actually a little bit of a lie in the sense that we a lot of the air flow instances.

22:15.400 --> 22:16.400
We're running on VM.

22:16.400 --> 22:18.400
So we could resize them on the fly.

22:18.400 --> 22:23.400
But for the bare metal instance that we have that was running the most number of jobs.

22:23.400 --> 22:27.400
That was just the machine that we had for five years.

22:27.400 --> 22:29.400
The UX is better.

22:29.400 --> 22:34.400
When you're received an alert email by air flow that will take you to local host.

22:34.400 --> 22:39.400
It was kind of a pain because you needed to run an HH cell to get your air flow.

22:39.400 --> 22:42.400
So now you can click on the link and you get to air flow analytics.

22:42.400 --> 22:44.400
We can be dead or something like this.

22:44.400 --> 22:47.400
And it's just a better UX story.

22:47.400 --> 22:50.400
And I think what I'm the most excited about.

22:50.400 --> 22:55.400
And I think Ben is as well is that we were able to create and provide reusable building blocks.

22:55.400 --> 22:59.400
We have some flink pipelines that Gabrielle is actually working on.

22:59.400 --> 23:01.400
That are storing data in S3.

23:01.400 --> 23:07.400
We have teams that are interested in getting a PG cluster that wasn't particularly easy before

23:08.400 --> 23:09.400
WMF.

23:09.400 --> 23:12.400
So this is getting easier now.

23:12.400 --> 23:16.400
And the adoption, the increased adoption story is interesting.

23:16.400 --> 23:19.400
We actually didn't know how to write DAX before.

23:19.400 --> 23:22.400
And so we ended up creating our own instance.

23:22.400 --> 23:26.400
And we kind of trained at air flow DAG authoring.

23:26.400 --> 23:33.400
And then we ended up writing DAGs to administrate, to handle operations on that instance.

23:34.400 --> 23:39.400
So we have a DAG that would just remove all files from S3 or clean up the database.

23:39.400 --> 23:40.400
So things like this.

23:40.400 --> 23:45.400
And it got us into a position where we are actually more equipped to talk to the data engineering teams.

23:45.400 --> 23:48.400
Because we share more of their vocabulary that we speak.

23:48.400 --> 23:54.400
And the final slide is now that we're there.

23:54.400 --> 23:55.400
So we speak.

23:55.400 --> 23:57.400
How do you have a question, sir?

23:57.400 --> 23:58.400
Not like, yeah.

23:58.400 --> 24:07.400
What we'll invest on in the near future when we're done with the migration.

24:07.400 --> 24:14.400
We have weird mechanisms at the moment, allowing DAG authors to run ad hoc jobs.

24:14.400 --> 24:19.400
And we can unravel, we can remove all of these weird things.

24:19.400 --> 24:24.400
It also relies on a piece of software that is no longer main thing that we'd like to get rid of.

24:24.400 --> 24:33.400
And so instead of bundling that ad hoc environment, we can allow DAG authors to just run pause themselves.

24:33.400 --> 24:44.400
So they would run a task that itself would talk to Kubernetes to create an ad hoc job running in Kubernetes itself using the Kubernetes pod operator.

24:44.400 --> 24:49.400
I don't know if any of you, or any of you, is it familiar with the Wikipedia dumps?

24:49.400 --> 24:58.400
But it's a regularly updated data set that you could download yourself for directly or appear to peer.

24:58.400 --> 25:03.400
And it's basically the entire content of being a Wikipedia or the different wikis.

25:03.400 --> 25:07.400
And we have a nightmareish system.

25:07.400 --> 25:09.400
Let's call it this way.

25:09.400 --> 25:11.400
That does that.

25:11.400 --> 25:14.400
And we are on the hook for it.

25:14.400 --> 25:22.400
And we are trying to migrate it using these Kubernetes pod operator to airflow.

25:22.400 --> 25:32.400
Because the fact that it would run on airflow would give us some more observability and re-try mechanisms to actually make it easier to manage.

25:32.400 --> 25:39.400
And finally, we have, I think I've reworked the slides, but then I might have not saved them.

25:39.400 --> 25:46.400
We don't only have, we have the machine learning team who is interesting in running airflow,

25:46.400 --> 25:53.400
but we also heard from other SNR teams who would like to have quite complex operations.

25:54.400 --> 26:08.400
The more that we, the easier we make it to get a new instance and to run it and to manage it, the more we found that teams were asking us about airflow.

26:08.400 --> 26:12.400
And could I get an instance and how can I get running and things like this?

26:12.400 --> 26:19.400
And so it seems to us that it's becoming part of the lingua franca of Wikimedia.

26:20.400 --> 26:24.400
And I think on that, thank you for your time.

26:24.400 --> 26:27.400
And if you have any questions, I think we have just a little bit of time.

26:27.400 --> 26:28.400
Four minutes.

26:28.400 --> 26:31.400
And there was a question over there.

26:44.400 --> 26:46.400
That's right, so I'll repeat the question.

26:46.400 --> 26:52.400
The question was, for the postgres, is the data actually stored on the network drives,

26:52.400 --> 26:55.400
or is it local to the Kubernetes workers?

26:55.400 --> 27:02.400
At the moment, it is stored on-saf in the RBD using the RBD applications.

27:02.400 --> 27:09.400
So they are block devices from-saf that are provided to the pods so that the pods can,

27:10.400 --> 27:13.400
the rapid Jason, so the very low latency speeds.

27:13.400 --> 27:18.400
We are aware that there could be a right amplification problem.

27:18.400 --> 27:23.400
So if we write a lot of stuff to the postgres database, it's writing it to two database instances,

27:23.400 --> 27:26.400
which themselves have three replicas of the copy.

27:26.400 --> 27:33.400
So, you know, we need to make sure that what we don't do is overuse the network resource

27:33.400 --> 27:38.400
to try to give us better postgres availability.

27:38.400 --> 27:41.400
But we're aware of that situation, so we're watching it.

27:41.400 --> 27:48.400
And if that happens in future, then we may move those postgres databases to the Kubernetes

27:48.400 --> 27:54.400
workers themselves, for example.

27:54.400 --> 27:58.400
So the question was, have we considered migrating Spark to Kubernetes?

27:58.400 --> 27:59.400
Yes.

27:59.400 --> 28:03.400
I actually had it on the slides before it, and I removed it.

28:03.400 --> 28:04.400
But it's a fair point.

28:04.400 --> 28:11.400
A lot of what Airflow is doing is, like we mentioned, is orchestrating Spark jobs.

28:11.400 --> 28:19.400
Right now, we have an enormous yarn cluster and a small, pretty small Kubernetes cluster.

28:19.400 --> 28:24.400
The more we can migrate to Hadoop workers, as Kubernetes workers,

28:24.400 --> 28:32.400
I think the best and the better will be, because we only will have a single orchestration layer.

28:32.400 --> 28:38.400
And this is something that we can do now that we couldn't do before.

28:38.400 --> 28:45.400
We could use the Spark Kubernetes executor, that instead of running the Spark driver on yarn

28:45.400 --> 28:51.400
and the Spark application on yarn, because we could actually just run the Spark driver

28:51.400 --> 28:54.400
and the application on Kubernetes itself using pods.

28:54.400 --> 29:00.400
We haven't really tested it yet, but it's something that is on our minds.

29:00.400 --> 29:08.400
Yeah, just to say on that point, we have at the moment much larger yarn cluster,

29:08.400 --> 29:12.400
and we have Spark running with the yarn shuffler.

29:12.400 --> 29:18.400
So the jobs themselves still run on yarn and make use of the fact that we have the data locality.

29:18.400 --> 29:23.400
If we move Spark itself into Kubernetes, which we're enabled to do,

29:23.400 --> 29:25.400
we have the Spark operator.

29:25.400 --> 29:29.400
We would lose that data locality benefit.

29:29.400 --> 29:37.400
So we're trying to get to the point where it's more flexible that certain Spark jobs

29:37.400 --> 29:42.400
could run in Kubernetes, and certain ones could run on yarn.

29:42.400 --> 29:47.400
And we're trying to make sure that Docker containers can be used throughout for that.

29:47.400 --> 29:49.400
Yes.

29:49.400 --> 29:50.400
Thanks for your time.

29:50.400 --> 29:53.400
We are any insights for us,

29:53.400 --> 29:59.400
regarding upgrading our flow, because it also has migration and stuff, I know.

29:59.400 --> 30:04.400
Okay, so the question was, do we have any insights on how to upgrade our flow,

30:04.400 --> 30:07.400
because there are certain migrations and things?

30:07.400 --> 30:13.400
As it's got a flask core and the SQL alchemy,

30:13.400 --> 30:15.400
the limbic.

30:15.400 --> 30:16.400
A limbic.

30:16.400 --> 30:20.400
The database migrations have not been a big problem for us.

30:20.400 --> 30:24.400
I think we are pretty aggressive about upgrading.

30:24.400 --> 30:27.400
I tend to, because we have a testing sense,

30:27.400 --> 30:30.400
we can see pretty quickly whether it fails or not,

30:30.400 --> 30:35.400
but whenever I see an announcement on the Airflow Slack channel,

30:35.400 --> 30:37.400
I try to upgrade into one of those.

30:37.400 --> 30:41.400
My experience is that it's been pretty successful,

30:41.400 --> 30:44.400
but your mileage may vary on them.

30:44.400 --> 30:49.400
And as I say, we are still in the throws of this big migration.

30:49.400 --> 30:53.400
We had hoped that we have been able to come here and just talk about all the successes,

30:53.400 --> 30:56.400
but we are still halfway through.

30:56.400 --> 31:00.400
So we have one environment where we build our flow in a Debian package,

31:00.400 --> 31:03.400
and we have another environment where we build it in the container.

31:03.400 --> 31:07.400
We sometimes have problems keeping those two environments in sync,

31:07.400 --> 31:13.400
and also we need to make sure that dark authors have an air flow development environment

31:13.400 --> 31:17.400
that they can use and run their unit tests and this kind of thing.

31:17.400 --> 31:27.400
So yeah, they are pitfalls, but in terms of, in terms of the actual production services upgrades,

31:27.400 --> 31:31.400
we have not run into serious problems with any database migration.

31:31.400 --> 31:33.400
I think we are all the time.

31:33.400 --> 31:34.400
Thank you.

31:34.400 --> 31:35.400
Thank you.

