WEBVTT

00:00.000 --> 00:16.600
the

00:16.600 --> 00:38.600
Okay, hello everyone. Thanks for being here. It's my first positive, so I'm very happy to be here as well.

00:38.600 --> 00:49.600
So my name is Antoni Hazapis. I'm from Greece, Greek. The institution is a research institute called fourth.

00:49.600 --> 00:56.600
And today I'm going to talk about the work we are doing on running Kubernetes workloads on HPC.

00:56.600 --> 01:02.600
Hopefully this could be on times we can have a discussion at the end.

01:02.600 --> 01:13.600
So this is a picture of hardware. The question is what do you run on this hardware? Do you run your cloud workloads or your HPC workloads?

01:13.600 --> 01:21.600
And what is happening now most of the time is that you have partitions and you can run some stuff on one part, some stuff on the other.

01:22.600 --> 01:47.600
Okay, I won't go into the details most of you know that the workloads are pretty different, right? So the guys on the left run box jobs, they run tightly parallelized code, they run binaries, the guys on the right, the packets are coding containers, they use frameworks and they mostly run webs of microservices and that kind of stuff.

01:48.600 --> 02:00.600
But what we want, okay, we don't want to make them all run the same stuff, we want kind of embrace the diversity and the run hybrid workloads.

02:00.600 --> 02:13.600
So there it is. Here is that you have a big pipeline and some parts are like legacy code, important, some parts and they run via slum and some parts.

02:13.600 --> 02:27.600
You may have found some, you know, Kubernetes tool but that's all the computation or you can have spark or, I don't know, pytor, whatever, and you want to run it on the same machine.

02:27.600 --> 02:36.600
So the common approach is just to bridge the two environments, you run one code on one environment, the other, there are many tools that do this.

02:37.600 --> 02:55.600
I have an extensive related work in the paper, I'll show you at the end, but the idea, there is the problems with this bridging are clear, you have to move your data around and you can have, there are many the restrictions in both environments.

02:55.600 --> 03:05.600
So, for example, the HPC environments are mostly, you know, maybe they are aeridop, they have very restrictions for access, moving data and things like that.

03:05.600 --> 03:15.600
In another project, we tried to run everything on a Kubernetes, so this is kind of interesting.

03:15.600 --> 03:27.600
We run containers in Kubernetes that simulated an HPC environment, so you had slum inside the containers, but that required the slum scheduler to coordinate with a Kubernetes scheduler.

03:27.600 --> 03:34.600
We tried that, it's pretty awful, but it kind of worked.

03:34.600 --> 03:49.600
So then we tried what I'm going to show you today, which is a way to run cloud software in HPC, and actually we run many, many clouds as a slum job.

03:49.600 --> 04:01.600
So it may be that some of you are not familiar with Cloud Kubernetes is, I won't go into much technical details, but I have to talk a bit about it, so we can go through the rest of the slides.

04:01.600 --> 04:21.600
So in the cloud ecosystem, everyone has their code packets in containers, so we've solved the deployment problem, and now we have to solve the runtime problem when you have hundreds or thousands of machines, someone has to run your containers.

04:21.600 --> 04:27.600
So you need kind of an operating system, because I like to call it an operating system, because it does what operating systems do.

04:27.600 --> 04:46.600
It takes care of your resources, it runs your jobs, it handles networking failures, so you just describe your containers, and then you hand them over to some API, and it kind of runs it on whatever resources are available, even if those resources are distributed.

04:46.600 --> 05:01.600
So Kubernetes provides an abstraction, it has a lot of primitives and conventions of how to write these descriptions, and it's compliant with a DevOps, let's say culture, because DevOps is not a job description, it's a culture.

05:01.600 --> 05:12.600
So you have to, you can use the same environment for development, then you write these nice descriptions, and then you just fire it up to the cloud.

05:12.600 --> 05:30.600
And there are many, many, many different packaging of Kubernetes, you can run it on the Raspberry, you can run it on Amazon, and many, many extensions and third-party tools, and this is very interesting, because with this work we can leverage those tools in the HPC context.

05:30.600 --> 05:47.600
Now, in Kubernetes, you declare what you want to do, so you give like the recipe, and it is run at some point, so you don't type in commands like run this or run that, you say, I want these three things run.

05:47.600 --> 06:16.600
I talked that there is an API, you have abstractions, so you run pods, which are collections of containers, these are organized in deployments, and these are organized, there are also services, jobs, etc., and the structure is a typical distributed system, so you have one head node or one let's say master that takes the commands, and then you have agents or workers that execute the containers.

06:17.600 --> 06:40.600
So the image on the right, I will go into a, not a deep dive, but a shallow dive on what these components do, so basically the API server is the point where all the commands enter, and it's the point that all the different components coordinate, and that's very interesting, because in Kubernetes you can replace any component with another component, as long as it speaks the same language to the API server.

06:40.600 --> 07:09.600
The API server needs some persistent storage, but it gets from the key value storage to CB, and then you have various, we call it controllers, because they just pick to the API, but they do specific jobs, so you have this guy that implements all the different abstractions, you have the scheduler, which selects the nodes, you have a DNS for discovery of services, and many different storage controllers.

07:09.600 --> 07:38.600
It's the things that run on the action machines, you have the Q-blet, which is the most important in today's discussion, it's the agent that runs on the node that implements the lifecycle of pods, and then the other stuff, let's not talk much about it, it's mostly for networking, the components you have to different networking, have the actual pods, the networks, and then you have the service networks, which are like pointers to the service.

07:39.600 --> 07:55.600
So to run a job, the user gives the description to the API server, this propagates to the different controllers, and at some point the pod is sent to a Q-blet to be run in the runtime.

07:55.600 --> 08:14.600
Now in HPK, we want to change this flow, but we want to run this as a user, and this is kind of one-to-run, this is a user job, without root permissions, which is kind of tricky.

08:14.600 --> 08:23.600
We would like all the abstractions to be available, as many as possible, at least those that don't depend on hardware features.

08:23.600 --> 08:37.600
But what's most important, we want to delegate all the resource management to SLAR, so we want this to run as a job in SLAR, and this respects the organization policies, and complies with existing resource accounting.

08:37.600 --> 08:51.600
So you know what your users run, you know that SLAR is most organizations that have SLAR, have heavily invested in all the utilities around SLAR, and scale across all the nodes.

08:51.600 --> 09:10.600
And we want to use singularity as a container runtime, because it's kind of an established practice to have it installed in HPC, and of course make it easy for everyone, both administrators and users, meaning we want it to be as less intrusive as possible for the admins.

09:10.600 --> 09:19.600
So what can we do? Actually, we can take all the things that are in green, the boxes, and just run them as a process.

09:19.600 --> 09:25.600
They don't depend on anything, they're just services, user-level services, we can just run them as a process.

09:25.600 --> 09:34.600
But these are the things we have to change, so the scheduler cannot make decisions anymore, it has to pass through the decisions to SLAR.

09:34.600 --> 09:52.600
And most of the things that have to do with net working have to be changed, but the most important thing is a cublet, because that's the actual node in the worker, in the Kubernetes software.

09:52.600 --> 10:00.600
So that's the worker has to be some custom worker that does another type of job.

10:00.600 --> 10:06.600
We can discuss more technical, if you want, at some other point.

10:06.600 --> 10:20.600
So what we do, what we have done is this is your cluster on the right hand, this is your cluster manager, and these are your cluster nodes, and this is the software that we package.

10:20.600 --> 10:32.600
So we package everything that's all the standard, let's say, Kubernetes components inside the container, and this is run as a SLAR job.

10:32.600 --> 10:49.600
Actually, everything is run as a SLAR job, but this is, let's say, one binary that runs, and this includes some changes, but we don't change anything in the Kubernetes code, so we don't patch anything or something, we just run the standard binaries.

10:49.600 --> 10:54.600
And we have a very custom cublet, so that's like the secret source.

10:54.600 --> 11:18.600
So this guy here, it's actually, it's a run, it doesn't differ from a normal, let's say, Kubernetes, but it has a very peculiar worker node, this guy, that kind of represents a whole cluster.

11:18.600 --> 11:37.600
So when a command comes in to run a workload, it goes to the APA server, it propagates all the different components as it would, but then our scheduler just selects this node, and this node represents a whole cluster and creates a SLAR job to run it on the cluster.

11:37.600 --> 11:47.600
So it has to convert the containers that should run into a SLAR job and give it to the cluster to run.

11:47.600 --> 11:57.600
So practically, you can think of HPK as a filter, as a translator, between Kubernetes jobs and SLAR scripts.

11:57.600 --> 12:14.600
So somebody that runs on a cluster, ideally can give either SLAR scripts or jams through HPK, and they would both be converted at the end to the scheduler, to the job scheduler.

12:14.600 --> 12:35.600
So pods and the other jams enter and exit that scripts from the other side, and there is a small technical detail here that the pods, which are collections of containers, and there are some intrigues there that the pods has to have a unique IP address, et cetera.

12:35.600 --> 12:47.600
So actually, these SLAR scripts do not just run containers, they run a hierarchy of containers, they run a master container that represents a pod, and this runs inside the other containers that represent the containers.

12:47.600 --> 13:03.600
And we try also to keep the resource requirements, so if you say this container of this pod needs, let's say two CPUs and four gigabytes of memory, this resource allocation will go all the way down to SLAR, as your job description.

13:03.600 --> 13:10.600
HPK runs as a user process, our users can run the workloads using either.

13:10.600 --> 13:22.600
HPK itself is a job, but it doesn't need the special allocation, I mean it's very lightweight, and a little support is needed by the environment.

13:22.600 --> 13:30.600
Basically, it's a sum-sigularity configuration, because we need singularity to give unique IPs to these pods.

13:30.600 --> 13:38.600
So this is something we try to do it ourselves with user-eleven networking, but it was too difficult and they're too complicated.

13:38.600 --> 13:50.600
So we delicate the job to singularity, and there are established projects that do that, and it's easier to convince an admin to install someone else's code that it has a big community.

13:50.600 --> 13:55.600
And then something that's a resource project.

13:55.600 --> 14:02.600
So an example of what you can run is, this is a very popular tool in a cloud-native world.

14:02.600 --> 14:12.600
It's called Argo. It's a language in general where you can run workflows has a nice UI, that you can monitor your workflows.

14:12.600 --> 14:24.600
This is actually an image from a workflow that doesn't be an analysis, it breaks down the different strands and it runs a branch for its different strands.

14:24.600 --> 14:28.600
It has all the features you would expect from such a language.

14:28.600 --> 14:32.600
So assume you have this kind of workload, right?

14:32.600 --> 14:39.600
Somebody has written it in this format, you can now run it on your slum cluster.

14:40.600 --> 14:44.600
And we also, this is kind of a hack.

14:44.600 --> 14:54.600
We added some some extensions, some path, you can give your job in a yummy, you can pass through flags down to slum.

14:54.600 --> 15:04.600
So this was kind of a neat extension so you could say to Argo, I want you to run this job, which is an MPI job,

15:04.600 --> 15:09.600
but pass these flags to slum directly when you get to run.

15:09.600 --> 15:17.600
So you could use Argo to run your MPI stuff integrated with your other stuff as well.

15:17.600 --> 15:30.600
Now in the paper and we have reached at the point, this is describing the paper, we have reached at the point we can run more complicated stuff.

15:30.600 --> 15:37.600
So we can, we fire up HPK, then we install several things using help.

15:37.600 --> 15:46.600
If you know the, so we install Jupiter, we install different controllers, Minio, these are all standard and cloud native software.

15:46.600 --> 15:57.600
And then we fire up the Jupiter notebook and we run some kind of AI tuning workflow, let's say script.

15:57.600 --> 16:01.600
And this runs inside slum.

16:01.600 --> 16:08.600
Also, there was an interesting collaboration with Yuli, has part of the DIPC project because ended.

16:08.600 --> 16:22.600
These are, these are huge PCJU projects as described by the previous speaker, where we run some, some spark, jobs and some, this is, this was just as a proof of concept,

16:22.600 --> 16:31.600
that we can run it somewhere else. The most difficult part was to prepare the test pad coordinate with administration team.

16:31.600 --> 16:45.600
Disable, exclusive node policy from slum, because this means that even if you ask from slum one CPU, we give you a whole node with 256 CPUs.

16:45.600 --> 16:53.600
So there are some things, but it, it has been, it has been running an environment outside for.

16:53.600 --> 16:59.600
So the vision is that the cloud users can exploit the HPC hardware.

16:59.600 --> 17:12.600
And as, again, the previous speaker showed you, there's a big movement in Europe to, to enlarge this HPC ecosystem, we're buying more machines.

17:12.600 --> 17:27.600
There's a big machine in Greece coming as well. And so we expect that there will be users that have never touched HPC, and I've never seen it, and we will now have all this hardware and how can they use it, right?

17:27.600 --> 17:35.600
So from an HPC user point of view, you have all this cloud software that does things very, very easily, right?

17:35.600 --> 17:50.600
You can run this operator and do AI training or whatever. It seems easy, or you can use a lot of software, a lot of frameworks like Arbor or other frameworks already available.

17:50.600 --> 17:55.600
And of course Jupiter, everybody loves Jupiter. So it's one way to run it.

17:55.600 --> 18:16.600
And from the HPC center point of view, it's a music way to run, run your somehow, I don't know if this will avoid the second partition, but at some point maybe, maybe things get more standardized, at least from the administration point of view.

18:16.600 --> 18:40.600
Okay, so that was my talk. I was five minutes short. The code is available. This is an actively being worked on. You can find it on GitHub. This is also a good opportunity to explore the other projects we have in our lab.

18:40.600 --> 18:53.600
And we have also a paper online, which you can download, free, and read. And I want to thank our sponsors, which are here at PCJU and SipsJU project.

18:53.600 --> 18:55.600
Thank you.

18:55.600 --> 19:23.600
Thank you. Great job. It's super interesting. I was curious. And then you mentioned you installed other controls as to if you're good at install, like the GitHub, like CI controller on the Kubernetes and then run the actual pods through my slurm on the HPC system.

19:23.600 --> 19:26.600
Whatever. What should that work?

19:26.600 --> 19:40.600
Yeah. So the question was, if I have much other software than install, so actually we don't care what software you installed. So it should work. There will be bugs.

19:40.600 --> 19:50.600
It's not a mature, let's say, software, but we have installed operators from the Q-flow project. We have installed, I don't know.

19:50.600 --> 19:59.600
We have installed other things. And at the end, all of these stuff that get installed is translated to containers that need to run.

19:59.600 --> 20:15.600
And the Q-blet just does that, right? It gets a description internal description of what container should be run and run. So why not? Probably it will run.

20:15.600 --> 20:27.600
Thank you for that. That's the HPC allows for Kubernetes cluster to spend over a several physical mods.

20:27.600 --> 20:36.600
That's the power. Once the speed of the game can control the speed, the speed will pause on different vehicles.

20:36.600 --> 20:44.600
Okay. So the question was, does HPC take advantage of multiple nodes? What's the speed of communication between the nodes?

20:44.600 --> 20:57.600
Right. So HPC does not, so the Kubernetes point of view, you have one node that spans all the cluster.

20:57.600 --> 21:09.600
So your jobs in Kubernetes think that they run on one node that has like 14,000 GPUs and 50,000 CPUs and I don't know what.

21:09.600 --> 21:23.600
So it spans, it just tells large jobs to run. So it takes advantage of all the nodes and the communication at the end uses what the nodes themselves use, right?

21:23.600 --> 21:34.600
So it doesn't interfere with the runtime of the jobs, of the runtime environment. I mean, where they run and how they communicate, it does nothing interfere with that.

21:34.600 --> 21:42.600
So ideally, it takes advantage of your whole cluster.

21:42.600 --> 21:53.600
The speed of the network, the speed of the network, the speed of the network, we don't change that.

21:53.600 --> 22:09.600
It's probably that when I said we need to have this plugin in singularity to take advantage of hand-out IPs, if these are based around the fast network, it will use the fast network.

22:09.600 --> 22:10.600
Yes?

22:10.600 --> 22:16.600
One of that, I don't even know how to do it. It's a right position. It's a step away, no, I'd rather marry them.

22:16.600 --> 22:26.600
The more method of action, I can think of for one to pick on HPC, would they find that half?

22:26.600 --> 22:33.600
Has that been thought of in trying to get into the container registry? Could it be done?

22:34.600 --> 22:42.600
So the question was on how could we run binder hub? I'm not aware of binder hub, I would check it.

22:42.600 --> 22:48.600
But we have run, I think we have run a lot different complicated things.

22:48.600 --> 22:53.600
I'm not, I don't see any reason. If it runs in Kubernetes, it should run on HPC.

22:53.600 --> 22:55.600
So that's possible.

22:56.600 --> 23:11.600
Some issues at this point, in practical now, when you run something in a SLARM, you have normally time limits or things like that.

23:11.600 --> 23:24.600
So you can't run a container registry forever. This usually has to, you have to package everything and run it as part of one job, one thing.

23:24.600 --> 23:31.600
So it's not for running Apache. The problem won't allow you from the HPC center, right?

23:31.600 --> 23:34.600
Or a container registry forever.

23:54.600 --> 24:02.600
I didn't get the question.

24:02.600 --> 24:13.600
Oh, okay. Well, practically, yeah. So the question, if I heard it correctly, was can I run multiple such HPC,

24:13.600 --> 24:16.600
multiple Kubernetes clusters, right?

24:16.600 --> 24:17.600
I don't know.

24:17.600 --> 24:21.600
Which is the instance of the Comparapas.

24:21.600 --> 24:26.600
What are the like?

24:26.600 --> 24:27.600
You don't know what the Comparapas is on your pace.

24:27.600 --> 24:30.600
Multiple instances of HPC.

24:30.600 --> 24:33.600
Okay, control it.

24:33.600 --> 24:40.600
Well, I, you could. I don't know. I don't see why you would need to do that.

24:40.600 --> 24:45.600
This whole thing is run for a, for one user, right?

24:45.600 --> 24:48.600
And one user can run it multiple times as well.

24:48.600 --> 24:50.600
That's fine.

24:50.600 --> 24:52.600
Thank you.

24:52.600 --> 24:56.600
Thank you.

24:56.600 --> 24:57.600
Yes.

24:57.600 --> 25:07.600
Thank you.

