WEBVTT

00:00.000 --> 00:12.400
Hello everyone, my name is Hendrick. I work at CSE, which is the Finnish National Supercomputing

00:12.400 --> 00:19.920
Center. And I'm going to speak to you about the your HPC Federation platform where I currently

00:19.920 --> 00:26.600
now function as the technical coordinator for the whole solution. Now for those who are

00:26.600 --> 00:31.880
now very aware of the ecosystem and Starbucks waiting what is the your HPCJU and what are we

00:31.880 --> 00:37.880
actually federating? I'm going to give brief overview of the challenges the user are facing, which

00:39.080 --> 00:45.000
then drive the design of the service. I'm going to talk about the architecture, the components,

00:45.000 --> 00:52.920
and highlight the open source aspects of the project of service. So for those who do not know

00:52.920 --> 00:59.480
the your HPCJU is a joint initiative between EU, European countries and some other actors.

00:59.480 --> 01:06.760
But essentially there are entity which then co-funds really large machines in collaboration with

01:07.720 --> 01:13.480
European countries. The most important part is whoever is managing the infrastructure, the

01:13.480 --> 01:17.800
supercomputer, that's a hosting entity. I'm going to talk about them a lot and you're going to see

01:17.880 --> 01:26.040
the abbreviation HE multiple times in the presentation. And so for example on the right we have

01:26.040 --> 01:34.680
Leonardo in Italy, they're the hosting entity would be Chinika, as many GPUs and BSC to the left

01:34.680 --> 01:43.720
with modern Ostrom 5. The users of the system as with most national supercube computing infrastructure

01:43.800 --> 01:52.040
is academic researchers, research institutes, public authorities, and industry. This is kind of a

01:53.400 --> 01:59.000
maybe a more novel point on the HPC side where a lot of universities that's just for research

01:59.000 --> 02:06.920
about 20% of the compute capacity for the systems is kind of reserved for SME usage.

02:07.880 --> 02:13.560
Okay, so the actual systems we are targeting with this federation which is going to be accessible

02:13.560 --> 02:21.240
by the platform. So we have around 12 classic supercomputer, big Linux boxes, large parallel

02:21.240 --> 02:27.880
file system, loads of local users, a nightmare for anyone who actually gets a hiccup when you say

02:27.880 --> 02:35.480
that we have 3,000 untrusted users with shell access on my machine and that's my status quo.

02:36.440 --> 02:43.640
Then after the classic supercomputers, they are these AI factories which are essentially just

02:43.640 --> 02:51.320
AI optimized supercomputers with then a bunch of added functionality. They do not exist yet.

02:52.920 --> 03:02.200
About seven have been announced at the end of 2024, but probably a lot of GPU, but then more features

03:02.200 --> 03:08.520
around data management, next talk previously illustrated, some of the issues, and then more cloud-like

03:08.520 --> 03:14.520
features, let's see what actually emerges. Then for more experimental scientific usage,

03:14.520 --> 03:20.760
which will also federate are the various procurement quantum computers for European researchers

03:20.760 --> 03:27.320
to test out different architectures and technologies to see which one is going to come out on top.

03:27.800 --> 03:33.800
If you want more information, you can go to the UHBCJUs page where they have a nice info on

03:33.800 --> 03:40.840
what's up and running and what's up and comment. Okay, so the current issue if you are a researcher

03:40.840 --> 03:48.680
or industry or something and you want to use your HPCJU resources is that each of these now currently

03:48.680 --> 03:57.080
12 different HPC sites have completely different identities, projects, and onboarding methods.

03:57.400 --> 04:01.800
So this means that if you're granted compute access for the computer in Italy,

04:01.800 --> 04:08.040
okay, you register an account, you send a copy of your passport, you're filling of a bunch of stuff,

04:08.040 --> 04:12.840
you get an project, you add people there. Okay, then you need to do some other research. Now,

04:12.840 --> 04:21.320
you got resources on Lumia, which is in Finland. Again, register completely different process.

04:21.320 --> 04:28.200
It does not know anything about you. So then when people start using multiple of these machines,

04:28.200 --> 04:35.320
it's really a hassle to keep redoing this and then especially managing large projects.

04:36.440 --> 04:42.920
For example, some of the largest projects, or Lumia, they have 150 people in the project,

04:42.920 --> 04:48.120
then you might have a 50 people more in Italy and you don't have a central way of managing that.

04:49.000 --> 05:00.040
We're seeing an increasing amount of new user groups, which are not familiar with classical HPC,

05:00.040 --> 05:06.440
the Linux command line, or then are accustomed to a different level of abstraction for their compute.

05:07.160 --> 05:13.240
And of course, there is a great interest then to make sure they can effectively utilize the compute power.

05:13.320 --> 05:19.000
Not just dump them in the shell and say, have a nice day, because that's not how Europe is going to win,

05:19.000 --> 05:27.240
or at least participate in the AI race. Growing heterogeneity of the compute is a headache for

05:27.240 --> 05:32.760
everyone who is not in love with the hardware. They just want to do the science, but then we

05:32.760 --> 05:38.920
a center keeps switching out. Okay, now we bought AMD, yep, sorry, it's way worse the software,

05:39.000 --> 05:43.400
but you're going to have to now do three months of extra work to continue your analysis.

05:45.000 --> 05:53.640
And again, to the topic of the previous talk, compute is way easier to move than data.

05:54.200 --> 06:01.320
So all of the time, one of the challenges when you want to utilize more resources on another system,

06:01.320 --> 06:05.960
or you're just allocated resources on a different system, it's not putting your code.

06:06.520 --> 06:12.520
It's if you need to move your data there, and this is for some science areas, this is more of a problem,

06:12.520 --> 06:17.480
climate science, high energy physics, or something. If you suddenly want to move a petabyte,

06:17.480 --> 06:24.440
there's something, it's just a snap of the fingers. Okay, so do the actual solution, now that you know

06:24.440 --> 06:30.360
what's going on in the ecosystem. The Federation platform has a key few functionalities.

06:30.920 --> 06:39.160
The first one is one federated identity with single sign-on. This in practice would mean that a

06:39.160 --> 06:49.320
end user would use their probably home institution university provided authentication to then

06:49.320 --> 06:58.200
identify themselves to all the services provided by all the machines. Then a single point,

06:58.280 --> 07:04.760
one stop shop for resource allocation, management, and monitoring. So if you're a professor

07:04.760 --> 07:11.160
and you have multiple projects, multiple allocation and multiple machines, you have a single point of

07:11.160 --> 07:19.480
doing that. Direct access, you'd like SSH certificates. I will not cover these in details,

07:19.480 --> 07:26.120
but essentially again, making sure there is the same way of accessing all the systems. So you don't

07:26.120 --> 07:34.280
need to teach an end user, then different ways that we manage SSH keys on this particular system.

07:34.280 --> 07:38.040
That's more of that. Okay, I want to compute there. Okay, it's going to be the exact same. It just

07:38.040 --> 07:46.520
tell me where you want to go. And then for the kind of less low-level compute oriented user

07:46.520 --> 07:52.280
communities, high-level features, like web-based access for Jupyter notebooks, following up

07:52.280 --> 07:57.480
what your AS simulation is doing, browsing some stuff or just simple shell access.

07:58.520 --> 08:05.960
A federated software catalog, which is based on easy, more on that later, but leveling the

08:05.960 --> 08:12.680
feel that you can expect some sort of basic catering of software and all the different system. Again,

08:12.680 --> 08:19.880
not having to start from scratch every time. And then advanced features for workflows and data

08:19.880 --> 08:25.000
transfers. So again, for people not very familiar with the ecosystem. Okay, I want to compute

08:25.000 --> 08:30.920
that in that, and I'm going to need that data. Please go and execute my code for me. For the timeline,

08:30.920 --> 08:38.760
so we started work now in January a few weeks ago. And the plan is that this will be ready on

08:38.760 --> 08:46.600
nine systems by Q1226. So you can look forward to that. And unless I mess up my job.

08:47.480 --> 08:53.800
And then just as information, it's a consortia of five partners where CSC where work is coordinating,

08:53.800 --> 09:00.120
other partners are Nordenette, Geont, IT4I, Tartu, and University of Gent.

09:03.000 --> 09:09.800
Okay, now I discuss the features. The high-level, really high-level picture you can keep in

09:09.800 --> 09:17.880
your head is very safe platform, which we build, which provides all the fancy features and services,

09:17.880 --> 09:24.360
which I just described to you. And now the end user then either interactively, click at the clicks

09:24.360 --> 09:30.760
in their web browser. On that platform or then fetches API keys to do the more fancy stuff,

09:30.760 --> 09:38.040
the more programmatic stuff, or is allowed then to fetch SSH credentials to go via the command

09:38.360 --> 09:42.760
or not removing any of the low-level stuff that's still going to work just be more streamlined.

09:43.880 --> 09:48.760
Now the secret source of the whole solution is of course the integration

09:49.560 --> 09:55.880
to the actual hosting sites. This talk does not cover that. I hope I can give that talk at some

09:55.880 --> 10:02.440
other point. But essentially, there will be a varying amount of glue making sure that data

10:03.240 --> 10:09.400
about users and allocation go in the correct direction, so that user then actually has something

10:09.400 --> 10:14.760
to touch on the system. Not drawn out, but of course, I mean data transfer is not going to take

10:14.760 --> 10:20.760
around trip via our platform, which is probably going to be physically located in Northern Europe.

10:20.760 --> 10:26.040
That would be very slow. Data transfer and other things which do not need to go via the platform

10:26.120 --> 10:30.200
go directly from system to system, or then from the user to the system.

10:35.800 --> 10:40.440
So the architecture, if we're looking now, this is the central platform. So the thing in the

10:40.440 --> 10:47.160
middle and the previous picture. So essentially, it's a very modular and flexible architecture

10:47.240 --> 10:55.640
where separate components are tightly glued together with the SSO and single identity,

10:55.640 --> 10:59.480
but they're responsible for some particular feature of the system.

11:00.280 --> 11:04.920
And this is, I mean, we're heavily leveraging a lot of open source components, which I'm going

11:04.920 --> 11:11.880
to show next, and then to be able to accommodate all the future systems. Now we know how nine systems

11:12.200 --> 11:17.320
look, but there's three systems coming online, seven AI factors, which we do not know what the

11:17.320 --> 11:22.280
text that will look like, and eight quantum computers where people have now even decided what

11:22.280 --> 11:28.920
they interfaces. So we're going to need a high degree of being able then to adapt to whatever

11:28.920 --> 11:34.920
somebody decides to build to be able to plug it into this. So there's the core platform, which we

11:34.920 --> 11:40.840
manage, and then the, again, glue components, the hosting entity components, which are responsible

11:40.920 --> 11:52.360
for connecting our various components. Perfect. Okay, so now I'm going to flash a bunch of components,

11:52.360 --> 11:59.000
some of these are maybe familiar for you, some are not, but I've left the links there. So the AI

11:59.000 --> 12:06.040
is going to leverage John's My Access ID, and then a few other components, which are still being built.

12:06.840 --> 12:13.000
Allocation is based on Valdor, or then you might have heard the term Puri, which is a kind of

12:13.560 --> 12:20.920
deployment of the Valdor software. Open on demand is probably the HPC technology, which is most

12:20.920 --> 12:25.400
broadly used out of these ones, it comes from the states, and that would be responsible for the

12:25.400 --> 12:33.160
interactive click at the click at the, work flow is based on Lexus and hippie. Reporting is your

12:33.240 --> 12:37.400
basic standard tech stack. There's Grafana, there's a saying that there's open search.

12:39.640 --> 12:43.640
We have a help desk, because, again, this is we're not just building a software product,

12:43.640 --> 12:48.600
we're actually producing a service, which means we'll have users, which will lead help.

12:49.720 --> 12:55.240
And then the software catalog, as I briefly mentioned, is going to be based on easy-built and easy

12:55.240 --> 13:02.520
via a certain VMFS. So most of this is open source, you can go click of this up, you can go

13:02.520 --> 13:07.960
download them, some of them are fairly complex to run, so you can just do an install and

13:07.960 --> 13:13.000
accept, expect them to be up and running, but if you're willing to put in the time, most of

13:13.000 --> 13:19.240
this thing, so you can run them yourself. A lot of them are already running separately on various systems.

13:21.800 --> 13:30.440
Good. Okay, I'm going to give a extremely brief overview of the main components.

13:30.680 --> 13:38.120
Okay, so the AI, as I said, that you utilize as my access ID for those not in the know,

13:38.120 --> 13:45.160
that is essentially super simplified, it's an AI proxy with a discovery service and account for

13:45.160 --> 13:51.960
registry. What that means is that when a user logs in, they can search for their home institution,

13:51.960 --> 13:57.240
for example, there's about 3,000 hits for something named University of Blah,

13:57.880 --> 14:03.320
then they are presented with that, for example, that university's logging screen, and okay,

14:03.320 --> 14:10.760
now they are registered. Now, my XID knows that there is some unique ID which identifies with

14:10.760 --> 14:17.080
this identity provider, then there's fancy things for linking if you have multiple identity providers

14:17.080 --> 14:25.080
mapping to the same actual natural person and so on. For those who are doing the technical integration,

14:25.160 --> 14:34.360
essentially some or open IDC works out of the box. And this is going to provide a SSH certificate

14:34.360 --> 14:43.480
authority for then the access. So users log in, get a short live SSH certificate, which is then

14:43.480 --> 14:51.240
trusted at the target sites, you can log in. 24 hours later, you have to reauthenticate and get

14:51.240 --> 15:02.600
another key to continue working. Perfect. Okay, let's have to speed up. Okay, allocations,

15:02.600 --> 15:09.800
this was based on valor and SSH had a unified place where you can view your project, your project

15:09.800 --> 15:16.600
memberships, and to what systems the project grant you access and how much compute hours you have left.

15:17.160 --> 15:23.960
The actual granting of the project is not directly part of our platform, that's the responsibility

15:23.960 --> 15:32.600
of another component being procured, your HPC peer review platform, which should be online somewhere

15:32.600 --> 15:42.840
around summer, if I recall correctly. Workflow is kind of your entry point for doing easy to use

15:42.840 --> 15:49.320
managed workflow for selecting, like I have an application, it needs this input. I can now

15:49.320 --> 15:56.440
graphically connect these and it will take care of running it then on a single system or then

15:56.440 --> 16:02.920
over multiple systems. It also supports then this kind of data staging aspect that, okay, if your

16:02.920 --> 16:11.640
data is on system X will move it somehow to system Y depending on how much functionality the

16:11.720 --> 16:17.240
various hosting entities provide for moving data. Do they have irons for staging? Perfect,

16:17.240 --> 16:23.560
this is going to be really quick. Do you only provide SCP through a login node? It'll work,

16:23.560 --> 16:27.960
but you'll have to spend a bit more time in the coffee room waiting for us to transfer the data.

16:28.680 --> 16:34.760
And this is able then to target, I mean, slow them, other batch gaps, or then Kubernetes,

16:34.760 --> 16:38.440
which is most likely going to be more relevant with the AI factories.

16:42.200 --> 16:50.840
Okay, open on the band interactive, you can get test stops, jupy notebook, a shell, for those

16:50.840 --> 16:56.120
who have used open on demand, we're not going to use the vanilla open on demand. We're going to

16:56.120 --> 17:01.160
develop functionality to move it further away from the system. Currently open on demand will

17:01.160 --> 17:07.160
essentially sit in your HPC center. We don't want to do that. That's too invasive. So we're going to

17:08.120 --> 17:13.400
do several contributions to be able to move it further away from the system, which might also

17:13.400 --> 17:20.440
be useful elsewhere. There will be health desk and reporting will be able to show you how much

17:20.440 --> 17:26.040
resources you are consuming and hopefully how much resources, some of your postdocs burnt on

17:26.040 --> 17:32.920
something stupid and then you can go and not run the door. The software catalog, there are multiple

17:33.000 --> 17:38.520
people here on the room who know easy way better than me, but essentially we're going to use easy

17:38.520 --> 17:47.560
to provide a pseudo unified stack with at least some of the software and compilers being that

17:47.560 --> 17:52.760
the user does not need to care specifically what architecture they're running on, are they're running

17:52.760 --> 17:59.240
on AMD or are they running on Nvidia? Is this okay, I want to use PyTorch, module load PyTorch.

17:59.640 --> 18:11.400
Easy fixes that. Okay, so last two minutes on the open source aspect, this would not be possible

18:11.400 --> 18:19.080
without open source. If you consider the scope of the functionality and the implicit understanding

18:19.080 --> 18:25.560
the implemented technologies have of user requirements, starting from scratch and compiling that

18:25.560 --> 18:33.080
into something sensible would be an extremely large job. Federation, of course, requires a degree

18:33.080 --> 18:39.560
of trust that, hey, we're our stuff works. With open source that, okay, somebody else is running it,

18:39.560 --> 18:45.240
you can have a look at it, you can go and check it out. It's much easier than here is my binary

18:45.240 --> 18:50.840
blob, please install it so that stuff works. And then also on the promised features, because

18:51.720 --> 18:56.760
now I've shown you a bunch of stuff. I'm promising you very many things which can be done.

18:57.320 --> 19:01.160
If this was a close source solution, which wasn't used anywhere and you can

19:02.280 --> 19:08.200
look into it, you would probably think that he's trying to, I mean bullshit us. There's no way,

19:08.200 --> 19:13.560
but now you know kind of, yes, that is an open source component. It works there. I understand why it

19:13.560 --> 19:22.360
works. This could also work. And users would probably rather integrate into less things,

19:22.360 --> 19:28.360
the more things. And then of course, we need to do heavy modifications to be able to integrate.

19:28.360 --> 19:33.400
I mean, glue everything together. If this was close source, not a chance.

19:34.280 --> 19:39.720
And third seconds for less thing. Yes. So hopefully the benefit for the rest of the community

19:39.720 --> 19:46.360
would be that we try to upstream what makes sense. Everything does not make sense, but we have

19:47.960 --> 19:53.720
fortunately the ability to continuously work with the upstream and proposed features,

19:53.720 --> 19:58.280
instead of doing this involved, which is sometimes the case with these kind of projects,

19:58.280 --> 20:04.440
when there are some IPR-related restrictions. For a lot of projects, the integration

20:04.760 --> 20:11.880
is the hard part. So I'm hoping that when we work out the integration with 20 plus systems,

20:12.520 --> 20:19.240
we're able to upstream or those change or at least document that this is things you can encounter

20:19.240 --> 20:24.680
during the integration. And then the final last point, of course, is it's a large user-based

20:24.680 --> 20:29.960
for the project. You get bug testing's feedback for certain projects. It's also a very good

20:29.960 --> 20:34.920
measure of impact. Hey, we built such a good project that they've dumped it and opened it

20:34.920 --> 20:40.280
to 3,000 European scientists. Okay, thanks. Questions?

20:47.480 --> 20:54.840
Thank you, Henry. Any question? Yes. All right. Is this something that it's only meant for the

20:54.840 --> 21:03.240
big players in the HPC ecosystem, or can we expect to have small clusters, university clusters,

21:03.800 --> 21:10.680
also left in people FBA? Yes. So the question was that if this is intended for the big players,

21:11.400 --> 21:18.840
in the long run, this is meant also to facilitate that that if you want to connect smaller,

21:18.920 --> 21:23.640
national or regional things to this, that will be possible.

21:28.520 --> 21:33.400
You're mentioning that you're going to use certificate authorities, because everybody said

21:33.400 --> 21:38.360
you have to use M&A even for SSH, you see people on it. But the thing that keeps on popping up

21:38.360 --> 21:43.640
on this Marsite is there are a lot of people actually SSH into the cluster with spits,

21:44.280 --> 21:49.480
and the shortlet certificates are an issue. You already have a solution to that, and I'm going

21:49.480 --> 21:56.760
to try to protect that network connection framework. So the question was that how the certificate

21:56.760 --> 22:03.480
lifetime of the SSH connections affect certain user workloads. So yes, some of the things you

22:03.480 --> 22:09.080
will be able to do the workflow stuff, and the hope is that we would get rid of these

22:09.080 --> 22:16.680
infinitely lifetime on secure keys somewhere, then the Federation platform tries to be very

22:16.680 --> 22:25.160
agnostic from site policy. So then it's more up to the hosting entity, the how long do you want

22:25.160 --> 22:32.680
your certificate to be. So if somebody decides we only want to give our HPC users 48 hours,

22:32.680 --> 22:38.920
fine, we'll dump out 48 hours, or then if, yeah, we'll find with three months, good.

22:39.880 --> 22:46.120
Personally, I would like to see a kind of dynamic shift where the longer your SSH certificate life

22:46.120 --> 22:51.800
time is, the more restrictions I add to it. So okay, you can get six months, buying and

22:51.800 --> 22:57.960
restrict the IP, I'm going to turn off port forwarding X11 forwarding. You want everything, all the

22:57.960 --> 23:02.760
fancy stuff. Okay, I'm just going to give you 48 hours. That's what I would like to see us

23:02.760 --> 23:05.640
going towards. It's a good compromise in my opinion.

23:07.880 --> 23:08.520
More questions.

23:11.960 --> 23:18.920
Yeah. So I'm also a plan of again, we'll get healing in this layer. So we can maybe decide

23:18.920 --> 23:29.880
if we can say my code, if the cheapest of most convenient site and copy the data,

23:29.880 --> 23:36.040
if you're previous to when the copy of those is our convenient way to do it.

23:36.040 --> 23:41.560
Yes, so that's an explicit, sorry. Yeah, the question was, is there going to be any kind of

23:41.560 --> 23:47.560
automated scheduling features built into the Federation platform? That's explicit

23:47.880 --> 23:52.040
requirement for the Federation platform. A flushed on one of the slides, I had a very, very

23:52.040 --> 23:58.280
but yes, there will be a smart scheduler, both available in the workflow and available as a service,

23:58.280 --> 24:04.360
which is, okay, please tell me what system should I run on if I'm time to solution or energy

24:04.360 --> 24:10.520
to solution or you have some other specific requirements. Usually, probably, police tell me which

24:10.600 --> 24:14.200
system is the least used, because that's probably going to get you the quickest answer.

24:32.200 --> 24:38.520
So the question was how we manage the scheduling and the staging in of data.

24:39.480 --> 24:45.560
This is a layer sitting on top of all the clusters and slurming everything. So most likely it

24:45.560 --> 24:52.280
would be that, okay, we stage in the data and when that's done, then you would start scheduling

24:52.280 --> 24:58.520
the jobs. If you have very, if the cost of failure when data is now available is small,

24:58.520 --> 25:04.360
then you could try to cheat it out that, okay, data is probably going to, I mean, be available in

25:04.360 --> 25:09.480
hour and the queue is at least two hours, let's take a small risk. But then if you're scheduling

25:09.480 --> 25:13.080
half the machine, then you're probably going to take the conservative approach.

25:13.080 --> 25:38.200
So the question is, how we ensure that the S8 certificate maps to a local user? How that essentially works

25:38.200 --> 25:44.680
is that the platform when there is a new user will exchange information with the target side.

25:44.680 --> 25:50.920
That, hey, there is a user with the following attributes. Please create necessary local accounts

25:51.480 --> 25:57.400
and then inform us of the unix account. At that point, in the certificate, that will contain

25:57.400 --> 26:05.080
the unique My Access ID identifier, which the hosting entity would have in there principles mapping

26:05.080 --> 26:10.840
that this unique identifier is allowed for this unix user. And then we've know that, so then they are

26:10.840 --> 26:15.240
let in.

26:15.240 --> 26:21.800
George, yes, this transfer, in terms of what you have to point it to technology yet, for data transfers.

26:21.800 --> 26:27.880
So, yeah, the question was the technology for data transfers, the answer is whatever is available

26:27.880 --> 26:32.920
at the sites. So we're not introducing or imposing any additional, we work what's

26:33.720 --> 26:39.560
available at the various, some have IRONs, some of S3, some have just SCP.

26:39.560 --> 26:45.480
All right, the record up here, one final take, where does both of this is, as well as getting

26:45.480 --> 26:49.320
university, is in all of that, several of these files will be highlighted. So if you want to work

26:49.320 --> 26:54.360
on open-source software, on the biggest supercomputers in Europe, this could be very interesting for you,

26:54.360 --> 27:00.520
looking for new jobs. I'm hiring.