WEBVTT

00:00.000 --> 00:12.000
All right, the second to last talk, and the things HPC, their room at Fatham, is by Edgar

00:12.000 --> 00:14.000
on MPI-Bind.

00:14.000 --> 00:18.000
We saved the best for last.

00:18.000 --> 00:21.000
Edgar Leone, Lauren Thielmore, National Laboratory.

00:21.000 --> 00:27.000
I work for Leone or Computing, which is the institution that or the organization that feels

00:27.000 --> 00:32.000
the supercomputers at the laboratory, and we are currently hosting the Fatham

00:32.000 --> 00:34.000
Supercomputer in the world.

00:34.000 --> 00:42.000
I will talk about MPI-Bind, a library that maps applications onto supercomputers.

00:42.000 --> 00:47.000
To begin, HPC users face complex architectures.

00:47.000 --> 00:51.000
On one hand, you have applications with different requirements.

00:51.000 --> 00:55.000
So they need portability, at least at the laboratory and all the institutions.

00:55.000 --> 01:01.000
We need applications that run not only on one architecture, but on multiple architectures.

01:01.000 --> 01:07.000
And we want our application developers to be productive, particularly when they move across architectures,

01:07.000 --> 01:09.000
across vendors.

01:09.000 --> 01:17.000
And we want them to be able to extract a significant percentage of the peak performance of the architecture.

01:17.000 --> 01:25.000
And so here on the screen, I have several of recent architectures, including the AMD

01:25.000 --> 01:29.000
MI3.8, which I will go into a little bit more detail.

01:29.000 --> 01:34.000
But there is also the MBD architectures and others.

01:34.000 --> 01:40.000
So to give you a closer look at how these architectures look like.

01:40.000 --> 01:51.000
This is the node topology of the AMD MI3.8, which is the heart of the al Capitan Supercomputer

01:51.000 --> 01:53.000
at Lawrence Livermore.

01:53.000 --> 01:57.000
And this is the top super at the moment in the world.

01:57.000 --> 02:04.000
On the right hand side, they have a picture of the MI3.10A chip.

02:04.000 --> 02:08.000
And you can see that is composed of multiple chiplets.

02:08.000 --> 02:11.000
So you have six accelerator chiplets.

02:11.000 --> 02:13.000
Those are the ones that I highlighted.

02:13.000 --> 02:18.000
And then you have the three CPU chiplets, which are not highlighted here.

02:18.000 --> 02:21.000
And then you have HPM memory on the sides.

02:21.000 --> 02:27.000
One of the key characteristics of this chip is that you have a single memory.

02:27.000 --> 02:34.000
So you have this memory that is shared, physically shared between the CPU and the GPU.

02:34.000 --> 02:39.000
So this is different than other architectures where you have CPU memory.

02:39.000 --> 02:40.000
And then you have GPU memory.

02:40.000 --> 02:43.000
And then you have to be moving data back and forth, right?

02:43.000 --> 02:47.000
So a key distinguish characteristic there.

02:47.000 --> 02:51.000
So on a compute node of the Capitan, we have four of these, right?

02:51.000 --> 02:54.000
And so you have the four quadrants here.

02:54.000 --> 02:57.000
And the three CPU chiplets show like this.

02:57.000 --> 03:01.000
So you have eight cores per CPU chiplet.

03:01.000 --> 03:05.000
And you have two hardware threads per core.

03:05.000 --> 03:10.000
And each as I said, each APU has its own memory.

03:10.000 --> 03:14.000
And here is the GPU.

03:14.000 --> 03:18.000
And it also has a network interface controller as well.

03:18.000 --> 03:25.000
And again, just to highlight that this memory right here is shared between the CPU and the GPU.

03:25.000 --> 03:32.000
And so when you think about mapping applications onto this architecture, it's not easy.

03:32.000 --> 03:35.000
I'm going to give you another example.

03:35.000 --> 03:42.000
So here we have the Capitan architecture of the Frontier Supercomputer at Oak Ridge National Lab,

03:42.000 --> 03:46.000
which should be, I guess, number two on the top half corner at the moment.

03:46.000 --> 03:50.000
And so you know, again, it's not easy here.

03:50.000 --> 03:55.000
So here, for example, the GPUs are local to an L3 cache,

03:55.000 --> 04:02.000
as opposed to the, they might 300 a day, which the GPUs are attached to sort of an anomaly.

04:02.000 --> 04:04.000
And what could this type of architecture?

04:04.000 --> 04:09.000
Like what you're trying to port your applications onto these machines?

04:09.000 --> 04:13.000
Well, one of them, if you have multiple hardware threads per core,

04:13.000 --> 04:19.000
you could be mapping your processes onto each hardware thread rather than to a full core.

04:19.000 --> 04:23.000
So you would be getting like half of the performance that you're expecting.

04:23.000 --> 04:27.000
You could be launching multiple GPU kernels,

04:27.000 --> 04:30.000
and they are being mapped to a single GPU, right?

04:30.000 --> 04:33.000
That's supposed to multiple GPUs.

04:33.000 --> 04:37.000
You could have, let's say, that I have tasks on this chiplet right here,

04:37.000 --> 04:42.000
and instead of launching to my local GPU, I could be launching to a remote GPU.

04:42.000 --> 04:47.000
If you notice, GPUs here is here, not here, this is GPU 4, right?

04:47.000 --> 04:50.000
So if you think that you're going to be using GPU 0 here,

04:50.000 --> 04:52.000
well, then you're using this one over here,

04:52.000 --> 04:56.000
and so it will be much more expensive to use that GPU.

04:56.000 --> 05:01.000
So again, it's not easy to map these applications,

05:01.000 --> 05:04.000
but you could have you have MPI processes.

05:04.000 --> 05:07.000
If you have open MP threads, if you have GPU kernels,

05:07.000 --> 05:12.000
then you have to be very careful about how you map things onto the architecture.

05:13.000 --> 05:17.000
So with that in mind, we created MPI bind,

05:17.000 --> 05:22.000
which is mapping algorithm to take your application onto the hardware.

05:22.000 --> 05:26.000
It does the mapping automatically for a user,

05:26.000 --> 05:30.000
so that they don't have to think about all these things that I talked about.

05:30.000 --> 05:34.000
And the three design pins, it principles of MPI bind,

05:34.000 --> 05:36.000
is it focuses on the memory system.

05:36.000 --> 05:39.000
I think that was key, particularly at the beginning,

05:39.000 --> 05:42.000
that most of the mapping and affinity algorithms,

05:42.000 --> 05:44.000
they were looking at the compute elements,

05:44.000 --> 05:47.000
regardless of the memory system.

05:47.000 --> 05:50.000
I think now they're coming up to speed, like slurren and so on,

05:50.000 --> 05:53.000
but particularly at the beginning, MPI bind was the only one

05:53.000 --> 05:56.000
that was really looking at the memory system.

05:56.000 --> 05:58.000
It focuses on locality, right?

05:58.000 --> 06:00.000
So it's going to give you local resources.

06:00.000 --> 06:04.000
It won't give you a remote GPU,

06:04.000 --> 06:06.000
and I'll tell you why that is.

06:06.000 --> 06:10.000
And again, we look for portability across architecture,

06:10.000 --> 06:14.000
across MPI libraries, across resource managers,

06:14.000 --> 06:16.000
and vendors as well.

06:16.000 --> 06:18.000
So for the remainder of the talk,

06:18.000 --> 06:21.000
I will just focus on a few characteristics of MPI bind.

06:21.000 --> 06:24.000
I don't have time to go over all of them,

06:24.000 --> 06:28.000
but you'll have links to further resources if you are interested.

06:28.000 --> 06:33.000
So let's begin with MPI bind and sort of how it works,

06:33.000 --> 06:37.000
as I said, the key characteristic is to follow the memory system.

06:37.000 --> 06:39.000
And so for doing that,

06:39.000 --> 06:45.000
we first create an abstract tree of the memory system of your computer,

06:45.000 --> 06:46.000
right?

06:46.000 --> 06:49.000
So you have no more domains, you know, memory domains,

06:49.000 --> 06:52.000
and then you have different memories,

06:52.000 --> 06:55.000
the L3 cache, the L2 cache, the L1 cache is and so on.

06:55.000 --> 06:58.000
So you first have that heater key.

06:58.000 --> 07:00.000
Once you have the memory tree,

07:00.000 --> 07:05.000
then you start attaching the compute elements to those memory elements,

07:05.000 --> 07:07.000
right? So for example, L1 cache,

07:07.000 --> 07:10.000
we'll have cores associated with them.

07:10.000 --> 07:13.000
No more domains, they may have GPUs associated with them,

07:13.000 --> 07:16.000
and then we're in different controllers and so on.

07:16.000 --> 07:19.000
And so now when we get an application,

07:19.000 --> 07:24.000
then we're going to try to map it to the memory tree first.

07:24.000 --> 07:27.000
And then once we do that map it to the memory tree,

07:27.000 --> 07:35.000
if that task, the device is associated with those memory resources.

07:35.000 --> 07:40.000
So let's say that I have a 32 tasks,

07:40.000 --> 07:45.000
and I want to map that onto this tree.

07:45.000 --> 07:49.000
So first of all, I'm going to distribute my tasks over the nomad domains,

07:49.000 --> 07:54.000
so that I will have 16 tasks on this nomad domain.

07:55.000 --> 07:58.000
And then what I'm going to try to do is,

07:58.000 --> 08:02.000
so I have 16 workers that I have to map.

08:02.000 --> 08:05.000
So I'm going to start traversing them,

08:05.000 --> 08:08.000
the top tree down, right?

08:08.000 --> 08:10.000
And so I'm going to start looking at this level,

08:10.000 --> 08:12.000
which is the L3 level,

08:12.000 --> 08:14.000
but I only have 10 L3s here,

08:14.000 --> 08:16.000
and I have 16 workers.

08:16.000 --> 08:18.000
So I'm going to keep going down the heater key,

08:18.000 --> 08:21.000
until I can match that request,

08:22.000 --> 08:26.000
which happens to be at level B4,

08:26.000 --> 08:29.000
where I have 20 L1 cashes,

08:29.000 --> 08:34.000
so then I can map that request of 16 workers that I need.

08:34.000 --> 08:36.000
And so I will map it at that level,

08:36.000 --> 08:40.000
and of course once I do the assignment of the L1 cashes,

08:40.000 --> 08:42.000
it will have courses associated with it,

08:42.000 --> 08:45.000
it will have GPUs associated with it, and so on.

08:45.000 --> 08:47.000
So everything will be local.

08:47.000 --> 08:50.000
So that's basically how the algorithm work.

08:50.000 --> 08:55.000
Now I'm going to go over certain properties of MPI bind

08:55.000 --> 09:00.000
to tell you a little bit more about how it works.

09:00.000 --> 09:02.000
So in this example,

09:02.000 --> 09:05.000
I have a node with two nomad domains,

09:05.000 --> 09:08.000
40 cores and 4 GPUs.

09:08.000 --> 09:10.000
And let's say that I want to do something simple,

09:10.000 --> 09:11.000
right?

09:11.000 --> 09:14.000
I have an application with 8 MPI tasks.

09:14.000 --> 09:18.000
So what I want to do is to use 2 GPUs,

09:18.000 --> 09:21.000
but I'm sorry, 2 tasks per GPU,

09:21.000 --> 09:22.000
right?

09:22.000 --> 09:25.000
I have 4 GPUs and I have 8 tasks.

09:25.000 --> 09:28.000
So if I want to do a reasonable mapping

09:28.000 --> 09:31.000
on IBM's JSM,

09:31.000 --> 09:34.000
which is like this learn of IBM,

09:34.000 --> 09:36.000
if you will,

09:36.000 --> 09:38.000
I have to do something like this

09:38.000 --> 09:40.000
to be able to do a good mapping,

09:40.000 --> 09:41.000
right?

09:41.000 --> 09:42.000
And so for a user,

09:42.000 --> 09:45.000
just trying to understand where all of this means

09:45.000 --> 09:47.000
is not easy, right?

09:47.000 --> 09:51.000
And so at least for our users,

09:51.000 --> 09:54.000
they would be doing something like this instead.

09:54.000 --> 09:55.000
Right?

09:55.000 --> 09:56.000
So MPI bind,

09:56.000 --> 09:58.000
the only thing that you need to pass

09:58.000 --> 10:01.000
is really the number of tasks that you're going to be running.

10:01.000 --> 10:02.000
Right?

10:02.000 --> 10:04.000
And if MPI bind is on by default,

10:04.000 --> 10:06.000
you don't have to say anything else.

10:06.000 --> 10:07.000
If it's not on by default,

10:07.000 --> 10:09.000
then of course you have to say,

10:09.000 --> 10:12.000
you know, I want to be using MPI bind.

10:12.000 --> 10:15.000
And it has a number of options that you can pass

10:16.000 --> 10:21.000
to MPI bind through this minus minus MPI bind parameter.

10:21.000 --> 10:24.000
The only required parameter for MPI bind

10:24.000 --> 10:27.000
is the number of tasks.

10:27.000 --> 10:29.000
The rest, you know, are options

10:29.000 --> 10:31.000
and I will cover some of these options.

10:31.000 --> 10:33.000
If you're interested in all of them,

10:33.000 --> 10:34.000
of course, you know,

10:34.000 --> 10:38.000
take a look at the links that I will share later.

10:38.000 --> 10:41.000
So let's talk a little bit about portability

10:41.000 --> 10:44.000
and how MPI bind achieves that.

10:44.000 --> 10:47.000
But first, let me give you a counter example.

10:47.000 --> 10:52.000
And I suggest please do not use the MPI constructs

10:52.000 --> 10:54.000
for affinity and binding,

10:54.000 --> 10:56.000
because if you do that,

10:56.000 --> 10:57.000
you're not going to be able to,

10:57.000 --> 10:58.000
you're not going to be portable.

10:58.000 --> 11:00.000
So let's say that you're on a system

11:00.000 --> 11:02.000
that is using Intel MPI,

11:02.000 --> 11:05.000
and you're going to another system that uses open MPI.

11:05.000 --> 11:07.000
You're even within that system,

11:07.000 --> 11:09.000
if you just want to move to a different MPI

11:09.000 --> 11:11.000
that perhaps performs better,

11:11.000 --> 11:15.000
then you have to change your constructs that you're using, right?

11:15.000 --> 11:17.000
So it's much better to do it

11:17.000 --> 11:21.000
at the resource manager level in that case.

11:21.000 --> 11:26.000
So how does MPI bind manages portability across system?

11:26.000 --> 11:32.000
As I said, it only relies on this abstract memory compute tree.

11:32.000 --> 11:35.000
As long as you can build that compute tree,

11:35.000 --> 11:38.000
MPI bind will work the same in every architecture.

11:38.000 --> 11:43.000
And the way that it constructs the memory tree

11:43.000 --> 11:45.000
is using HW lock.

11:45.000 --> 11:48.000
So at the end, if you can run HW lock,

11:48.000 --> 11:49.000
you can run MPI bind.

11:49.000 --> 11:52.000
So that's the only requirement that you have.

11:52.000 --> 11:58.000
Now, it's important to highlight that there is a separation

11:58.000 --> 12:03.000
between the mapping algorithm and actually binding your processes

12:03.000 --> 12:05.000
and your tasks and your threats and so on.

12:06.000 --> 12:11.000
And so MPI bind those in two different approaches.

12:11.000 --> 12:16.000
So the first one, we have a center face to get the mapping, right?

12:16.000 --> 12:19.000
And then for every resource manager that you have,

12:19.000 --> 12:24.000
we created a plugin that actually does the binding, right?

12:24.000 --> 12:28.000
So we have slorn plugins, some flux plugins,

12:28.000 --> 12:31.000
fluxes, a resource manager that we use at Lawrence Leone

12:31.000 --> 12:33.000
is getting more common now.

12:33.000 --> 12:36.000
But so the plugins, the way that they work,

12:36.000 --> 12:40.000
is they get the job information from either slorn or flux.

12:40.000 --> 12:43.000
They will call MPI bind to get the mapping.

12:43.000 --> 12:46.000
And then they will do the binding of the tasks

12:46.000 --> 12:52.000
or setting the environment variables that need to be set to do affinity.

12:52.000 --> 12:57.000
Okay, so MPI bind tries to minimize remote memory access

12:57.000 --> 12:59.000
as much as possible.

12:59.000 --> 13:03.000
And this is important because this is the stream benchmark.

13:03.000 --> 13:07.000
If you look at your local memory bandwidth

13:07.000 --> 13:09.000
versus remote memory bandwidth,

13:09.000 --> 13:12.000
you can get like a 50% penalty, for example.

13:12.000 --> 13:15.000
So what MPI bind those is that for every task,

13:15.000 --> 13:18.000
it will keep it contained within a new model main

13:18.000 --> 13:22.000
to make sure that all of the threats of that process

13:22.000 --> 13:24.000
access just local memory.

13:24.000 --> 13:27.000
But of course, if you need more memory,

13:27.000 --> 13:32.000
then it can spill over two other new model main.

13:32.000 --> 13:36.000
Another characteristic that is important for MPI bind

13:36.000 --> 13:41.000
that I mentioned is that you want to be using local resources.

13:41.000 --> 13:44.000
So if MPI bind gives a GPU,

13:44.000 --> 13:47.000
it will be associated with local CPUs.

13:47.000 --> 13:52.000
You get local CPUs, it will be associated with local GPUs.

13:52.000 --> 13:59.000
And so here's an example or some performance moving data

13:59.000 --> 14:06.000
between the CPU memory and a local GPU versus moving data

14:06.000 --> 14:09.000
from a CPU memory to a remote GPU.

14:09.000 --> 14:13.000
And you can get a significant performance degradation

14:13.000 --> 14:16.000
when you're using a remote GPU.

14:16.000 --> 14:17.000
So it does matter.

14:17.000 --> 14:20.000
It depends on the architecture and the topology

14:20.000 --> 14:23.000
of course, if you're using a MPI link versus PCI and so on,

14:23.000 --> 14:28.000
but it can matter in many cases.

14:28.000 --> 14:32.000
Another component that I want to talk about is MPI bind

14:32.000 --> 14:35.000
has been an important component to address

14:35.000 --> 14:39.000
system noise on high performance computing systems.

14:39.000 --> 14:42.000
System noise refers to any activity

14:42.000 --> 14:45.000
that runs along the application and affects

14:45.000 --> 14:47.000
the performance of the application.

14:47.000 --> 14:49.000
So think of the operating system.

14:49.000 --> 14:52.000
The operating system has to run on the same compute note

14:52.000 --> 14:56.000
and it may affect the performance of applications

14:56.000 --> 15:00.000
if it's running on the same core or on the same hardware thread.

15:00.000 --> 15:03.000
So one common approach to deal with system noise

15:03.000 --> 15:05.000
is to specialize your CPUs.

15:05.000 --> 15:09.000
So you will be using some CPUs for the operating system

15:09.000 --> 15:12.000
and some CPUs for your application to provide

15:12.000 --> 15:15.000
as much isolation from the operating system as possible.

15:15.000 --> 15:18.000
So you have application CPU system CPUs.

15:18.000 --> 15:23.000
And with MPI bind is very easy to control the placement

15:23.000 --> 15:25.000
on your architecture, right?

15:25.000 --> 15:30.000
So for example, you could say,

15:30.000 --> 15:35.000
I want to use the bottom hardware threads of every core

15:35.000 --> 15:39.000
because the top hardware threads of every core are going to be used

15:39.000 --> 15:41.000
for system processing.

15:41.000 --> 15:45.000
So you would just say MPI bind restrict and you pass

15:45.000 --> 15:48.000
the system CPUs and then you're good.

15:48.000 --> 15:51.000
So the system administrators they don't have to

15:51.000 --> 15:54.000
change this long configuration or the flux configuration

15:54.000 --> 15:56.000
they don't have to do any of that.

15:56.000 --> 15:58.000
And just to mention, you can also use

15:58.000 --> 16:02.000
MPI bind restrict not only on CPUs but also on memory domains.

16:02.000 --> 16:05.000
You can say, I want to stay with memory domains

16:05.000 --> 16:14.000
for and 7 and you will stay within that part of the compute note.

16:14.000 --> 16:17.000
We've done multiple studies of system noise.

16:17.000 --> 16:19.000
And I just want to highlight that here,

16:19.000 --> 16:24.000
highlight here that MPI bind allows that type of study very easily

16:24.000 --> 16:27.000
just by changing that environment variable.

16:27.000 --> 16:30.000
Here I have two configurations where we use

16:30.000 --> 16:34.000
the top hardware threads for the application or the bottom

16:34.000 --> 16:36.000
hardware threads for the application.

16:36.000 --> 16:39.000
Plus, some other kernel settings.

16:39.000 --> 16:43.000
And here we use a micro benchmark that is just a compute loop

16:43.000 --> 16:45.000
that runs over and over and over again.

16:45.000 --> 16:48.000
And then you measure how long it takes every iteration

16:48.000 --> 16:51.000
and then you plot those execution times.

16:51.000 --> 16:54.000
And you can say there is a significant performance

16:54.000 --> 16:56.000
improvement on this side.

16:56.000 --> 16:59.000
And not necessarily to compare the two configurations,

16:59.000 --> 17:02.000
but just to tell you that with MPI bind is very easy

17:02.000 --> 17:06.000
to do this type of studies, as long as you have the c-sad

17:06.000 --> 17:10.000
means doing the placement of the operating system process

17:10.000 --> 17:11.000
on the other side, right?

17:11.000 --> 17:14.000
But at least on the user side is very easy to change

17:14.000 --> 17:16.000
if you're using MPI bind.

17:16.000 --> 17:21.000
You can also use this MPI bind restrict variable

17:21.000 --> 17:27.000
to place your jobs in whatever custom

17:27.000 --> 17:30.000
where you have, let's say that you have a traditional

17:30.000 --> 17:32.000
simulation application.

17:32.000 --> 17:35.000
And then you also have a machine learning component.

17:35.000 --> 17:38.000
It could be a surrogate model for example, right?

17:38.000 --> 17:41.000
And you want to place them differently on the note

17:41.000 --> 17:45.000
to say take advantage of the GPUs or the CPUs and so on.

17:45.000 --> 17:47.000
So you can do that pretty easily.

17:47.000 --> 17:52.000
So in this case, I have job one on the memory

17:52.000 --> 17:54.000
domains 4 and 5.

17:54.000 --> 17:58.000
And I have job two on memory domain 6 and 7.

17:58.000 --> 18:01.000
And then the third job is using the bottom cargo

18:01.000 --> 18:05.000
threads of the first half of the CPU.

18:05.000 --> 18:13.000
So you can do that as it is shown on this slide.

18:13.000 --> 18:17.000
So we have been using MPI bind that LNL in production

18:17.000 --> 18:22.000
since 2015, so I guess it's going to be 10 years next year.

18:22.000 --> 18:26.000
It has changed over time to account for the different

18:26.000 --> 18:28.000
architectures that we see.

18:28.000 --> 18:32.000
But the important thing for us is that our application developers

18:32.000 --> 18:35.000
will want them on day one to get to a new machine

18:35.000 --> 18:38.000
and don't worry about the affiliate and binding and give them

18:38.000 --> 18:41.000
a reasonable binding to begin with.

18:41.000 --> 18:44.000
And of course, if that is not enough, they can explore more.

18:44.000 --> 18:47.000
But they don't have to think about it when they

18:47.000 --> 18:50.000
start running on a new system.

18:50.000 --> 18:53.000
You can build it very easily with the new auto tools.

18:53.000 --> 18:56.000
You can use SPAC.

18:56.000 --> 18:59.000
We have, as I said, this LORM and the flux plugins.

18:59.000 --> 19:03.000
In order to enable it on SLORM is very easy.

19:03.000 --> 19:07.000
You basically change one line of this configuration file

19:07.000 --> 19:13.000
in SLORM and then you're good to go.

19:13.000 --> 19:17.000
So MPI bind is an excellent initial policy.

19:17.000 --> 19:21.000
But if you have an application that requires different

19:21.000 --> 19:26.000
mappings over time, then this is not going to do it

19:26.000 --> 19:28.000
because MPI bind is a static.

19:28.000 --> 19:32.000
On the other hand, there is no modifications to your application

19:32.000 --> 19:33.000
in order to use MPI bind.

19:33.000 --> 19:35.000
So that's powerful.

19:35.000 --> 19:39.000
We're developing a new library for applications that are more

19:39.000 --> 19:43.000
complex that require different dynamic mappings throughout the

19:43.000 --> 19:46.000
light of the application.

19:46.000 --> 19:49.000
We have a lot of documentation and articles available.

19:49.000 --> 19:51.000
So you'll feel free to use those ones.

19:51.000 --> 19:56.000
These lights are on the web page so you can get those links.

19:56.000 --> 20:00.000
And to conclude, I just want to say that MPI bind is the

20:00.000 --> 20:04.000
goal of MPI bind is to make application developers more

20:04.000 --> 20:05.000
productive.

20:05.000 --> 20:08.000
And we focus on sort of three areas on performance,

20:08.000 --> 20:11.000
productivity and portability.

20:11.000 --> 20:14.000
And with that, I will take questions.

20:14.000 --> 20:19.000
Thank you.

20:19.000 --> 20:22.000
How are you?

20:22.000 --> 20:24.000
Several questions, okay.

20:24.000 --> 20:26.000
Let's talk about it.

20:26.000 --> 20:30.000
Just a question of the relation between MPI bind and

20:30.000 --> 20:34.000
the job scalar, because you will show you can run

20:34.000 --> 20:37.000
as run and you just don't hate tasks.

20:37.000 --> 20:40.000
And then MPI bind will do the magic.

20:41.000 --> 20:44.000
How can MPI bind is a huge hardware role on the remote

20:44.000 --> 20:45.000
notes.

20:45.000 --> 20:49.000
Do actually figure out the resources there.

20:49.000 --> 20:54.000
When I just told, as well, I want to take that.

20:54.000 --> 20:55.000
Yeah.

20:55.000 --> 20:59.000
So the question and I'm going to try to please correct me if I

20:59.000 --> 21:00.000
didn't get it right.

21:00.000 --> 21:03.000
But so he's talking about the integration with

21:03.000 --> 21:08.000
SLURN and how would I deal with remote notes, for example.

21:08.000 --> 21:13.000
And so MPI bind is running on every, on every note, right?

21:13.000 --> 21:16.000
So the decisions are actually made local.

21:16.000 --> 21:22.000
It doesn't have to worry about making decisions on a global scale

21:22.000 --> 21:25.000
because from the resource manager, I will get say,

21:25.000 --> 21:29.000
how many workers per node are going to be on that note.

21:29.000 --> 21:33.000
And it will take that information and then do the work.

21:33.000 --> 21:38.000
So again, all the decisions are local to a node

21:38.000 --> 21:41.000
because I will get that information from SLURN for

21:41.000 --> 21:42.000
flux.

21:42.000 --> 21:43.000
Yeah.

21:43.000 --> 21:46.000
But if I execute the run and the logging node.

21:46.000 --> 21:47.000
Yeah.

21:47.000 --> 21:50.000
So it kind of know that.

21:50.000 --> 21:53.000
So it might seem that I have different confused notes with

21:53.000 --> 21:54.000
different hardware.

21:54.000 --> 22:00.000
How does MPI bind know to which one is best for my application?

22:00.000 --> 22:05.000
So I guess the decision manually before that.

22:05.000 --> 22:06.000
Yeah.

22:06.000 --> 22:10.000
So yeah.

22:10.000 --> 22:15.000
That's what I'm trying to gauge.

22:15.000 --> 22:18.000
How could MPI bind work on the logging node?

22:18.000 --> 22:21.000
How is that okay to interpret your question?

22:21.000 --> 22:25.000
Or what is the logging node is different from the computer?

22:25.000 --> 22:26.000
That is fine.

22:26.000 --> 22:29.000
So what is the difference?

22:29.000 --> 22:30.000
Yeah.

22:30.000 --> 22:34.000
What if I have two notes that are different architecture?

22:34.000 --> 22:41.000
So MPI bind will call HW lock locally on every node once.

22:41.000 --> 22:43.000
And it will get the topology.

22:43.000 --> 22:46.000
And the mapping will be done based on that local topology.

22:46.000 --> 22:47.000
Okay.

22:47.000 --> 22:49.000
So you can do topologies.

22:49.000 --> 22:51.000
Absolutely.

22:51.000 --> 22:58.000
If you have a cluster with GPUs of AMD and then MBD and

22:59.000 --> 23:01.000
and knows without the GPU, that's fine.

23:01.000 --> 23:06.000
Because there will be one call per node together topology.

23:06.000 --> 23:08.000
And then it will make the decision based on that.

23:08.000 --> 23:09.000
Yes.

23:09.000 --> 23:10.000
Okay.

23:10.000 --> 23:11.000
All right.

23:11.000 --> 23:12.000
Thank you.

23:12.000 --> 23:13.000
Yeah.

23:13.000 --> 23:14.000
Yeah.

23:14.000 --> 23:17.000
So when there's no perfect assignment of the resources,

23:17.000 --> 23:22.000
let's say that there's either you have to choose a bad Nick or a bad GPU

23:22.000 --> 23:23.000
or a situation.

23:23.000 --> 23:27.000
How do you score them and decide which one is better or less better?

23:27.000 --> 23:28.000
Yeah.

23:28.000 --> 23:31.000
So what is?

23:31.000 --> 23:35.000
How do we determine what's the best mapping?

23:35.000 --> 23:41.000
Because in some cases, choosing a GPU over a CPU may give you a different mapping.

23:41.000 --> 23:43.000
And so we certainly run into this.

23:43.000 --> 23:47.000
And so what we have is at the beginning of this slide deck.

23:47.000 --> 23:50.000
I show some of the options to MPI bind.

23:50.000 --> 23:54.000
And you can say my application is really going to be focused.

23:54.000 --> 23:56.000
It's a CPU heavy focus.

23:56.000 --> 23:59.000
So optimize the binding for the CPUs.

23:59.000 --> 24:02.000
You can say optimize the binding for the GPUs.

24:02.000 --> 24:04.000
And it will do that.

24:04.000 --> 24:09.000
Because yeah, there is no single ideal binding for all cases.

24:09.000 --> 24:10.000
Right.

24:10.000 --> 24:12.000
So then we have to pass hints to MPI bind.

24:12.000 --> 24:17.000
So you will do minus minus MPI bind equal GPU optimized.

24:17.000 --> 24:22.000
And it will optimize the binding for the GPUs in that case.

24:22.000 --> 24:23.000
So yeah.

24:23.000 --> 24:24.000
Absolutely.

24:25.000 --> 24:30.000
Are there any plans to support Intel GPUs?

24:30.000 --> 24:37.000
So if I haven't tried the HW lock on Intel GPUs,

24:37.000 --> 24:40.000
if HW lock is working on Intel GPUs,

24:40.000 --> 24:43.000
then you don't have to do anything.

24:43.000 --> 24:45.000
That's the only constraint.

24:45.000 --> 24:50.000
But I don't have any Intel GPUs that I can try things on.

24:50.000 --> 24:51.000
And so I haven't tried.

24:51.000 --> 24:52.000
I'm sorry.

24:53.000 --> 24:54.000
Oh, okay.

24:54.000 --> 24:55.000
So there you go.

25:08.000 --> 25:09.000
I'm sorry.

25:09.000 --> 25:10.000
Please tell me your question again.

25:10.000 --> 25:12.000
If you are thinking to store the bind,

25:12.000 --> 25:13.000
you get off the thread.

25:13.000 --> 25:16.000
The event will be spawned by MPI task.

25:16.000 --> 25:18.000
And if with the environment,

25:18.000 --> 25:21.000
is affected by the environment?

25:21.000 --> 25:22.000
Yes.

25:22.000 --> 25:23.000
So the question is,

25:23.000 --> 25:27.000
do we do any affinity thread affinity?

25:27.000 --> 25:30.000
If if process is are launching,

25:30.000 --> 25:32.000
you know, multiple threads.

25:32.000 --> 25:35.000
So we take care of open MP binding.

25:35.000 --> 25:37.000
So for open MP,

25:37.000 --> 25:39.000
all of the threads will also be bound

25:39.000 --> 25:41.000
to the to the right resources.

25:41.000 --> 25:44.000
If you're launching post-extreds,

25:44.000 --> 25:47.000
they will just be confined to the process.

25:47.000 --> 25:50.000
The process mapping or the process,

25:50.000 --> 25:53.000
the resources that have been assigned to that process.

25:53.000 --> 25:57.000
So it won't be as fine grain as for open MP,

25:57.000 --> 26:00.000
but you will be constrained within that process.

26:00.000 --> 26:02.000
So you wouldn't be able to say,

26:02.000 --> 26:04.000
do remote memory accesses

26:04.000 --> 26:07.000
because the process is confined to a new model main.

26:07.000 --> 26:08.000
For example.

26:09.000 --> 26:11.000
So the question is,

26:11.000 --> 26:14.000
do you support,

26:14.000 --> 26:15.000
like,

26:15.000 --> 26:17.000
see you aspect of device that's important,

26:17.000 --> 26:18.000
where you actually,

26:18.000 --> 26:19.000
like,

26:19.000 --> 26:20.000
partition,

26:20.000 --> 26:22.000
which threads would be also,

26:22.000 --> 26:24.000
which see you something device.

26:24.000 --> 26:28.000
So the question is,

26:28.000 --> 26:29.000
do I support,

26:29.000 --> 26:33.000
see you masking on GPUs that's for that?

26:33.000 --> 26:35.000
And I guess the answer is not,

26:35.000 --> 26:37.000
because I'm not familiar with that.

26:37.000 --> 26:39.000
Yeah, yeah,

26:39.000 --> 26:42.000
I love to talk to about that.

26:42.000 --> 26:44.000
Okay.

26:44.000 --> 26:46.000
Thank you very much.

26:46.000 --> 26:48.000
Thank you very much.