WEBVTT

00:00.000 --> 00:07.000
Well, so I'm Felix, this is Vaseos.

00:07.000 --> 00:13.000
We're going to talk about the work we've done in reframe with open source tools to support

00:13.000 --> 00:15.000
for basic performance testing.

00:15.000 --> 00:18.000
And so what do we do at the video?

00:18.000 --> 00:22.000
We are both in the team called Applied Systems at Nvidia.

00:22.000 --> 00:29.000
And what we do is we build internal supercomputers with the latest GPUs and CPUs

00:29.000 --> 00:31.000
and meta-nox hardware.

00:31.000 --> 00:34.000
And we have created internal clusters like EOS,

00:34.000 --> 00:38.000
it was number nine on top 500 in November 2023.

00:38.000 --> 00:42.000
And we do that to enable internal users so that they can run

00:42.000 --> 00:45.000
that deep learning work load or the HPC workloads.

00:45.000 --> 00:49.000
And as such, we also help people from our customers

00:49.000 --> 00:52.000
to build their own cluster at a later point.

00:52.000 --> 00:55.000
And our internal clusters run benchmarks,

00:55.000 --> 00:58.000
such as well HPL and MLP of training,

00:58.000 --> 01:00.000
if you're familiar with that.

01:00.000 --> 01:02.000
And also, okay, of deep learning research,

01:02.000 --> 01:06.000
such as training the megachron model on the cylinder computer,

01:06.000 --> 01:09.000
that was the generation before EOS.

01:09.000 --> 01:13.000
And so that's the kind of things we are doing in our team.

01:13.000 --> 01:18.000
And we've been using reframe for a few years now

01:18.000 --> 01:22.000
for basically, if you're familiar with reframe, it's very useful for the

01:22.000 --> 01:26.000
performance validation of your cluster and regression testing.

01:26.000 --> 01:28.000
So reframe is an open source project,

01:28.000 --> 01:33.000
developed initially at CSES by some people around here.

01:33.000 --> 01:37.000
And you express your test in a declarative manner,

01:37.000 --> 01:39.000
you express your test in Python,

01:39.000 --> 01:43.000
and you can express dependencies, constraints on your test,

01:43.000 --> 01:48.000
and you define what you're expecting the output to be,

01:48.000 --> 01:52.000
and you define performance targets for each test.

01:52.000 --> 01:54.000
And reframe of generating the scripts,

01:54.000 --> 01:56.000
launch starts using slum automatically,

01:56.000 --> 01:57.000
you don't need to care about slum,

01:57.000 --> 01:59.000
and it will gather the results,

01:59.000 --> 02:01.000
execute everything concurrently as much as it can,

02:01.000 --> 02:02.000
respect the dependencies,

02:02.000 --> 02:05.000
and then you will get the results green,

02:05.000 --> 02:07.000
passing, red, it's failing, obviously.

02:07.000 --> 02:11.000
And yes, you open source, it's as great documentation,

02:11.000 --> 02:12.000
you can take a look.

02:12.000 --> 02:15.000
So platform and testing, but does it look like in reframes,

02:15.000 --> 02:18.000
it's a very simple test where we run the stream benchmark,

02:18.000 --> 02:22.000
and you declare methods, which are the decorator,

02:22.000 --> 02:24.000
performance function, and you say,

02:24.000 --> 02:27.000
I have a metric called copy bandwidth for stream.

02:27.000 --> 02:29.000
So I'm going to execute stream.x,

02:29.000 --> 02:30.000
the binary cost stream.x,

02:30.000 --> 02:32.000
and I'm declaring a regax here,

02:32.000 --> 02:34.000
and the first matching group,

02:34.000 --> 02:36.000
I'm saying the first matching group,

02:36.000 --> 02:38.000
basically cast that to float,

02:38.000 --> 02:40.000
and that's going to be my copy bandwidth,

02:40.000 --> 02:42.000
and same for triad bandwidth,

02:42.000 --> 02:45.000
because stream as multiple benchmarks.

02:45.000 --> 02:47.000
And then for the performance targets,

02:47.000 --> 02:50.000
you define dictionary of dictionary,

02:50.000 --> 02:52.000
and you say for the system,

02:52.000 --> 02:53.000
for the default system,

02:53.000 --> 02:55.000
you can say by system at different targets.

02:55.000 --> 03:00.000
I want my copy bandwidth to be between 23,000 megabytes per second,

03:00.000 --> 03:04.000
minus 10%, plus 13%, so you give percentage bounds

03:04.000 --> 03:07.000
around this middle value.

03:07.000 --> 03:11.000
And that's what you usually do for performance testing in reframe,

03:11.000 --> 03:14.000
and when you execute reframe, it looks like this.

03:14.000 --> 03:18.000
That's in a terminal that doesn't actually get lapsi-i.

03:18.000 --> 03:19.000
It looks like this.

03:19.000 --> 03:21.000
You have the name of the test.

03:21.000 --> 03:23.000
reframe tells you what it's doing.

03:23.000 --> 03:24.000
It's executing this.

03:24.000 --> 03:26.000
As I said, okay, the green minute,

03:26.000 --> 03:28.000
it's passes, and then you have this summary again.

03:28.000 --> 03:30.000
And where reframe is very useful,

03:30.000 --> 03:32.000
is that you have really the choice,

03:32.000 --> 03:34.000
or you send the logs.

03:34.000 --> 03:37.000
You can send the logs to a plastic search,

03:37.000 --> 03:38.000
to a gray log,

03:38.000 --> 03:41.000
and you can also configure exactly the format

03:41.000 --> 03:42.000
where you output so here.

03:42.000 --> 03:44.000
We're defining a format for the output,

03:44.000 --> 03:46.000
which is CSV.

03:46.000 --> 03:49.000
I mean, separated by pipe, but still CSV.

03:49.000 --> 03:51.000
And your output file here,

03:51.000 --> 03:53.000
by reframe after executing this,

03:53.000 --> 03:56.000
will be, so timestamp,

03:56.000 --> 03:58.000
the name of the test,

03:58.000 --> 04:00.000
the metric we see,

04:00.000 --> 04:03.000
like triad or red or right.

04:03.000 --> 04:04.000
I had to read that some of that,

04:04.000 --> 04:06.000
but that was a parallel value right here.

04:06.000 --> 04:09.000
And that was the performance we were targeting,

04:09.000 --> 04:12.000
and minus 3% plus 3% is the performance

04:12.000 --> 04:16.000
bounds around the value we were targeting.

04:16.000 --> 04:20.000
And how do we define those performance bounds

04:20.000 --> 04:22.000
usually in reframe?

04:22.000 --> 04:25.000
They have to be fixed a pair system,

04:25.000 --> 04:27.000
but so basically we're going to write

04:27.000 --> 04:28.000
on a few machines.

04:28.000 --> 04:30.000
We're going to see what the performance looks like.

04:30.000 --> 04:31.000
We're going to mean the deviation,

04:31.000 --> 04:32.000
and we're going to see,

04:32.000 --> 04:34.000
okay, how do I define my bounds?

04:34.000 --> 04:35.000
And the problem we're facing,

04:35.000 --> 04:37.000
if you bounds are too narrow,

04:37.000 --> 04:39.000
you might get spreaders failures,

04:39.000 --> 04:41.000
because you didn't really test the footwear

04:41.000 --> 04:42.000
population initially,

04:42.000 --> 04:44.000
and you're going to get for a positive,

04:44.000 --> 04:46.000
and admins do not like that,

04:46.000 --> 04:47.000
so as I will tell you,

04:47.000 --> 04:48.000
can you please increase the bounds?

04:48.000 --> 04:50.000
It's failing by the very, very small margin.

04:50.000 --> 04:52.000
We need it to pass,

04:52.000 --> 04:53.000
it's not a really performance problem.

04:53.000 --> 04:54.000
If it's too large,

04:54.000 --> 04:55.000
the problem,

04:55.000 --> 04:56.000
if it's too large,

04:56.000 --> 04:58.000
here I showed normal distribution,

04:58.000 --> 05:00.000
because it's often actually normal distribution,

05:00.000 --> 05:02.000
for performance results across the population,

05:02.000 --> 05:03.000
if it's too large,

05:03.000 --> 05:04.000
obviously like free sigma,

05:04.000 --> 05:05.000
free circulation,

05:05.000 --> 05:06.000
you can get,

05:06.000 --> 05:07.000
obviously, a regression

05:07.000 --> 05:09.000
that while it's still within your bounds,

05:09.000 --> 05:10.000
and that happens,

05:10.000 --> 05:11.000
you know,

05:11.000 --> 05:12.000
actually quite a lot.

05:14.000 --> 05:15.000
Thank you.

05:15.000 --> 05:17.000
And so,

05:17.000 --> 05:18.000
a common problem we,

05:18.000 --> 05:19.000
of course,

05:19.000 --> 05:21.000
we face is that we need to validate

05:21.000 --> 05:22.000
our clusters,

05:22.000 --> 05:24.000
the performance compared to

05:24.000 --> 05:25.000
one week ago,

05:25.000 --> 05:26.000
to weeks ago,

05:26.000 --> 05:28.000
or if we install a new software,

05:28.000 --> 05:29.000
we want to have confidence

05:29.000 --> 05:31.000
that we have the same performance

05:31.000 --> 05:32.000
as before,

05:32.000 --> 05:34.000
otherwise users will report

05:34.000 --> 05:35.000
problem,

05:35.000 --> 05:36.000
as they're going to take us a lot of time,

05:36.000 --> 05:37.000
because they're going to report

05:37.000 --> 05:38.000
problems,

05:38.000 --> 05:39.000
it's very complex application,

05:39.000 --> 05:40.000
as we go through much harder

05:40.000 --> 05:41.000
to find what's going on.

05:41.000 --> 05:43.000
So, we need a very robust test suite,

05:43.000 --> 05:45.000
and performance bounds are,

05:45.000 --> 05:46.000
honestly, it's a great.

05:46.000 --> 05:48.000
We find a lot of bugs with free frame

05:48.000 --> 05:49.000
and performance bounds,

05:49.000 --> 05:51.000
but for tests that have a wide,

05:51.000 --> 05:52.000
variation,

05:52.000 --> 05:54.000
you tend to have wide ranges,

05:54.000 --> 05:56.000
and you're going to miss some regressions.

05:56.000 --> 05:57.000
So,

05:57.000 --> 06:00.000
people usually doing their own tools

06:00.000 --> 06:01.000
on top of free frames,

06:01.000 --> 06:02.000
they wear developing their own

06:02.000 --> 06:03.000
pandascript,

06:03.000 --> 06:04.000
like we did,

06:04.000 --> 06:05.000
or they wear sending things to

06:05.000 --> 06:06.000
splank,

06:06.000 --> 06:07.000
or elastic search,

06:07.000 --> 06:08.000
and build tools on top of this.

06:08.000 --> 06:09.000
And we wanted to see,

06:09.000 --> 06:10.000
okay,

06:10.000 --> 06:11.000
but that's not very portable,

06:11.000 --> 06:13.000
that's really depends on your architecture,

06:13.000 --> 06:14.000
on your cluster.

06:14.000 --> 06:15.000
What can we do in

06:15.000 --> 06:16.000
reframe

06:16.000 --> 06:17.000
to have that built-in

06:17.000 --> 06:18.000
baked in?

06:18.000 --> 06:20.000
And Vasileo is going to take over

06:20.000 --> 06:22.000
and talk to you about this.

06:28.000 --> 06:29.000
Can you hear me?

06:29.000 --> 06:33.000
So, yeah, as Felix said,

06:33.000 --> 06:35.000
yeah, people did,

06:35.000 --> 06:37.000
oh, okay, sorry.

06:37.000 --> 06:39.000
As Felix said,

06:41.000 --> 06:42.000
yeah,

06:42.000 --> 06:44.000
it's been,

06:44.000 --> 06:46.000
users that have been using

06:46.000 --> 06:48.000
free frame that have been using

06:48.000 --> 06:49.000
ad hoc solutions,

06:49.000 --> 06:50.000
one lies that is often

06:50.000 --> 06:52.000
do historical analysis.

06:52.000 --> 06:53.000
So,

06:53.000 --> 06:54.000
I,

06:54.000 --> 06:55.000
we thought that could be useful

06:55.000 --> 06:57.000
for other users in the community as well,

06:57.000 --> 07:00.000
to be able to compare past results,

07:00.000 --> 07:01.000
inspect past,

07:01.000 --> 07:02.000
test results,

07:02.000 --> 07:05.000
get performance metrics,

07:05.000 --> 07:06.000
aggregate performance across

07:06.000 --> 07:07.000
different characteristics,

07:07.000 --> 07:08.000
like,

07:08.000 --> 07:09.000
not least,

07:09.000 --> 07:10.000
test parameters,

07:10.000 --> 07:11.000
time periods,

07:11.000 --> 07:13.000
also be able to compare

07:13.000 --> 07:15.000
performance between runs,

07:15.000 --> 07:17.000
between different configurations,

07:17.000 --> 07:19.000
between the current and versus,

07:19.000 --> 07:21.000
historical data.

07:21.000 --> 07:23.000
And, for example,

07:23.000 --> 07:25.000
also different time periods.

07:26.000 --> 07:27.000
And,

07:27.000 --> 07:29.000
we also wanted to,

07:29.000 --> 07:30.000
as a key goal,

07:30.000 --> 07:32.000
to store as much test information as we can,

07:32.000 --> 07:33.000
because,

07:33.000 --> 07:35.000
experience shows that,

07:35.000 --> 07:38.000
you later regret the information you haven't collected.

07:38.000 --> 07:39.000
So,

07:39.000 --> 07:42.000
it's better if you have the information already,

07:42.000 --> 07:44.000
all the test information that you can have.

07:44.000 --> 07:46.000
We still want to allow,

07:46.000 --> 07:48.000
want to allow external post-processing,

07:48.000 --> 07:49.000
because you never,

07:49.000 --> 07:53.000
we will do the post-processing that everybody else would like to do.

07:53.000 --> 07:54.000
So,

07:54.000 --> 07:56.000
it is to give a basic,

07:56.000 --> 07:57.000
analytics,

07:57.000 --> 07:58.000
let's say,

07:58.000 --> 07:59.000
layer.

07:59.000 --> 08:00.000
Also be backward compatible,

08:00.000 --> 08:01.000
we didn't want,

08:01.000 --> 08:02.000
like,

08:02.000 --> 08:04.000
users over frame to come back,

08:04.000 --> 08:05.000
complaining out,

08:05.000 --> 08:06.000
you change this option,

08:06.000 --> 08:07.000
you broke that interface,

08:07.000 --> 08:08.000
you broke my test,

08:08.000 --> 08:10.000
so we want backward compatibility.

08:10.000 --> 08:11.000
And,

08:11.000 --> 08:12.000
also,

08:12.000 --> 08:13.000
we want to provide an easy,

08:13.000 --> 08:15.000
command line interface,

08:15.000 --> 08:16.000
intuitive,

08:16.000 --> 08:19.000
to be able to do some basic analytics.

08:19.000 --> 08:22.000
So, we consider two options.

08:23.000 --> 08:24.000
As Felix said,

08:24.000 --> 08:28.000
one way of storing performance data is what we call,

08:28.000 --> 08:29.000
in the refrain perflogs,

08:29.000 --> 08:32.000
which are those usually CSV files,

08:32.000 --> 08:34.000
that they contain the,

08:34.000 --> 08:37.000
essential performance data of tests.

08:37.000 --> 08:38.000
But there is,

08:38.000 --> 08:40.000
although they are compact,

08:40.000 --> 08:43.000
there is two disadvantages to that.

08:43.000 --> 08:45.000
Important test information may be lost,

08:45.000 --> 08:46.000
because,

08:46.000 --> 08:47.000
yeah,

08:47.000 --> 08:49.000
they don't carry all the whole information.

08:49.000 --> 08:52.000
And information is really bound to the users,

08:52.000 --> 08:56.000
to that log format that the user defined,

08:56.000 --> 09:00.000
and basically select what information is important.

09:00.000 --> 09:01.000
So,

09:01.000 --> 09:03.000
then the second option,

09:03.000 --> 09:05.000
which refrain does internally,

09:05.000 --> 09:08.000
stores a full test case information

09:08.000 --> 09:10.000
in a JSON report,

09:10.000 --> 09:12.000
which then can dump to a file.

09:12.000 --> 09:14.000
But the advantage of this is that

09:14.000 --> 09:16.000
it contains the whole test information,

09:17.000 --> 09:19.000
the test parameters, test variables,

09:19.000 --> 09:21.000
where they turn and so on.

09:21.000 --> 09:22.000
On the other hand,

09:22.000 --> 09:24.000
it's quite verbose,

09:24.000 --> 09:26.000
and it's also unstructured data,

09:26.000 --> 09:29.000
because every test may have different variables.

09:29.000 --> 09:32.000
Those that they have potentially used refrain,

09:32.000 --> 09:34.000
it's test can define its own parameters,

09:34.000 --> 09:36.000
its own new variables,

09:36.000 --> 09:37.000
so,

09:37.000 --> 09:40.000
which could be important to,

09:40.000 --> 09:43.000
they are usually important to the performance you get.

09:43.000 --> 09:45.000
Nonetheless, we select the option too,

09:45.000 --> 09:47.000
more complete.

09:47.000 --> 09:48.000
And this is,

09:48.000 --> 09:51.000
I'm going to now describe briefly,

09:51.000 --> 09:54.000
a bit of the designer architecture of this feature.

09:54.000 --> 09:55.000
So, essentially,

09:55.000 --> 09:59.000
it's layered with interfaces between its layer,

09:59.000 --> 10:03.000
so that we can choose and along the next different

10:03.000 --> 10:05.000
implementations for its layer.

10:05.000 --> 10:06.000
So, on the top level,

10:06.000 --> 10:08.000
there is a new CLI interface,

10:08.000 --> 10:12.000
where we added some new command line options.

10:13.000 --> 10:16.000
There's a couple of list stored test cases and sessions,

10:16.000 --> 10:19.000
which will list present data of previous

10:19.000 --> 10:21.000
or specific test,

10:21.000 --> 10:24.000
in a tabular form.

10:24.000 --> 10:26.000
There is its counterpart,

10:26.000 --> 10:29.000
described stored test cases and sessions,

10:29.000 --> 10:31.000
which retains raw data,

10:31.000 --> 10:32.000
in JSON,

10:32.000 --> 10:35.000
which then you can ingest elsewhere,

10:35.000 --> 10:40.000
and process post-process yourself.

10:40.000 --> 10:41.000
There is a new option,

10:41.000 --> 10:42.000
performance compared,

10:42.000 --> 10:44.000
that compares past results,

10:44.000 --> 10:47.000
and there's also two other utility options,

10:47.000 --> 10:51.000
to attach new information to the session,

10:51.000 --> 10:53.000
with session extra,

10:53.000 --> 10:54.000
all control the,

10:54.000 --> 10:55.000
the data format.

10:55.000 --> 10:57.000
The analytics layer does,

10:57.000 --> 10:58.000
essentially,

10:58.000 --> 10:59.000
the test case grouping,

10:59.000 --> 11:01.000
the performance aggregations,

11:01.000 --> 11:03.000
and the performance differences,

11:03.000 --> 11:06.000
and retains either tabular data,

11:06.000 --> 11:08.000
or JSON data to the layer above,

11:08.000 --> 11:09.000
and at the bottom,

11:09.000 --> 11:11.000
there is a storage layer,

11:11.000 --> 11:12.000
where stores the results,

11:12.000 --> 11:13.000
the data base,

11:13.000 --> 11:15.000
and is also responsible for

11:15.000 --> 11:16.000
creating the raw results,

11:16.000 --> 11:17.000
out of the database,

11:17.000 --> 11:19.000
and doing also the filtering,

11:19.000 --> 11:21.000
based on the various criteria,

11:21.000 --> 11:24.000
and then gives the upper layer

11:24.000 --> 11:26.000
some JSON data.

11:26.000 --> 11:28.000
Now,

11:28.000 --> 11:30.000
some of the implementation details,

11:30.000 --> 11:31.000
so reframe,

11:31.000 --> 11:34.000
that's already in the reframe report,

11:34.000 --> 11:37.000
is a big JSON file that reframe

11:37.000 --> 11:39.000
produce with all the details,

11:39.000 --> 11:42.000
and this is its structure.

11:42.000 --> 11:44.000
So the structure is,

11:44.000 --> 11:46.000
it's a bit hierarchical,

11:46.000 --> 11:47.000
so you start with the session,

11:47.000 --> 11:49.000
which is essentially a reframe,

11:49.000 --> 11:51.000
minus run invocation,

11:51.000 --> 11:53.000
and it has a session info.

11:53.000 --> 11:54.000
Now, the session info,

11:54.000 --> 11:57.000
it has a unique identifier,

11:57.000 --> 12:00.000
plus an information about the session,

12:00.000 --> 12:02.000
which includes also an information

12:02.000 --> 12:04.000
now passed with session extras,

12:04.000 --> 12:06.000
and then session contains runs.

12:06.000 --> 12:08.000
Now, if you run,

12:08.000 --> 12:10.000
if you run a frame,

12:10.000 --> 12:12.000
your test may run multiple times,

12:12.000 --> 12:14.000
and that depends on the actual options,

12:14.000 --> 12:15.000
that you pass.

12:15.000 --> 12:17.000
For example, if you have max retries,

12:17.000 --> 12:18.000
your failing,

12:18.000 --> 12:20.000
your failing test will be retried a couple of times,

12:20.000 --> 12:23.000
or if you want to just rerun the test multiple times,

12:23.000 --> 12:25.000
that's why there are multiple runs within a session.

12:25.000 --> 12:26.000
Now,

12:26.000 --> 12:27.000
within a run,

12:27.000 --> 12:29.000
there is a set of test cases,

12:29.000 --> 12:31.000
which is actually the test have run,

12:31.000 --> 12:33.000
with all the information that your test has,

12:33.000 --> 12:35.000
but for variables,

12:36.000 --> 12:37.000
performance reference,

12:37.000 --> 12:38.000
threshold,

12:38.000 --> 12:39.000
actual performance,

12:39.000 --> 12:40.000
that you got,

12:40.000 --> 12:41.000
and so on.

12:41.000 --> 12:44.000
This is like the information we need.

12:44.000 --> 12:45.000
Now,

12:45.000 --> 12:47.000
we store the results in a,

12:47.000 --> 12:48.000
we started light,

12:48.000 --> 12:49.000
so we started using it,

12:49.000 --> 12:50.000
and I calculated the basis,

12:50.000 --> 12:52.000
but the deal with the layers is that

12:52.000 --> 12:54.000
if the need comes up in the future,

12:54.000 --> 12:56.000
this could be easily replaced.

12:56.000 --> 12:58.000
So, essentially,

12:58.000 --> 13:03.000
we do index test cases and sessions.

13:04.000 --> 13:06.000
So, practically,

13:06.000 --> 13:07.000
we store,

13:07.000 --> 13:08.000
in the database,

13:08.000 --> 13:11.000
the full JSON blog of the,

13:11.000 --> 13:13.000
of the report.

13:13.000 --> 13:15.000
And then, we index the session,

13:15.000 --> 13:16.000
also, by their UID,

13:16.000 --> 13:17.000
and their time.

13:17.000 --> 13:18.000
So, then,

13:18.000 --> 13:20.000
we can do easily time-based queries.

13:20.000 --> 13:22.000
And also, the test cases themselves,

13:22.000 --> 13:24.000
we gain index time,

13:24.000 --> 13:25.000
with,

13:25.000 --> 13:27.000
by their completion time,

13:27.000 --> 13:29.000
and also by a judo,

13:29.000 --> 13:30.000
UID,

13:30.000 --> 13:31.000
let's say,

13:31.000 --> 13:32.000
the session,

13:32.000 --> 13:33.000
you,

13:33.000 --> 13:34.000
unique identifier,

13:34.000 --> 13:35.000
the run index,

13:35.000 --> 13:36.000
and the test index,

13:36.000 --> 13:37.000
inside the session.

13:37.000 --> 13:39.000
So, then you have the unique coordinates,

13:39.000 --> 13:41.000
in a specific refrain run,

13:41.000 --> 13:43.000
and you can retrieve all the test case information,

13:43.000 --> 13:46.000
and then you can apply filtering stuff.

13:46.000 --> 13:49.000
So, time-based queries,

13:49.000 --> 13:51.000
use this index to retrieve,

13:51.000 --> 13:52.000
the sessions of interest,

13:52.000 --> 13:55.000
then the test cases are decoded,

13:55.000 --> 13:56.000
and then filtered,

13:56.000 --> 13:58.000
and they return to the upper layer,

13:58.000 --> 14:00.000
for analytics processing,

14:01.000 --> 14:02.000
and similarly,

14:02.000 --> 14:03.000
for sessions,

14:03.000 --> 14:06.000
where only the information of the session is decoded,

14:06.000 --> 14:07.000
to save space,

14:07.000 --> 14:10.000
because you don't want the whole session information.

14:10.000 --> 14:14.000
So, going on a bit with the syntax,

14:14.000 --> 14:16.000
the general syntax in all those options,

14:16.000 --> 14:18.000
has three parts,

14:18.000 --> 14:19.000
select,

14:19.000 --> 14:20.000
spec,

14:20.000 --> 14:21.000
and aggregation, spec,

14:21.000 --> 14:22.000
and the columns,

14:22.000 --> 14:23.000
spec,

14:23.000 --> 14:24.000
those like presentation,

14:24.000 --> 14:25.000
spec, somehow.

14:25.000 --> 14:26.000
So, this select,

14:26.000 --> 14:28.000
spec defines which results,

14:29.000 --> 14:30.000
we want to select.

14:30.000 --> 14:32.000
So, it can have different forms.

14:32.000 --> 14:33.000
One is like timestamps.

14:33.000 --> 14:34.000
So, you can say here,

14:34.000 --> 14:36.000
from 25th of January,

14:36.000 --> 14:38.000
and the 31st of January,

14:38.000 --> 14:39.000
give me all the results.

14:39.000 --> 14:41.000
Or with abbreviation,

14:41.000 --> 14:43.000
there is also the last seven days,

14:43.000 --> 14:44.000
till now,

14:44.000 --> 14:45.000
or by UUID.

14:45.000 --> 14:46.000
So, you just say,

14:46.000 --> 14:49.000
even the results from that specific session,

14:49.000 --> 14:52.000
or you can have through session properties,

14:52.000 --> 14:54.000
which usually you start with session extras.

14:54.000 --> 14:55.000
So, here,

14:55.000 --> 14:56.000
it says,

14:56.000 --> 14:58.000
all the tests that they have

14:58.000 --> 15:00.000
run with a driver version,

15:00.000 --> 15:02.000
576,

15:02.000 --> 15:05.000
on that host name.

15:05.000 --> 15:07.000
Then the aggregation spec,

15:07.000 --> 15:09.000
defines how we want to group

15:09.000 --> 15:12.000
and aggregate the performance results.

15:12.000 --> 15:14.000
Oh, yeah.

15:14.000 --> 15:17.000
We default,

15:17.000 --> 15:19.000
there is a default grouping by name system,

15:19.000 --> 15:20.000
partition environment,

15:20.000 --> 15:23.000
and the performance variables and units.

15:24.000 --> 15:27.000
And we can use custom groupings,

15:27.000 --> 15:31.000
and they're set of available aggregation that you can use.

15:31.000 --> 15:32.000
Then there is the column specs,

15:32.000 --> 15:33.000
we define what to show,

15:33.000 --> 15:36.000
but if all those are the fields that they are,

15:36.000 --> 15:39.000
we have grouped our results by,

15:39.000 --> 15:41.000
but you can add additional fields,

15:41.000 --> 15:45.000
or you can completely use custom columns.

15:45.000 --> 15:46.000
One, I think,

15:46.000 --> 15:49.000
is that some common filtering options from refrain,

15:49.000 --> 15:51.000
like minus n or minus e,

15:52.000 --> 15:54.000
they are used,

15:54.000 --> 15:57.000
they can be reused when you do

15:57.000 --> 15:59.000
different formats in a latex query.

15:59.000 --> 16:00.000
So here,

16:00.000 --> 16:01.000
I have some examples.

16:01.000 --> 16:02.000
So, for example,

16:02.000 --> 16:03.000
here,

16:03.000 --> 16:06.000
it's list me the main performance of a specific benchmark,

16:06.000 --> 16:08.000
like stream code for the last seven days,

16:08.000 --> 16:10.000
this is how you can do that.

16:10.000 --> 16:12.000
Then imagine you have a parameterized test,

16:12.000 --> 16:14.000
where your test has a mode,

16:14.000 --> 16:15.000
different modes,

16:15.000 --> 16:18.000
and also parameterized over the GPUs on the node,

16:18.000 --> 16:19.000
and you say,

16:19.000 --> 16:21.000
the main across all GPUs on the node,

16:21.000 --> 16:25.000
and I want all nodes that I have tested,

16:25.000 --> 16:27.000
and for all modes.

16:27.000 --> 16:29.000
So here is a query,

16:29.000 --> 16:30.000
when we can get like,

16:30.000 --> 16:32.000
for a specific driver version,

16:32.000 --> 16:34.000
we can get the information we want.

16:34.000 --> 16:35.000
Then, okay,

16:35.000 --> 16:37.000
I want you to compare all the benchmark data that you have

16:37.000 --> 16:41.000
between two driver versions,

16:41.000 --> 16:43.000
and yeah.

16:43.000 --> 16:44.000
And then,

16:44.000 --> 16:46.000
there are some also examples here,

16:46.000 --> 16:47.000
I'm going to skip them

16:47.000 --> 16:49.000
of getting like some information.

16:49.000 --> 16:52.000
And if you want to get like the raw JSON report,

16:52.000 --> 16:53.000
yeah,

16:53.000 --> 16:55.000
you can still get it with the describe storage session,

16:55.000 --> 16:58.000
and then you can post-process the way you want.

16:58.000 --> 17:00.000
Or just list it as CSV,

17:00.000 --> 17:02.000
and just get the information that you need.

17:02.000 --> 17:04.000
And here is an example.

17:04.000 --> 17:07.000
So this feature is available in refrain4.7,

17:07.000 --> 17:10.000
which is the latest version.

17:10.000 --> 17:12.000
It's by default disabled,

17:12.000 --> 17:13.000
so you have to enable it,

17:13.000 --> 17:16.000
and then you can also customize where you want the results

17:16.000 --> 17:17.000
to be stored.

17:17.000 --> 17:19.000
And here is an actual query,

17:19.000 --> 17:23.000
and you see how it shows up

17:23.000 --> 17:26.000
with performance table.

17:26.000 --> 17:29.000
So you have the value A,

17:29.000 --> 17:30.000
your first set,

17:30.000 --> 17:31.000
your second set,

17:31.000 --> 17:33.000
and also the difference between the two.

17:33.000 --> 17:35.000
So you can easily spot,

17:35.000 --> 17:36.000
you know,

17:36.000 --> 17:39.000
regressions that are smaller than,

17:39.000 --> 17:42.000
especially within the thresholds.

17:42.000 --> 17:44.000
And we have like two or three minutes,

17:44.000 --> 17:46.000
if you are very quickly.

17:46.000 --> 17:49.000
I'm going to really quickly describe all use reframing

17:49.000 --> 17:50.000
how we use this feature.

17:50.000 --> 17:52.000
So as I said, it's very important for us

17:52.000 --> 17:55.000
to check each hour with component,

17:55.000 --> 17:57.000
because if one of them is behaving anomaly,

17:57.000 --> 17:59.000
it can slow down your old HPC,

17:59.000 --> 18:00.000
or the planning training.

18:00.000 --> 18:02.000
So we need reframing test,

18:02.000 --> 18:03.000
basically,

18:03.000 --> 18:04.000
running on HPPU,

18:04.000 --> 18:05.000
which is available in each HCA,

18:05.000 --> 18:07.000
HPU memory,

18:07.000 --> 18:08.000
everything SSD.

18:08.000 --> 18:09.000
So every box,

18:09.000 --> 18:12.000
basically on the HDGX-800 diagram,

18:12.000 --> 18:15.000
needs to be properly checked for performance,

18:15.000 --> 18:16.000
and stability,

18:16.000 --> 18:18.000
and that's what we are using.

18:18.000 --> 18:19.000
That's why we're using reframing.

18:19.000 --> 18:22.000
So we're using SLAM to reframing SLAM,

18:22.000 --> 18:25.000
and we have our own container on time,

18:25.000 --> 18:26.000
that's also open source called EnRoot,

18:26.000 --> 18:28.000
and Pixys is the SLAM integration

18:28.000 --> 18:29.000
for this container on time.

18:29.000 --> 18:31.000
And we use a lot of open source projects

18:31.000 --> 18:33.000
for the testing,

18:33.000 --> 18:34.000
the black nickel,

18:34.000 --> 18:37.000
and then we bandwidths for HPU memory,

18:37.000 --> 18:39.000
or adMAPF test,

18:39.000 --> 18:40.000
or famous stream benchmark,

18:40.000 --> 18:41.000
FIO for disk,

18:41.000 --> 18:43.000
and we have single notice

18:43.000 --> 18:44.000
that we are going to test,

18:44.000 --> 18:45.000
as I said, each component,

18:45.000 --> 18:47.000
and we have kind of eye-o-level test

18:47.000 --> 18:48.000
that are closer,

18:48.000 --> 18:50.000
maybe to what users are running,

18:50.000 --> 18:52.000
but are very important for performance

18:52.000 --> 18:53.000
prediction,

18:53.000 --> 18:54.000
and also,

18:54.000 --> 18:56.000
things that are multi-node,

18:56.000 --> 18:57.000
because you cannot test multi-node,

18:57.000 --> 18:58.000
obviously the network,

18:58.000 --> 18:59.000
on just one node.

18:59.000 --> 19:00.000
So we have two types of tests,

19:00.000 --> 19:03.000
and actually the only people

19:03.000 --> 19:05.000
in our team that use the reframing SLAM,

19:05.000 --> 19:07.000
because reframing SLAM is me,

19:07.000 --> 19:08.000
and Vasilios,

19:08.000 --> 19:11.000
and our users actually use the GitLab CIY,

19:11.000 --> 19:12.000
and they say,

19:12.000 --> 19:14.000
I want to run on this cluster ABC,

19:14.000 --> 19:17.000
I want to run the single node flavor,

19:17.000 --> 19:18.000
and I want the short version,

19:18.000 --> 19:19.000
so like 30 minutes,

19:19.000 --> 19:21.000
and they click run pipeline in GitLab CI,

19:21.000 --> 19:22.000
and boom,

19:22.000 --> 19:24.000
they get the per node run,

19:24.000 --> 19:25.000
and they get a different node fail,

19:25.000 --> 19:27.000
they can click at it,

19:27.000 --> 19:28.000
and click on it,

19:28.000 --> 19:29.000
and look at the reframing log.

19:29.000 --> 19:33.000
So that's the way we integrated reframing SLI tool

19:33.000 --> 19:35.000
into something that our admins can use,

19:35.000 --> 19:37.000
without adding them to no reframing.

19:38.000 --> 19:41.000
And reframing even supports a drag unit export,

19:41.000 --> 19:43.000
and the lab CIY also supports a unique export.

19:43.000 --> 19:45.000
So you can click on the node,

19:45.000 --> 19:46.000
and say,

19:46.000 --> 19:48.000
oh, and actually this is the reframing log directly,

19:48.000 --> 19:52.000
you see the reframing log directly to the user right here.

19:52.000 --> 19:55.000
And I think that's the last slide,

19:55.000 --> 19:57.000
and right on time.

19:57.000 --> 20:00.000
And I think this is great,

20:00.000 --> 20:03.000
because that allows us to have more insights,

20:03.000 --> 20:05.000
and we can,

20:05.000 --> 20:07.000
when we run in GitLab CI,

20:07.000 --> 20:08.000
we populated that base,

20:08.000 --> 20:11.000
and then when people ask us in our team,

20:11.000 --> 20:14.000
hey, can you compare this between this driver version of Nvidia,

20:14.000 --> 20:17.000
and this driver version to verify everything is fine.

20:17.000 --> 20:19.000
We just run one command,

20:19.000 --> 20:22.000
and we give them the table in as key art format,

20:22.000 --> 20:23.000
but obviously it's our next,

20:23.000 --> 20:26.000
a lot of next steps to get more insights into the statistics,

20:26.000 --> 20:29.000
to get more comparison of the statistics,

20:29.000 --> 20:31.000
and also a big open is automatically,

20:31.000 --> 20:33.000
we make that more accessible to users,

20:33.000 --> 20:35.000
we did for GitLab CI,

20:35.000 --> 20:37.000
and also the query of the,

20:37.000 --> 20:39.000
the latency of the queries is still a bit slow,

20:39.000 --> 20:41.000
so that's something that's,

20:41.000 --> 20:43.000
if I see those, we'll work on.

20:43.000 --> 20:45.000
Thank you.

20:46.000 --> 20:48.000
Thank you.

20:54.000 --> 20:56.000
Any questions for feeling sensitive?

20:56.000 --> 20:57.000
Yeah.

20:57.000 --> 20:58.000
Do you think,

20:58.000 --> 20:59.000
kind of reframe the public base,

20:59.000 --> 21:02.000
where your CI might not have identical notes,

21:02.000 --> 21:05.000
so it might be a worse CPU on one,

21:05.000 --> 21:06.000
like,

21:06.000 --> 21:07.000
like,

21:07.000 --> 21:08.000
figure,

21:08.000 --> 21:11.000
you know,

21:11.000 --> 21:12.000
in this case,

21:12.000 --> 21:13.000
I said,

21:13.000 --> 21:14.000
oh yeah,

21:14.000 --> 21:15.000
do we support,

21:15.000 --> 21:16.000
can we support,

21:16.000 --> 21:19.000
in reframe the case where we have multiple types of notes?

21:19.000 --> 21:20.000
Yeah.

21:20.000 --> 21:22.000
We know we do support this use case,

21:22.000 --> 21:24.000
in reframe you can have multiple performance targets already,

21:24.000 --> 21:26.000
saying if the node is this type,

21:26.000 --> 21:27.000
you get this performance target,

21:27.000 --> 21:28.000
just this type,

21:28.000 --> 21:29.000
this performance target,

21:29.000 --> 21:30.000
and for this feature,

21:30.000 --> 21:32.000
you can add,

21:32.000 --> 21:34.000
arbitrary metadata,

21:34.000 --> 21:35.000
to the database,

21:35.000 --> 21:36.000
saying,

21:36.000 --> 21:38.000
I want to run this workload,

21:38.000 --> 21:41.000
and I'm going to add an arbitrary tag called,

21:41.000 --> 21:42.000
RWA1,

21:42.000 --> 21:43.000
and,

21:43.000 --> 21:44.000
and then if you have a different node,

21:44.000 --> 21:45.000
you'll say,

21:45.000 --> 21:46.000
I want to run with RWA2,

21:46.000 --> 21:47.000
and then you can ask,

21:47.000 --> 21:48.000
this feature,

21:48.000 --> 21:49.000
you can say,

21:49.000 --> 21:52.000
compare all the results just on RWA1.

21:52.000 --> 21:54.000
So, yeah,

21:54.000 --> 21:55.000
we can add,

21:55.000 --> 21:56.000
anything to add?

21:56.000 --> 21:57.000
No,

21:57.000 --> 21:58.000
from,

21:58.000 --> 22:00.000
also from reframe test side,

22:00.000 --> 22:03.000
you can support like multiple clusters at the same time,

22:03.000 --> 22:05.000
and then you can in your test,

22:05.000 --> 22:08.000
you can have constraints for your test,

22:08.000 --> 22:09.000
say for example,

22:09.000 --> 22:10.000
this test is for GPU,

22:10.000 --> 22:12.000
and then reframe automatically,

22:12.000 --> 22:14.000
will only select your test

22:14.000 --> 22:16.000
for a configuration that has,

22:16.000 --> 22:17.000
for example,

22:17.000 --> 22:18.000
TPU.

22:18.000 --> 22:20.000
So, yeah,

22:20.000 --> 22:21.000
that's it.

22:21.000 --> 22:23.000
What are the questions?

22:23.000 --> 22:25.000
Yeah, any more questions?

22:29.000 --> 22:30.000
One,

22:30.000 --> 22:31.000
one,

22:31.000 --> 22:32.000
one,

22:32.000 --> 22:33.000
and then,

22:33.000 --> 22:35.000
thank you very much.

22:35.000 --> 22:37.000
Thank you.

22:37.000 --> 22:39.000
Thank you.