WEBVTT

00:00.000 --> 00:11.240
Hello, everyone. It's my pleasure to be here and talk to you in front of you. I'm going

00:11.240 --> 00:17.520
to talk about reduced precision and scientific computing and the main concern here is the

00:17.520 --> 00:25.360
accuracy and keeping the improvement about the speed up. So, many scientific applications

00:25.360 --> 00:32.600
they are highly dependent on the linear algebra computation and often we are considering

00:32.600 --> 00:38.800
FB64 as something accurate. But nowadays we are seeing that many GPUs are providing

00:38.800 --> 00:45.760
reduced precision with huge amount of the floating point operation. So, we are interested

00:45.760 --> 00:53.120
to use that and provide something with good accuracy. The first thing that is coming

00:53.120 --> 01:01.240
to the mind about the reduced precision is that start to computation find the first solution

01:01.240 --> 01:08.600
somehow good quality and after that start refining that. For that reason I was reading

01:08.600 --> 01:14.160
a paper and then I was thinking maybe can I replicate that experiment and see what

01:14.160 --> 01:19.560
happened if we forget the initial solution that provided by reduced precision and just

01:19.560 --> 01:31.080
use some random numbers as the initial point. I did that and I got some interesting result.

01:31.080 --> 01:35.960
Almost the accuracy for forward error which means that how we are close to the correct

01:35.960 --> 01:41.160
solution and backward error means how much modification we need in the initial data

01:41.160 --> 01:48.680
to reach to the final solution. They were very similar to each other. You can see them here.

01:48.680 --> 01:55.320
They have a similar magnitude of order and in that part even random initialization is

01:55.320 --> 02:03.320
better. And that point is for backward error every little different but I think we can

02:03.320 --> 02:10.400
consider them very close to each other. And about timing here I have implemented this

02:10.400 --> 02:17.000
idea in MATLAB. So, is the simulation is not real timing but we have to pay for that.

02:17.000 --> 02:26.600
So finding a first solution in reduced precision is costly is not free. So I am going back

02:26.600 --> 02:34.040
with this idea. We have this possibility with the GPU how we are able to use that. So the

02:34.040 --> 02:39.440
geometry with perspective is finding the intersection between hyperplans to find the solution

02:39.440 --> 02:44.640
of the linear system of equation. And how we are implementing that is the classical

02:44.640 --> 02:51.280
way that we are seeing LAPAC package. First we have to do the panel factorization. And

02:51.280 --> 03:00.720
based on that we have to use the DTRSM and DGM. Those two are really good parallel but the other

03:00.720 --> 03:05.520
one which is panel factorization is not because for keeping the accuracy every time we have

03:05.520 --> 03:11.760
to check the column finding the maximum value and reordering the columns. But the rest of them

03:11.760 --> 03:18.000
is GEM and GEM is very good implemented by the GPU. So the idea is that okay send the

03:18.000 --> 03:24.800
panel factorization to the CPU and during the performing the trailing update you can compute

03:24.800 --> 03:31.840
this one. So somehow you will go in parallel and get the solution as fast as possible.

03:34.000 --> 03:40.320
That idea is very dependent to the sensitivity of the architecture that you are using. If you have

03:40.400 --> 03:46.880
a super fast GPU, some slow CPU and problem is between connection, you are losing the

03:46.880 --> 03:55.520
time and the collaboration is not working. So it is better to keep everything inside of the

03:55.520 --> 04:07.840
GPU and here is a test you can see up to 4 times we can get improvement. So now how we are

04:08.640 --> 04:14.640
able to make this part faster. I am going to talk about the fast life short life and normal life.

04:14.640 --> 04:22.160
Imagine you are able to live two times. The first time you can see the consequence of your decision

04:22.160 --> 04:30.000
and based on that for your normal life you can forget over thinking and just do your life in easy way.

04:30.000 --> 04:37.280
So we are going to use that concept for computation and instead of doing just once,

04:37.360 --> 04:42.800
do it twice. It is more costly, more data movement. But since we are changing the type of the

04:42.800 --> 04:54.640
algorithm we are able to get more benefit and get it something faster. So first we are finding the

04:54.640 --> 05:01.680
pivot list and then you are performing the algorithm without pivoting because of that concept

05:01.680 --> 05:07.040
we are not looking for the pivot anymore. So we can just perform the final factorization

05:07.200 --> 05:14.880
for the top square part of the panel and using the TRS M which is somehow jammed for the rest

05:14.880 --> 05:21.040
of that. So we are increasing the possibility to run most of the part in parallel way.

05:22.800 --> 05:31.760
And by reducing two FV16 we are able to load four times more data in the GPU and find the pivot

05:31.760 --> 05:40.560
for larger metrics. We are introducing several algorithms. One of them is XPRP which is using

05:40.560 --> 05:48.400
mix of FV32 and FV16. The other one is completely half version which is HPRP and the other

05:48.400 --> 05:56.960
one is MPF that two versions are applying the pivot list to the whole metrics. But this one

05:57.520 --> 06:04.000
is considering the panel and is just finding the panel pivot of the panel. So let's see

06:05.360 --> 06:12.960
how they are accurate. If we use the PRP for the whole metrics after some dimension we can see

06:12.960 --> 06:21.920
that we are losing accuracy. For example, for HPRP after 4K for XPRP which is mixed but for

06:21.920 --> 06:29.520
whole metrics after 20K but for MPF the accuracy that we are obtaining is identical to

06:29.520 --> 06:39.920
double precision. And let's see about the timing. Depending on the architecture,

06:40.560 --> 06:46.720
somewhere we are able to improve and somewhere no. For example for this one which is

06:46.720 --> 06:54.400
belong to the workshop GPU, we are able to get improvement for smart-sized metrics. That one is

06:54.400 --> 07:00.640
for A100. For larger dimension we are able to get improvement because the type of the cache and

07:00.640 --> 07:08.320
architecture completely different and that one also is related to the RTX 3090. The behavior is

07:08.400 --> 07:15.280
similar to the waltz. And that's it. Thanks for listening.

07:21.520 --> 07:27.120
All right. You might actually face the big early so there's at least time for one question.

07:28.320 --> 07:31.920
We have any questions over your question. No.

07:38.400 --> 07:49.280
I think we can apply that. You are asking me what will happen or how we are able to apply

07:49.280 --> 07:55.920
that method to esports metrics. Esports metrics are coming from real world application and

07:55.920 --> 08:02.560
you can see some elements are very large, some of them are very small. So you should consider

08:02.560 --> 08:11.760
that to normalize the data before starting this. To avoid having very large because in FB16

08:11.760 --> 08:16.720
we are not able to keep very large numbers. This is one consideration and another one is that

08:19.920 --> 08:27.520
I think we are able to apply this method to multipron 12 metrics is because we can consider

08:27.520 --> 08:32.720
each part of that metrics as a panel and apply the MPF method over that.