WEBVTT

00:00.000 --> 00:02.000
You

00:30.000 --> 00:32.000
You

01:00.000 --> 01:02.000
You

01:30.000 --> 01:32.000
You

02:00.000 --> 02:02.000
You

02:30.000 --> 02:32.000
You

03:00.000 --> 03:02.000
You

03:30.000 --> 03:32.000
You

03:32.000 --> 03:34.000
You

03:34.000 --> 03:36.000
You

03:54.000 --> 03:58.000
Thank you. She usually start the reboot but

03:58.000 --> 03:59.800
perhaps, okay, good.

03:59.800 --> 04:02.080
So let's start from the beginning then.

04:02.080 --> 04:04.480
So thank you very much.

04:04.480 --> 04:09.800
So this is about a form of multi-precision or redmetic.

04:09.800 --> 04:13.600
And the problem multi-precision or redmetic

04:13.600 --> 04:17.040
is that it is more cost extensive.

04:17.040 --> 04:18.960
You have a cost overhead.

04:18.960 --> 04:22.520
So that's where the parallel computing comes in.

04:22.520 --> 04:27.160
So multiple double or redmetic allows you

04:27.160 --> 04:29.920
to work more precise, but you still

04:29.920 --> 04:32.000
work with floating point of redmetic.

04:32.000 --> 04:36.840
So we are looking into making the miracle algorithms

04:36.840 --> 04:40.520
give them a more broader range.

04:40.520 --> 04:43.600
Now to make it efficiently and parallel,

04:43.600 --> 04:47.400
I'm trying to vectorize the multiple double redmetic.

04:47.400 --> 04:51.400
And I'm going to explain how you can do that.

04:51.400 --> 04:53.640
There are two types of parallel computations.

04:53.640 --> 04:58.400
So the first one is a high level, very straightforward.

04:58.400 --> 05:01.640
The harder one is actually not really straightforward.

05:01.640 --> 05:05.840
And it will just give some hints to that.

05:05.840 --> 05:10.040
Okay, so I'm teaching also computational geometry this semester.

05:10.040 --> 05:14.080
So the definition of a robust algorithm,

05:14.080 --> 05:17.520
a robust algorithm actually performs in the same way

05:17.520 --> 05:22.560
if you start to tweak the input a little bit.

05:22.560 --> 05:27.520
So typically we work with 64 bit doubles,

05:27.520 --> 05:28.960
but we can extend it.

05:28.960 --> 05:34.440
So the idea to extend 64 bit float,

05:34.440 --> 05:38.520
a redmetic using floating point of redmetic goes back

05:38.520 --> 05:41.520
to the late 60s, early 70s.

05:41.520 --> 05:42.880
When people at some point,

05:42.880 --> 05:46.280
they were still forced to work with 32 bits.

05:46.280 --> 05:51.960
But they could actually figure out how to create more precision.

05:51.960 --> 05:56.360
So I was using the packages QD lip a lot

05:56.360 --> 06:00.240
and there is also the comparative library.

06:00.240 --> 06:04.400
My goal is actually to go to GPU acceleration.

06:04.400 --> 06:08.880
So I'm presenting this on a gaming laptop,

06:08.880 --> 06:13.120
which is actually already quite as powerful as the graphics cards

06:13.120 --> 06:18.120
and the big servers that I was using 10, 12 years ago.

06:18.120 --> 06:22.600
So computers were actually too fast.

06:22.600 --> 06:24.520
For users, it's not a problem.

06:24.520 --> 06:26.160
But if you are a software developer

06:26.160 --> 06:29.160
and you really want to exploit your computer,

06:29.160 --> 06:31.000
it's a big problem.

06:31.000 --> 06:33.800
Okay, what is a multiple double?

06:33.800 --> 06:37.960
It's simply a sequence of several doubles.

06:37.960 --> 06:40.200
So if you work with complex arithmetic,

06:40.200 --> 06:43.920
you have a real part and you have an imaginary part,

06:43.920 --> 06:44.840
two doubles.

06:44.840 --> 06:48.040
Well, double double is actually very much the same.

06:48.040 --> 06:52.800
Now, if your result can be represented exactly in one double,

06:52.800 --> 06:55.600
like you want to compute the number eight here,

06:55.600 --> 06:59.080
then your second double is going to give you

06:59.080 --> 07:02.960
the accuracy at which you compute it.

07:02.960 --> 07:06.520
So with double double, you have about 32 decimal places,

07:06.520 --> 07:07.520
then you double again,

07:07.520 --> 07:12.480
but probably you have 64 decimal places again and again.

07:12.480 --> 07:13.680
So that's the good thing.

07:13.680 --> 07:16.000
So you can still do numerical analysis.

07:16.000 --> 07:18.960
You kind of have an error estimate.

07:18.960 --> 07:21.680
If your result is an exact,

07:21.680 --> 07:25.560
can be exactly represented by 64 double.

07:25.560 --> 07:28.920
The drawback is that you're not going to notice

07:28.920 --> 07:31.680
all that much with double double arithmetic.

07:31.680 --> 07:36.000
Actually, your problems are going to be compute bound.

07:36.000 --> 07:39.520
So your processor is going to work harder.

07:39.520 --> 07:42.080
But when you get to double,

07:42.080 --> 07:47.680
actually, you may want to do some parallels.

07:47.680 --> 07:50.960
And then, of course, you can double it up.

07:50.960 --> 07:54.200
So you see the average number of floating point

07:54.200 --> 07:56.920
operations that you will have to do

07:56.920 --> 08:01.760
to do an addition, a multiplication, and a division.

08:01.760 --> 08:05.320
So here is why I am doing this.

08:05.320 --> 08:07.560
I'm working with Taylor series.

08:07.560 --> 08:10.640
So you may remember Taylor series from Calculus.

08:10.640 --> 08:13.480
So here is a very nice series.

08:13.480 --> 08:18.640
So the exponential starts with the leading coefficient one,

08:18.640 --> 08:20.640
but then the coefficients actually

08:20.640 --> 08:24.760
they go down very, very fast as a factorial.

08:24.760 --> 08:28.520
If you only need eight terms, then double precision will do.

08:28.520 --> 08:32.400
But if you need more, so for example, if you need 32,

08:32.400 --> 08:35.040
theory turns in your Taylor series,

08:35.040 --> 08:39.280
to represent the exponential correctly.

08:39.280 --> 08:42.200
So also having these tiny components,

08:42.200 --> 08:45.760
you actually will need, quote, double arithmetic.

08:45.760 --> 08:50.320
And actually 32 is a good number if you think about the warp size

08:50.320 --> 08:51.520
in a GPU.

08:51.520 --> 08:57.160
So typically, we would like to work with 64 or 128.

08:57.160 --> 08:59.680
OK, so here is the idea.

08:59.680 --> 09:05.400
So what happens is that when you do an operation with a double

09:05.400 --> 09:08.120
double, it's going to do the operation,

09:08.120 --> 09:11.600
and then it's going to renormalize it again.

09:11.600 --> 09:15.160
So if you do a quad double, it's going to also renormalize.

09:15.160 --> 09:20.360
Make sure that all the doubles that represent your quad of double

09:20.360 --> 09:24.240
are actually going to be known overlapping.

09:24.240 --> 09:29.760
So that is going to have a lot of tests ifs and things like that.

09:29.760 --> 09:32.560
The idea is if you vectorize, you actually

09:32.560 --> 09:37.880
would like to postpone the renormalization.

09:37.880 --> 09:41.000
So here is if you want to add a sequence,

09:41.000 --> 09:45.840
so you have 52 bits of precision in your fraction,

09:45.840 --> 09:50.960
you simply cut your 52 in half and you insert 0s.

09:50.960 --> 09:53.800
So these 0s are going to fill up.

09:53.800 --> 09:56.000
So you could also think about if you want

09:56.000 --> 09:58.800
to do exact integer arithmetic, you

09:58.800 --> 10:02.320
would split your 64 bit into 30 bits,

10:02.320 --> 10:05.360
and you would add 0s in upfront.

10:05.360 --> 10:09.440
And these numbers will actually fill up.

10:09.440 --> 10:14.600
OK, so double, double arithmetic is also

10:14.600 --> 10:19.960
referred to to error free operations.

10:19.960 --> 10:23.120
I want to do inner products.

10:23.120 --> 10:26.880
So I want to multiply matrices, for example.

10:26.880 --> 10:31.200
The reference code is going to compute just

10:31.200 --> 10:36.120
a sequence of sums of products.

10:36.120 --> 10:40.600
So for products, if you multiply 2, say 13 bits,

10:40.600 --> 10:44.920
you can have a 16-bit 26-bit number.

10:44.920 --> 10:48.480
So that's why a double double, which consists of a high double

10:48.480 --> 10:53.920
and a low double, it's going to be split into 8 parts.

10:53.920 --> 10:58.520
The 8 parts are all going to have 32 non-zero bits,

10:58.520 --> 11:01.880
and going to be filled up with 0s.

11:01.880 --> 11:04.040
So it sounds very complicated.

11:04.040 --> 11:08.480
So what we're doing is that here with one inner product

11:08.480 --> 11:12.600
and double double arithmetic is going to be replaced

11:12.600 --> 11:17.400
by 8 inner products, but the 8 inner products

11:17.400 --> 11:21.720
are going to happen in double precision arithmetic.

11:21.720 --> 11:26.040
So computers are very good at doing matrix matrix multiplication,

11:26.040 --> 11:29.160
with double precision arithmetic.

11:29.160 --> 11:33.320
Multiple double precision in a product are replaced

11:33.320 --> 11:37.280
by inner products in double arithmetic.

11:37.280 --> 11:41.160
So that's what I mean by factoring, factoring.

11:41.160 --> 11:44.520
So here is the setup.

11:44.520 --> 11:47.680
It's important to still float in point,

11:47.680 --> 11:51.000
and the numbers can still float.

11:51.000 --> 11:55.560
You have to prevent this by kind of re-normalizing

11:55.560 --> 11:56.760
intermittently.

11:56.760 --> 12:01.000
So sometimes people say with a 64-bit double,

12:01.000 --> 12:03.280
you have 53 bits in precision,

12:03.280 --> 12:05.800
because you had that normalized bit.

12:05.800 --> 12:09.720
Now when we split a double in four pieces,

12:09.720 --> 12:12.600
we also have to make sure that these exponents

12:12.600 --> 12:18.880
are actually 0, negative 13, negative 26, negative 39.

12:18.880 --> 12:24.840
OK, so here is the computational result.

12:24.840 --> 12:28.400
And in some sense, this is already a parallel computation,

12:28.400 --> 12:32.320
because inside the computer, there are pipelines happening.

12:32.320 --> 12:36.440
If you do a double precision floating point operation,

12:36.440 --> 12:38.840
there's going to be the denormalization,

12:38.840 --> 12:42.120
making sure that the exponents are the same,

12:42.120 --> 12:45.440
adding the fractions, and then again storing.

12:45.440 --> 12:51.280
So by rewriting so that first big inner product,

12:51.280 --> 12:54.560
as many inner products of doubles,

12:54.600 --> 12:57.360
we actually get a lot more speed up,

12:57.360 --> 13:02.680
and a lot more gradual progression.

13:02.680 --> 13:06.040
So if you see if you go from, and the left column there,

13:06.040 --> 13:10.040
so the ordinary, what I call the ordinary normal,

13:10.040 --> 13:12.080
double double and ticked,

13:12.080 --> 13:15.440
you have a immediately penalty hit of 13,

13:15.440 --> 13:19.280
and then if you go to double, you have again 12 factor.

13:19.280 --> 13:24.360
But if you do it vector, it goes way, way more gradual.

13:24.360 --> 13:31.800
And this was, by the way, done on a very recent compute,

13:31.800 --> 13:38.160
with a recent chip, compiled with a full optimization.

13:38.160 --> 13:42.520
OK, now two levels of parallelism.

13:42.520 --> 13:48.360
So the big point is that going from, say 12 milliseconds to nine,

13:48.360 --> 13:51.560
seconds you're going to notice.

13:51.560 --> 13:53.440
So that's actually the whole big problem

13:53.440 --> 13:57.000
of using multiple precision arithmetic.

13:57.000 --> 14:00.120
Things that start to take milliseconds,

14:00.120 --> 14:03.800
suddenly become seconds, second becomes minutes,

14:03.800 --> 14:05.600
and minutes become hours.

14:05.600 --> 14:09.080
So the nine seconds, you can actually,

14:09.080 --> 14:11.720
if you do this in a very straightforward,

14:11.720 --> 14:15.600
so I had to do 1,000 in a product.

14:15.600 --> 14:19.880
I can actually reduce this with using all the course,

14:19.880 --> 14:26.000
and that CPU, I can reduce it from 300,

14:26.000 --> 14:33.520
from 293 to 300, going from quad-double four times more.

14:33.520 --> 14:37.680
I like speed up, but we also like quality up.

14:37.680 --> 14:41.480
So in some sense, if you use all the course in that computer,

14:41.480 --> 14:45.960
you actually can do the same computation,

14:45.960 --> 14:52.360
four times more accurate within the same time.

14:52.360 --> 14:55.400
OK, so that's the nice type of parallelism here

14:55.400 --> 14:56.880
as a reference.

14:56.880 --> 15:00.480
So this is what I call the work crew model.

15:00.480 --> 15:05.880
You just every threat picks up an in a product,

15:05.880 --> 15:09.720
and they work through a job queue.

15:09.720 --> 15:12.840
And the only thing you need to be careful about

15:12.840 --> 15:16.560
is that when you increment your job counter,

15:16.560 --> 15:20.080
so every threat, when it's done, it asks for the next job,

15:20.080 --> 15:24.720
that they can only do this.

15:24.720 --> 15:27.520
They have to do this in a critical section.

15:27.520 --> 15:29.520
And this can be very nicely implemented.

15:29.520 --> 15:32.400
So I think this is classic.

15:32.400 --> 15:36.760
Ada, OK, now comes the hard part.

15:36.760 --> 15:47.400
So with GPUs, you can actually do the multiple double arithmetic,

15:47.400 --> 15:51.640
that's fine, but then there are the tenser course.

15:51.640 --> 15:55.760
So the computer that I mentioned actually

15:55.760 --> 15:59.760
it has a high-end on pair graphics coordinates.

15:59.760 --> 16:05.880
And what this experiment shows is that these convolutions,

16:05.880 --> 16:12.840
they can actually be reformulated as matrix-matrix products.

16:12.840 --> 16:17.000
So this is what we call data staging.

16:17.000 --> 16:22.760
So we have to make sure that the data is all properly

16:22.760 --> 16:27.000
arranged in vectors, and that we have very carefully

16:27.000 --> 16:30.520
we know where the beginning is, and where the end,

16:30.520 --> 16:32.720
and where are we going to write.

16:32.720 --> 16:36.080
So GPU code is typically very short.

16:36.080 --> 16:40.120
You just smash some kernel, but the preparation

16:40.120 --> 16:42.720
for the code takes a very long time.

16:42.720 --> 16:45.160
And actually what I've presented to you here

16:45.160 --> 16:50.560
is kind of a first step in making the tenser course,

16:50.560 --> 16:54.480
making them useful for higher precision,

16:54.480 --> 16:59.280
multi-matrix multiplications.

16:59.280 --> 17:00.600
Thank you for your attention.

17:00.600 --> 17:01.600
On time.

17:01.600 --> 17:10.600
Thank you.

17:10.600 --> 17:13.600
So questions and answers.

17:13.600 --> 17:22.600
I have gone, you said we were using a 96-core CPU.

17:22.600 --> 17:23.600
Yes.

17:23.600 --> 17:28.400
Then you mentioned that we had only a six-time speed-up or something

17:28.400 --> 17:29.400
like that.

17:29.400 --> 17:39.280
Yeah, so these speed-ups are actually only done on one core.

17:39.280 --> 17:40.960
This is, okay, I'm sorry.

17:40.960 --> 17:46.800
So the question is, I'm using 96 cores, and how come these are

17:46.800 --> 17:49.120
the speed-ups here?

17:49.120 --> 17:56.080
In this slide, I'm actually only using one core, and I'm using the 96 cores

17:56.080 --> 17:59.280
to reduce the 9 seconds over there.

17:59.280 --> 18:05.640
So this is kind of here the highest time, and I want to reduce this,

18:05.640 --> 18:08.480
comparing to the 318.

18:08.480 --> 18:14.640
So that's about with quadruples, about 10 times slower.

18:14.640 --> 18:21.320
What I can do here is if I have 1,000 in a product to do,

18:21.320 --> 18:27.880
and they are distributed among the 9-d6 cores in that computer.

18:27.880 --> 18:37.240
And that reduces the time to 200 to 318, I'm sorry, to 298 milliseconds.

18:37.240 --> 19:01.240
Yeah, so the question is, I'm using some of fours here, have I considered using atomic

19:01.240 --> 19:04.000
variables, and I think I do.

19:04.000 --> 19:11.400
So there are these types that I can use, yes.

19:11.400 --> 19:12.400
Yes?

19:12.400 --> 19:13.400
Yes.

19:13.400 --> 19:30.800
So the question is, are the exponents useful?

19:30.800 --> 19:39.720
So the exponents, so in some sense, so here is how you can see it.

19:39.720 --> 19:45.040
So that's also drawback actually of multiple double arithmetic.

19:45.040 --> 19:52.440
You still remain 11 bits of in your exponent.

19:52.440 --> 19:56.400
So at some point you're going to have overflow or underflow.

19:56.400 --> 19:59.920
So you have to do still numerical analysis.

19:59.920 --> 20:08.640
So the question is if these exponents are useful, here they are not, if you are just normalizing,

20:08.640 --> 20:11.040
because they all stay fixed.

20:11.040 --> 20:14.320
And that's actually what you want.

20:14.320 --> 20:20.600
So in some sense for your numbers, you're only going to use, so here I generated all constant

20:20.600 --> 20:26.480
numbers in a very nice range, but it's actually this exponent that is going to matter.

20:26.480 --> 20:33.760
And all the other numbers they are going to somehow carefully need to be rebalanced.

20:33.760 --> 20:34.760
Yes?

20:34.760 --> 20:35.760
Yes.

20:35.760 --> 20:45.160
So it comes up with that theorem, and that impacts so much of the temperature of the process,

20:45.160 --> 20:51.160
so that it has a better effect of the right, and then put it in like this range, so the

20:51.160 --> 20:56.160
load, I don't think of the fact.

20:56.160 --> 21:00.400
So your question is about the load balancing?

21:00.400 --> 21:01.400
Yes.

21:01.400 --> 21:06.680
When they are currently driving a lot, yes.

21:06.680 --> 21:14.240
When they impact over the temperature and the control by a high temperature of how the

21:14.240 --> 21:17.560
process is going to be.

21:17.560 --> 21:18.560
Yes.

21:18.560 --> 21:28.760
So the question is about the temperature, and when the, so I don't know, so that the machine

21:28.760 --> 21:35.200
starts to get a lot more noisy when I run it, and the fans are really starting to blow

21:35.200 --> 21:37.680
very hard.

21:37.680 --> 21:43.320
So that's how I see that it is working or I hear that it's working.

21:43.320 --> 21:48.360
But yeah, I have no sensors, I mean, there could be sensors there, but I don't know how

21:48.360 --> 21:49.360
the temperature.

21:49.360 --> 21:50.360
Okay.

21:50.360 --> 21:56.000
That the HLFI sort of made a close and the cancels, you know, that to reduce the temperature

21:56.000 --> 22:04.720
because the family is able to add a risk, it's max speed for the heating of the

22:04.720 --> 22:11.360
process, and all this particular process, it's great for this into the ignition of the

22:12.360 --> 22:13.360
temperature.

22:13.360 --> 22:15.360
No, I never used that.

22:15.360 --> 22:17.360
Yeah.

22:17.360 --> 22:18.360
Okay, thank you.

22:18.360 --> 22:19.360
Thank you.

