WEBVTT

00:00.000 --> 00:12.000
Okay. Good morning, everyone. My name is David or David. I'm a research data scientist

00:12.000 --> 00:17.360
at the Alan Cheering Institute and today I'll be talking about explainable forecasting

00:17.360 --> 00:23.160
from big weather data, rapid and sustainable solutions.

00:23.160 --> 00:31.360
So, data-driven weather prediction has progressed substantially over the past few years.

00:31.360 --> 00:36.920
And the reason is essentially that traditional physics-based forecasting is accurate but

00:36.920 --> 00:42.880
is actually computationally very expensive. And a lot of open source data has been made

00:42.880 --> 00:48.440
available over the past few years and as a result, many data-driven models purely data-driven

00:48.440 --> 00:54.120
models have come out. And some of these data-driven models can actually outperform these

00:54.120 --> 01:03.640
physics-based models for many standard forecast scores.

01:03.640 --> 01:09.760
Most of these data-driven models use a deep learning architecture and they employ different

01:09.760 --> 01:15.720
approaches. For example, vision transformers, neural operators or graph neural networks

01:15.720 --> 01:25.920
or typical approaches. For example, Google, DeepMind graphcast uses graph neural networks.

01:25.920 --> 01:31.120
And a lot of these models are trained on the publicly available error 5 data set, which

01:31.120 --> 01:39.320
is the European Center for Medium Range Weather Forecasting and waspherick-rean analysis. And these

01:39.320 --> 01:44.200
are measurements of the global climate covering the period from the mid-20th century until

01:44.200 --> 01:50.840
today. And they contain hourly estimates of many variables, such as temperature, wind

01:50.840 --> 01:56.040
speed, precipitation, etc. On a lot longer grid at the multiple pressure levels, so at

01:56.040 --> 02:02.320
multiple altitude. And it's a huge data set. The whole thing has hundreds of terabytes,

02:02.320 --> 02:09.400
but just to give you an idea, a slice at a single pressure level of 500 hectopascals on

02:09.400 --> 02:16.040
the finest resolution of 0.25 degrees for a single variable temperature. And for a time

02:16.040 --> 02:22.800
range of 20 years, has a size of 730 gigabytes. And here I'm showing a slice of temperature

02:22.800 --> 02:29.000
at that pressure level 500 on the first of January 2020. So this is roughly what the data

02:29.000 --> 02:31.960
set looks like.

02:31.960 --> 02:37.800
So this was actually a side project that we did. We're developing a graph neural network

02:37.800 --> 02:42.120
model with the UK's Met Office as well, but we had a pause in the project and we decided

02:42.120 --> 02:49.000
to try a different approach. And we wanted to develop an inexpensive data-driven forecast model

02:49.000 --> 02:54.360
that could serve as a baseline for comparison, so sort of a benchmark. Having a similar

02:54.360 --> 03:00.200
role to persistence forecasting and climatology. So these are two metrics that are used to

03:00.200 --> 03:04.920
compare data-driven models against, which are very simple, persistence forecasting essentially

03:04.920 --> 03:09.880
assumes that the weather forecast remains static for the whole of the forecast period.

03:09.880 --> 03:16.280
And climatology essentially assumes that the forecast is the same as the historical average

03:16.280 --> 03:23.000
for that particular data in time. So those are very simple models that are used to compare,

03:23.000 --> 03:27.880
that are used as benchmarks to compare data-driven models against. And we also wanted to develop

03:27.880 --> 03:33.560
a method that could deepen or understanding of the underlying physics of the problem. And that's

03:33.560 --> 03:38.280
why we came across dynamic mode decomposition. So dynamic mode decomposition is a purely

03:38.280 --> 03:45.880
data-driven model. It's computationally efficient. It's explainable because the modes and eigenvalues,

03:45.880 --> 03:52.280
I'll explain a bit more about that later, that come out of this are interpretable and they explain

03:52.280 --> 03:57.320
the underlying physics of the model that you're of the data set that you're trying to model.

03:58.520 --> 04:02.760
And they can also approximate a nonlinear dynamic called system through a linear approximation.

04:03.560 --> 04:13.640
So the team, most of us were based on the Alan Turing Institute, we also had a very good collaboration

04:13.640 --> 04:18.920
with Nathan Kuts from the University of Washington who has been working on DMD for many years now,

04:19.640 --> 04:24.440
and with a PhD candidate from the Barcelona Supercomputing Center, Bennett, who's working on

04:24.440 --> 04:33.320
lower-order modeling for fluid dynamics. So a bit of background about DMD, this looks a bit

04:33.320 --> 04:39.240
scary, but it's actually very simple. So actually DMD seeks the leading spectral decomposition,

04:39.240 --> 04:45.080
so the eigenvalues and eigenvectors of the best fit linear operator A that relates to snapshot

04:45.080 --> 04:52.520
matrices in time. So imagine here my matrix matrix X was my error five slides. And along the rows,

04:52.520 --> 04:58.120
I would have different spatial locations on the lat long grid and different columns, I would have

04:58.200 --> 05:03.000
different snapshots over time, organized in time. And then I make a matrix X-prime which is just

05:03.000 --> 05:09.080
this matrix shifted in time by one snapshot. And my assumption here is that X-prime can be expressed

05:09.080 --> 05:16.680
as a linear function of X. So X-prime equals A times X. And A is this best fit linear operator.

05:18.280 --> 05:26.440
And DMD essentially seeks for an eigenvalue decomposition of A, and seeks this

05:26.440 --> 05:32.920
decomposition by expressing X as a multiplication of the eigenvectors by the amplitudes and

05:32.920 --> 05:39.240
the matrix of eigenvalue exponentials, which inform about the time evolution of the problem.

05:39.240 --> 05:45.320
So this is essentially DMD and a nutshell. It's essentially doing, assuming a linear relationship

05:45.320 --> 05:50.600
doing a linear regression problem. And then calculating the eigenvalue decomposition of A,

05:50.600 --> 05:57.080
and expressing X as a, as a run car reconstruction through this equation at the bottom of this

05:57.080 --> 06:01.480
slide. And this has connections with Kutman theory, which essentially says that you can

06:02.360 --> 06:08.840
express and nonlinear dynamical system through an infinitely dimensional linear system.

06:10.280 --> 06:17.400
Here A would be our approximation of the Kutman operator, which in this case would be a finite

06:17.480 --> 06:27.640
dimensional operator. And DMD has the advantages of connecting the SPD. So single value decomposition

06:27.640 --> 06:34.360
for spatial dimensionality reduction, as well as the fast Fourier transform for temporal frequency

06:34.360 --> 06:41.640
identification. So it's doing both things at the same time. So it's reducing the spatial dimensionality

06:41.880 --> 06:51.560
through SPD. And finding the dominant temporal frequencies through FFT. And here you can see a

06:51.560 --> 06:57.000
typical example of DMD being applied to a classical problem in fluid dynamics, which is a cylinder

06:57.000 --> 07:02.120
in a cross flow. So imagine I had different snapshots of the full field behind the cylinder.

07:03.080 --> 07:09.320
And I applied DMD to this. I would solve this linear regression problem. And then if I did the eigenvalue

07:09.480 --> 07:15.000
decomposition of A, what I would get are these dynamic modes, which in form of the spatial behavior

07:15.000 --> 07:20.840
of these vibrations that compose the problem. And these dynamic modes, spatial modes,

07:20.840 --> 07:27.960
would also be accompanied by these temporal oscillations. And I could also, so I could use these results

07:27.960 --> 07:33.080
to better understand the problem, but also for forecasting, because I could just project

07:33.880 --> 07:39.560
the data onto a new time vector into the future. And I could predict the future state of the problem.

07:42.840 --> 07:47.880
Now DMD is very good, but it's actually very sensitive to the presence of noise, which is

07:47.880 --> 07:52.840
always the case in real world data sets. So the approach we're using is actually optimized DMD,

07:52.840 --> 07:56.360
which is based on the algorithm proposed by Askerman Kütz in 2018.

07:57.000 --> 08:03.480
And essentially, optimized DMD is a post-processing step of the exact DMD algorithm. It seeks

08:04.680 --> 08:12.120
an nonlinear optimization of DMD through a variable projection algorithm. So actually,

08:12.120 --> 08:17.000
I don't know if you can see my cursor here. Yes, so it's actually solving,

08:17.000 --> 08:22.600
optimized DMD is solving this exponential fit in problem directly. And by doing this,

08:23.320 --> 08:29.160
it's much less sensitive to the presence of noise and avoids much of the bias of exact DMD.

08:30.600 --> 08:37.640
So here, the approach to solving of DMD is that you first compute the exact DMD,

08:38.200 --> 08:43.960
and then you solve this equation in a post-processing step. This is actually quite computationally intensive,

08:43.960 --> 08:50.280
especially if X is very large. So what you can actually do is compute the SVD of X to produce

08:50.280 --> 08:56.280
a rank R reconstruction of X, and solve this optimization in a low rank approximation of X.

08:56.280 --> 09:02.600
So this is what we're doing, because it's much easier to solve. So this is essentially called

09:02.600 --> 09:08.040
approximate op DMD, because I'm solving the optimized DMD algorithm in the rank R approximation of my matrix.

09:09.640 --> 09:15.400
Right, so we've been using the PIDMD package, which is a publicly available Python package

09:15.400 --> 09:21.320
for performing DMD, and this optimized DMD algorithm is implemented in the Bob DMD class.

09:22.120 --> 09:28.200
Now, in order to make this work for error five, we've had to modify the algorithm slightly,

09:28.200 --> 09:34.040
and we've incorporated a new method called FIT icon for a fit economy, essentially,

09:35.000 --> 09:41.320
which allows you to do a much cheaper DMD fit. So just to give you an idea,

09:41.400 --> 09:46.920
this is the old fit method that was in the algorithm, and this becomes intractable if X is very

09:46.920 --> 09:51.880
large, which was our case. So we were getting out of memory kills on a HPC trying to do this.

09:53.080 --> 09:57.800
However, with this new fit icon method that we've implemented, we can actually solve this problem

09:57.800 --> 10:05.240
on a laptop. By using this trick of solving the problem in the low rank approximation of X that I showed

10:05.240 --> 10:12.920
in the previous slide. Okay, so yes, we can apply of DMD to the rank R approximation of X,

10:13.800 --> 10:19.000
but we still need to compute the SVD of X, and actually computing the SVD of a very large matrix

10:19.000 --> 10:27.480
is very computationally intensive. So how about building separate DMD models for much smaller

10:27.480 --> 10:34.520
subsamples of X, and combining these DMD models to produce a single forecast. So instead of trying

10:34.520 --> 10:40.920
to solve everything at once by doing the SVD on X, we produce different subsamples with an each

10:40.920 --> 10:48.760
subsample, we capture a range of the time frequencies and spatial frequencies that are present in

10:48.760 --> 10:54.760
the dataset. So with the DMD model applied to each subsample, we only capture some of the dynamics,

10:55.400 --> 11:00.120
but then we can combine all of these models together to produce a single forecast. So this is the

11:00.120 --> 11:05.160
approach we've gone for. So here I could summarize X by producing three different subsamples.

11:05.160 --> 11:10.920
So in X1 for instance, my snapshots would be one hour apart from each other, and the whole duration

11:10.920 --> 11:16.200
of the observation would be one month. I could then produce a matrix X2, where with a time delta

11:16.200 --> 11:21.640
one day between snapshots and a duration observation of two years, and a next three matrix

11:21.640 --> 11:28.280
with a delta time of 15 days between snapshots and a duration of 20 years. So neither of these three

11:28.280 --> 11:32.600
subsamples solve the full range of the dynamics, but together they do, that's the trick here,

11:32.600 --> 11:39.960
and X1, X2, and X3 are much smaller than X. So we tried this in a toy dataset, so this is one

11:39.960 --> 11:47.880
one-d problem, so over time, so imagine this was a rope oscillating over time. There are three

11:47.880 --> 11:55.000
different vibrations in bed and on top of each other with some Gaussian noise, and so this is my

11:55.000 --> 12:01.640
problem essentially, and I try I produce three different subsamples with different observation

12:01.640 --> 12:08.840
durations and delta periods between snapshots. So again neither of these three subsamples is

12:08.840 --> 12:15.080
able to capture the full dynamics, but hopefully by applying a DMD to each one of these separately

12:15.080 --> 12:20.920
and combining those DMD models I can produce an accurate forecast. And it's actually, this actually

12:20.920 --> 12:27.880
works. So if I produce a DMD model for X1, X2, and X3 separately, I can then cherry pick the DMD

12:27.880 --> 12:32.200
modes from each of these models and produce a single model for forecasting. And as you can see,

12:32.200 --> 12:37.480
I can get a very accurate forecast, which is very similar to the ground truth. You can check

12:37.480 --> 12:43.880
this out and get hubbed if you're interested, it's in a public repo. Right, so what are our preliminary

12:43.960 --> 12:49.800
results? I should say that we've literally finished the code base very recently and we've just

12:49.800 --> 12:57.880
tried playing playing with the error five data set a few days ago. So this is I'm showing temperature

12:57.880 --> 13:05.400
really mean square error for different pressure levels. So trying to predict temperature over a period

13:05.400 --> 13:14.760
of 10 days, the first 10 days of January 2020, with a DMD model trained on December 2019 only

13:14.760 --> 13:21.560
with a delta time of one hour. So as you can see for a pressure level of 100 hectopascal, so that's

13:21.560 --> 13:28.680
quite high up in the in the atmosphere that's around 16 kilometers above sea level. And that is

13:28.680 --> 13:35.000
doing is doing okay, it's definitely doing better than persistence, especially at higher lead times,

13:35.560 --> 13:40.280
and it's sort of doing the same as climatology, sometimes doing better, sometimes doing worse.

13:42.120 --> 13:48.200
Now when I move to a pressure level of 500 hectopascals, my DMD becomes, my DMD prediction becomes

13:48.200 --> 13:56.360
quite rubbish. The reason for it is that as you move closer to the Earth's surface,

13:57.400 --> 14:02.840
the field becomes much more turbulent. So there are many more scales to resolve and DMD is not doing

14:02.920 --> 14:10.600
such a good job here. Either because I'm not used to get enough low rank approximation of error

14:10.600 --> 14:19.320
five, or because I'm not using enough data, for instance, maybe December 2019 is not enough

14:19.320 --> 14:25.800
in order to capture the first 10 days of January. So what I tried here is this approach that I mentioned

14:25.800 --> 14:30.440
in one of my previous slides, which is combining 2 DMD models to produce a single forecast.

14:30.520 --> 14:36.200
So on the left hand side, I have the DMD forecast with data trained in December 2019 only.

14:36.760 --> 14:41.720
And here I've combined two different DMD models together, one trained in December 2019 with a delta

14:41.720 --> 14:46.680
time of one hour, and another DMD model trained on two years worth of data with a delta time of one

14:46.680 --> 14:50.680
day between snapshots. And as you can see, my prediction becomes considerably better

14:51.800 --> 14:58.760
using this approach. Again, this is super fast to compete because the SVDs of these matrices are

14:59.400 --> 15:06.600
you can do them on a HPC very quickly. And it's a very cheap method essentially. And here I'm

15:06.600 --> 15:13.960
still doing worse than climatology, but again I haven't included data for further back essentially.

15:13.960 --> 15:19.880
I could have included, for instance, data for the past 20 years with a delta time of 15 days

15:19.880 --> 15:28.600
between snapshots to try and improve this model further. Right. The next thing that DMD is

15:28.600 --> 15:34.680
good at, apart from forecasting, is explainability. So the results that come out of DMD are

15:34.680 --> 15:39.320
actually explainable. You can understand them. You can plot them. So here on the left hand side,

15:39.320 --> 15:46.680
I'm showing the DMD spatial mode number four. And this is what it looks like if you plot it on a

15:46.680 --> 15:52.920
left long grid and it shows your regions where in this case the temperature field is linearly correlated.

15:53.720 --> 15:59.480
I'm how this patterns evolve over space. And on the right hand side, I'm plotting the temporal

15:59.480 --> 16:04.680
evolution of this mode. So I'm not a meteorologist and I'm not very good at interpreting what this

16:04.680 --> 16:13.720
means, but essentially this would hopefully summarize a particular weather pattern that could help you

16:13.720 --> 16:18.680
better understand the underlying physics and could also help you, for instance, have you're trying

16:18.680 --> 16:22.680
to produce a deep learning model? You could use this as an input feature to your model.

16:25.560 --> 16:30.520
Right. And this is my last slide. This is the package we've developed in Python. It's called

16:30.520 --> 16:38.920
DynamoDera. And it's publicly available on GitHub. It's current capabilities are error five

16:38.920 --> 16:44.120
downloading and slicing from the cloud, data pre-processing, the singular value decomposition,

16:44.120 --> 16:49.080
either standard or through a randomized algorithm. We also have integrated data version control,

16:49.080 --> 16:53.640
which we found to be very useful because there are various hyper parameters that you can tune.

16:53.640 --> 16:57.960
For example, the low-ranked reconstruction, how many DMD modes you're using, how many pressure

16:57.960 --> 17:04.120
levels you're testing. So actually using data version control has been an amazing addition.

17:05.720 --> 17:12.040
We're currently working on the integrated application of DMD using PythonD. But the key thing here is

17:12.120 --> 17:16.600
that actually you can just do the SVD on the HPC. And with this new fit icon that we've developed,

17:16.600 --> 17:20.600
you can actually run the DMD step on your laptop. It's very quick.

17:22.040 --> 17:28.280
Future work includes adding post-processing capabilities and parallelized SVD using

17:29.160 --> 17:36.280
Pylon, Pylode or Modeling, which is this GitHub package that is this Python package and GitHub

17:37.240 --> 17:42.360
that then it's from the Barcelona Supercomputing and Sender is working on, which can actually

17:42.360 --> 17:48.600
do parallelize SVD on huge matrices in a very short time. And of course, contributions are welcome.

17:49.400 --> 17:50.600
That's all. Thank you very much.

18:01.240 --> 18:02.200
Questions for Dali?

18:02.200 --> 18:08.200
Well, speaking question, I'm here for Dali for the Barcelona Supercomputing.

18:20.200 --> 18:23.880
Yeah, that's a very good question. It's not straightforward to do that.

18:24.280 --> 18:32.840
Oh, yeah, sorry. So the question was how do you choose how to subsample the data in order to produce

18:32.840 --> 18:40.600
these separate DMD models that you can then build a DMD model and then you produce a single

18:40.600 --> 18:45.960
forecast from those DMD models. So it's not straightforward to choose that subsampling and actually

18:45.960 --> 18:52.040
it's a lot of trial and error. You sort of have to ensure that you capture the full range of dynamics.

18:54.200 --> 18:59.720
Well, yeah, that's so if there's overlap in the dynamics, that's not a problem

19:00.280 --> 19:07.400
because essentially what you would get is a repeated DMD mode on separate subsamples. So if that's

19:07.400 --> 19:13.000
the case, you only choose one DMD mode if it appears twice across two different subsamples.

19:14.440 --> 19:19.320
But it's better to be cautious than not to be cautious in this sense. So you choose a subsampling

19:19.320 --> 19:23.160
where you know there's going to be overlap of the dynamics across the subsamples.

19:29.320 --> 19:35.960
More questions. Yeah, in the back, I think very much for a very good tool.

19:37.560 --> 19:42.680
Are you sending your any comparisons against the baseline of temperature? Yeah.

19:50.120 --> 20:01.720
Right. So the question was, how does it compare how does DMD perform for other variables such

20:01.720 --> 20:08.360
as precipitation for instance? I haven't tried on precipitation yet. That's the answer.

20:08.360 --> 20:15.160
I've only tried on temperature and wind speed and I got similar similar performance for temperature

20:15.240 --> 20:18.360
and wind speed. I don't know about precipitation.

20:24.360 --> 20:26.680
More questions? Yeah.

20:28.360 --> 20:32.520
I'm not very far in constant combination, right? But since you are basically putting a data

20:32.520 --> 20:40.440
on vector in the locality in your grid somehow encoded in that board, I think, in just matching

20:40.520 --> 20:46.120
arbitrary data without knowing any data without knowing any of the locality of the data.

20:47.720 --> 20:56.600
So the question is, is there locality information in the matrix action which I do DMD essentially?

20:57.640 --> 21:02.680
So as long as the location remains constant across the different columns,

21:03.560 --> 21:11.640
when you do DMD you're essentially doing SVD across space. So as long as the spatial location

21:11.640 --> 21:16.440
remains the same across different columns, you're capturing the linear correlation that exists

21:16.440 --> 21:28.440
between spatial locations. How much is data do you need to derive this problem?

21:28.840 --> 21:33.000
I was thinking, well, we have a lot of them, scale, forecast, weather, and few months of things.

21:33.720 --> 21:39.400
So that's a good question again. Sorry. So the question was how much data do I need in order to

21:39.400 --> 21:47.960
develop a good DMD model? And that's something to be answered. So error five goes all the way

21:47.960 --> 21:56.440
back to the 50s. So I guess the more data you use the better. Although that might always be the case

21:56.600 --> 22:04.600
because you're making it harder for the optimizer to match your X matrix. So my gut feeling is

22:04.600 --> 22:12.040
that 20 years worth of data is enough. Now we also want to capture things such as global warming.

22:12.760 --> 22:19.400
And for that, DMD might actually not be the best possible model out there because that's more

22:19.480 --> 22:25.160
of a moving average. And DMD is not great at capturing trends. It does capture trends

22:26.200 --> 22:31.160
by fitting a very slow sinusoid, but it's essentially trying to fit sinusoid to everything.

22:36.680 --> 22:37.480
More questions?

22:49.720 --> 22:55.560
Right, that's also a very good question that I have to test out. I've tried to, sorry, sorry.

22:55.560 --> 23:02.120
Yeah, the question is how far into the future can I forecast before my forecast just becomes rubbish?

23:03.560 --> 23:12.120
So that's something that I need to test out at scale. I've tried two months and it behaves

23:12.120 --> 23:18.120
relatively the same for those two months. And the reason is that is capturing these oscillations,

23:18.120 --> 23:24.280
these periodic oscillations that repeat themselves constantly. So if you compare this against

23:24.280 --> 23:31.960
for instance, graphcast is designed to essentially forecast the first few days. For instance,

23:31.960 --> 23:37.960
in the plot that I showed of January 2020, graphcast would probably do well in the first 15 days of

23:37.960 --> 23:45.400
January. And then it would become rubbish. Whereas DMD does consistently well over a longer,

23:45.480 --> 23:51.160
much longer period. It doesn't do as well as graphcast, of course, in those first few days of the

23:51.160 --> 23:56.760
forecast, but it roughly stays the same. The quality of the forecast roughly stays the same.

23:56.760 --> 24:22.200
Right. So the question is how would it deal with non-linear weather events? And I would say non-periodic weather events, such as storms,

24:22.360 --> 24:30.200
that are sort of stochastic, probably not well on those. Because again, if there is no periodicity,

24:30.200 --> 24:35.000
if it's not a repeating pattern, DMD is not great at that.

24:52.200 --> 24:58.600
It could be used for really long forecasting, and the methodology, like for those of the year,

25:00.520 --> 25:08.360
for the two juniors of our future, like eight to the end of the course years, or all the other.

25:09.400 --> 25:15.080
Yeah. So the question was whether DMD could be used for weather forecasting, long-range weather forecasting,

25:15.080 --> 25:19.480
essentially. And I think the answer is similar to what I gave before.

25:19.640 --> 25:27.240
So yes, because it can capture these oscillations that happen periodically in the global climate,

25:28.760 --> 25:33.480
it's stable at producing those, and it would produce a better quality forecast in that sense,

25:33.480 --> 25:38.280
that graph neural network models that essentially are trained to predict

25:39.800 --> 25:43.720
the next 10 days, for instance, of a particular variable. Yeah.

25:43.720 --> 25:52.520
Okay, that's enough up here. Thank you. Thank you.