WEBVTT

00:00.000 --> 00:10.400
Hi everyone, first of all I want to thank first of all for having me, also the HPC

00:10.400 --> 00:14.600
Deaf Room for having me and thanks to all of you for being here.

00:14.600 --> 00:20.080
Today I want to talk partly, I want to present my talk which is partly cloudy with the

00:20.080 --> 00:25.800
chance of sorrow, a virtualized approach to zastos from ECMWS, field database, I'm to be

00:25.800 --> 00:32.440
a screamer, I'm a research software engineer at ECMWF and I'm over former cloud, back

00:32.440 --> 00:39.800
end engineer working for ECMF for, on this day exactly two years and I'm also part of the

00:39.800 --> 00:46.240
one-wheel-easy project to collaborate with colleagues from JSC, DKZ, MPI, Jarex, KT

00:46.240 --> 00:52.080
and University of Cologne. At ECMWF I'm working mainly on the mass ecosystem,

00:52.320 --> 00:59.120
mass is an abbreviation for a meteorological archive and retrieval systems storing all of our

00:59.120 --> 01:05.760
focus data, so just a huge archive if you want to say it like this and I'm mainly working

01:05.760 --> 01:12.400
on the field database which is abbreviated FDB, a caching layer on top of the actual archive

01:12.400 --> 01:18.240
and I'm also working on interfaces to this FDB mainly the Python interface and the Zara interface

01:18.240 --> 01:24.240
which we want to have a look at today. What is ECMWF? It's an intergovernmental organization

01:24.240 --> 01:31.600
established in 1975, consists out of 23 member states and 12 cooperating states and we just recently

01:31.600 --> 01:41.200
crossed the 500 head count number for employees. We have 24 seven operational services doing

01:41.280 --> 01:48.400
operation in numerical weather prediction and having four forecasts today supporting national

01:48.400 --> 01:55.200
weather services with coupled models and also business stone stream. On top of that we are research

01:55.200 --> 02:02.800
institution and doing experiments to continue to improve our models, also doing refocus and climate

02:02.800 --> 02:07.440
analysis on the right-hand side you can see the three sides and I'm stationed in one.

02:07.760 --> 02:15.200
So, sorry, why are we actually talking about this out of the day? Well recently and recently

02:15.200 --> 02:25.440
we saw a huge explosion in scientific data. We had ECMWF currently produced 360 terabytes

02:25.440 --> 02:32.640
of forecast data a day and there is a projection for that to cross the petabyte threshold

02:32.720 --> 02:37.680
in 2027. In the right-hand side you can see the image, what the individual lines are does

02:37.680 --> 02:43.920
really matter currently but the blue one is basically the overall archive and you can see in 2025

02:43.920 --> 02:50.080
we crossed the exoBite scale threshold for the entire archive. On top of that we're seeing

02:50.080 --> 02:54.720
increasing pool of target users coming from all different domains for example social science and

02:54.720 --> 03:04.240
geologists not only meteorological or climate users and of top of and as well we're seeing

03:04.240 --> 03:09.680
more and more users rushing in from the Python side doing for example machine learning or AI.

03:11.440 --> 03:17.440
Our normal and HPC use cases are mainly driven by the focus right so we're doing

03:17.440 --> 03:23.520
numerically weather prediction workflows, writing and reading to parallel file systems and of course

03:23.520 --> 03:29.760
also analysis of forecast in weather and climate and on also so if we take a look at what the

03:29.760 --> 03:33.840
users actually are keen on to do with the data we've seen a shift in the recent years.

03:34.720 --> 03:41.680
There are many hybrid workflows or combining HPC with actual cloud workflows or pipelines

03:42.400 --> 03:47.120
and people are in general really interested in having cemented a semantic data storage in the cloud.

03:48.080 --> 03:54.720
Also doing interactive analysis, explorative work or an electronic on large data set in general

03:54.720 --> 03:59.840
we won't say people are really keen on having analysis ready cloud optimist data sets.

04:00.560 --> 04:07.360
On the right hand side of broader example for such a use case this is the error explorer you can explore

04:07.360 --> 04:13.680
planetological data and in this case the user was asking for a time series data of precipitation

04:13.680 --> 04:22.560
in millimeter per month of process and in the HPC use cases we normally write the entire state of

04:22.560 --> 04:29.200
the atmosphere for each individual simulation step of the model to our file systems and you can see

04:29.200 --> 04:34.640
that the use case in the case of the request of the user is completely orthogonal to that

04:34.640 --> 04:38.800
access pattern because he or she is requesting a time series right.

04:39.360 --> 04:44.720
So the error question which initially arises is can we actually give flexible access to the data

04:44.720 --> 04:50.160
and do that so that even non-domain people have a nice time.

04:51.440 --> 04:56.240
So the goal of this talk is actually to show how bizarre because something which is quite

04:56.240 --> 05:04.000
family or familiar and quite popular in the Python ecosystem is fitting to this HPC use case

05:04.000 --> 05:08.320
and also how this fits together with the already implemented open source solution we have in place.

05:09.200 --> 05:14.400
And we want to bridge the gap between the classic HPC approaches and more modern cloud-based solutions

05:14.400 --> 05:19.600
right or if you want to take this diagram how do we actually fill in this question right there.

05:20.480 --> 05:25.360
So to be able to do this we need to talk about the individual pieces of the chain right so

05:25.360 --> 05:30.960
starting on the right hand side we're talking about in the FDB. The FDB is a domain specific

05:31.040 --> 05:38.880
type forms storage object store it's strange act at print action has no explicit synchronization

05:38.880 --> 05:42.720
does all this synchronization on a file system level and there's no MPI involved.

05:42.720 --> 05:48.160
And general you can think of it as a key object store where the key is some meteorological

05:48.160 --> 05:53.920
meta data and the values are just binary data can be grip files or ODB files.

05:55.120 --> 06:00.320
And an example for such a key in this object store is written in the light left hand side

06:00.400 --> 06:07.120
down below the slide and you can see this are just different meteorological variables

06:07.120 --> 06:13.840
and values describing the actual data safe there. On the right hand side you can see the operational

06:14.640 --> 06:20.560
setup we have different numeric method numerically weather prediction models running

06:20.560 --> 06:24.800
which are outputting the data in the FDB and the FDB is then distributing this data over

06:24.880 --> 06:31.120
different pack ends could be doused, saffronos or a master file system. And the models are

06:31.120 --> 06:38.400
outputting the mass meta data and the data itself and this is typically done in a grip file format

06:38.400 --> 06:44.000
so we need to talk about grip a bit. What is grip grip is an abbreviation for general

06:44.000 --> 06:50.480
regularly distributed information in binary form this is a short one right and is in general

06:50.480 --> 06:57.120
just during gridded fields really and the field is a scalar vector field could be temperature

06:57.120 --> 07:02.240
or pressure and the left hand hand there's a picture depicting the normal grip field so this

07:02.240 --> 07:08.560
is really for in that example two meter temperature on the surface of the of the globe but it can

07:08.560 --> 07:14.800
also be like as I said pressure on different height levels. Grable standard doused by the WMO for

07:14.800 --> 07:20.960
remedial logical data exchange it's a binary format compact so has certain capabilities in compression

07:20.960 --> 07:27.120
and yeah it's efficient for large number of weather prediction data sets. It's optimized for

07:27.120 --> 07:32.320
archival workflows or in general operational workflows it's really good at sequential access

07:32.880 --> 07:38.160
it has a very minimal overhead for all the things we're doing in our operation pipelines

07:38.160 --> 07:42.400
and it's self describing meaning that what I said on the last slide as well the meta data and the

07:42.400 --> 07:47.120
actual values are stored in the same file right next to each other on the disk well there are

07:47.120 --> 07:51.680
also some downsides for example it's a rather complex format you have some tables on top

07:51.680 --> 07:55.840
describing the meta data and in case you want to read the grip file you need to have those

07:55.840 --> 08:02.240
tables to be able to restore the meta data completely and also because it was initially designed

08:02.240 --> 08:09.520
in 1985 this obviously predates cloud native use cases. The lower part of this slide you can see

08:09.520 --> 08:14.000
that is grip file structure on disk there are different sections for the meta data and section

08:14.000 --> 08:19.280
7 is typically where the data is stored all the other sections are certain subjects and sections

08:19.280 --> 08:27.280
of the meta data so finally we also need to talk about Zato make the last piece of this chain what

08:27.280 --> 08:32.960
is Zara actually it's two things really it's a library and it's a format but you could also say

08:32.960 --> 08:38.160
it's an API and behavioral contract but if you're not in the chunk compress an end dimensional

08:38.160 --> 08:44.400
data format to the user it looks somewhat like a non-play array although they introduced

08:44.400 --> 08:49.680
chunking on top chunking is just a fancy way of saying instead of querying the entire array

08:49.680 --> 08:54.800
all the time on querying subsections of the array and those subsections are then returned to

08:54.800 --> 08:59.760
the user. The key features of Zara is that it's storage diagnostics so it doesn't really

08:59.760 --> 09:04.240
meta or appear to the user where the actual source is located could be on a local file system

09:04.320 --> 09:10.640
could be in S3 bucket it has a great language interoperability there are many different

09:10.640 --> 09:16.800
language programming languages implementing features for the or implementing clients for the

09:16.800 --> 09:23.600
Zara library chunking I already mentioned in general it's a hierarchical format mapping groups

09:23.600 --> 09:29.040
or arrays and chunks to the corresponding file system objects right so groups are mapped to

09:29.120 --> 09:35.360
directories arrays mapped to set of files and each chunk is one file on the file system.

09:37.600 --> 09:44.080
All right so to answer the question how do we actually bridge the gap between grip and

09:44.080 --> 09:50.560
Zara I want to talk about the local use case first so where the FDB is running locally and also

09:50.880 --> 09:58.160
the user is doing only local requests so we came up with a solution for

09:59.360 --> 10:06.640
here bridging this gap as follows the user first of all inserts this meta data we've seen earlier

10:08.880 --> 10:15.520
and this can be arbitrary metadata really whatever data the user interested in and it's

10:15.520 --> 10:23.360
handing that to this FDB layer and this knows exactly how to query sort of meta data about this

10:23.360 --> 10:30.080
read about this request the user just query it and is asking the FDB for this meta data

10:31.120 --> 10:37.760
the FDB is then returning this information and is returning a virtual view we call it to the

10:37.760 --> 10:42.720
user this is just a Zara store but the Zara store doesn't contain any data yet so everything in

10:42.720 --> 10:49.440
there is lazy there can be groups in arrays within the store really like arbitrary structure but

10:49.440 --> 10:56.160
it's nothing is loaded it's just like a virtual description of what could be in there if we

10:56.160 --> 11:03.760
condense this image now to this we now have our Zara store the user can interact with but we also

11:03.760 --> 11:08.880
need to be able to query something from this so as soon as the user tries to access in array

11:08.880 --> 11:15.120
I mentioned earlier already that then certain chunks get loaded right so what the user is doing

11:15.120 --> 11:22.000
it's he or she is accessing the array and then the chunks underneath get get query from the Zara

11:22.000 --> 11:26.800
group and array and because the Zara group array knows exactly which stores it's part of

11:28.320 --> 11:36.800
the implementation of ours is then able to map this inquiry to a mouse request and sending this to the

11:36.800 --> 11:45.200
FDB the FDB is then returning those bytes of the Zara for I'll back to the array and the chunk

11:45.200 --> 11:51.360
gets actually delivered to the user this is the the high level explanation of what is going on

11:51.360 --> 11:56.800
this two-step approach between first defining the actual view and then querying the data

11:56.800 --> 12:03.760
from the actual database to give you an overview how this looks here the code is more or less readable

12:04.720 --> 12:08.800
if not it's not a big deal because what I just wanted to show with that piece of code is

12:08.800 --> 12:15.120
how concisely can define such a virtual store or such a virtual view there's actually in that

12:15.120 --> 12:21.920
example a little bit more going on so there are two parts which we define at each part is a

12:21.920 --> 12:29.120
you could be a request to some meteorological data and those are virtual arrays they are not really

12:29.120 --> 12:36.720
existing in the Zara as I just described they are just the mapping which has to be done when

12:36.720 --> 12:43.920
accessing the files from the FDB right and those are in that case concatenated together and the

12:43.920 --> 12:49.680
initial request are four-dimensional but they are mapped with the access definitions to an actual

12:49.680 --> 12:57.680
two-dimensional array in the Zara word so this is also shown on the diagram on the lower part of

12:57.680 --> 13:02.640
the slide so there are two lazy views that they are concatenated together and they end up in Zara

13:02.640 --> 13:10.400
array and the dimensions you can see right below this SARS-Zine the first dimension of this

13:10.400 --> 13:16.080
actual mapped array is the data I'm dimension so we're the data and the time of the initial

13:17.120 --> 13:21.920
method data is mapped together to one dimension and then the second dimension is the parameter

13:22.000 --> 13:27.440
level list I mentioned where the same thing happened with the parameter the level list and the third

13:27.440 --> 13:32.480
dimension although I just said that it's in two-dimensional array the third dimension is implicit

13:32.480 --> 13:41.120
and it stores all the values of the group files we query from the FDB okay now we want to talk about

13:41.120 --> 13:48.880
the remote case actually right because we want to bridge the gap between the HPC and the remote

13:48.960 --> 13:56.160
user and Zara has a lot of capabilities in the regard already so if you have a user you probably

13:56.160 --> 14:03.680
are aware that there is an HTTP store and you can just hand it a certain URL to a data file

14:03.680 --> 14:09.920
somewhere located on a server and it's it's just able to open it and the users able to access the

14:09.920 --> 14:16.320
content this was what is depicted on the left hand side with the aqua data set in our case because

14:16.320 --> 14:21.840
we had to implement this two-step approach for first virtually defining the view and then

14:21.840 --> 14:28.240
querying the data we needed to a slightly beefy implementation of the client and the server all right

14:28.240 --> 14:34.880
so the client and the server has shown in this picture take care of this two-step approach

14:34.880 --> 14:40.880
and they are communicating over a compressed HTTPS but once we're on the server side on the HPC

14:40.880 --> 14:46.080
everything happens exactly like I just described for the local case so how would that look

14:46.080 --> 14:54.800
like the first that would be the client triggers the creation of a virtual view the server then

14:54.800 --> 15:01.040
creates this virtual view of the FTP data as described earlier and is returning the URL to the user

15:03.040 --> 15:09.600
and or to the client and the client then is able to open the virtual view via an FLS spec store and

15:09.600 --> 15:17.840
send the chunk request back to the server once the server gets the request it's growing the chunk

15:17.840 --> 15:24.880
data from the FTP because it knows how to map the the request and sends the bytes once it has

15:24.880 --> 15:31.280
it from the FTP back to the client and to the client it looks like the actual data was stored locally

15:31.440 --> 15:40.320
and this is this is the entire the entire process so to summarize and what have we seen I think

15:40.320 --> 15:46.000
it's the easiest ways to have a look at the diagram basically we were able to extract data from

15:46.000 --> 15:52.400
the FTP directly while by our creating virtual views which can be concatenated as is not necessary

15:54.320 --> 15:58.640
this is an optional step right but can be done by the user and then it's handed to the

15:58.960 --> 16:06.480
user and really the the user only grows the data once it's sitting the certain chunk that's

16:06.480 --> 16:14.000
everything I have so far thanks a lot for your attention and I want to thank the BMFT for sponsoring

16:14.000 --> 16:20.480
the project and also my colleagues who collaborated on the topic with me and yeah I'm looking forward

16:20.480 --> 16:39.040
to your questions yeah I don't know climate reasons and I mean so I'm wondering how

16:39.040 --> 16:45.200
designing the place with net CDF files it's a way so if our scientists have our

16:45.200 --> 16:52.320
connection of net CDF files do they have to convert them to a lot of data storage or is this

16:52.320 --> 16:59.760
you more as a cache like the data is used on the fly and then the end so the question was

16:59.760 --> 17:06.560
there are certain files which may be HDF net CDF and so on and whether they need to be converted to

17:06.640 --> 17:12.640
ZAR right to be usable and so there is a certain library you can use which is called virtual

17:12.640 --> 17:18.160
ZAR we have nothing to do with this this is part of the ZAR ecosystem right and this is basically

17:18.160 --> 17:26.080
able to open net CDF HDF I think to some extent grip files which are not stored in the FTP just

17:26.080 --> 17:30.800
on a random file system as well and display them as a ZAR very close to this virtual approach

17:30.880 --> 17:37.280
we implemented but like a bit more general to with other file systems as well right so that would be

17:37.280 --> 17:45.440
the option you have there without so there's no need for conversing this but you could use this library

17:45.520 --> 18:01.360
okay so I actually brought as well the link collection oh yes sorry so the question was

18:01.360 --> 18:06.880
whether the API is already available right so the local use cases already living in the FTP repository

18:06.880 --> 18:14.560
here the client server thing I showed is under development but it's going to hit this also rather

18:16.400 --> 18:22.400
thanks yeah

18:32.160 --> 18:38.080
yeah that's a very good question actually and so the view is so the question was whether the

18:38.080 --> 18:43.360
there's a state saved on the server right which persists the view and this is actually done

18:43.360 --> 18:51.920
so currently we are just holding the viewer certain user created it hasht in memory and this is

18:51.920 --> 18:58.640
reused once another user queries the same view right there are some some open questions that

18:58.640 --> 19:04.480
regard for example if you have a lot of like overlap between re-quest what do you do but this is more

19:04.480 --> 19:27.040
less open yeah so the question was whether the virtual is our state there's some whether ours

19:27.040 --> 19:31.840
has the state I'm not entirely sure how virtual is work some of the hood right I'm not

19:31.840 --> 19:38.720
affiliated with the project or anything um I think the two step approach is hidden in the virtual

19:38.720 --> 19:42.400
is our stuff because there has to be a map in this mapping as well I'm not entirely sure

19:50.400 --> 19:57.680
alright if there are no further questions thanks a lot