WEBVTT

00:00.000 --> 00:18.000
Okay, second talk is Christian, um, efficient histogramic for high performance computing in T++ with Yoda.

00:18.000 --> 00:19.000
Thank you, Christian.

00:19.000 --> 00:20.000
Thank you very much.

00:20.000 --> 00:21.000
Good morning.

00:21.000 --> 00:24.000
Um, thanks very much to the organizers for having me.

00:24.000 --> 00:31.000
My name is Christian. I'm a research author engineer at the Center for Advanced Research Computing at University College London.

00:31.000 --> 00:36.000
And I'm also a particle physicist, analyzing collision data from the Large Hadron Collider or LHC.

00:36.000 --> 00:44.000
And today I want to talk to you about a tool called Yoda, which is a statistics library that we use in particle physics.

00:44.000 --> 00:49.000
But before we get into the software, let me just set the stage a little bit better.

00:49.000 --> 00:53.000
The LHC is the world's largest particle accelerator.

00:53.000 --> 00:57.000
Um, it's housed in a 27-kilometer tunnel beneath the Swiss French border.

00:57.000 --> 01:00.000
You can see this indicated in this aerial view.

01:00.000 --> 01:04.000
And if you squint your eyes, you can probably make out the city of Geneva and the Alps in the background.

01:04.000 --> 01:08.000
Inside this tunnel, we accelerate particles to nearly the speed of light.

01:08.000 --> 01:14.000
And then we smash them in four interaction points, which is where the LHC experiments are located.

01:14.000 --> 01:20.000
We have Alice, CMS, LHCB and Atlas, which is the one I'm working on.

01:20.000 --> 01:24.000
So the LHC generates petabytes of data annually.

01:24.000 --> 01:32.000
And just last year in July, the Atlas and CMS experiments each surpass the one exabyte threshold.

01:32.000 --> 01:39.000
Which is the equivalent of about 3,000 years of uninterrupted Netflix streaming, also CHPT tells me.

01:39.000 --> 01:45.000
Um, but even that is only about 10% of the total data set that we expect to record over the lifetime of the LHC.

01:45.000 --> 01:47.000
So it's a big data set.

01:47.000 --> 01:50.000
Um, this data isn't just the experimental results.

01:50.000 --> 01:52.000
It also includes simulated data.

01:52.000 --> 01:59.000
Um, for coming from Monte Carlo Venture generators, which are used to model the processes that we expect to observe in nature.

01:59.000 --> 02:05.000
Simulated data allows us to compare our experimental measurements against theoretical predictions.

02:05.000 --> 02:13.000
And they pay a crucial role in understanding and interpreting our data.

02:13.000 --> 02:26.000
Um, yeah, at the same time, the scale of these event samples of the simulated event samples presents a major challenge for efficient processing and analysis.

02:26.000 --> 02:38.000
If you've ever created a histogram with NumPy and Python or even with Excel, then you might expect a workflow where you sort of hold all of the memory and data and create a histogram from it.

02:38.000 --> 02:42.000
But this simply does not scale the data sets at the scale of the LHC.

02:42.000 --> 02:54.000
What we need is a tool that allows fast, in-loop analysis to allow us to summarize statistical information while keeping the memory usage minimal.

02:54.000 --> 03:10.000
And the solution there is the updateable, um, constant size objects that allow you to summarize, um, to keep track of summary statistics such as mean and variance and stuff like that.

03:10.000 --> 03:18.000
Um, with minimum, with minimum memory usage.

03:18.000 --> 03:24.000
And that's where Yoda comes in. So Yoda stands for yet more objects for data analysis.

03:24.000 --> 03:32.000
And it's designed around memory efficiency and high performance, and it's used in particle physics for over a decade now.

03:32.000 --> 03:54.000
Um, it's, um, written in C++, um, but also provides Python bindings, as well as a whole suite of command and tools for, um, output, um, inspection manipulation from a conversion and things like this.

03:54.000 --> 04:06.000
Um, although Yoda originated from a background of Monte Carlo, then generated analysis, it's designed as really general purpose, and it's not tied to any specific application.

04:06.000 --> 04:16.000
Yoda was designed around, um, several key principles is a differential consistency, a real choir, because histograms don't just count occurrences.

04:16.000 --> 04:26.000
It also represent probability densities, it's a level. Um, and so you want to make sure that these objects scale correctly if the binding changes.

04:26.000 --> 04:41.000
Um, continuous aggregation allows us to process the data incrementally, as it comes in rather than waiting until all the data's there and analyzing it later.

04:41.000 --> 04:50.000
Weighted statistical moments are being stored, um, which allows us to also calculate uncertainties correctly in the case of weighted data sets.

04:50.000 --> 04:59.000
And for more, for essentially all of the modern Monte Carlo engine rate, this weighted data sets is the norm.

04:59.000 --> 05:10.000
And integral consistency allows us to, um, project higher multidimensional histograms into lower dimensional histograms without introducing bias.

05:10.000 --> 05:25.000
As a couple more principles to do with usability and robustness, we want to separate style from substance, meaning that the statistical data should remain invariant.

05:25.000 --> 05:35.000
Um, regardless of how it's plotted, so your choice and visualization shouldn't influence the design of your statistical underlying framework.

05:35.000 --> 05:39.000
We also want to separate the binding from the bin content.

05:39.000 --> 05:49.000
And that allows us to have a clear distinction between live and a nerd objects, meaning active statistical tracking and finalized representations.

05:49.000 --> 05:53.000
I'll show you an example in a couple of slides, what I mean, well, that exactly.

05:53.000 --> 06:01.000
Yeah, and finally, um, user friendliness. So here has a clean and intuitive, um, API.

06:01.000 --> 06:10.000
Um, and as well as, uh, it's got minimal, it's got zero dependencies, in fact, the core library.

06:10.000 --> 06:15.000
We've got a couple of optional dependencies, mainly to do with desired output formats.

06:15.000 --> 06:21.000
I think HDF5, um, but we'll get back to that when we talk about IO.

06:21.000 --> 06:28.000
So let's talk about a bit more about some of the, um, more fun features that we introduce in a background to make this all happen.

06:28.000 --> 06:39.000
Um, we've introduced, um, a new access, flexible access class in Yoda 2, which was released at the end of 2023, or the latest major version.

06:39.000 --> 06:44.000
And this can handle both continuous and discrete access.

06:44.000 --> 06:47.000
Um, a continuous access by far the most common type.

06:47.000 --> 06:50.000
This is the thing we have end bins defined by n plus 1 edges.

06:50.000 --> 06:53.000
We've got an underflow and an overflow in each end.

06:53.000 --> 07:02.000
A discrete access, if you want to contrast that is, um, really useful for categorical data, such as counting particle multiplicities.

07:02.000 --> 07:10.000
Um, and this is the case we have one explicit bin per category, and then you have one other flow that, uh, catches all the outliers.

07:10.000 --> 07:16.000
And here are the main differences that discrete binning don't have a bin with associated with them.

07:17.000 --> 07:23.000
Additionally, we introduced a binning class, which manages one or more of these axes.

07:23.000 --> 07:38.000
And efficiently maps between the global bin index and the tuple of local indices along these axes, uh, supporting slicing and marginalizations.

07:38.000 --> 07:43.000
Um, the bin content in Yoda is templated, which means it's quite flexible.

07:43.000 --> 07:53.000
Nevertheless, we provide two primary content types to handle raw data collection and find a nice statistical representations.

07:53.000 --> 08:04.000
For the live bin content, we have this dbn class or distribution class, which tracks first and second order statistical moments dynamically.

08:04.000 --> 08:08.000
Um, um,

08:08.000 --> 08:14.000
and so I wanted to say about this. No, I don't think so.

08:14.000 --> 08:18.000
Uh, I, and this, this was already available in Yoda one originally in Yoda two.

08:18.000 --> 08:22.000
We've just, um, generalized the star of a tree dimensions.

08:22.000 --> 08:35.000
Uh, contrast that with the inert content type that we introduced an estimate class, and that really represents a finalized value with an optional error breakdown.

08:35.000 --> 08:50.000
That supports both correlated and uncorrelated and certainty components, as well as their respective treatment and arithmetic operations.

08:50.000 --> 08:55.000
We have a bin wrapper class that takes the template to content and links to bin properties.

08:55.000 --> 09:03.000
And all of that is bundled in at the back end into a bin storage class, which is essentially, um,

09:03.000 --> 09:08.000
a main, um, histograming object, a bit object.

09:08.000 --> 09:11.000
And that can handle arbitrary data types, including nested structures.

09:11.000 --> 09:18.000
So you can play around with storage of storages, which can be quite fun.

09:18.000 --> 09:23.000
Um, we support both index-based lookups and coordinate-based lookups.

09:23.000 --> 09:30.000
And for the latter, we also optimize between linear and logarithmic lookups, when it seems.

09:31.000 --> 09:40.000
Um, and finally we also support bin masking, which allows us to disable specific bins, and that can be useful to create visual gaps and distributions as well.

09:40.000 --> 09:53.000
One of the main features in Yoda is its ability to handle on the fly updates to the binning and for that we introduced, um, this object called the fillable storage, which is a pulling morphic, um, abstraction error.

09:53.000 --> 09:58.000
The users have filled adapter to manage to bin updates efficiently.

09:58.000 --> 10:04.000
The fill operation will return the index of the bin that has been filled, and that allows active tracking,

10:04.000 --> 10:06.000
uh, tracking of active regions.

10:06.000 --> 10:13.000
Um, and that's a feature that can be useful for, for example, machine learning, um, applications,

10:13.000 --> 10:20.000
where adaptive binning might become, uh, necessary.

10:20.000 --> 10:26.000
When the analysis is done, Yoda allows, uh, reductions from life to a note object.

10:26.000 --> 10:38.000
And we have specialized, um, zero-dimensional cases if you want to just store, um, key statistical objects, without creating a dummy binning around it,

10:38.000 --> 10:43.000
such as simple counters, and on the estimates.

10:44.000 --> 10:54.000
You can also reduce both the life and the inert objects to a simple scatter objects, um, which is very useful for plotting.

10:54.000 --> 11:02.000
And finally, all of the user-facing types that you see here, they also inherit from Yoda's analysis object base class,

11:02.000 --> 11:12.000
and that thing introduces, um, metadata storage as well to provide additional context about the object.

11:13.000 --> 11:23.000
All of the analysis objects in Yoda can also be serialized into a vector of doubles, uh, and that makes it possible to distribute them efficiently across MPI ranks.

11:23.000 --> 11:38.000
And at the same time, it's also enabled sufficient, on the fly or rather, in memory merging of different histograms across different MPI ranks, without, um, having to rely on expensive,

11:38.000 --> 11:42.000
and expensive operations.

11:42.000 --> 11:48.000
And this design is also, this makes it interesting also for machine learning applications.

11:48.000 --> 11:53.000
In terms of, um, IO, build a support multiple output formats.

11:53.000 --> 12:00.000
We have a compressed ASCII output format, which is, um, metadata friendly and human readable.

12:00.000 --> 12:03.000
I put a little snapshot here for some dummy objects.

12:03.000 --> 12:09.000
Essentially, what this would look like in your editor, um, two histograms, um, with some metadata at the top,

12:09.000 --> 12:16.000
and then you've got different columns corresponding to different parts of the summary statistics, one row per bin.

12:16.000 --> 12:27.000
Alternatively, we also support an HDF5 based output format, which is clearly more suitable for HPC applications.

12:27.000 --> 12:39.000
We also provide a Siphon-based Python API, um, for scripting and workflow integration, um, and we also have a built-in plotting mechanism, which is, writing out,

12:39.000 --> 12:48.000
MATLAB-based Python scripts, uh, that are completely standalone, in the sense that you don't even need your installation to execute them.

12:48.000 --> 12:56.000
They contain all of the numerical data, and the plotting interface for those fully customizable without changing the underlying data structures.

12:56.000 --> 13:11.000
Um, and this ensures consistency and reproducibility makes it quite nice to share with your collaborators.

13:11.000 --> 13:25.000
So just to summarize, um, you'll as a small but powerful statistics tool, um, it's build around, um, a couple of key principles, which is,

13:25.000 --> 13:38.000
uh, a clean intuitive API, uh, self-consistent plotting and minimal dependencies, and robust and scalable statistical data analysis capabilities.

13:38.000 --> 13:45.000
You're the two, the latest major version is a significant evolution in terms of usability, flexibility, and performance,

13:45.000 --> 13:52.000
and if you need, if you're working with large data sets and you need robust statistical analysis, you're as a great tool to have.

13:52.000 --> 13:59.000
Thanks very much for your attention, and if you've got any questions, hit me.

14:22.000 --> 14:36.000
Okay, thank you for the talk, that was great. I have two questions. So one is, I understand that this is more of a library, less than in application,

14:36.000 --> 14:45.000
and the other question is, uh, can you offload to GPUs, or can you make use of like parallel STL that could be offloaded to GPUs?

14:46.000 --> 14:59.000
Um, we haven't had a request to offload to GPUs, don't think in principle, we probably could, um, we've never tried, could be done.

14:59.000 --> 15:02.000
There's interest, and we, I think we'd be happy to look into it.

15:02.000 --> 15:13.000
And for the, the main thing, and, and particle physics where this originated is that we need something that we can do in loop, and the way that this is now designed is that,

15:13.000 --> 15:19.000
solve this problem at some level, and so if you look at typical applications where this is used,

15:19.000 --> 15:25.000
histogram doesn't even show up in the profile in plots anymore, because this essentially solves the memory issue.

15:25.000 --> 15:37.000
And if, if there were becoming a problem that you need to have with the GPUs, we'd adapt and look into this for sure.

15:37.000 --> 15:39.000
Thank you very much for the talk.

15:39.000 --> 15:50.000
Um, what I wonder is, do you provide some Nampai compliant API to, to use a Yoda basically or start using Yoda before diving into the details?

15:50.000 --> 15:53.000
Yeah, so the question was whether we should repeat them.

15:53.000 --> 15:58.000
The question was whether we, um, provide some Nampai compliant API. So we do.

15:58.000 --> 16:08.000
And, uh, sci-fi base, some Python API uses the C++ code underneath the hood, but if you say asked for an array of all the,

16:08.000 --> 16:14.000
so we're not, we'll check, do you have Nampai available in your environment if so will return it to you as an, as Nampai array, and if not,

16:14.000 --> 16:20.000
if we'll fall back to the default Python stuff. So yes.

16:20.000 --> 16:24.000
Any more questions?

16:24.000 --> 16:39.000
Yeah, so the other question that, that just came to my mind is in the beginning, you mentioned you have these like super large datasets.

16:39.000 --> 16:46.000
And then you also mentioned that this basically becomes to like no more like identifiable in the profiles.

16:46.000 --> 16:51.000
Is that because it's so fast or because the other stuff is so slow.

16:51.000 --> 16:57.000
And like how large is the data that like this actually deals with, right?

16:57.000 --> 17:06.000
Um, it's, um, the, typically the the other stuff, the event reconstruction that you have to do.

17:06.000 --> 17:14.000
And we calculate your observables are going to the Instagrams that takes a lot more compute than actually maintaining the objects.

17:14.000 --> 17:20.000
Obviously, you can stretch that and you can go for very large objects with very fine binning.

17:20.000 --> 17:27.000
But typically the kind of histograms that we tend to measure and publish don't have that many bins altogether.

17:27.000 --> 17:31.000
What, what is that many in your use case?

17:31.000 --> 17:33.000
Like, because I have no idea.

17:34.000 --> 17:37.000
It depends, it depends.

17:37.000 --> 17:47.000
But let's say like if you have, if you book a histogram with 50 bins in that, that's not much really in a grand scheme of things.

17:47.000 --> 17:58.000
If you push this to a thousand bins, so I'm sure. Yeah, that's a different thing. Thank you.

17:58.000 --> 18:02.000
Any more more questions?

18:02.000 --> 18:09.000
Don't fall asleep.

18:09.000 --> 18:12.000
Okay. Thank you.

