WEBVTT

00:00.000 --> 00:16.240
Thank you very much for inviting me and I'm going to talk not so much about COVID but

00:16.240 --> 00:22.880
more about doing artificial intelligence models for biomedical research or machine learning

00:22.880 --> 00:25.160
as we call them actually.

00:25.160 --> 00:29.920
First of all, let me introduce myself, I'm a professor for drug mathematics at the University

00:29.920 --> 00:34.960
of Sarland, I hold the Master Mathematics and Peace Dean Molecular Biology and according

00:34.960 --> 00:38.360
to German law, I'm allowed to teach bioinformatics.

00:38.360 --> 00:46.800
I also have a semi-active blue sky account as it became the plot platform of choice these days

00:46.800 --> 00:47.800
in the community.

00:47.800 --> 00:52.720
First of all, let me tell you a little bit about where I am.

00:52.720 --> 00:59.200
Sarland is a little state in Germany on the French border and geographically it's actually

00:59.200 --> 01:02.320
a closer to Brussels or to Paris than to Berlin.

01:02.320 --> 01:11.440
In Sarland we have nice nature, we have nice food, we have some not-so-nice food and we have

01:11.440 --> 01:15.600
the hand-wants Institute for Pharmaceutical Research and the University with the Center for

01:15.600 --> 01:18.600
Bioinformatics which are my home institutions.

01:18.600 --> 01:25.840
So I will take one minute to make some advertising for the Center for Bioinformatics, we have

01:25.920 --> 01:31.440
the full study program in Bioinformatics in Bethlehem, so if you have someone who wants

01:31.440 --> 01:39.080
to study Bioinformatics, consider sending them there, the Bachelor is taught in German, but

01:39.080 --> 01:44.240
Master is fully in English and we have 90% of international students there.

01:44.240 --> 01:51.040
We have five full professorships and two junior professor groups and the spread of topics

01:51.120 --> 01:58.160
is very broad from everything from algorithms to clinical bioinformatics and to a driven

01:58.160 --> 01:59.160
drug discovery.

01:59.160 --> 02:04.000
Enough of that, let's talk about machine learning and biomedical research.

02:04.000 --> 02:11.200
I stole this figure from a paper, it actually reports models that are used in medicinal

02:11.200 --> 02:16.640
imaging, but it doesn't matter because the trend is the same in every field.

02:16.720 --> 02:24.480
Approximately from mid-2010s, the number of papers reporting some machine learning or AI

02:24.480 --> 02:31.680
model in some biomedical domain exploded and there is no stopping to it, so I think it's growing

02:31.680 --> 02:36.160
and growing uncontrollably.

02:36.160 --> 02:41.840
But as we know AI models are black box models and probably some people have seen this picture

02:41.840 --> 02:45.760
before, razor hand, yes of course.

02:45.760 --> 02:51.440
So you know what the trick is, we don't really know how model learns what it learns.

02:51.440 --> 03:00.880
So here we have a classifier for images, huskies, dogs or wolves and the classifier seems

03:00.880 --> 03:07.360
to be very good on the one error, but if you look closer at what the model actually uses,

03:07.360 --> 03:15.200
what information it uses to make predictions, you figure that dog is not in the image at all

03:15.200 --> 03:22.960
and what you actually trained is a perfect snow detector. The trick is yes, dogs are usually photographed

03:22.960 --> 03:30.400
on some neutral background wolves are predominantly photographed in winter on the snow background

03:30.400 --> 03:39.120
and yeah that's how it worked. It's all very funny, until we actually enter some biomedical

03:39.120 --> 03:46.560
domain. And here COVID comes into play when the epidemic started, of course, the number of

03:47.360 --> 03:54.640
machinery in tools to classify CT scans into COVID and non-COVID also exploded and after

03:56.640 --> 04:02.640
some time into the pandemic, some people actually went into trouble of analyzing some of these

04:02.640 --> 04:10.000
tools and what they discovered is that the models just like the huskies wolf classifier didn't

04:10.000 --> 04:18.480
really learn on relevant information. What it learned was some auxiliary in this case for example labels

04:19.120 --> 04:27.840
on the radiograms. The reason for this was that all COVID samples, so images of COVID

04:27.840 --> 04:36.720
CT scans came from one hospital and the images for healthy, long scans came from a different hospital

04:36.720 --> 04:44.240
which were labeling their scans differently. So basically what the model learned was to detect

04:44.240 --> 04:50.560
these labels and unless you really do analysis, in this case it was some feature important

04:50.560 --> 04:56.640
analysis you never know. And yes what they showed here very nicely if you swap labels

04:57.520 --> 05:06.640
the prediction swaps. So the model didn't learn anything. So if we are thinking about how our models work

05:06.640 --> 05:16.880
in general terms, we as developers have control over the blue area in this image but we don't

05:16.880 --> 05:24.480
have much control about the red area where your model is going to be used, where it will be applied

05:24.480 --> 05:30.240
on some real data. So what we have during the development cycle we have trained validation

05:30.240 --> 05:39.840
and test data and in the deployment time we have some new data inference data. So what does

05:39.840 --> 05:46.240
information leakage have to do with this? Let's first define what is information leakage. By definition

05:46.240 --> 05:52.000
it is use of information during model training process that would not be available during inference.

05:53.120 --> 05:59.760
You can think of your training data as a table. You have features in columns and you have samples

05:59.760 --> 06:06.000
in rows and you can have information leakage across both directions. Feature leakage is very easy.

06:06.000 --> 06:10.160
You have a feature that is highly correlated with the variable that you're trying to predict. For

06:10.160 --> 06:15.920
example you are trying to predict yearly salary of some employees and you have a column that

06:15.920 --> 06:22.480
reports their monthly salary. Very useful but not really what you want to your model to learn.

06:23.840 --> 06:30.960
Another type of leakage is across samples where for example you have samples in the training

06:30.960 --> 06:36.000
set and in the test set that are just identical. You have not cleaned your data properly. You

06:36.560 --> 06:41.440
have duplicates and accidentally they end up in training data and test data.

06:42.720 --> 06:50.080
Which can be theoretical taking care of relatively easily. But there are other more subtle ways

06:50.080 --> 06:57.680
we how this information from the training data can leak into your test data. For example

06:57.680 --> 07:03.520
yeah, parametric visualization, non-identical distribution data when your positive and negative

07:03.680 --> 07:10.640
samples are differently distributed. So what you can try to do since you don't have control over

07:10.640 --> 07:18.720
report. You don't know what will happen during inference time. You can a little bit pay attention

07:18.720 --> 07:32.080
to your training process to try to stop leakage from happening. We published a paper recently where

07:32.080 --> 07:37.760
we have formulated seven questions that every prudent researcher should ask themselves

07:38.640 --> 07:46.320
when training AI models. But I'm going to focus on two of them that are related to a specific type

07:46.320 --> 07:53.840
of data leakage sample similarity. So let's talk about splitting the data while you're training your

07:54.080 --> 08:01.360
model. You have a data set and you need to partition it into a typically into training

08:01.360 --> 08:07.200
validation and test set. Here they are shown in different colors and this is one way of doing this

08:07.200 --> 08:14.880
and this is a different way of doing this. And you see that they are different but you probably

08:14.880 --> 08:25.120
don't know what for. So the idea is that if your partitioning puts the training and test data apart

08:26.320 --> 08:33.520
this makes life of the model kind of hard. You train on one space of data and you're testing

08:33.520 --> 08:43.360
on a different space of the data space and in this way you challenge your model. And hopefully

08:43.360 --> 08:53.680
when it comes to inference the data that will be used for inference will most likely it will

08:53.680 --> 08:59.040
come from a different part of the data space. Hopefully once you challenge your model with

08:59.040 --> 09:03.680
this part of the data space it will also do well on a different part of the data space. That's the

09:03.680 --> 09:13.120
hope. But what you can do without really knowing the inference data. So if we

09:13.120 --> 09:20.320
consider typical biological data sets they can be one-dimensional suppose you have a list of

09:20.320 --> 09:25.760
proteins that you want to predict something for or they can be two-dimensional if you have

09:25.760 --> 09:33.360
protein targets and drugs that can interact with them and then you can split these data sets

09:33.360 --> 09:42.960
from easiest splits to hardest splits. So for one-dimensional data set what you can do to make life

09:42.960 --> 09:51.760
of the model harder is to look for similarities in your data and then try to put clusters of similar

09:51.760 --> 09:58.320
data points into one split. So either training or validation or test split. For two-dimensional

09:58.640 --> 10:05.840
data it's harder because you have similarities across two dimensions. So first thing that you

10:05.840 --> 10:10.880
can do to make life of the model a little bit more difficult than just random misleading the

10:10.880 --> 10:17.360
metrics you can split it by column or by row. In this way some drugs for example will be never seen

10:17.360 --> 10:22.880
by the model during training and then in tests you will see how well it performs on unseen drugs

10:22.880 --> 10:31.440
or similar to some targets and then you can try to account also for similarities and create a split

10:31.440 --> 10:40.080
that will first of all put all similar drugs and targets together. In this way you will inevitably

10:40.080 --> 10:45.600
lose some data because some points you will not be able to classify. So what we are presenting

10:45.600 --> 10:54.720
a tool for controlling data leakage that instead of making models to memory allowing models to

10:54.720 --> 11:07.600
memorize their data are to make models generalized better on unseen data. So this tool takes your

11:07.600 --> 11:15.360
data set and if it is a normal molecules molecular data set it automatically creates all possible

11:15.360 --> 11:21.600
splits of different targets. So now if we consider again this drug target interaction example where

11:21.600 --> 11:26.960
you have some molecules and some drugs and you want to predict whether they bind to each other

11:28.000 --> 11:35.440
we can create all these different splits and here are some metrics that I'm not going to go

11:35.440 --> 11:41.920
into very much detail but what our tool can do it can measure the amount of leakage based on

11:41.920 --> 11:48.560
the similarity between the data points and what we have trained several standards machine learning

11:48.560 --> 11:58.240
models on these splits and what we can show is that the less leakage between the splits the

11:58.640 --> 12:07.840
worse performance of the models and also we could show that our tool makes a life of models harder

12:07.840 --> 12:15.040
than any other tool that was available on the market at the time of publication. So yeah to conclude

12:15.040 --> 12:21.680
that data leakage is a problem in predictive models and it should be addressed before you even start

12:21.680 --> 12:28.000
training your model. It should be addressed when you are looking at your data set and

12:28.720 --> 12:36.640
what is also a bit counterintuitive is that better models that can be can better generalize

12:37.680 --> 12:45.280
to unsend data often perform worse at least in the reported benchmarks. Of course in an independent

12:45.280 --> 12:50.960
benchmark the truth will come out because you would probably challenge models with some hard splits

12:51.040 --> 13:00.640
but if you read papers you often see the benchmarks on random splits and this is a bit

13:00.640 --> 13:07.840
disappointing to people who try to develop really good models. So right that's all from my side

13:07.840 --> 13:11.520
for today thank you very much

13:21.040 --> 13:32.080
Thank you so much for listening to this. The question was is there a way to visualize

13:34.160 --> 13:41.280
yes you can is there a way to visualize information leakage we can't think of a way to visualize it

13:41.280 --> 13:49.280
but we can measure it and we have developed a formula that measures similarity between all data

13:49.280 --> 13:54.480
points in the data sets and between the data sets and in a way this is the objective function

13:54.480 --> 13:58.480
of the optimization problem that we are solving in our tool.

14:09.760 --> 14:14.080
So the question was how many GPUs we need to train the models.

14:15.040 --> 14:22.320
The point of the tool that I was talking about and all the analysis that I was talking about

14:22.320 --> 14:28.160
it was not training the models it was actually before it was data analysis and splitting the

14:28.160 --> 14:37.760
data is also a hard task. So what we did we implemented an optimization problem with

14:38.080 --> 14:43.760
solving and optimization problem using integer linear programming. Now we are doing this with

14:43.760 --> 14:52.080
standard solvers so for bigger data set it's not an issue of time time is usually in hours

14:52.080 --> 14:59.120
but it's often an issue of memory. So one of the next questions that we need to solve is to develop

14:59.440 --> 15:03.120
especially a solver for this.

15:11.280 --> 15:15.040
Thank you very much Olga.

