WEBVTT

00:00.000 --> 00:07.000
Thank you very much for the good interest.

00:07.000 --> 00:12.000
Again, I'm Dio, I work at TikTok in the US and I work on this project.

00:12.000 --> 00:17.000
So today I'm going to, this project is more closer to the application

00:17.000 --> 00:21.000
site where like other talks are focused on like how you enable T and

00:21.000 --> 00:23.000
like VMs and all that.

00:23.000 --> 00:26.000
So it's more like a higher level application.

00:26.000 --> 00:33.000
How we're going to use computer computing in, you know, real work.

00:33.000 --> 00:36.000
So let me get started.

00:36.000 --> 00:40.000
So this is the whole content so I'm going to start with why we need the

00:40.000 --> 00:42.000
private data analytics and why it's hard.

00:42.000 --> 00:47.000
And I'll talk about our project and also include some demo.

00:47.000 --> 00:50.000
So private data analytics.

00:50.000 --> 00:52.000
So why private data is so important.

00:52.000 --> 00:55.000
So everybody will agree that it was very good for value extraction.

00:55.000 --> 00:59.000
Like personalized contents or, you know,

00:59.000 --> 01:03.000
training the recommendation model for better recommendation and so on.

01:03.000 --> 01:06.000
But not many people talk about public interest.

01:06.000 --> 01:12.000
So I'm going to focus on the public interest case.

01:12.000 --> 01:20.000
So sharing private data, if you see, it's actually really, really important for public interest,

01:20.000 --> 01:22.000
especially public health.

01:22.000 --> 01:28.000
Some researchers want to use medical data, personal medical data is very,

01:28.000 --> 01:34.000
very strictly protected or personal health data collected by personal medical devices.

01:34.000 --> 01:41.000
Or for public safety, even very, very strict private data like PII, like personal address or

01:41.000 --> 01:47.000
for number, they can be used to identify some public safety issues.

01:47.000 --> 01:51.000
Like if they're associated with crimes or illegal activities.

01:51.000 --> 01:56.000
Education, personal academic performance, like their scores or attendance,

01:56.000 --> 02:01.000
and their engagement information, or the combined

02:01.000 --> 02:06.000
in combination with some public, sorry, private data like address,

02:06.000 --> 02:11.000
can be used to find out some correlation between academic performance,

02:12.000 --> 02:17.000
and people like where they leave or their background and so on.

02:17.000 --> 02:21.000
We're civic engagement, personal beliefs or social activities,

02:21.000 --> 02:27.000
how this kind of personal beliefs affects the social activities of personal and so on.

02:27.000 --> 02:31.000
These are not the only examples.

02:31.000 --> 02:38.000
So to give a very concrete example, one of the research that was published in CCS last year

02:38.000 --> 02:43.000
is to understand illicit drug promotion by using cross-platform data.

02:43.000 --> 02:47.000
So the what research did was that they used,

02:47.000 --> 02:53.000
they figured that there's a pattern in illicit drug promotion promoters.

02:53.000 --> 03:01.000
So how they do it is that they basically use the cross-platform referral traffic

03:01.000 --> 03:08.000
to draw people into their drug-doing promotion,

03:08.000 --> 03:12.000
but without getting detected by their moderation,

03:12.000 --> 03:20.000
by using some, because it's very hard to detect this kind of cross-platform,

03:20.000 --> 03:26.000
because either YouTube or Instagram, they will have their own data,

03:26.000 --> 03:29.000
and they have to determine based on their own data.

03:29.000 --> 03:37.000
In this case, they were able to identify those kind of cases by leveraging both data

03:37.000 --> 03:40.000
from different organizations.

03:40.000 --> 03:46.000
Another example is there's a very big initiative in a UK called HDR UK.

03:46.000 --> 03:49.000
What they're trying to build is a trusted research environment,

03:49.000 --> 03:55.000
where they want to combine all the medical data from the health providers

03:55.000 --> 04:02.000
to a little public researchers to get insights from those data.

04:02.000 --> 04:13.000
Yeah, these are examples of the efforts to provide access to private data for public interest.

04:13.000 --> 04:15.000
But why is it hard?

04:15.000 --> 04:18.000
The first challenge, of course, the data privacy risk.

04:18.000 --> 04:23.000
There's a trusted issue, because a lot of entities might have conflict interest,

04:23.000 --> 04:27.000
like YouTube and Instagram, they might not want to share their data to each other.

04:27.000 --> 04:30.000
This go with using data fabrication.

04:30.000 --> 04:35.000
Even if you're claiming that your benign researcher is still possible

04:35.000 --> 04:43.000
that researcher do some things with the private data that they have never promised to do,

04:43.000 --> 04:49.000
like tracking some private information of the data and so on.

04:49.000 --> 04:54.000
And there's another issue, which is different, different trust domain issues.

04:54.000 --> 05:02.000
So, a lot of time data is processed in a different place from different places

05:02.000 --> 05:06.000
that is not owned or controlled by the data owner.

05:06.000 --> 05:12.000
This is especially true when you deal with the trust domain,

05:13.000 --> 05:18.000
and there's a big compliance issue.

05:18.000 --> 05:25.000
Apart from the security issues, you still have the condition to the security.

05:25.000 --> 05:28.000
You definitely need to protect the data.

05:28.000 --> 05:32.000
In addition to that, you have to keep all the privacy policies enforced,

05:32.000 --> 05:37.000
such as data retention or purpose limitation of the data.

05:37.000 --> 05:45.000
And providing the raw data might not legally alert in some countries or in some areas.

05:45.000 --> 05:53.000
And also changing the geolocation of the data or formal data could be also legally restricted and so on.

05:53.000 --> 06:01.000
The second challenge is that nowadays the data is distributed across multiple places,

06:01.000 --> 06:04.000
even for a single organization, you can think of it.

06:05.000 --> 06:10.000
In the old days, the organization, if they own the data, they will have the data in their own servers,

06:10.000 --> 06:12.000
they manage and control.

06:12.000 --> 06:19.000
But in these days, this is not true because they often delegate the data to the third party,

06:19.000 --> 06:22.000
data warehouse, like software or data breaks,

06:22.000 --> 06:28.000
or maybe they can even store it in some cloud provider resources like buckets.

06:28.000 --> 06:34.000
And also the compute exists not only in the organization's servers,

06:34.000 --> 06:42.000
but also in the cloud provider, like they've owned the workload in the GCP or azer and so on.

06:42.000 --> 06:46.000
So it raises the challenge about accountability and transparency.

06:46.000 --> 06:51.000
So when things went wrong, like let's say there's a data breach happen,

06:51.000 --> 06:56.000
who's going to take the responsibility.

06:56.000 --> 07:02.000
It's very hard to determine what calls the breach and who's going to,

07:02.000 --> 07:04.000
it's all about the accountability.

07:04.000 --> 07:12.000
And also it's very important to make it possible to verify every single data transfer,

07:12.000 --> 07:18.000
every single processing of the data in the compute notes.

07:18.000 --> 07:22.000
That's what we need.

07:22.000 --> 07:30.000
So we concluded that we may need a standard way that provides a strong privacy protection mechanism,

07:30.000 --> 07:32.000
using various PTs.

07:32.000 --> 07:37.000
And also not only this mechanisms exist,

07:37.000 --> 07:42.000
we also need to enforce that technically.

07:42.000 --> 07:46.000
So terms and conditions to the researchers are sorry researchers, not enough,

07:46.000 --> 07:55.000
you know, it cannot prevent them from abusing the data or violating the privacy policies.

07:55.000 --> 07:59.000
And we also need to have accountability and transparency,

07:59.000 --> 08:03.000
so we need to be able to provide a tool to the data owners

08:03.000 --> 08:10.000
that they can confidently audit or verify what's happening with the data.

08:10.000 --> 08:13.000
And finally they're usability.

08:13.000 --> 08:19.000
So with all these guarantees, you should not sacrifice the results.

08:19.000 --> 08:26.000
So some PT technology sacrificed the accuracy of the results for the sake of the privacy,

08:26.000 --> 08:32.000
but this was not our requirement when we designed this system.

08:32.000 --> 08:38.000
And in addition to that, we wanted it to be very, very easy to deploy,

08:38.000 --> 08:44.000
and very, very easy to use, and also very easy to customize.

08:44.000 --> 08:50.000
So there are existing solutions already when we exploit these problems.

08:50.000 --> 08:52.000
So one is the data clean room.

08:52.000 --> 08:57.000
So in the industry, they're already using this kind of call,

08:57.000 --> 08:59.000
a framework called data clean room,

08:59.000 --> 09:06.000
where you basically define the policy on every single single statement to,

09:06.000 --> 09:13.000
you know, define who can access which table, and who can query which kind of query and so on.

09:13.000 --> 09:21.000
And it's operating in some third party that has no confidence on interests or anyone in the data,

09:21.000 --> 09:23.000
anyone who owns the data.

09:23.000 --> 09:27.000
The second option that we thought about was a differential privacy,

09:27.000 --> 09:31.000
the differential privacy actually pre-processed the data,

09:31.000 --> 09:38.000
or add a noise to the result of the aggregated SQL to limit the information leakage,

09:38.000 --> 09:41.000
theoretically limit the information leakage.

09:41.000 --> 09:45.000
The final option was cross-experiment environment.

09:45.000 --> 09:58.000
So we kind of assess the processing concept, which of the techniques.

09:58.000 --> 10:01.000
So first of all, SQL power space data clean room.

10:01.000 --> 10:05.000
Although they provide a very good usability, they allow researchers,

10:05.000 --> 10:09.000
they allow users to query anything on the data,

10:09.000 --> 10:12.000
and they provide very high accuracy.

10:12.000 --> 10:17.000
They necessarily feature to technically enforce the privacy policies.

10:17.000 --> 10:23.000
Also, they often miss the privacy protection in general.

10:23.000 --> 10:34.000
The differential privacy on the other hand provided very well-defined technical guarantee on privacy,

10:34.000 --> 10:38.000
but it kind of sacrifices accuracy.

10:38.000 --> 10:41.000
So traffic is executing environment on the other hand.

10:41.000 --> 10:46.000
It not only provides technical enforcement and high accuracy,

10:46.000 --> 10:52.000
but also it can provide some transparency in terms of accountability that I mentioned before.

10:52.000 --> 10:58.000
One issue is that the usability of it, so I said it could be better.

10:58.000 --> 11:04.000
By that, I mean, for the, this type of data analytics,

11:04.000 --> 11:09.000
we figure that cross-experiment environment is very, very hard to use,

11:09.000 --> 11:15.000
because of the way, the analyzer data, the analyzer data doesn't match with the model,

11:15.000 --> 11:21.000
that cross-excusion environment deals with the workloads.

11:21.000 --> 11:24.000
So let me, let me talk a little bit about that later.

11:24.000 --> 11:28.000
But that's why we decided this project humanity.

11:28.000 --> 11:36.000
So to this end, we built the framework with the following goals.

11:36.000 --> 11:40.000
First, technical enforcement on the privacy policy,

11:40.000 --> 11:43.000
policy via various PT technologies,

11:43.000 --> 11:46.000
and just secondly, you wanted to be usable,

11:46.000 --> 11:50.000
so we wanted to provide an interactive tool to utilize the data

11:50.000 --> 11:55.000
and third, the accuracy, we should not sacrifice accuracy

11:55.000 --> 11:58.000
for the sake of anything else,

11:58.000 --> 12:02.000
and then, finally, the transparency and accountability.

12:02.000 --> 12:05.000
Oh, actually, last thing is the deployment.

12:05.000 --> 12:09.000
We wanted to make it easy to deploy into the cloud.

12:09.000 --> 12:12.000
So these are our design goals.

12:12.000 --> 12:17.000
So one of the observation that we saw is that

12:18.000 --> 12:21.000
the data analytics actually happens in two stages.

12:21.000 --> 12:25.000
One is programming stage and the other is execution stage.

12:25.000 --> 12:30.000
And each of the stage has very, very different requirements.

12:30.000 --> 12:32.000
So in the programming stage, usually,

12:32.000 --> 12:36.000
you only need very small data set and very small amount of compute.

12:36.000 --> 12:40.000
You don't need like a thousand GPUs or something.

12:40.000 --> 12:44.000
And it's better to be very interactive,

12:44.000 --> 12:51.000
because when you program, you usually try some code with your data

12:51.000 --> 12:55.000
and play with the data to get some initial insights

12:55.000 --> 13:00.000
before you want to do the full analysis.

13:00.000 --> 13:03.000
But because of that, it's very hard to control the data.

13:03.000 --> 13:07.000
Researchers or users, they can do anything with the data,

13:07.000 --> 13:10.000
and it's very hard to control.

13:10.000 --> 13:13.000
So it has a very high privacy risk.

13:13.000 --> 13:17.000
On the other hand, once they're done with the programming,

13:17.000 --> 13:22.000
they are actually able to run this very large batch

13:22.000 --> 13:25.000
on the larger data and compute.

13:25.000 --> 13:29.000
And it's only happened once after you program everything

13:29.000 --> 13:31.000
and you make sure that it works.

13:31.000 --> 13:36.000
Just have to run it once to get the final output.

13:36.000 --> 13:40.000
And this stage actually is easier to control the data.

13:41.000 --> 13:44.000
Also, it has a little privacy risk because of that.

13:44.000 --> 13:47.000
So the approach that we took is that,

13:47.000 --> 13:51.000
actually, why don't we separate these stages and focus on

13:51.000 --> 13:54.000
different problems in each of the states.

13:54.000 --> 13:59.000
So for the protection of the execution stage where you run

13:59.000 --> 14:04.000
this workload in large batch on our actual data,

14:04.000 --> 14:08.000
we can use the Confession Computing.

14:08.000 --> 14:12.000
And for the programming stage, we can use other PET technology

14:12.000 --> 14:16.000
that has different trade-offs.

14:16.000 --> 14:19.000
Like synthetic data is one of the examples you can use

14:19.000 --> 14:24.000
differentially private synthetic data to mark the actual data.

14:24.000 --> 14:29.000
Like has a same statistical characteristic,

14:29.000 --> 14:34.000
but has no risk of privacy leakage for example.

14:34.000 --> 14:36.000
So that's the basic idea.

14:36.000 --> 14:39.000
And the benefit of this is that actually,

14:39.000 --> 14:42.000
what you're doing is actually separating the data

14:42.000 --> 14:44.000
policy and code policy.

14:44.000 --> 14:48.000
So you can choose the, you can flexibly choose

14:48.000 --> 14:51.000
the data policy on the programming stage.

14:51.000 --> 14:55.000
You can either use LDP perturbation or sample data

14:55.000 --> 14:57.000
or DP synthetic data.

14:57.000 --> 15:01.000
Whatever the data on it wants to protect the data privacy

15:01.000 --> 15:04.000
with the budget that they want.

15:04.000 --> 15:08.000
Where they still can, although,

15:08.000 --> 15:12.000
users to get the accurate result in the execution stage.

15:12.000 --> 15:18.000
And you can enforce the call policy on the execution stage.

15:18.000 --> 15:22.000
So we'll get accurate result on the execution stage

15:22.000 --> 15:26.000
because it will run on the full data set,

15:26.000 --> 15:30.000
which is securely enabled by Confession Computing.

15:30.000 --> 15:34.000
And specifically why Confession Computing is very useful here.

15:34.000 --> 15:37.000
So first is, it provides transition,

15:37.000 --> 15:41.000
transition of trust making your work with various trust model.

15:41.000 --> 15:45.000
For example, cross organization data providers, as an example,

15:45.000 --> 15:49.000
you don't have to not all the data providers

15:49.000 --> 15:51.000
may trust each other.

15:51.000 --> 15:56.000
In this case, you can transition this execution to the cloud

15:56.000 --> 16:00.000
and then who has no conflict of interest

16:00.000 --> 16:05.000
and run the workloads there without needing to complicate the trust model.

16:05.000 --> 16:08.000
And in particular, if you have execution, of course,

16:08.000 --> 16:13.000
is guaranteed by the Rotate Station plus the trust execution environment.

16:13.000 --> 16:16.000
And also one of the very interesting things that we found

16:16.000 --> 16:19.000
is that the attention and report could be also used

16:19.000 --> 16:24.000
to prove that it was executed in a legitimate environment.

16:24.000 --> 16:29.000
So why this is very useful in our use case is that

16:29.000 --> 16:33.000
the scientific research evaluation results

16:33.000 --> 16:36.000
often needs to be reproducible,

16:36.000 --> 16:42.000
but without having them reproduce the entire research,

16:42.000 --> 16:44.000
you can just provide a testing report and say that,

16:44.000 --> 16:48.000
okay, this is the script and this is the output provided by this

16:48.000 --> 16:51.000
clip on a certain environment.

16:51.000 --> 16:56.000
And this could be just proof of the experiment

16:56.000 --> 17:01.000
and the proof of the integrity of the evaluation, the research.

17:01.000 --> 17:04.000
This is the Manatee Data and Code Pipeline.

17:04.000 --> 17:09.000
So we use a Jupyter-Hop to provide a Jupyter-Hop

17:09.000 --> 17:11.000
into Jupyter Lab interface to a user,

17:11.000 --> 17:16.000
and user can actually interact the API using the

17:16.000 --> 17:20.000
Jupyter Lab extension, and then the data access

17:20.000 --> 17:24.000
through the data SDK, which will access different

17:24.000 --> 17:26.000
data at different stages.

17:26.000 --> 17:29.000
When the API suddenly gets a job,

17:29.000 --> 17:35.000
it will schedule this container in the executor park

17:35.000 --> 17:38.000
in a T-pack and we made it flexible

17:38.000 --> 17:43.000
that such that you can choose a different T-pack and,

17:43.000 --> 17:47.000
like, depending on your needs.

17:47.000 --> 17:50.000
It's very easy to deploy this via the platform,

17:50.000 --> 17:53.000
it deploys in the Kubernetes cluster,

17:53.000 --> 17:56.000
either in GCP or in MiniCube,

17:56.000 --> 18:01.000
and it will leverage some of the cloud resources if necessary.

18:01.000 --> 18:03.000
So here's the use case that tick.

18:03.000 --> 18:07.000
So a tick though we have the same exact problem,

18:07.000 --> 18:10.000
because we have to provide the data

18:10.000 --> 18:14.000
to the public researchers to provide a transparency.

18:14.000 --> 18:19.000
And we have launched T-pack have launched a product called

18:19.000 --> 18:22.000
VCE based on this solution,

18:22.000 --> 18:26.000
and it was built on top of the open source.

18:26.000 --> 18:29.000
There are other potential use cases, obviously.

18:29.000 --> 18:34.000
We're exploring, and then I'm going to quickly show the true demo.

18:34.000 --> 18:39.000
So the demo shows that we saw from the insurance charts

18:39.000 --> 18:42.000
data set from Kaggle and it's open data set.

18:42.000 --> 18:46.000
And the task is, we want to train a model that predicts

18:46.000 --> 18:51.000
insurance charts based on, yeah, based on the data.

18:51.000 --> 18:54.000
And then we use the differential,

18:54.000 --> 18:56.000
private synthetic data in first stage.

18:56.000 --> 18:58.000
We provisioned that.

18:58.000 --> 19:00.000
Let me show the video.

19:00.000 --> 19:02.000
So in the Jupyter Lab interface,

19:02.000 --> 19:07.000
you can create a notebook.

19:07.000 --> 19:11.000
And then you initialize the environment.

19:11.000 --> 19:14.000
And actually, you can import the data SDK,

19:14.000 --> 19:16.000
and initialize the data SDK.

19:16.000 --> 19:21.000
Then with the data SDK, you can access the stage one data,

19:21.000 --> 19:23.000
the raw data.

19:23.000 --> 19:27.000
And then you can actually explore the synthetic data

19:27.000 --> 19:31.000
by printing out some correlationship, hit map,

19:31.000 --> 19:34.000
and also other things.

19:34.000 --> 19:38.000
And once you're ready, you can submit this job to the second stage.

19:38.000 --> 19:41.000
And then it will go to the API,

19:41.000 --> 19:43.000
and you can see that the image is building.

19:43.000 --> 19:46.000
So we'll build the back end, we'll build the image,

19:46.000 --> 19:49.000
the container image, and once the image built,

19:49.000 --> 19:52.000
it will schedule it to the container.

19:52.000 --> 19:54.000
Sorry, the key back end.

19:54.000 --> 19:57.000
The key back end we're using here is the confidential space.

19:57.000 --> 20:00.000
So you can see once the VM finishes,

20:00.000 --> 20:05.000
you can download the output, and then the output,

20:05.000 --> 20:09.000
and the output, you can see the results from the real data.

20:09.000 --> 20:12.000
So here, the output privacy is not guaranteed,

20:12.000 --> 20:18.000
but you can add additional staff in between the download output

20:18.000 --> 20:24.000
to make sure that nothing, nothing private goes out from the output.

20:24.000 --> 20:27.000
This is about the code policy.

20:27.000 --> 20:30.000
And let me skip this part.

20:30.000 --> 20:33.000
So, sorry.

20:33.000 --> 20:37.000
The later part will do the execute boost

20:37.000 --> 20:40.000
to train the model.

20:40.000 --> 20:43.000
And then the testing report, you can download the autism report,

20:43.000 --> 20:48.000
and show, see, you know, this is the autism report

20:48.000 --> 20:50.000
from the Google Conference space,

20:50.000 --> 20:52.000
and you can actually verify the signature,

20:52.000 --> 20:55.000
as well as comparing the output hash

20:55.000 --> 20:59.000
that is in the autism report to, you know,

20:59.000 --> 21:04.000
prove that this output was in the generated by this script

21:04.000 --> 21:07.000
with the certain hash in the,

21:07.000 --> 21:10.000
let's make a confidential space environment

21:10.000 --> 21:13.000
with the, with the SCV enabled.

21:13.000 --> 21:18.000
So, yeah, let's pretty much it.

21:18.000 --> 21:24.000
So the project, I only have one minute, so, yeah.

21:25.000 --> 21:28.000
Actually, you can try this out.

21:28.000 --> 21:32.000
The fully open source, you can locally deploy the mini-coup,

21:32.000 --> 21:35.000
and also try, although the mini-coup version

21:35.000 --> 21:39.000
doesn't really use a TE, you can still try the interface.

21:39.000 --> 21:43.000
And you can actually follow the tutorial to reproduce

21:43.000 --> 21:46.000
or a shown here in the GCP.

21:46.000 --> 21:49.000
If you have a GCP account, you can try that.

21:49.000 --> 21:53.000
And we're collaborating Google on this project,

21:53.000 --> 21:58.000
and it's always joined us for more collaboration.

21:58.000 --> 22:01.000
Yes, that's nice.

22:01.000 --> 22:06.000
Thank you.

22:06.000 --> 22:11.000
So, we can move in and out, but we can have some quick UA,

22:11.000 --> 22:14.000
but we speak a little bit, so we can hear you while a lot of people

22:14.000 --> 22:16.000
move in and out.

22:16.000 --> 22:18.000
Yeah.

22:19.000 --> 22:23.000
And I'm pouring like in a TE,

22:23.000 --> 22:27.000
we have to run a constant time code,

22:27.000 --> 22:32.000
or essentially code that a quick branching pattern

22:32.000 --> 22:36.000
can not be done in the data, which are processing otherwise you need

22:36.000 --> 22:41.000
the data by us, for example, five minutes of timing ahead.

22:41.000 --> 22:44.000
And it seems that you are running here just,

22:45.000 --> 22:48.000
as it is, but they are not going to be done.

22:48.000 --> 22:53.000
So, what do you think about that?

22:53.000 --> 22:54.000
Okay.

22:54.000 --> 22:57.000
To repeat the question was,

22:57.000 --> 23:00.000
it seems that we're not protecting against the site channel,

23:00.000 --> 23:04.000
because if the execution time is not constant,

23:04.000 --> 23:08.000
it may be susceptible to timing channel attack, right?

23:08.000 --> 23:09.000
Yeah, that's a good question.

23:09.000 --> 23:11.000
I think it's a separate,

23:12.000 --> 23:17.000
separate issue that can be addressed by some additional techniques.

23:17.000 --> 23:21.000
But what we're trying to solve here is not the site channel,

23:21.000 --> 23:25.000
or the scope of TE itself.

23:25.000 --> 23:28.000
What we're doing is that we're building,

23:28.000 --> 23:34.000
you know, this general data-private data analytics platform,

23:34.000 --> 23:36.000
using the existing TE.

23:36.000 --> 23:38.000
That's why we're focused on this in this work.

23:38.000 --> 23:42.000
But of course, I think that if the work was very susceptible

23:42.000 --> 23:45.000
to timing channel or any other site channel attacks,

23:45.000 --> 23:48.000
I think it should be addressed case by case.

23:49.000 --> 23:52.000
If you have the question,

23:52.000 --> 23:55.000
you see that, like, for research,

23:55.000 --> 23:58.000
how can I choose our choice?

23:58.000 --> 24:00.000
Because you don't get them the data,

24:00.000 --> 24:03.000
you just get them the ultimate statistics.

24:03.000 --> 24:06.000
So, how can you make sure that it actually activates?

24:06.000 --> 24:08.000
I've been this stage once.

24:08.000 --> 24:11.000
Yeah, is it like the, actually,

24:11.000 --> 24:15.000
how do you, how do you, the question?

24:16.000 --> 24:20.000
What's kind of the difference between,

24:20.000 --> 24:23.000
I was at the, and just giving them data,

24:23.000 --> 24:26.000
and in, how can they still have,

24:26.000 --> 24:29.000
with the size of the site, not having the load data,

24:29.000 --> 24:31.000
and when they're like giving to the problem,

24:31.000 --> 24:33.000
that's really, actually,

24:33.000 --> 24:35.000
if they're going to go through the things.

24:35.000 --> 24:37.000
Yeah, I think the question is,

24:37.000 --> 24:41.000
why is the difference between just giving the data to the researchers,

24:41.000 --> 24:44.000
and, you know, doing this approach?

24:45.000 --> 24:49.000
Sure, that they're getting the correct insight for ourselves.

24:49.000 --> 24:52.000
So, I think, in the second stage,

24:52.000 --> 24:55.000
you will get the full, you know,

24:55.000 --> 24:58.000
output using the real data,

24:58.000 --> 25:00.000
so you'll get the insight.

25:00.000 --> 25:01.000
The, the problem is,

25:01.000 --> 25:03.000
the, the first, the programming stage,

25:03.000 --> 25:05.000
where you, basically,

25:05.000 --> 25:08.000
sacrifice some accuracy using,

25:08.000 --> 25:10.000
using some data protection techniques, right?

25:11.000 --> 25:13.000
And, and I think our argument is that,

25:13.000 --> 25:16.000
you can run this job like multiple times

25:16.000 --> 25:19.000
before you actually produce the final result.

25:25.000 --> 25:27.000
Yeah, yeah, yeah.

25:27.000 --> 25:29.000
Yeah, yeah.

25:29.000 --> 25:31.000
That's, that's a model.

25:31.000 --> 25:33.000
Yeah, thank you very much.

25:33.000 --> 25:35.000
All right, run.

25:35.000 --> 25:36.000
Sorry.

25:36.000 --> 25:37.000
Sorry.

25:37.000 --> 25:39.000
Yeah, we can't tell offline.

25:40.000 --> 25:42.000
Yeah, yeah.