WEBVTT

00:00.000 --> 00:07.440
So, hey, and I guess you get to introduce yourself.

00:07.440 --> 00:08.440
Absolutely, hey, everybody.

00:08.440 --> 00:09.440
My name is Ben Busby.

00:09.440 --> 00:13.120
I'm going to try to cram three lightning talks into a talk.

00:13.120 --> 00:16.920
First, I'm going to tell you about some cool stuff in videos building the bioinformatics

00:16.920 --> 00:17.920
space.

00:17.920 --> 00:22.560
Second, I'm going to talk about what would happen if you wanted to sequence and distribute

00:22.560 --> 00:24.440
millions of genomes.

00:24.440 --> 00:28.960
And then third, if any of you are interested in knowledge graphs or rag, I'm going to talk

00:28.960 --> 00:31.160
about cramming knowledge graphs and rag to get it.

00:31.160 --> 00:32.160
Everybody good with that?

00:32.160 --> 00:33.640
All right, let's roll.

00:33.640 --> 00:38.760
So yeah, I'm Ben Busby here, some disclosures, but I will say that a lot of us, with a bunch

00:38.760 --> 00:42.960
of academic institutions and stuff, but a lot of us, and in video are really from the open

00:42.960 --> 00:43.960
source community.

00:43.960 --> 00:49.640
So, part of it, I remembered my galaxy t-shirt, but I did forget my NFC or socks.

00:49.640 --> 00:55.540
Anyway, so yeah, we work at an Nvidia, we work with pretty much everyone in the sort of

00:55.540 --> 01:01.060
industrial bioinformatics ecosystem, and also hundreds and hundreds of academic institutions,

01:01.060 --> 01:06.100
really trying to help speed up algorithms and bioinformatics.

01:06.100 --> 01:11.380
It's one of the main things we do, and really across the entire sort of biological science

01:11.380 --> 01:15.140
as ecosystem, it's not shown here, but including agriculture.

01:15.140 --> 01:19.020
I'll talk about a few of these things, but today I'm not going to talk about robots, if

01:19.020 --> 01:22.100
you want to talk about robots after we can do that.

01:22.100 --> 01:28.980
Anyway, so yeah, Nvidia builds all kinds of stuff and also accelerates all kinds of open

01:28.980 --> 01:33.500
source algorithms so you can use them faster and faster and then cramps them together

01:33.500 --> 01:35.100
in convenient ways.

01:35.100 --> 01:39.700
So that's something to think about, but for those of you who are bioinformaticians,

01:39.700 --> 01:46.540
there's about 20 massively sped up regular bioinformatics algorithms.

01:46.540 --> 01:51.340
Here are about 10 of them, and if you're interested, check out Parabrics.

01:51.340 --> 01:57.060
By the way, some of them are in NF Coral, ready, and others are in galaxy.

01:57.060 --> 02:01.340
So that's a really nice thing that I'll talk a little bit about some things we're doing

02:01.340 --> 02:03.540
in single cell, as well as models in a minute.

02:03.540 --> 02:08.780
But if you're a bioinformatician, you want to go faster, Google for Parabrics.

02:08.780 --> 02:10.860
What do I mean by massively faster?

02:10.860 --> 02:19.580
So on an RTX Pro 6,000, BWA is about 135 times faster than it is on a relatively comparable

02:19.580 --> 02:25.020
CPU, so there's a lot of speed up which actually turns into being cost savings if you're

02:25.020 --> 02:29.260
sequencing or mapping a lot of genomes.

02:29.260 --> 02:34.280
So for single cell, things are even faster, so the average single cell workflows about

02:34.280 --> 02:41.640
200 times faster on GPU, you can check it out, and so that's to me that's really, really

02:41.640 --> 02:42.640
cool.

02:42.640 --> 02:45.760
So if you're interested in millions of cells or particularly hundreds of millions of

02:45.800 --> 02:49.680
cells, GPU really just makes sense.

02:49.680 --> 02:52.840
One of the really cool things about this is this all works in Jupiter notebooks too,

02:52.840 --> 02:56.240
so you don't have to know Kudai anything like that, you can just run it.

02:56.240 --> 03:00.080
And then also a lot of people don't know that there's accelerated data science.

03:00.080 --> 03:06.880
If you Google Rapids.ai, there's a pandas, there's a numpy, and I think most importantly,

03:06.880 --> 03:10.720
there's a psychic learn version called QML.

03:10.760 --> 03:14.720
In addition, in addition, in addition, in addition, in addition, in addition, in addition,

03:14.720 --> 03:18.720
in addition, in addition, in addition, in addition, in addition, in addition, in addition,

03:18.720 --> 03:22.720
in addition, in addition, in addition, in addition, in addition, in addition, in addition,

03:22.720 --> 03:26.720
in addition, in addition, in addition, in addition, in addition, in addition, in addition,

03:26.720 --> 03:30.720
in addition, in addition, in addition, in addition, in addition, in addition, in addition,

03:30.720 --> 03:34.720
in addition, in addition, in addition, in addition, in addition, in addition, in addition,

03:34.720 --> 03:38.720
in addition, in addition, in addition, in addition, in addition, in addition, in addition, in addition,

03:39.200 --> 03:43.600
I think I'm more or less on time, but the vision is that, you know, in five years for people

03:43.600 --> 03:49.760
with very common ideologies of disease, we can really move towards hitting them pharmacology

03:49.760 --> 03:55.120
more or less straight off the sequencer. But I think something we ignore all the time, we think

03:55.120 --> 03:59.120
about variants of this variants that variant disease blah, blah, blah, but, you know, I mean,

03:59.120 --> 04:05.520
about 8% of variants have single disease, single variant associations, according to NHGRI,

04:05.520 --> 04:11.280
and other places, the vast majority of diseases, multi-factorial.

04:11.280 --> 04:16.080
Variant annotation is very common, there's 300 variant annotators, in this tool called open

04:16.080 --> 04:20.880
provide, but still we're missing a lot, we have a lot of variants of unknown significance, why,

04:20.880 --> 04:25.680
probably because of lack of contextualization, because there's actually polygenic.

04:25.680 --> 04:32.000
So here's a bunch of reasons that diseases are multi-factorial, but why don't we think about this

04:32.000 --> 04:36.720
very well? Well, it turns out humans don't like to think about multi-factorial causation.

04:36.720 --> 04:42.400
Deep learning, fortunately, does not have this problem. So, but anyway, so we decided to try to

04:42.400 --> 04:47.920
attack this problem head on in a very, very simplistic way. Some of these things, thank you.

04:47.920 --> 04:52.160
Some of these things are seem quite complicated, but we thought we could do it simply,

04:52.160 --> 04:56.480
and by me, I mean, me and a bunch of really clever graduate students at CMU,

04:57.440 --> 05:02.720
we downloaded a thousand genomes split them by habitat blocks, or sorry,

05:02.720 --> 05:08.720
recombination sites into habitat blocks, and clustered them. What we saw is in Puerto

05:08.720 --> 05:14.640
Reakins, Great Britain's and Humb Chinese, around TNF, you get 9, 8, and 3 clusters of

05:14.640 --> 05:20.800
habitat blocks. Looking at basic ancestry, HLA, you get 17, 15, 13. What does this mean?

05:20.800 --> 05:26.240
This means that genome is discretizable. So basically what we can do is we can go from a linear

05:26.240 --> 05:32.320
genome to a pan genome, but we can make this pan genome without making millions and millions

05:32.320 --> 05:37.360
of alignments, which I think is really important, because that is actually computationally heavy,

05:37.360 --> 05:42.720
even on a GPU. So basically what we're doing is approximate hashing coding for very large genomics

05:42.720 --> 05:48.960
by using these haploblocks. So we go basically from genome graphs to approximate hashing coding,

05:48.960 --> 05:54.560
then what we can do is label these haploblock clusters with snips, and then use them

05:55.200 --> 06:00.320
use convert them to binary strings, and then use these to come up with feature weights for different

06:00.320 --> 06:05.440
parts of the genome in a particular disease of phenotype. So that's something we're particularly

06:05.440 --> 06:11.120
excited about at this particular point. And I would like to note that a bunch of this was actually

06:11.120 --> 06:19.200
built in the Elixir biohagathon in Berlin, in November. So these are a bunch of things that

06:19.200 --> 06:23.680
Nvidia does, but I want to talk about actually knowledge graphs, and that's going to be sort of

06:23.840 --> 06:30.400
the third thing we talk about. So really thinking about the phenotype side of clustering of disease

06:30.400 --> 06:34.800
subtypes, it would be really nice if we could do disease clustering. Why? Because we've known

06:34.800 --> 06:39.520
all the ideologies of breast cancer for 30 years, or the four major ideologies I should say.

06:40.560 --> 06:45.040
But this is what a snowmed graph looks like right now. This is what a graph could look like.

06:45.040 --> 06:50.800
So there is a future there. And I think something I call pig rag could be part of the future

06:51.440 --> 06:56.400
in full disclosure. I didn't do any of the engineering on this project. I just made this graphic,

06:56.400 --> 07:04.960
and I did that with Google Gemini. So not all that much. But anyway, so yeah, I mean, most of you know

07:04.960 --> 07:10.720
what graph rag is, anybody not know what graph rag is? It's written up there. So anyway,

07:12.000 --> 07:18.640
great. So basically what we've done here to do Retrieval Augmented Generation for an LLM.

07:19.520 --> 07:25.120
A lot of people do that now. But basically what we've done is put a pie towards geometric

07:25.120 --> 07:31.280
gen in in front of the LLM. And you can retrieve a subgraph. You can see who's made what contribution,

07:31.280 --> 07:37.600
et cetera, et cetera. And then you can concatenate the embeddings from a gen and from an LLM.

07:37.600 --> 07:43.600
Why is that important? Because then you can also do document retrieval. So you can go ahead and

07:43.600 --> 07:48.400
merge unstructured data with structured data. But the really cool thing here is you can put

07:48.400 --> 07:53.120
out another knowledge graph so you can actually do validation as well as asking natural language

07:53.120 --> 07:58.720
queries. We're doing a lot of experiments where we look at you know one hop to five hop prompting

07:58.720 --> 08:04.480
and bin it and so on and so forth. Anyway, and it's you do need to do some fun tuning of the

08:04.480 --> 08:11.920
decoder. But we tested this out with Neo4j. You can check out this blog. And this was in the

08:12.000 --> 08:19.520
start prime data set. And you can actually you actually double the accuracy of retrieval

08:19.520 --> 08:27.200
augmented generation by putting a gen in front of the LLM. So that is all I wanted to tell you today.

08:27.200 --> 08:35.360
I think I managed to get all of this in nine minutes. They told me I had 10. So we should have five

08:35.360 --> 08:41.520
minutes for questions. Hopefully you guys have a bunch of questions. But so this is kind of my vision

08:41.520 --> 08:46.000
of the future that we can discretize the human and actually we're trying this with peanut

08:46.000 --> 08:53.120
and sorghum as well. Human and agricultural genomes, we can use those to develop models for

08:53.120 --> 09:01.600
phenotypes that we can actually represent in knowledge graphs to treat people. So treat people faster.

09:02.320 --> 09:04.640
So that's it. And happy to take any questions.

09:11.520 --> 09:25.120
All right. So somebody asked me when will Kuda be released as open source. So there's a full open

09:25.120 --> 09:32.320
source stack and I will say I have no idea. I work in a completely, I work in a very

09:33.280 --> 09:42.720
distant corner of in video where we really focus a lot on genomes. I certainly could ask

09:43.760 --> 09:48.800
whether they're going to tell me I don't know. But one thing I will say is that I work with a lot

09:48.800 --> 09:53.360
with the open source community and there's so many tools developed around Kuda that I think what we

09:53.360 --> 09:58.240
see is open source developers will develop first around Kuda just because it's so much easier

09:58.240 --> 10:04.240
and then moved to other sort of GPU support systems. Yeah.

10:28.640 --> 10:37.920
So I don't know how to answer a question based around low-level stuff, but we can discuss around

10:37.920 --> 10:44.720
beers, I think. But that said, so all of the paratrox modules, we spend a lot of time on

10:45.520 --> 10:51.680
reproducibility and matching to the original algorithms. And so people can check all of that out.

10:51.680 --> 10:57.200
And I mean, for example, we work a lot with the original development team. So for example,

10:57.200 --> 11:03.600
with the variant deep somatic, those sorts of things, to get matching with the GPU processes and

11:03.600 --> 11:08.400
CPU, we also do a lot of work with algorithm developers to try to get similar matching.

11:09.280 --> 11:11.760
And actually optimization as well. Yes.

11:27.200 --> 11:41.600
Yeah. So that's a great question. So somebody is saying, is there

11:41.600 --> 11:46.800
out of stations, server? One thing I would say is, I mean, in genomics, oh yeah, yeah.

11:46.800 --> 11:56.560
I'm stepping back and repeating the question. So if I am remembering the question correctly,

11:56.560 --> 12:04.720
it's, so are we allowing people to go ahead and verify that

12:26.640 --> 12:36.000
special.

12:36.000 --> 12:40.800
Are there signed to add a station artifacts? And I do not know the answer to that question.

12:40.800 --> 12:44.800
So, yep. Anybody else?

