WEBVTT

00:00.000 --> 00:15.160
Hello, everyone. We'll start the next talk. Robert will talk more about the work after Ivan

00:15.160 --> 00:22.080
has introduced quantization. Now he's working with the result of it. And the funny thing,

00:22.080 --> 00:26.960
the first thing, and being here and being at the Devroom, that's the first time the guys

00:27.040 --> 00:32.240
actually met and they will talk to each other. So this is a great place to meet. Robert is

00:32.240 --> 00:39.680
all in your hands. All right. Thank you. Hey, everybody. I'm Robert Collins. So let me just give

00:39.680 --> 00:47.360
you a bit about myself. It's just some background. And I have to read from my notes I apologize.

00:47.360 --> 00:53.520
I'm not used to public speaking. But I have a very background in visual arts and in front

00:53.520 --> 00:59.760
and full stack development and enterprise software development. And lots of things that are not

00:59.760 --> 01:07.520
related to AI. But I was interested when I went back and completed my master's in 2019,

01:07.520 --> 01:14.480
I did a CAPST project, creating an ecosystem to scheduling notifications for e-learning apps.

01:14.480 --> 01:20.640
And I built NullSTM model to do that scheduling. And that was a lot of fun and very interesting

01:20.640 --> 01:29.760
to me. So I was inspired to pursue it further. I completed a elsewhere postgraduate diploma

01:29.760 --> 01:38.880
and applied AI ML. And I was sick for a couple of years for a bone infection and it took me

01:38.880 --> 01:45.520
out of practice for a long time. And so I found it difficult to continue in the work that I was doing

01:45.520 --> 01:54.320
and web development. So I decided to apply myself to software in AI. So I took a position in

01:54.320 --> 02:00.160
reinforcement learning. And I'm generally interested in AI and how AI can make immediate and

02:00.160 --> 02:05.360
small positive impacts in people's lives. And I feel like that actually is starting to happen

02:05.360 --> 02:11.360
in large language models, which is the thing that came right after the LSTM model that I built

02:11.360 --> 02:21.600
in 2019. But I did actually get to see it while it was in a university early code generating

02:21.600 --> 02:31.520
LLMs. So this project is llama GGU of optimized. It's up on GitHub. It's a toolkit for optimizing

02:31.520 --> 02:39.280
multi-lingual model quantization through iterative refinement of importance matrices. So you'll see

02:39.280 --> 02:42.960
they'll use the word multi-lingual related to this project and that really has to do with the

02:42.960 --> 02:49.040
background of the project. Really what importance matrices allow you to do is to ensure demand

02:49.040 --> 03:00.640
competencies are preserved from the model. So for example, in the community code competency

03:01.520 --> 03:04.640
has been something that has been of interest to people. They've been building important

03:04.640 --> 03:13.520
matrices that ensure that the model performs as well after quantization as it can in regards to

03:13.520 --> 03:22.560
code competency. So I have a background or sorry for the remainder of the background about the

03:22.560 --> 03:29.520
project. I have a friend who is a language instructor. She's Bolivian and she works with students.

03:29.520 --> 03:38.080
She speaks Kechua or come from families that speak Kechua. So I kind of speak a little bit of

03:38.080 --> 03:45.440
Spanish and also Portuguese. I mean emphasis on kind of, but I really I actually hope to

03:45.440 --> 03:52.400
retire in Brazil. And so I have an interest in language. Maybe that's why I was interested in

03:52.400 --> 04:00.480
this elementary model from Barcelona Supercomputing Center. And that model encompasses over 30

04:00.480 --> 04:06.160
European languages. And I wanted to quantize that model for Olamah because it's a popular runtime

04:06.880 --> 04:14.480
and it didn't have it. And that uses LMSE++. And I noticed that LMSE++ offers important

04:14.480 --> 04:20.080
matrices to produce better models. But the data sets that I found just weren't really appropriate for that.

04:20.640 --> 04:26.400
So, I mean what I mean by that is that the data sets that were available to generate

04:26.400 --> 04:32.400
important matrices focused specifically at the easiest ones to find were just on English and

04:32.400 --> 04:38.400
or code. So I worked through making my own for the model and using it and doing my research

04:38.400 --> 04:43.520
with perplexity AI. And I realized when I went back to write about my experience and show what

04:43.520 --> 04:49.040
I had done that I had made some assumptions. Thanks to the type of research that we get with

04:49.040 --> 04:56.640
perplexity AI, we're just not actually well grounded. So I wanted to actually ground it better

04:56.640 --> 05:02.640
than that. And I went about doing that and that led to building this project. So let me see.

05:05.360 --> 05:11.120
So to give you some of the background that leads to the use of the importance matrices. The first

05:11.120 --> 05:18.000
paper here shows the preservation of domain competency is orthogonal to the minimization of

05:18.000 --> 05:23.920
loss and quantization. What this means to the authors is that you should use calibration samples

05:23.920 --> 05:29.360
in the preservation of domain competencies. The second paper is an addendum about how to build

05:29.360 --> 05:35.440
your data sets for important matrices. And it's not showing anything of use specifically to my

05:35.440 --> 05:41.920
projects. Like it's not really about important matrices or LMSE++ is more about calibration data.

05:41.920 --> 05:48.480
But the idea I think provides a default assumption that balance data sets are more appropriate

05:48.480 --> 05:54.800
to the preservation of calibration data. And then what that means is if you have a model

05:55.520 --> 06:03.120
that's designed to interact in English, Spanish, and Portuguese, you might be tempted to just

06:03.120 --> 06:09.600
lazily rely on a wealth of English language data in your calibration. This would be informed

06:09.600 --> 06:13.360
from an understanding that in pre-training, whenever you're building one of these large language

06:13.360 --> 06:17.840
models, you use so much data that you provide information about other languages in the process

06:17.840 --> 06:22.960
because languages are related. So just training something on English, you expect that it will

06:22.960 --> 06:30.160
learn essentially various European languages pretty well out of the box. And that kind of is true

06:30.160 --> 06:36.160
with just a sprinkling of other languages. But if you're really interested in preserving a wealth

06:36.240 --> 06:42.080
of expression in other languages, if these other languages really are important to you, maybe

06:42.080 --> 06:50.240
you shouldn't do that. And then the final two articles here really make the same claim as the

06:50.240 --> 06:57.040
first article, at least for me, and how I'm going to use them. But they tie it specifically

06:57.040 --> 07:10.880
to language competency. So the tools I'm using obviously are Lama C++ because it provides an

07:10.880 --> 07:16.640
importance matrix, which is generated from a data set, and that data set is I'm using it is a

07:16.640 --> 07:22.000
sub sample of pre-training data that you can get from models that are gracious enough to provide

07:22.080 --> 07:28.960
that or from the teams. The data can be chosen to reflect competencies of import to the user.

07:28.960 --> 07:35.280
And I'm also using it because Lama C++ because Lama uses it and it's a widespread and accessible

07:35.280 --> 07:44.080
runtime. Other runtime or systems can use it as well like Lama Studio. And all of that can change.

07:44.080 --> 07:49.280
There's I was going to say actually I wrote my notes that there's a rumor that Lama is

07:49.280 --> 07:54.800
possibly working on the runtime or perhaps a plug-able system that actually a draft pull request

07:54.800 --> 08:01.520
came in the day before us then started where they set up a plug-able system for MLX. So they are

08:01.520 --> 08:08.560
in fact heading in that direction. It's worth noting also that unslawed recently build out internal

08:08.560 --> 08:14.400
tools for dynamically leaving some weights, unquantized for multimodal and for text-only models.

08:15.280 --> 08:20.080
And they're not saying that they won't open source at some point. Like conversationaly

08:20.080 --> 08:25.360
fully unofficially just to me and not in any way in a binding sense. Like I don't want to

08:26.560 --> 08:32.640
prejudice what they would do or anything. But it just sounded like in a very casual conversation

08:32.640 --> 08:37.280
that they would be open to open sourcing at least the text-only variance of that.

08:38.320 --> 08:42.160
And I think the challenges that are required to make this system work with theirs,

08:42.960 --> 08:47.920
the project that I'm going to show you here, is not too big and that's because with this system

08:47.920 --> 08:54.560
does it orchestrates the CLI largely for Lama C++, it provides a framework for using that system

08:54.560 --> 09:01.120
in an analytical fashion. The second library is when I need to use, there are places where I prefer

09:01.120 --> 09:08.720
to use the use Python and integrate directly rather than using the CLI and so I use Lama C++

09:08.720 --> 09:16.720
Python. So just very quickly we've heard a better demonstration or description of this at the

09:16.720 --> 09:23.040
beginning of the last presentation. But an important matrix is a mechanism for identifying which parts

09:23.040 --> 09:30.080
of a neural network are most crucial for its function. So now you have a sense of what the problem

09:30.080 --> 09:36.800
kind of is and what the approach that Lama C++ provides to deal with that, how it kind of

09:37.680 --> 09:45.920
uses an approach that's similar to calibration data in quantization. What this project does is

09:45.920 --> 09:52.240
addresses the challenges of quantization and quantization loss in language diversity and models

09:52.240 --> 09:58.080
with specific domain competency. So the community, again, has largely been focused on producing

09:58.080 --> 10:02.880
a single dataset or like a small number of them that reflect standard domain competencies

10:02.880 --> 10:09.040
like mostly English and code. And there's been some additional focus on French and Chinese and

10:09.040 --> 10:16.000
I'm sure you could find others if you really looked. But I feel like we could do, we can, the tools

10:16.000 --> 10:23.200
are there to conform it to whatever the competencies are that the model was designed for. So this is

10:23.200 --> 10:31.680
a visualization you can produce from the work that you do, working with the tool kit that I'm presenting

10:31.760 --> 10:39.760
here. Let me describe this. On the left to right, the Z-axis, what's called KL divergent

10:39.760 --> 10:44.960
values. Some of you may familiar with that, if not, there's something called perplexity,

10:45.680 --> 10:50.480
which measures like the loss in the output. This is before that if a lot,

10:50.480 --> 10:54.640
the widgets level we're looking at the individual widgets that are used to produce output.

10:55.120 --> 11:02.960
And then the height here, the density, describes the distribution basically.

11:04.080 --> 11:08.480
So like all of the values that are yellow, that are very high there, that's because almost all

11:08.480 --> 11:13.360
the values are the same. And so each one that you're looking at here in depth, it's not really

11:13.440 --> 11:26.720
a manifold that's like a series of individual quantizations. I forget what I was about to say there,

11:26.720 --> 11:36.800
I apologize. But yeah, so there is a metric that's provided with each of these that describes

11:36.800 --> 11:41.600
that converts that distribution into a single number. And that's just based on community input,

11:41.600 --> 11:47.360
the community discussed, and described using the KL divergent's mean. And then also that the

11:48.080 --> 11:54.160
90th and 95th and 99th percentile larger percentile values were more important. And so I use

11:54.160 --> 12:01.120
like a weighted mean of those values. And it could be a different thing. It doesn't have to be

12:01.120 --> 12:05.760
those values. You might prefer to use like I actually think in retrospect, I would like to use the

12:05.760 --> 12:12.800
CRISPR test instead, which is a different singular value expression of that distribution that

12:12.800 --> 12:20.080
pays attention to the outliers. But that's maybe a great tool to you for the project to allow

12:20.080 --> 12:25.920
the modification of a metric in producing these graphs. But as opposed to having a modified

12:25.920 --> 12:32.160
source code for production of this graph. But the important thing is if you look at this metric,

12:32.160 --> 12:40.720
there's diminishing returns. It goes from roughly 0.39 to 0.34, 0.295, 0.294. And a thousand

12:40.720 --> 12:46.080
you'll see it's different. And what this is, is this is the supplement for Mandala. I was describing

12:46.080 --> 12:51.760
earlier. So on the first one, the one that's closest to us, is the unquantized model versus

12:51.760 --> 12:58.320
quantization with no importance matrix. And then it's 250 samples per language, over the 30

12:58.320 --> 13:13.680
some languages. And then 500 and 750, really? So yeah, that was from the initial run that I was

13:13.680 --> 13:19.440
using to go back and verify. And the way that I approached it was slightly different. It's

13:19.440 --> 13:25.520
demonstrated that if you selected from a data set roughly the top 10% of chunks from the data set

13:25.520 --> 13:31.680
in terms of populations of outliers or something like the KLD values at the 95th or 99th percentile,

13:31.680 --> 13:37.520
that you can get better results. So for example, what was shown in the discussion was that if you take

13:37.520 --> 13:42.080
500,000 samples and we're going to get down to 45,000 samples, maybe you can get a, it's only a

13:42.080 --> 13:46.640
couple of percent better, but you can get a better result out of it. Looking at the, there's if you

13:46.640 --> 13:52.720
look at the metric here, there's an elbow obviously at around 500 samples. So it might make sense

13:52.720 --> 13:59.040
if you really wanted to maximize your results to go up to 5,000 and reduce down to 500 samples per

13:59.040 --> 14:13.920
language. This is the same chart just showing chunk by chunk metrics. So Lama C++ provides an iterative approach

14:13.920 --> 14:21.360
and this toolkit provides the same. Like the Lama I matrix for example allows you to compose

14:21.360 --> 14:26.400
an I matrix and then add it to the fashion. You can build a data set and with a certain amount of

14:26.400 --> 14:31.600
text data and then you can get more text data, refer to your old data set and produce a new data set

14:31.600 --> 14:36.240
that's even larger. In fact, I created a small poll request that you could build one data set,

14:36.240 --> 14:41.760
build another and combine them. This toolkit is designed with the same sense of visibility.

14:42.800 --> 14:50.960
There's an emphasis also on storage space for people who have concerns in that fashion.

14:51.760 --> 14:57.520
And there's a usage guide. There's even a podcast on the read me for this that I did in

14:57.520 --> 15:03.440
notebook Lama to help make it approachable. I'm going to rush through for lack of time.

15:05.200 --> 15:11.200
The I matrix data set and KLD bench are the two core tools that you'll actually use. The other

15:11.200 --> 15:15.680
two core tools are the tools that actually do the work, but hopefully you won't have to use them very much.

15:15.680 --> 15:20.400
The top one is an interface to download and randomly shuffle a few so desire

15:21.360 --> 15:26.560
data sources for making an important matrix with your test data. It's a plug-able system built

15:26.560 --> 15:31.600
with it provides a default access to the Oscar library as it so happens but you can build it for

15:31.600 --> 15:37.600
whatever. In fact, I started working on another model recently called Sailor II and some of

15:37.600 --> 15:45.200
their data was only available in unlabeled and randomly shuffled and I just needed a couple of

15:45.200 --> 15:50.480
languages from it. So I wrote a, I modified my plug-in to use a library to identify the language

15:50.480 --> 15:55.440
and if the confidence was high and then we'd added it into the data set and it makes it work very

15:55.440 --> 16:00.640
simply. KLD bench accepts a model or the loadages that are already generated and it orchestrates

16:00.640 --> 16:07.200
the other two scripts. It runs compared loadages chunk by chunk and all of this is resumable.

16:08.000 --> 16:13.040
It only keeps two loadages for the data on disk at a time. I think that's important to highlight

16:13.120 --> 16:19.200
unless you generate it in advance. If you have terabytes of disk space, then maybe you don't

16:19.200 --> 16:22.720
want to regenerate them over and over again whenever you're doing the baseline because you'll be

16:22.720 --> 16:29.440
regenerating the same data so you could use generate loadsage just once. These are the visualizations

16:29.440 --> 16:36.320
tools that are provided. I have a whole thing on early stopping but as I'm running low on time,

16:36.320 --> 16:40.720
I'm going to skip that. The first thing that you saw was the composite comparison. There's also

16:40.720 --> 16:48.080
a tool that provides a similar manifold looking at chunk output for the KLD version's values

16:48.080 --> 16:53.440
for a single run on the model and then the read KLD bench marks is a plain text tool.

16:56.880 --> 17:03.760
If you this is the early stopping that I kind of skipped but it just provides a way to generate

17:03.760 --> 17:08.720
to segregate out a section of data that you're going to use for testing for comparison purposes

17:08.800 --> 17:15.120
that are not going to go into later whenever you generate quantizations with important

17:15.120 --> 17:24.160
spaces. I have to wrap up now so I apologize. I just wanted to show you that there's excellent

17:24.160 --> 17:32.960
documentation as well and yeah that's just the final single run visualization. Thank you very much.

17:33.920 --> 17:43.840
And of course you can talk always to Robert after the talk there's plenty of time and tomorrow

17:43.840 --> 17:48.880
we have the a Plumber's Conference if you didn't know so we can we can meet.

