WEBVTT

00:00.000 --> 00:27.000
So, hello again everybody, my name is Cedric and I'm joined today by my good friend Carol Chen and today we would like to present and actually do some demos on the concept of synthetic data and synthetic data generation and this has been quite a hot topic recently and you might have heard about it in the news, right?

00:27.000 --> 00:48.000
If you are on YouTube as much as I am, you might have seen interviews with Sam Altman or the Zuck as I like to call them talking about synthetic data, talking about model distillation or, for example, you might have seen in new model cards that a lot of the frontier and state-of-the-art models are now being trained with synthetic data, right?

00:48.000 --> 01:00.000
You start thinking about that. The AI is making content for AI, right? Are we going to have model collapse or in another light? Are we going to soon run out of training data, right?

01:00.000 --> 01:09.000
If you take a look at this chart, there's essentially a prediction of about 2027, 2028, where we might run out of available tokens to use for model training.

01:09.000 --> 01:20.000
It brings us to this general idea of synthetic data and for our talk today how synthetic data is powering the next evolution of language models.

01:20.000 --> 01:27.000
Gardner here reports that by 2030 synthetic data might overshadow real data in language models.

01:27.000 --> 01:35.000
All things to think about, but we'll walk through this today, we'll talk about challenges of AI and production,

01:35.000 --> 01:53.000
how synthetic data can help, and where it fits in for the LLM pipeline of pre-training, post-training, and model evaluation, and do some cool demos, and show how, say, for example, if you're using a model that's maybe been tuned with synthetic data for a particular use case, you could have 1,000%

01:53.000 --> 02:04.000
times less cost, for example, than using a standardized chat GPT. So saving money, saving the environment, this is all great things that synthetic data can do,

02:04.000 --> 02:10.000
feel free to check out the slides, and we'll do a quick introduction. Carol, would you like to go first?

02:10.000 --> 02:22.000
Thanks, Edrick. So I'm Carol Chen. I'm also from Red Hat. I do a lot of support for upstream communities, and some of them are listed here, including a strong lab, which utilizes synthetic data.

02:22.000 --> 02:29.000
So if you want to find out more information, it's trucklight.au, but you can also find me a sidebed on many different networks.

02:29.000 --> 02:38.000
Thank you. Yes, Carol's awesome, and my name is again, is Sedrick. I'm an engineer and advocate, organized KCD in New York City.

02:38.000 --> 02:49.000
And I make YouTube videos if you want to learn about AI topics on those whiteboard things, feel free to check that out. So, Carol, pass it back to you.

02:49.000 --> 03:00.000
All right, thanks, Sedrick. So before we dive into synthetic data, let's quickly look at some of the challenges that of AI adoption. First of all, it's cost.

03:00.000 --> 03:09.000
So the cost of influencing a scale to training and everything. However, this is changing.

03:09.000 --> 03:16.000
I'm sure most, if not all of us here have heard about deep seek. Well, there are some questions about, you know, some of their claims.

03:16.000 --> 03:26.000
The reality is that innovation, especially open source innovation, can impact costs and efficiency both at the model level and beyond.

03:26.000 --> 03:36.000
In addition to saving energy and other resources, performance is also increased in these new open models, and deep seek is definitely not the only one.

03:36.000 --> 03:43.000
For example, there's the grand models if you're afraid of them. And this is a good trend, and open source is fueling it.

03:43.000 --> 03:47.000
We're looking at several orders on magnitude here from billions to millions.

03:47.000 --> 03:54.000
Although I've just came across some article debunking the, you know, $6 million claim by deep seek.

03:54.000 --> 04:05.000
But anyway, according to a recent gardener report, more than 30% of January projects will be abandoned due to cost being one of these major factors.

04:06.000 --> 04:12.000
Next, we look at complexity. Let's say we want to customize our lens with our own data.

04:12.000 --> 04:18.000
The curation of data sets is usually non-frivial and very complex for non-data scientists like myself.

04:18.000 --> 04:26.000
So whether it's business-centric data, or, you know, the main specific data we need to incorporate this additional knowledge,

04:26.000 --> 04:34.000
because some of the limitations are due to those models being to generic.

04:34.000 --> 04:44.000
And not only that, as more and more companies and organizations need to align models with private and sensitive data, such as in healthcare.

04:44.000 --> 04:49.000
This complex training and tuning processes, you know, you have to do with that.

04:49.000 --> 04:55.000
And not to mention lack of quality data that, you know, due to privacy reasons or, you know, very few specialized,

04:55.000 --> 05:01.000
don't mean every experts that you can get a data from. So, like you said, the scarcity of data.

05:02.000 --> 05:22.000
Finally, there's the, you know, different types of, in many cases, we require the flexibility of deploying on hybrid clouds and using different types of models, such as, you know, smaller scale models on edge supplementary larger, larger language models.

05:22.000 --> 05:38.000
So the control and of these can also help with dealing with tricky regulations, such as data privacy laws, handling data anonymizations and so on.

05:38.000 --> 05:48.000
So these allows the flexibility allows organization to take advantage of innovation while keeping private data secure.

05:48.000 --> 06:01.000
If you have to cover some of these common AI challenges, what are some of the, you know, the role of synthetic data can help, you know, in addressing some of these, and I'll hand it over to subject to share with you about that.

06:01.000 --> 06:17.000
Thanks, Carol. Lots of, lots of good points on why synthetic data generation is important. And it kind of boils down to this, this idea that data curation is difficult, right, hiring data scientists in order to do the collection and

06:17.000 --> 06:29.000
other types of engineering you need to process and filter and annotate data and and essentially refine it into an end result that you could tune a model with is expensive.

06:29.000 --> 06:46.000
It's prohibitive and for organizations that want a domain specific model, it's just, it's just difficult. So synthetic data has been kind of a solution for this issue, you know, you could create domain specific data for health care for enterprises for.

06:46.000 --> 06:59.000
Industries where you might not have it, there might just be data scarcity and for regulations like HIPAA or the EU regulations where you just can't work with medical data, for example, in certain types of ways.

06:59.000 --> 07:06.000
It allows you to substitute and have a complement for that real data that you can then test and train with.

07:06.000 --> 07:16.000
And for cost and efficiency, elements have been proven to be a great annotator. And that's why you already see synthetic data being used in the state of these art models.

07:16.000 --> 07:27.000
Because they're annotated perfectly, they're high quality and elements when they actually go through this process as we'll see in some examples in the future, tend to explain things better than we do as humans.

07:27.000 --> 07:37.000
When we explain things, we think, you know, of different, how do I say, we kind of relate ideas in a different way than a model would for example.

07:37.000 --> 07:48.000
And so by using models as annotators and to explain and refine ideas, you can get better results as we've seen with the chatGPTs that have been trained on synthetic data.

07:48.000 --> 07:57.000
And for example, deep-seek itself was trained with synthetic data generation. And this brings us to the idea of pre-training for foundation models.

07:57.000 --> 08:07.000
So there's tons of different examples of refining, classifying and working with the available tokens out there on the internet in order to create and curate better data.

08:07.000 --> 08:27.000
So some of these might be, for example, a fine web that you might have heard of, which is a collection of kind of the crem de la creme of the web that is out there, right, for education and just for general purpose, hugging face to the cool experiment called CosmoPedia, where they generated a pretty large data set in order to train their small models.

08:27.000 --> 08:37.000
And they kind of took webpages, reprompted them, and used that to fine tune a model. And then even in video, rewrote almost two trillion tokens to remove low quality data, right.

08:37.000 --> 08:47.000
And what we're seeing is this trend for synthetic data generation of continually having more efficient and refined data. So it's more of a quality over quantity type of thing.

08:47.000 --> 08:55.000
We're instead of taking the whole internet and throwing it at a model for months and months while we're taking a specific portion in a fine portion.

08:55.000 --> 09:00.000
In addition to pre-training, there's post-training in the fine-tuning process that we're going to look at today.

09:00.000 --> 09:10.000
For example, Red Hat and IBM, we've got this project called Instruct Lab, which uses this student teacher approach in order to teach a model the way that you and I would learn something.

09:10.000 --> 09:30.000
There's also Microsoft's Agent Instruct, which is a similar process of taking in all of your raw source materials, transforming that, and then having that as seed instructions in the taxonomy, which is a great way to organize data for training to make sure that there's no biases or lacks in the models capabilities.

09:30.000 --> 09:46.000
And there's also the stage of model evaluation where you can see synthetic data being used. So for example, in various benchmarks that are out there, but also for replicating rag data, so making sure that you don't have limitations or blind spots.

09:46.000 --> 10:06.000
So for example, in general, sorry, LLMs have been proven to be pretty good for annotating that data and we're going to see that here in two different examples to essentially go through and be able to label existing data and create new data from seed examples.

10:06.000 --> 10:09.000
So who's ready for a live demo?

10:10.000 --> 10:14.000
Awesome, we've got four minutes here, so I'll try to speed run this.

10:15.000 --> 10:24.000
So the first example is, oh yeah, I thought our talk was 15, sorry about that. Cool, so that's exciting.

10:25.000 --> 10:29.000
Our first demo is going to be a bird style model.

10:30.000 --> 10:36.000
Yes, it's perfect, because I'll explain the demo real quick. So bird was like the OG chat GPT, right?

10:36.000 --> 10:48.000
And it's, it's a little bit different of an architecture because instead of going left to right, decoder style in encoder, so it takes the entirety of the text that you're processing with it.

10:48.000 --> 10:56.000
So we're going to use it for an example to take a large corpus of text, let's say it's like an investment banking company and they have a portfolio.

10:56.000 --> 11:01.000
They want to make sure that hey, analysts, are they liking things or they're not liking the market.

11:01.000 --> 11:12.000
So what this is going to do at the end of the day is provide almost over a thousand times cheaper model costs than say taking a general large purpose LLM.

11:12.000 --> 11:26.000
It's going to reduce the amount of carbon dioxide that's being produced by the model and it's going to reduce the latency for responses and be able to perform on par with one of these day of the art models simply by doing synthetic data generation.

11:26.000 --> 11:31.000
So we'll go ahead and hop out of here and I want to show you.

11:34.000 --> 11:36.000
My apologies, coolab.

11:37.000 --> 11:43.000
Well, let me, sorry, I have to hop in it.

11:43.000 --> 11:46.000
I thought I had the notebook ready.

11:46.000 --> 11:48.000
Oh, you're joking, okay.

11:48.000 --> 11:51.000
All right, so here we go.

11:51.000 --> 11:53.000
I'm going to head over to the notebook.

11:53.000 --> 11:57.000
So essentially what's happening here and I hope it's big enough for everybody.

11:57.000 --> 12:05.000
Let me make sure is that we're importing some requirements logging into hugging face and we're going to use the open source, a

12:05.000 --> 12:09.000
extra model to be our annotator for this specific example.

12:09.000 --> 12:13.000
From hugging face, we're importing a financial phrase, big data set.

12:13.000 --> 12:23.000
So this is examples of analysts who are actually in this domain who are saying, hey, this news is reputable or this not reputable, but this is in the positive light or a negative light.

12:23.000 --> 12:32.000
And so essentially we're dividing this into the different labels of the data set, so the sentence and the label that we want to provide.

12:32.000 --> 12:35.000
And we're providing kind of a chain of thought template here.

12:35.000 --> 12:38.000
So we want the model to explain it's thinking.

12:38.000 --> 12:45.000
We're providing examples a few shot prompting with, hey, if the operating profit increased, this is a positive example.

12:45.000 --> 12:51.000
But if the decrease profit was the result, then it's negative.

12:51.000 --> 12:53.000
Oh, shoot, okay.

12:53.000 --> 12:55.000
All right, no mind, one minute.

12:55.000 --> 12:59.000
Okay, anyways, all we end up doing is going through the data set.

12:59.000 --> 13:04.000
So it's already been labeled, but we want to see, hey, what would the annotator do?

13:04.000 --> 13:12.000
And be able to, at the end of my notebook here, be able to label the instructions here, positive, negative.

13:12.000 --> 13:20.000
And once we actually take this data set that would be extracted from a larger corpus of text, then we would do fine tuning on that model.

13:20.000 --> 13:26.000
Say, for an hour or two, with some maybe compute that you could rent from hugging face or AWS.

13:27.000 --> 13:30.000
And have a model that could actually do that.

13:30.000 --> 13:37.000
But the other example I wanted to provide really quickly is in struck lab, which is another way to do this for a conversational element.

13:37.000 --> 13:40.000
So the first thing was classification with text.

13:40.000 --> 13:46.000
This is for an actual conversational element where you can represent the questions and answers you want the model to provide.

13:46.000 --> 13:51.000
And then with a locally running model as we have here, be able to generate a data set.

13:51.000 --> 13:59.000
For example, here, like a JSONL to have the questions and answers that stem from the seed data.

13:59.000 --> 14:05.000
So I know we're out of time, so sorry, but that is our talk today.

14:05.000 --> 14:14.000
That's also another way to save costs and create a domain specific model either from classification or conversational elements.

14:14.000 --> 14:16.000
And on behalf of me and Carol, thank you.

14:16.000 --> 14:17.000
Sorry.

14:21.000 --> 14:23.000
Thank you.