WEBVTT

00:00.000 --> 00:29.900
Okay, thank you so...so hello everyone. My name is Resvanne. I work really bad in the

00:29.900 --> 00:37.180
to give her with two-door here and Seoul there, even though he's not mentioned in the slides,

00:37.180 --> 00:44.780
he's overseeing this kind of like a guardian angel of sorts, sometimes, and we're going to

00:44.780 --> 00:52.620
talk to you about AI for meetings. So, first and foremost, we're currently living in an AI

00:52.620 --> 01:00.620
goal rush, obviously who sells the shovels, gets the most money, and also there's like maybe

01:00.620 --> 01:07.660
a danger of trying to find like some use cases which are not very, so lots of people try to

01:07.660 --> 01:17.100
invent stuff to do with AI. We identified like the use cases, the most used, the most useful

01:17.100 --> 01:23.500
things, let's put it this way, for us, for meetings, in our opinion, it's this, and obviously

01:23.500 --> 01:29.660
for having them, you need transcriptions. I realize you won't be able to do summaries on a

01:29.660 --> 01:37.660
respect action points, do sentimental analysis or whatever. We've been doing, we've been using

01:37.660 --> 01:46.940
the Google APIs since 2017 to do captions, live captions, but we're unhappy with the quality of

01:46.940 --> 01:55.420
the results, so the accuracy was pretty crappy. So, in 2023, the Open AI was released with

01:55.420 --> 02:05.660
the SPR in 2022, and in early to 23, we also created the PRC, making it, making the

02:05.660 --> 02:13.820
SPR before, in real time. What were the challenges here? We needed the SPR to

02:14.060 --> 02:20.540
transcribe a stream, not a file, because SPR was built to transcribe an entire file. It needed

02:20.540 --> 02:25.900
to do it in real time, and it also needed to have good accuracy. How did we accomplish this? I'm

02:25.900 --> 02:32.220
going to try to speed up, because we have quite a lot of grant-cutter. We have a client who joins

02:32.220 --> 02:36.860
the meeting, opens the website of connection to our server, and starts pushing all your chunks.

02:36.860 --> 02:43.500
You see it on the upper left corner, then we have like CRVAD, which is a vertically detector.

02:43.580 --> 02:52.060
Which drops all the other chunks, which do not contain any voice, and then all these chunks get

02:52.060 --> 02:58.300
added to a buffer. At one point, we add a header to that buffer, transfer a minute into a normal

02:58.300 --> 03:04.060
way file, and then feed it to whisper for transcription. This obviously has a problem, because the

03:04.060 --> 03:11.500
buffer keeps growing and growing, and you need to cut the bastard, and we do it

03:14.380 --> 03:20.140
when we do it. Basically, we turn two types of transcriptions, back from the server, an interim,

03:20.140 --> 03:27.900
an intermediary, which hasn't solidified yet, which is still prone to changing, and the file

03:27.900 --> 03:36.060
transcription where we are almost sure that is the file one. How do we do that? We do that by analyzing

03:36.140 --> 03:44.540
the returned results from the whisper, which as you can see on the lower, that's basically

03:44.540 --> 03:51.180
on the line above is the buffer, and the one here in emergency cap, and we search for the biggest

03:51.180 --> 03:58.380
gap between words, and we cut there. This kind of works good, but it still lacks one of the challenges

03:58.380 --> 04:04.300
there, the accuracy part, because obviously if you cut the buffer, then you will lack context again.

04:04.300 --> 04:09.980
So, I don't know if you know it, but whisper allows you to provide an initial prompt. You could

04:09.980 --> 04:17.580
use that to put some acronyms in there, so if you wanted to just to guess the proper terms,

04:17.580 --> 04:26.220
and it will help you, but you can also feed the previous final transcriptions, so it knows what to

04:26.940 --> 04:31.340
how to translate, or you encourage it to describe it in a certain way, and this is what we do.

04:31.420 --> 04:37.580
Each time there's a final for a meeting, so the files from more participants, we feed those

04:37.580 --> 04:43.820
files into the initial prompt, and this is also limited to, I think, 4,000 and something tokens,

04:45.020 --> 04:52.140
and it gives it the context for the transcription. And that's it, that's very much it,

04:52.140 --> 04:58.620
and there's already the captions. I still have, like, a minute left, I guess, right,

05:01.340 --> 05:09.260
and maybe, yeah, and I don't know if you can see it when you know how it here it is.

05:11.340 --> 05:19.900
This is how it works. So, as you can see, this is pretty fast. It could be faster, but we're

05:19.980 --> 05:27.100
pleased with it. So, yeah, and then this gets, with me, start a presentation again,

05:28.860 --> 05:36.620
and close the other stuff. Oh, but you do. And everything you see here,

05:38.380 --> 05:43.820
gets fed to an LLM, and here to where it's going to talk to you about this. Thank you very much.

05:50.060 --> 06:00.140
Yeah, that's okay, I'm just a second. Yeah, you should close that, maybe.

06:01.740 --> 06:02.540
This goes the meeting.

06:10.860 --> 06:12.540
Oh, it's in a browser, sorry.

06:12.540 --> 06:21.540
Yeah, it's gone. Yeah, it's gone.

06:24.540 --> 06:27.540
There's a, it's closed, it's closed.

06:27.540 --> 06:31.580
Nice, I'm closing. Oh, it's, it's, it's, it's, it's the fucking, oh,

06:33.820 --> 06:37.340
the other was showing here, so yeah, that is confused a bit.

06:38.940 --> 06:39.980
All right.

06:42.860 --> 06:53.900
All right. So, yeah, we're going to chat a bit about the summarization, and two years ago,

06:53.900 --> 07:02.940
when we started to develop SkyNet, we had to find ways to enhance our meeting-related products,

07:02.940 --> 07:10.860
which were most of all, were based on GZMit. And we had two main considerations, which,

07:11.820 --> 07:16.460
one of those was the fact that, as I mentioned, we had the transcription at our disposal,

07:16.460 --> 07:22.300
which was already like a core feature in GZMit. And the other one was the fact that we weren't

07:22.300 --> 07:28.620
liking an experimental stage where we were experimenting with, as we weren't, we weren't sure where we

07:28.620 --> 07:35.180
would be going, and the resources that we had allocated were very low. So, basically, all the testing

07:35.260 --> 07:45.420
that we did was on our, on our max. And this is why we, we felt that's providing summarization

07:45.420 --> 07:52.300
would be like a good, efficient, low-cost feature. And this is also because of the fact that

07:53.820 --> 07:59.260
you don't really need like a big model to summarize things correctly. You can get away with,

07:59.260 --> 08:05.740
have, we're trying like a 8-bit on Lama model, for example, and you would get like this in results.

08:05.740 --> 08:11.100
And in the end, for example, today, we still run an 8-bit on Lama model. We didn't feel like

08:11.100 --> 08:22.300
the need to change it. So, just a few, a quick overview on how summarization came to be achieved,

08:22.300 --> 08:28.460
and how we implemented it. So, the first, the first method would be like the basic prompt. So,

08:29.020 --> 08:34.060
in this case, you'll just take the whole text that you want to summarize. You added like a prompt

08:34.060 --> 08:38.700
summarized and send it to every send everything to the Lama and you're getting a result back. Now,

08:39.500 --> 08:45.740
this can be, most of the time, it's going to work, but once you're dealing with larger

08:45.740 --> 08:51.900
payloads, and if your models are not maybe, do not have like a large enough context window,

08:51.900 --> 08:58.780
you're running into this issue where the whole payload will exceed the context window of the model.

08:58.780 --> 09:05.100
So, in that case, you kind of need to do the like a secondary approach, which is what we've

09:05.100 --> 09:11.420
implemented through the map in this method. And this basically works by splitting the whole initial

09:11.420 --> 09:18.780
texting to equal chance, and sending each of those chunks to the LM such to be summarized separately,

09:18.780 --> 09:23.420
and at the end, you're going to just have like intermediaries summary, which you'd mesh together

09:23.420 --> 09:29.180
and send for a final step to the, well, this might not be very efficient. There's like a third way,

09:29.900 --> 09:34.540
but the presentation should be shorter than I initially thought. So, I'm going to skip that.

09:35.100 --> 09:42.620
So, onto some architecture considerations. So, where the way we implement this kind of is,

09:42.620 --> 09:49.260
so whenever you're requesting the summary or action items for that matter, you're going to get back a

09:49.260 --> 09:55.180
job ID. So, this job ID is due to the fact that we behind the scenes, we keep like,

09:55.180 --> 10:06.220
already skew. And so, we keep this already skew, and we basically rely upon the clients to

10:06.220 --> 10:12.460
implement some sort of long-poiling mechanism in order to clear for the results. And there are several

10:12.460 --> 10:19.340
reasons why we did this. And the most important is the fact that, as I was mentioning, we initially

10:19.340 --> 10:24.860
started to run a restricted resources. I mean, not even now we're still running like on two

10:24.860 --> 10:29.820
eight-end GPUs, actually three eight-end GPUs. So, they're not like the top of the line GPUs.

10:30.540 --> 10:41.180
So, running on a restricted resources as you may may know, the lower specs, the lower specs

10:41.180 --> 10:47.180
hardware you have, the higher number, the lower number of requests you can process. So,

10:47.900 --> 10:54.860
this, of course, leads to incoming requests being delayed, in being executed. And this is just

10:54.860 --> 11:02.460
a problem that's being aggravated whenever you're encountering larger payloads. And you would be

11:02.460 --> 11:12.140
forcing like client timeouts and you'd be facing bottlenecks on your side. There's also the aspect

11:12.140 --> 11:19.180
of redundancy. So, whenever you're dealing with this stuff, you kind of need to think about

11:19.180 --> 11:26.460
node failures, which is something that has often happened in our cases. So, this might be even

11:26.460 --> 11:34.300
due to model hallucinating and going on infinitely and you would have to stop it forcibly at some

11:34.300 --> 11:40.140
point. Or maybe you're just introduced a bug, or maybe just the library that you're running,

11:40.700 --> 11:47.100
model on is just crashing in some place. And this is why you plan to need some

11:47.900 --> 11:55.420
some way of being able to rerun the same, the jobs that have failed or the jobs that have been

11:55.420 --> 12:02.220
have not been processed, but were taken for processing on the failing nodes. And to avoid a client

12:02.220 --> 12:10.220
regime mechanisms. And we also had like recurrent big spikes of usage due to rush hours on meetings.

12:10.220 --> 12:15.500
So, you can see in this graph. So, there are like about three, four hundred requests per minute,

12:15.500 --> 12:24.060
minute during a turning-minute lifespan and this was also something that we had to consider when

12:24.140 --> 12:31.900
we designed this small. Now, onto some quick comparison on Lama CPP versus GLM,

12:33.100 --> 12:38.380
we initially started working with Lama CPP because we were mainly experimenting on our local

12:38.380 --> 12:48.300
hardware on our max. And this was like a very useful tool because it was very optimized to work on

12:48.380 --> 12:55.580
various architectures, very, very CPU architectures. And it also provides a lot of

12:55.580 --> 13:01.900
quantizations. It can also provide some quantizations that allow you to run your models on to

13:01.900 --> 13:09.900
lower end hardware like a phone or a tablet. On the other hand, VLM does the opposite. So,

13:09.900 --> 13:16.460
it does like a heavy optimization for the GPUs case. So, it's like the, if you want to go

13:16.460 --> 13:21.900
see us about this stuff, you're going to have to switch to VLM, which is what we did.

13:23.900 --> 13:31.500
So, I wonder. Okay. And besides that, there's also the major aspect of dynamic matching,

13:31.500 --> 13:38.300
which VLM offers in Lama doesn't. And what this means is the fact that you're going to have

13:38.300 --> 13:43.660
you're going to take advantage of the full context window of the model because you're basically

13:43.740 --> 13:50.380
running your request in parallel, as long as it fits the whole context window of the model.

13:51.020 --> 13:59.660
And you can see this in action here. Yeah. So, on the left side, we have SkyNet, which runs

14:00.620 --> 14:08.220
10 requests in parallel. And after 25 seconds, it gets like, this is like emulating the

14:08.220 --> 14:13.260
long polling use case, but it's only does it like once. It's just doing a request at the end

14:13.980 --> 14:21.740
in order to check for the results. And most of the requests are, I mean, five of these already

14:21.740 --> 14:27.900
have about 50K characters, so they're not like small payloads. And you're still getting all the

14:27.900 --> 14:33.980
results in time. Now, if you were to do this sequentially, you wouldn't be able to achieve this timing,

14:34.060 --> 14:40.300
at least not on this set up. So, we, as I mentioned, we were running on an 810 GPU with a quantized

14:40.300 --> 14:50.540
intake Lama 3.1 for the update A-VDM. And some numbers, we've switched recently to Grafana from

14:50.540 --> 14:57.660
Wavefront, so we only have numbers starting September. And we've processed since then about 1.5

14:57.740 --> 15:05.260
million summaries and action items. So, this is it. If you want to check out our project,

15:05.260 --> 15:09.980
this is the URL. Thank you, Wendy, if you have any questions.

15:14.220 --> 15:20.220
So, this project here has both the transcriber and the LLA. So, they're residing in both

15:20.220 --> 15:26.380
you can, it's a module, as all the care puts it. So, you can enable either one or the other

15:26.380 --> 15:33.740
functionality or both at the same time. So, yeah, please check it out, and we have a question.

15:35.020 --> 15:43.180
You know, all the process is welcome. No, nothing is going to happen. No, everything is

15:43.180 --> 15:49.420
happening. Of course. I mean, you can configure it to go to OpenAI or Azure, but this is something

15:49.420 --> 16:00.460
that you're not going to do. It's not going to happen. As I was saying, so you can configure it to

16:00.460 --> 16:06.220
run to either your local sense, yourself hosted model, or you can configure it to run through

16:06.300 --> 16:11.900
OpenAI or Azure, if you would like anything else.

16:14.940 --> 16:16.380
All right, thank you.

