WEBVTT

00:00.000 --> 00:07.840
Hello.

00:07.840 --> 00:12.240
So our next speaker is Marco Castellucho.

00:12.240 --> 00:18.080
He's a self-described engineer, which is a hacker and an engineer.

00:18.080 --> 00:25.120
He is also an engineering manager at Modzilla, where he is leading the CLN quality tools,

00:25.120 --> 00:27.760
PDFs and translation teams.

00:27.760 --> 00:33.200
Today, he'll be talking to us about how Modzilla is using machine learning and large language

00:33.200 --> 00:41.360
models to develop an experimental tool to make code review experience more pleasant and easy

00:41.360 --> 00:42.800
for developers.

00:42.800 --> 00:47.680
He would also be talking about the architecture and the iterative development process of

00:47.680 --> 00:51.200
the tool and the future plans for this tool.

00:51.200 --> 00:55.680
Welcome, Marco.

00:55.680 --> 00:57.600
Thank you for the introduction.

00:57.600 --> 00:59.280
Thanks.

00:59.280 --> 01:05.440
So I will skip the introduction since I missed your name.

01:05.440 --> 01:08.720
The Nika did it for me.

01:08.720 --> 01:14.240
So the slides are not visible.

01:14.240 --> 01:31.120
Where is the... I forgot to, of course.

02:14.240 --> 02:26.560
Perfect.

02:26.560 --> 02:28.800
Sorry.

02:28.800 --> 02:36.960
So, yes, I'm Marco Castellucho, as Danica said, and as you can guess from the slides, I'm working

02:36.960 --> 02:39.120
at Modzilla.

02:39.120 --> 02:46.000
I'm an engineering manager managing the CI and Quality Tools team, translations team and PDF team.

02:46.000 --> 02:52.640
In particular, this talk is more related to CI and Quality Tools, where we use both classical

02:52.640 --> 02:59.520
tools like static analysis, LinkedIn, and also machine learning based tools.

02:59.520 --> 03:03.440
We've been using them for a few years since before they were cool.

03:03.840 --> 03:13.920
Now, we're continuing to using them even with the additional tools like LLMs.

03:13.920 --> 03:22.480
Before we dive in the topic, I want to first give you an idea of the Firefox scale.

03:22.480 --> 03:29.760
We have hundreds of bug reports, actually, they are not all defect.

03:29.760 --> 03:35.600
We use bug reports for everything like tasks or announcements, future requests, so two million

03:35.600 --> 03:44.320
bug reports are not actually two million defect, but still, we celebrate when we arrive at

03:44.320 --> 03:46.160
these kinds of numbers.

03:46.160 --> 03:53.600
We release every four weeks, with thousands of changes, we have 12 major releases in 2024.

03:54.480 --> 04:00.880
Firefox is one of the biggest and most complex software we recently celebrated it's

04:00.880 --> 04:08.640
20th birthday in November, and the NetScape was open, source 27 years ago, but was

04:09.200 --> 04:13.680
started even before then, so there's a lot of legacy and tech that.

04:14.240 --> 04:24.960
We have 950,000 commits made by 10,000 contributors with 32 million lines of code,

04:26.000 --> 04:35.680
46,000 commits only last year with 1,100 unique contributors every year, this number keeps increasing,

04:35.680 --> 04:36.320
which is nice.

04:36.960 --> 04:48.240
We use lots of languages starting from C++, JavaScript, and Rust is taking more and more space over time.

04:51.200 --> 04:58.640
So the idea we had was not to automate reviews, it was not even to make reviews faster.

04:59.280 --> 05:06.080
It was to try to identify more issues and improve code quality through code review,

05:06.080 --> 05:11.120
by supporting developers. We don't want to automate the code review process.

05:11.120 --> 05:14.320
We want to support developers in this process.

05:16.480 --> 05:24.080
This was a collaboration between us, Ubisoft, and two universities, Queen School of Computing,

05:24.080 --> 05:28.720
and University of Montreal, we all worked together in open source project.

05:30.080 --> 05:37.280
We didn't have access to the Ubisoft game source code, but they could work together with us

05:37.280 --> 05:41.600
in the open source project, and we're working on a paper to show the results.

05:43.600 --> 05:49.040
So let's dive in. First of all, I tripped down prompt engineering lane.

05:49.600 --> 05:55.520
So we started with a simple prompt. We just asked LLM,

05:55.520 --> 06:00.480
please find issues in the patch, and we showed the patch to the LLM.

06:01.520 --> 06:10.640
Of course, this generated lots of weird comments, comments that usually developers don't really

06:10.640 --> 06:20.320
want to see. So we added the human examples. Initially, there were comments that we identified as good

06:20.320 --> 06:27.200
by default. Then later on, we tried to use code similarity to find comments that were made

06:27.200 --> 06:36.400
previously by humans on code that is similar to the patch. Then we had to fight against the

06:36.480 --> 06:43.360
urge of the LLM to be nice. As you might have noticed, they always want you to be happy.

06:44.640 --> 06:52.320
So we saw comments like, this is a good addition to the code. Of course, reviewers don't really care

06:52.320 --> 07:01.120
about this kind of comments. So we had to explicitly say, please don't add a praising comments.

07:02.080 --> 07:09.840
Then we had to ask it to focus. It was commenting on everything. So sometimes it was generating

07:09.840 --> 07:15.120
comments that were related to things that the patch was already doing. So it was suggesting to

07:15.120 --> 07:22.560
do things that the patch was doing. So we had to ask it to focus on addition only, on the other lines.

07:24.560 --> 07:30.720
Then we had to add more and more and more and more filters. For example, there were many comments

07:31.040 --> 07:38.320
focusing on making sure that something exists. But of course, if a developer is writing a patch,

07:39.440 --> 07:46.240
they are using functions that exist. They will not use things that do not exist in the source code.

07:46.240 --> 07:53.520
So these kind of comments are kind of useless. This already gave us some initial interesting results,

07:54.160 --> 08:03.040
but we needed to do a bit more. We had to try to elicit reasoning from the LLM. So we initially

08:03.040 --> 08:09.520
before asking to review the patch, we asked the LLM to summarize the patch. So they could not

08:09.520 --> 08:18.720
understand what the changes were and then review based on the summary. Then we noticed it was still

08:18.720 --> 08:26.480
confused about usage of variables. For example, if it couldn't see where the variable was defined,

08:26.480 --> 08:35.440
in case of patches that for example, they are changing function. The variable definition is way earlier

08:35.440 --> 08:42.480
and you cannot see it unless you go back to the initial part of the function. So basically,

08:42.480 --> 08:49.120
we gave a tool to the LLM to be able to see more context around the changes if it wanted to see more

08:49.120 --> 08:57.520
context. Then we also gave it a tool to search code using our search fox tool, which is a code

08:57.520 --> 09:04.240
search tool for a Firefox source code. So we gave it the ability if it wanted to see

09:04.240 --> 09:11.920
function functions that were being used in the patch. We gave it the ability to search for those

09:11.920 --> 09:19.040
functions and see their implementation in the source code. This was particularly useful for

09:19.040 --> 09:25.120
Ubisoft, for example, because the LLM are already trained on Firefox source code, while for Ubisoft

09:25.840 --> 09:31.360
their code is called source. So they really needed the ways for the LLM to see what

09:32.320 --> 09:41.920
the prints of Perseya functions were doing. We also added the bad examples to

09:43.760 --> 09:52.240
other bad examples to teach the LLM not to add them. For example, still praising comments like

09:52.240 --> 09:58.960
overall the patch seems to be well implemented with no major concerns. And again, here,

09:58.960 --> 10:08.720
just like the human comments, initially we used predefined examples that we identified by ourselves,

10:09.360 --> 10:14.560
and then later on we started using data from from the field. I will talk about that later.

10:16.480 --> 10:23.520
So it seems linear. It actually wasn't this linear. It went more like, okay, go back,

10:24.400 --> 10:31.920
go forward, go back, refine, refine, refine until we got something that we were more or less happy with.

10:32.720 --> 10:43.360
And we were satisfied enough to offer it to a set of developers. So what it looks like?

10:45.120 --> 10:50.960
This is an example before we deployed the tool. So there was already a review comment.

10:50.960 --> 10:57.200
The LM didn't have access to it. And it generated the comment which was saying more or less the

10:57.200 --> 11:04.400
same thing in other words. So it was very promising. And you can see the interface here. The developers

11:04.400 --> 11:10.560
had the ability to approve comments. Mark them has invalid or Mark them has trivial the UI

11:10.560 --> 11:19.280
changed a little later. But I will show you an example later on. This is another example from when

11:19.360 --> 11:22.880
the tool was actually deployed. It was actually maybe from two days ago.

11:24.320 --> 11:32.480
It was nice so I added it. The LM identified that potentially potential overflow in this code

11:33.200 --> 11:37.360
and so the developer fixed it by adding a check for the overflow.

11:39.760 --> 11:46.160
This is another interesting example. In this case the LM identified a possible division by zero

11:47.120 --> 11:55.520
and it was actually true. But it wasn't clear how to fix it because there were tests that were

11:55.520 --> 12:01.760
actually using images with size zero. So the developer didn't exactly know how to fix it.

12:01.760 --> 12:07.200
The patch landed as it is and the developer will investigate more to figure out what to do exactly.

12:10.160 --> 12:14.960
And this is an example with JavaScript. You can see actually the first part of the patch

12:15.600 --> 12:22.240
was in another hand of the patch. It's line 15 and 16 and the second part of the patch is

12:22.240 --> 12:29.360
we later. And the LM was able to identify and problem with the accessing of this variable.

12:31.760 --> 12:36.400
It was a different language before it was a simpler class. Now it's a JavaScript

12:37.040 --> 12:42.080
of course it's able to identify issues in both languages and it cannot identify issues

12:43.040 --> 12:48.800
in different parts of the patch. Not only it has the full context basically.

12:50.480 --> 12:58.160
And this is UI when users reject comments. So when they approve it, it looks like this.

12:59.360 --> 13:05.840
The comment was generated the contents of the comment and then we say this comment was generated

13:05.840 --> 13:12.640
automatically and was approved by this person. Or if they added the comment as I will show later

13:12.640 --> 13:21.120
it will say it was generated automatically and was edited by. We want to make sure that developers

13:21.120 --> 13:24.800
that receive these kinds of comments know that they were generated automatically.

13:27.040 --> 13:35.280
This is the interface. This is actually from Ubisoft side. What happens when you try to reject comments

13:35.840 --> 13:44.080
so users could select the reason for rejection. Most of the cases it was a trivial comment

13:44.080 --> 13:50.160
when they rejected. But there were many cases also of useful in the development phase,

13:51.040 --> 13:55.840
not in the review phase or useful as a tip for reviewers, not a review comment.

13:56.240 --> 14:07.120
So do people like it? How many comments are accepted? 8% of comments are accepted

14:07.120 --> 14:13.600
from Modzilla, 7% of Ubisoft. It's not that huge number but it's also very promising because

14:13.600 --> 14:21.920
it means it is finding valuable, highly valuable cases. There is still a lot of noise that we need

14:21.920 --> 14:28.000
to address but still it's very promising as you saw in the examples before.

14:30.000 --> 14:36.960
Some developers said it detects perfectly possible flows in the code but most of the suggestions

14:36.960 --> 14:40.880
are more in the development realm than the review realm and that's why we added those two

14:41.760 --> 14:46.560
cases. Useful as a tip for reviewers, not as a review comment. Useful development phase, not in the

14:46.640 --> 14:56.480
review phase. There are not considered here between the 8% or the 7% because they were not accepted

14:56.480 --> 15:04.720
were rejected but they were still useful in some way. So we had another percentage of appreciated

15:04.720 --> 15:10.640
which contains both the accepted ones which were published and the ones that were considered

15:10.640 --> 15:18.080
as useful even though not useful as comments in the review. It was more or less the fourth of the

15:18.080 --> 15:26.880
comments, almost the third of the comments that Ubisoft. It also depended on the seniority of the

15:26.880 --> 15:35.520
developers. Some more general developers appreciated this review tips more than more senior developers

15:35.600 --> 15:45.120
for example. The other thing we were wondering is if the reviews would be slower or faster with

15:45.120 --> 15:51.600
this kind of tool we were not as I said at the beginning we were not aiming to make reviews faster

15:51.600 --> 15:58.720
but we also didn't want to make them slower and developers said things like often looking at

15:58.720 --> 16:04.000
the suggestion helped me think more about the code and reflect a bit more on what is done and why

16:04.080 --> 16:10.320
it is done. Overall I think it is a bit slower to do reviews but my reviews are actually better

16:10.320 --> 16:17.040
now that I used the tool. We didn't track the review time at Modzilla but Ubisoft

16:17.040 --> 16:24.720
tracked the review time and we saw that the additional time for this kind of comments was not that

16:24.720 --> 16:33.440
long. So for accepted comments it was around 28 seconds for rejected it was 16 seconds. Probably

16:33.520 --> 16:39.920
it was obvious that they were bad comments and in general it took 42 seconds more per review.

16:39.920 --> 16:48.000
So it didn't add too much review time. Of course there are some cases where it took way longer

16:48.880 --> 16:56.560
so towards 150 seconds and this is probably because sometimes the LLM could hallucinate

16:56.560 --> 17:02.720
problems that do not actually exist but then if you're not familiar enough with the code you

17:02.720 --> 17:08.000
need to actually go and try to find out if it is actually the case or not. It actually happened

17:08.000 --> 17:14.640
to me once. I was not familiar with the API that the developer was using so I had to search for

17:14.640 --> 17:22.560
documentation, try to figure out if it was real or not etc etc and I think it turned out to be real

17:22.560 --> 17:32.080
but still it was a long rabbit hole and our comments edited we were we were wondering if they were

17:32.160 --> 17:40.240
edited or not and what happens when they are edited we found that most comments are just accepted

17:40.240 --> 17:47.760
as they are or they are shortened and this is because LLM is often two verbose so in the most

17:47.760 --> 17:56.960
reason prompt we are asking it to be super concise. There were very few examples of length and

17:57.280 --> 18:02.080
comments at the time we wrote the paper it was only one. Now there are a few more.

18:03.520 --> 18:10.800
You can see in the first case it was just because the developer wanted to be nice to the developer

18:10.800 --> 18:19.840
and it added that suggestion on how to fix the comment. In the second case instead the reviewer

18:19.920 --> 18:28.720
was not sure about the comment but it thought it could be valuable so they said this was suggested

18:28.720 --> 18:34.000
by the review helper Adon however maybe this should handle errors but please make sure.

18:37.680 --> 18:42.240
I guess this is a case where you could consider it might have been a waste of time for

18:42.240 --> 18:47.280
if it turned out not to be true we don't know I need to look into it.

18:49.120 --> 18:54.720
The other thing we wanted to see is are accepted comments valuable enough

18:55.600 --> 19:02.720
as human comments so we set out to see if they led to the same number of changes and it turned out

19:02.720 --> 19:11.440
that yes more or less 74% of LLM comments lead to changes and 73% of human comments lead to

19:11.440 --> 19:15.360
changes so they lead to changes more or less in the same way.

19:18.240 --> 19:26.960
Next steps so we saw a few promising results at the same time we saw there is a noise in the results

19:26.960 --> 19:34.880
so we want to improve them. So the first step improved bug detection capabilities

19:35.840 --> 19:43.040
so right now we're using as I said in the prompt engineering phase we're using past human

19:43.040 --> 19:50.400
comments plus the knowledge of DLLM from pre-training and we're thinking of building a data set

19:50.400 --> 20:00.160
of synthetic comments so basically on Bugzilla a few years ago we were smart enough to think

20:00.160 --> 20:13.920
of adding a new field called regressed by where developers can say which patch caused a bug so

20:13.920 --> 20:20.880
this has been used for metadata purposes to know how many regressions there are instead of

20:20.880 --> 20:26.560
over time but it can also be used in this case for example for building a data set of

20:26.560 --> 20:35.520
fake comments to teach DLM about very highly valuable cases because basically we have the original

20:35.520 --> 20:40.880
bug we have the patch associated to the original bug we know this patch causes another bug which

20:40.880 --> 20:47.760
is the regression and we have the patch that fixes the regression so by giving to the LLM the original

20:47.760 --> 20:54.960
patch where the bug was not caught by the reviewer and the fixed patch which fixes the regression

20:55.040 --> 21:01.920
caused by the original patch we can generate fake comments that would have prevented these

21:01.920 --> 21:07.920
regressions of course it's not always possible because sometimes that my regression might be due

21:07.920 --> 21:15.200
to external reasons it was it would have been possible to find it by a reviewer but there are

21:15.200 --> 21:21.200
many cases where it is possible and these are extremely highly valuable because they were missed

21:21.280 --> 21:30.320
during the review by humans and this is to improve the acceptance rate but we also want to

21:30.320 --> 21:36.880
reduce the rejection rate due to noise because the LLM is still even though we added a lot of

21:36.880 --> 21:44.880
filtering etc to remove praise comments and stuff like that it's still making some of them

21:44.880 --> 21:52.160
so we want to reduce trivial and unwanted comments first of all we want to allow customization

21:52.160 --> 21:58.960
by type as I said there are some developers who like to see the review tips some other developers

21:58.960 --> 22:05.920
don't want to see review tips so we want to categorize them and give the option to the reviewers to

22:05.920 --> 22:14.720
see some of them and other reviewers to see only a subset and we want to introduce more filtering

22:14.720 --> 22:21.680
using past trajectory example so with the implementation of the tool as I showed we collect

22:22.640 --> 22:28.800
feedback from developers and we have a lot of accepted and rejected comments we can use

22:28.800 --> 22:36.160
the rejected comments in just them into a vector database so whenever we generate new comments

22:36.160 --> 22:44.080
with LLM we can find for similarly looking comments in the rejected comments vector database

22:44.080 --> 22:48.960
by similarity and we can add them into the prompt and ask the LLM not to generate

22:50.400 --> 22:55.920
things that look like that we're still working on it so I don't know exactly how

22:55.920 --> 23:11.840
the solution will look like I don't know exactly how the solution okay how the solution

23:25.920 --> 23:38.000
yes I think it's probably this maybe let's try to remove it

23:38.000 --> 23:50.080
maybe let's try to remove it

24:08.960 --> 24:25.280
oh it's working sorry about that so yeah we don't know exactly how the solution will look like

24:25.280 --> 24:32.800
but we have a big new not big but big enough data set of rejected comments from developers

24:32.800 --> 24:39.600
which we can use for clustering filtering and so on and so forth and we hope this will greatly

24:39.600 --> 24:47.920
reduce the number of noise comments the other thing we want to do is auto fixing comments a bit

24:47.920 --> 24:54.400
like the developer did to help sorry the regular did to help the reviewer this is an example

24:55.040 --> 25:01.600
here on the left there is a comment via reviewer of course we only want to do these in simple

25:02.560 --> 25:07.680
because probably the LLM won't be able to address very complex cases at least for now

25:08.400 --> 25:12.640
but for very simple cases where developers can just accept without having to

25:13.920 --> 25:19.520
edit the code locally and push the patch again ask for review again etc etc we could generate

25:19.520 --> 25:25.360
suggestions directly in the review system and they could just accept them and the land the patch

25:26.000 --> 25:31.200
so there's an example here where the reviewer said variables is actually required

25:31.920 --> 25:38.400
pointed to some documentation and but can be an empty dicked and so the LLM just added variables

25:39.280 --> 25:40.800
attribute with an empty dicked

25:43.440 --> 25:50.320
we also want to there are also many developers who said this is very useful

25:51.120 --> 25:56.720
but I would like to have it while I'm working on the patch and not after I submit the patch for

25:56.720 --> 26:03.440
review so we are thinking of building an ID integration so that they can use it before

26:04.160 --> 26:10.480
so this kind of issues especially the tips for example use useful in development phase

26:11.440 --> 26:16.640
they can actually use them and avoid wasting the time of the reviewers in this case we could actually

26:16.640 --> 26:23.520
speed up the review we wouldn't see it because the initial review by the LLM would happen before

26:23.520 --> 26:34.160
submission but it would still be a win overall all right this is more or less it when it comes to

26:34.160 --> 26:41.680
using LLMs for code review but I also wanted to show a bit what we're doing in addition to code

26:41.680 --> 26:50.560
review to still support developers with various tasks first of all QA test case generation so LLMs

26:50.560 --> 26:57.760
can often have hallucinations so you could say that they're very creative so we thought

26:58.480 --> 27:03.600
maybe we could use them for test case generation where you need a lot of creativity to break

27:04.560 --> 27:14.400
the software so we are getting to the LLM the role of expert quality assurance engineer

27:15.440 --> 27:21.120
we're giving examples of test cases we specified the feature that the LLM should test

27:22.480 --> 27:27.680
with the details of the feature the scope of the testing and so on and so forth and then we

27:27.680 --> 27:35.760
ask it to generate test cases and we tested it on one two three four five six seven eight five

27:35.760 --> 27:50.320
fox features which were both UI features and more backend features it went very well in general

27:51.040 --> 28:01.600
it generated 54 cases so 27% of the test cases were novel so the no QA QA thought of them

28:02.240 --> 28:08.400
one under the one already existed so 50% so which means they were good but already existed

28:09.600 --> 28:18.000
29 were invalid and 16 other reasons for rejection but in general it was very valuable for the QA teams

28:18.720 --> 28:28.320
you can see in in some cases where the feature is more obvious and there are less chances for

28:28.320 --> 28:34.160
corner cases like reader view comfort for example there was only one novel case but in other

28:34.720 --> 28:42.720
features which could lead to more corner cases like image to gen AI in PDF there were way more

28:43.440 --> 28:52.080
novel cases found in in case of the back button intervention which was mostly back and stuff

28:52.080 --> 29:02.320
there was no novel test case found another aspect is smart documentation search so we're building

29:02.320 --> 29:10.800
a ragged architecture to ingest the Firefox source code the back from baczilla the history of

29:10.800 --> 29:20.480
development basically and then building a chatbot for developers this was mainly thought for

29:20.480 --> 29:27.840
new contributors who might come in or new employees etc and we started testing it with the PDF

29:28.720 --> 29:38.160
by taking questions and answers from metrics from the PDFGS metrics channel and comparing the

29:39.280 --> 29:48.400
human answers with the LLM answers so we have an example here which was good where the

29:50.000 --> 29:57.600
person was asking how to download things in parallel etc the human answer it was not our answer

29:58.080 --> 30:03.120
I guess it was somebody else from the channel just said download download it at user

30:03.120 --> 30:13.600
another viewer well the LLM answer was this is already handled automatically by the browser

30:13.600 --> 30:21.040
network stack and we had the developer from the PDF team who helped evaluate the cases and

30:21.040 --> 30:28.800
he preferred the second the LLM answer to the human answer in the second case instead it was a

30:28.800 --> 30:37.760
wrong so it was a question related to offsets of the page etc the human answer was very short but to the

30:37.760 --> 30:47.040
point the LLM answer was wrong because it got the offsets wrong but then we actually found that the

30:47.120 --> 30:53.600
problem was in the documentation because the documentation was wrong there are many other cases

30:53.600 --> 30:58.080
were actually the LLM is wrong but in this specific case the LL the documentation was wrong

31:00.720 --> 31:07.040
another interesting thing for I think for us but also for other open source

31:08.880 --> 31:14.000
project we have lots of commits and it's very hard to identify what is important

31:14.960 --> 31:21.520
for end users or what is important for other developers and so on and so forth so we're experimenting

31:21.520 --> 31:30.400
with using an LLM to ingest the committee story from the software and identifying what is important

31:30.400 --> 31:37.040
for various reasons so in this case we're starting with the user release notes so we're focusing

31:37.040 --> 31:43.920
we're asking LLM to focus on user phasing changes but in the future we're also planning to do the same

31:43.920 --> 31:49.440
for web developers so we could identify commits which are changing APIs in a way that

31:50.320 --> 31:57.040
and developers care about and same for identifying new valuable contributions from new

31:57.040 --> 32:07.040
contributors for example another aspect which is very very important for Modzilla it's web

32:07.040 --> 32:15.440
compat tissue detection so we have many issues due to differences between browsers in some cases

32:16.160 --> 32:23.680
due to websites which only test in one browser compared to the other and so there are many cases

32:23.680 --> 32:32.160
where the websites might be working well in a Chrome for example Mod not in Firefox this is an

32:32.160 --> 32:39.760
example here where you can see the graphs on the Chrome side and in Firefox the content is almost

32:39.760 --> 32:50.240
empty basically and we're using a multi model model to identify potential issues we're still

32:50.240 --> 32:55.840
starting this project now but it's already showing some promising results of course there are

32:55.840 --> 33:00.240
lots of difficulties in this province not just a matter of comparing screenshots because you might

33:00.240 --> 33:07.760
have differences just by for example different advertisement when you reload the page or if you have

33:07.760 --> 33:13.440
a carousel for example if you take a screenshot now it's not going to be the same in two seconds because

33:13.440 --> 33:20.480
the carousel will move so you need to identify things that are actually important versus things that

33:20.480 --> 33:30.080
are just part of the website normally we also want to expand autofixing what

33:30.080 --> 33:36.720
I talked about before regarding comments we want to expand it to additional cases for example when

33:36.720 --> 33:47.920
we get crash crashes from users we can identify potential causes for the crash and suggest

33:47.920 --> 33:56.800
potential fix the other thing we would like to try is analyze build failure logs for a given patch

33:57.680 --> 34:05.680
to try to determine and resolve the build failure and both of these in theory could have a feedback

34:05.680 --> 34:11.760
cycle so that we can for example in the build case we can build get the error log

34:12.960 --> 34:18.640
try to fix it through the LLM if it still fails give it the new error log and so on and so forth

34:18.640 --> 34:30.640
just try a few times until the issues fixed and this gets us to the end if you're interested in

34:30.640 --> 34:37.040
experimenting with with our stuff any of the things that I mentioned and see where it could go on

34:37.040 --> 34:43.200
Firefox or if you want to extend support for this kind of tools to other review systems right now

34:43.200 --> 34:48.320
we only support our review system plus the Ubisoft one which is maybe only used by users of

34:48.960 --> 34:55.760
and our satisfoing this maybe only used by us or other projects if you want to adopt

34:55.760 --> 35:03.760
our tools in your project come talk to us in either later or in our bedroom tomorrow

35:04.000 --> 35:08.640
modila devram thank you

35:24.640 --> 35:31.840
thanks great talk maybe I missed it but is Mozilla hosting the LLM or are using an external service

35:32.800 --> 35:38.720
we're using an external service we're actually running multiple LLMs at the same time

35:41.680 --> 35:48.480
to compare results across models both size in size and type

35:49.440 --> 35:57.040
if you thought about boosting it yourself yes we haven't done it yet but the plan would be actually to

35:57.120 --> 36:03.920
use an open source model fine tune it with the data that we are collecting so that we can

36:03.920 --> 36:09.360
host it ourselves and it can be an open source model

36:13.760 --> 36:20.640
what are the costs for running those services I haven't looked at the costs recently they were not

36:21.280 --> 36:28.800
super high but if you want I can share the numbers with you later okay and especially the

36:28.800 --> 36:35.840
environment costs please the environmental costs yeah

36:39.280 --> 36:45.680
thanks for this great talk so I was interested I suppose you selected Ubisoft as a partner

36:45.680 --> 36:49.760
because you have common points with them so for instance they code in C++ with

36:49.760 --> 36:56.080
without divulging in these secrets of course I would like to ask you how the general public

36:56.080 --> 37:00.480
could make use of your research do they have to be programming what they're specifically

37:00.480 --> 37:07.120
or is there something that can be used generally sorry I missed the last part of the question

37:07.120 --> 37:13.520
the question is how reusable is your work for people who work with C++ code bases

37:15.680 --> 37:22.880
I think it should be pretty reusable the collaboration we Ubisoft was because we already were collaborating

37:22.880 --> 37:28.240
with them we have a lot of touchpoints like you mentioned because they also have a big large

37:28.240 --> 37:37.120
code bases in C++ and legacy code bases and the collaboration was both because we had similar

37:37.120 --> 37:43.360
problems but also because we wanted to try to make the two generic enough so that it could be applied

37:43.360 --> 37:57.040
also by other projects hello when generating the prompt of you know the steps of

37:57.040 --> 38:03.920
generating the context and creating all the prompt is there any limitation you you faced like

38:03.920 --> 38:12.960
the size of the poor requests for example like at what when you have like multiple issues right

38:13.040 --> 38:19.360
correlated with that patch how would you decide which context you the injector not and what was

38:19.360 --> 38:27.520
the limitation there yeah when we started we had way more limitations because of the content

38:27.520 --> 38:36.720
of the context size so we kept over time increasing and increasing the limitation on the so we

38:36.800 --> 38:43.760
have a limitation on the patch size if it is too large we don't review it and this limitation was

38:44.800 --> 38:50.240
pretty small at the beginning and then we kept increasing it and increasing it over time

38:50.880 --> 38:59.120
because of the larger context sizes that new models support it's certainly important especially

38:59.120 --> 39:04.640
as we build more and more tools right now we have only two tools which is the asking for more

39:04.640 --> 39:10.400
context around the changes lines and searching for code in the future if we add more and

39:10.400 --> 39:18.880
more tools for example searching for documentation running test or whatever we will have to choose

39:18.880 --> 39:26.480
what is most valuable because we will have limitations on the on the context anyway at some point

40:04.720 --> 40:13.840
out I was surprised by the little or the low acceptance rate of those comments my assumption

40:13.840 --> 40:21.440
would have been it would have been a lot higher and I was wondering if the AI is kind of forced to

40:21.440 --> 40:27.120
comment on every PR or is it free to just say I got nothing on this one and I'm you know I don't

40:27.120 --> 40:34.720
have to add a comment just curious about that can you repeat sorry does the AI have to comment

40:34.720 --> 40:41.120
on every PR or can it say I have nothing to add to this PR and just that maybe you know contribute to

40:41.120 --> 40:48.240
the lowest so in general even in cases where there are some issues it will still generate some

40:48.240 --> 40:54.240
trivial comments so that's why we're working on reducing that but yes what what you say is through

40:54.240 --> 41:04.160
the LLM has a urge to say something all the time so in some cases it will generate when it

41:04.160 --> 41:08.880
doesn't find it when it finds something it still generates something that is not so interesting

41:08.880 --> 41:15.920
but many of them are interesting but when it doesn't find anything it still really wants to say

41:15.920 --> 41:21.840
something and so it will fall back to its behavior even if we say please don't say this please

41:21.840 --> 41:28.320
don't say that it will still fall back to its behavior of saying something and yes so there are still

41:28.320 --> 41:42.720
many of these cases and we're working on reducing them I have a question as well I think I have

41:42.880 --> 41:53.680
to maybe like do you dog feed the AI tool to also review the code that is being used to generate

41:53.680 --> 42:03.680
the two? No not currently right now we are only using it on Firefox source code not on any other

42:03.680 --> 42:11.040
modzilla project so we're using we have a small group of developers who are participating in this

42:11.040 --> 42:18.000
experiment and they're all Firefox developers working on Firefox source code currently we only

42:18.000 --> 42:23.840
show the results to the reviewer so that because we want to collect as much feedback as possible if we

42:23.840 --> 42:30.080
showed it to the developer the developer could just fix the issue without clicking on the button

42:30.080 --> 42:34.480
accept or clicking on the button reject it it wouldn't make sense for the developer to accept the

42:34.480 --> 42:40.560
comment publish the comment and then fix it so that's why in the experimentation phase we're

42:40.560 --> 42:54.160
just showing it to reviewers and only Firefox reviewers. Hello yes another question so I may have

42:54.160 --> 43:01.360
misunderstood but like right now the tool only goes over the added lines the lines that were added

43:01.440 --> 43:08.000
in the patch what about the lines that were removed why are they not being reviewed by the tool?

43:08.000 --> 43:17.200
The issue was that before we asked it to focus on added lines and it was focusing on changes that

43:17.200 --> 43:25.440
were already made by the patch so it was suggesting to do what the patch was already doing so it was

43:25.760 --> 43:32.000
kind of a hack to avoid the LLM focusing on things that the patch was already doing which

43:32.000 --> 43:37.280
wouldn't make sense to comment about. I think it's a limitation that we should resolve

43:38.960 --> 43:46.880
while building the prompt engineering process it sounds easy but it's extremely complex like I said

43:46.880 --> 43:53.600
we go back and forth back and forth back and forth we build some tools to validate changes so whenever

43:53.760 --> 44:03.360
we make a change we have a tool that compares the we run the tool on many past reviews and we

44:03.360 --> 44:08.400
compare the comments that are generated by the new version of the prompt with the old version

44:08.400 --> 44:16.400
of the prompt automatically and then we manually label the ones that are new completely new

44:16.480 --> 44:20.800
compared to before so we can see okay is the new version of the prompt

44:22.640 --> 44:29.440
generating the same comments as before is it generating more accept more comments that are accepted

44:29.440 --> 44:34.240
that were previously accepted is it generating fewer comments that were previously rejected so

44:34.240 --> 44:42.400
now we have all of these tools around the tool itself and we can more easily iterate on the prompt

44:42.480 --> 44:48.320
so this is one of the things that I want to try next try to remove this or refresh it to ask

44:48.320 --> 44:56.480
it to focus on new changes not only on added lines also on the removed lines but we need to find

44:56.480 --> 45:03.040
a good way so that it doesn't fall back to saying to adding comments on things that are already

45:03.120 --> 45:12.880
done by the pat thanks any more questions you have described some comments that were

45:14.080 --> 45:23.200
good but not good review comments but good tips for development how can we have a comment given

45:23.200 --> 45:29.680
during a review which is good for development but not be a good review comment is a little bit

45:29.680 --> 45:39.440
serious to me so basically there were things like I could find some examples later and share them

45:39.440 --> 45:46.960
with you but one example that comes to mind is things like ensure that this code path is not

45:46.960 --> 45:54.080
leading to a crash for example this is something that the reviewer should do it's not for the developer

45:54.160 --> 46:02.160
to do so it's not a comment it's something like ensure this is a good like please double check this

46:02.160 --> 46:07.680
but it's a suggestion to the reviewer actually I mean it could be considered as a suggestion for

46:07.680 --> 46:11.840
both but usually when you are reviewing a patch you don't you don't add a comment to say please

46:11.840 --> 46:17.120
make sure this is correct you will make sure that this is correct and then if it is not correct

46:18.000 --> 46:24.560
sure if it is correct you will say something but you will never comment saying please make sure

46:24.560 --> 46:26.560
that this is correct

46:26.560 --> 46:56.000
okay I wonder does your system support iterative development and review is the element comment

46:56.000 --> 47:05.920
only done once per review and will it have context on previous comments it did

47:07.200 --> 47:13.120
at the moment it's one review and that's it we are thinking of

47:13.120 --> 47:23.520
re-architecting re-architecting it to be a bit to take into account the previous review

47:23.520 --> 47:31.600
as well but also to maybe make it conversational in case of those useful review tips for

47:31.600 --> 47:38.480
example if the element says please make sure that this is correct you could ask why are you saying

47:38.960 --> 47:46.720
that's and you could ask why are you saying that or can you look this function can you look

47:46.720 --> 47:54.160
up this function for me you know you could interact with the comment itself to dig deeper

47:55.760 --> 48:01.200
the other thing that comes to mind when I see this is we're also planning to

48:01.200 --> 48:06.240
right now we're generating the comment and that's it like you saw before we're planning to also

48:06.240 --> 48:14.160
generate reasoning behind the comment so this is this has to there's two reasons why we're

48:14.160 --> 48:19.600
we're doing that one is so that the element maybe will realize by itself that comment is

48:19.600 --> 48:24.800
wrong by generating the reasoning behind it and the other reason is if we generate both the

48:24.800 --> 48:30.720
comment and reasoning the developer does not have to spend as much time validating it because

48:30.720 --> 48:35.840
it will see the reasoning and it will be easier to see if there is something wrong in the

48:35.840 --> 48:52.400
in the reasoning there is an inflow okay any last questions there is one there

49:00.960 --> 49:13.680
yeah awesome talk at a question about tokens like how many tokens generally are being

49:13.680 --> 49:19.840
computed during like each review and is it a multi-step review to get the final result or is it just

49:19.840 --> 49:27.520
like a single like conversation that happens once it's so I don't know about the tokens any to

49:27.600 --> 49:35.760
looking to that it's a multi-step process so it's not just a single prompt and the result it's

49:36.720 --> 49:44.240
summarized the changes okay now please find issues okay now please filter these issues based on

49:44.240 --> 49:53.440
these other results so there are multiple steps to get to the answer there is the link to the project

49:54.000 --> 50:02.480
in the presentation so you can anybody can have a look at how the prompt prompting system prompting

50:02.480 --> 50:19.280
process works thanks just checking to see if anyone has any more questions okay looks like we

50:19.280 --> 50:24.320
are done thank you so much Marka that was really nice presentation