WEBVTT

00:00.000 --> 00:11.200
Hello everyone, I'm Javier Lasse. I'm a software engineer at Reddit for a bit more than

00:11.200 --> 00:16.000
three years now. And today I want to share with you a little bit of our story with

00:16.000 --> 00:22.500
observability in the QVID project. We'll start with a bit of a background about

00:22.500 --> 00:30.900
QVID. Dobs is a really really challenge that we faced. What was our approach to solve those difficulties?

00:31.700 --> 00:36.900
A little demo on how we are trying to do things now and a bit of the lessons work.

00:38.420 --> 00:47.380
So, for those of you that don't know QVID, basically it's a API and a runtime that you can install

00:47.380 --> 00:55.780
over Kubernetes to manage and run virtual machines as native workloads. We are a no consort project.

00:55.780 --> 01:03.540
We have over the five point eight thousand gate of stars more than three hundred contributors,

01:03.540 --> 01:11.300
one thousand plus four and ten thousand yards. This project is vendor than adopted by

01:11.300 --> 01:17.060
many organizations such as Reddit, Microsoft, Fire, Nvidia, ARM and many others.

01:18.900 --> 01:25.300
Basically, QVID works with a lot of components. On the master node, we have, for example,

01:26.180 --> 01:33.940
the main controller to manage virtual machines, the API to end over requests. But on each work where

01:33.940 --> 01:39.700
we also have a native installed, that's responsible to end also the VM container.

01:40.660 --> 01:47.700
Here we have a bit more detail. We have open names of these components. The idea here is to

01:47.700 --> 01:58.740
just understand that QVID is composed by many small parts. What was then the challenge?

01:58.820 --> 02:09.700
So, as the project progressed, QVID has more and users, more vendors. And obviously, we

02:09.700 --> 02:17.860
those the requests for observability features and new signals kept increasing. In the beginning,

02:17.860 --> 02:24.420
it was rapidly growing. And as many of the projects you probably work on in the initial stage,

02:24.420 --> 02:31.460
we usually won't care that much about having metrics, right? We just want to develop features

02:31.460 --> 02:37.300
that are needed. And sometimes, it will be nice to have this specific metric to track this.

02:37.300 --> 02:44.420
So, let's just put it somewhere. Let's just make it work. So, we can track it. Let's just

02:45.620 --> 02:51.220
step by step applying some glue here and there. So, we can have the necessary observability

02:51.220 --> 03:00.740
stuff that appears at each time. So, we call it like the initial wild west approach.

03:01.540 --> 03:08.500
We have a lot of components and usually teams responsible for each of those components.

03:08.500 --> 03:14.580
And each developer would just go to the code and, for example, prompt use metrics.

03:15.540 --> 03:21.620
As they just had the documentation and thought, okay, this seems like the correct way to do it.

03:21.620 --> 03:30.580
Let's just start it this way. And, okay, it's working totally fine. And a few years after that,

03:30.580 --> 03:36.100
this led to a high cognitive load. For example, when I start working on the project,

03:36.100 --> 03:42.420
I focus more on the observability. Sometimes, people would just request me to have some

03:43.380 --> 03:48.180
observability feature. For example, we want to add the stability levels for metrics.

03:48.180 --> 03:55.220
And what would it take to do it? I will need to go across several repositories.

03:55.220 --> 04:02.260
I didn't know where the metrics were being defined, how they were being defined. And it was

04:02.260 --> 04:08.980
really hard across those projects. For example, to keep a mental state of how each of those projects

04:08.980 --> 04:18.420
implement the observability signals. And alongside with that, we have a lot of business logic,

04:18.420 --> 04:29.460
intertwined with the observability logic. Usually, the you just go to the file where we want to

04:29.460 --> 04:35.140
track some specific signal and just start the logic to create a metric there and just define the

04:35.220 --> 04:43.460
metric there. But that doesn't scale for when you have the dedicated observability team.

04:43.460 --> 04:50.020
And also with that, we noticed a lot of inconsistencies in the naming conventions, the variables.

04:50.020 --> 04:56.020
And as I said before, it was hard to add new features for across the repositories.

04:56.660 --> 05:04.820
So our approach and this might be not the ideal solution for some of you. But we try to

05:04.820 --> 05:10.980
mobilize observability in a separate package. And as I said, these are all the repositories

05:10.980 --> 05:19.620
that our observability team works with every day. So what we thought was the best solution for us

05:19.620 --> 05:27.700
was to try to make something strict that was did not provide all the flexibility. It is like,

05:27.700 --> 05:34.340
for example, the parameters library gives, but would be the right fit for our project.

05:35.300 --> 05:42.180
We submitted the initial proposal for the code refactoring. And the goals were mainly

05:42.180 --> 05:47.940
to decouple the monitoring logic from the business logic. To encapsulate the monitoring

05:47.940 --> 05:53.460
best practices and the patterns that we wanted to follow in this library and just have it as

05:53.460 --> 05:59.940
the dependency for all those keyword components. This will mean also that it will be easier for us

05:59.940 --> 06:06.180
to keep the code and the utility is easier to maintain and evolve. This will only need to

06:06.180 --> 06:12.580
change them in a single location. And we could also have a structure and many tools to

06:12.580 --> 06:19.220
accurately and use the generic, for example, documentation, linkeding holes, define, for example,

06:19.220 --> 06:25.700
allow the nigh and opting lists for our customers. And all of these without having to change

06:25.700 --> 06:34.180
the code in multiple places. We started with strict interface and dependency model. We want to

06:34.260 --> 06:42.580
really limit what we've been across all those repositories and try to move most of it to our

06:43.380 --> 06:51.860
to our dependency. So I don't know if you can see it very well, but here we have the operations

06:51.860 --> 06:59.140
just to create the metrics. And the idea here is to have a list of all the metrics that are

06:59.140 --> 07:06.180
actually being exposed. And we have something similar to the registration that promised

07:06.180 --> 07:13.140
is library S. But this is very limited and it forces certain parameters to be passed and to be

07:13.140 --> 07:21.220
defined. And this is all, for example, on the single the package, for example,

07:21.220 --> 07:25.780
package monitoring, metrics, virtual controller. And then we have multiple files there. And you

07:25.780 --> 07:33.620
can see in the bottom that the logic to create the metric and are all the information that we need

07:33.620 --> 07:42.020
for the metric is defined also in this file. So this can be just responsibility of the observability

07:42.020 --> 07:52.100
team and not intertwined with other not-so-collated code. Here we have a very similar example

07:52.100 --> 07:59.380
about this time for the implementation of our collector. So in approach, we list of metrics

07:59.380 --> 08:10.020
then the definition of the metrics and then the code for the collector. And the registration

08:10.020 --> 08:16.660
of the metrics, all the metrics from this component are always registered here. They cannot be

08:16.660 --> 08:24.740
registered anywhere else. So we have these central focal point where we can quickly see what

08:24.740 --> 08:31.540
everything is being created. And we make sure that nothing that does not appear here might

08:31.540 --> 08:38.580
be being created or something like that. So this is our single point of truth.

08:39.140 --> 08:48.580
Here we have the link if you want to check later for the biggest PR. This was on the

08:48.580 --> 08:57.300
core QB platform. The initial PR to move from the old implementation to the new implementation

08:57.300 --> 09:04.500
also you know like some examples. This is the first one that we have on the initial

09:04.500 --> 09:09.700
application file for the virtual operator. In this example you can see here what was mentioning

09:09.700 --> 09:16.340
in the beginning. It's probably something that you might be used to see. We just define the

09:16.340 --> 09:21.940
metrics that we want in the beginning of our file. We have probably a need to function that

09:22.660 --> 09:30.100
in this case must register all those metrics and then during the code we are just populating

09:30.100 --> 09:37.700
the metrics directly there. Here we have an example because on the goals I mentioned that

09:37.700 --> 09:45.300
we wanted to be more careful with the generation of documentation both for the tooling and for

09:45.300 --> 09:52.900
customers. Here in green we have an example of four metrics that were being created in the code

09:53.460 --> 10:00.260
but since they were a bit all over the place they were not being picked up by the code

10:00.260 --> 10:07.780
that was generating the documentation. When we have this new system that mainly over metrics

10:07.780 --> 10:14.420
were registered in that single point, automatically we were able to catch those metrics and

10:14.420 --> 10:22.740
add it to the documentation and provide extra value for the customers. I also mentioned a bit

10:22.740 --> 10:29.140
of the valuations. Here on the photo valuations we have for example the ones that are also

10:29.140 --> 10:38.100
provided with the promitius libraries but for our use case we wanted to have additional valuations

10:38.100 --> 10:45.060
those you can see on those functions in the middle. They are incorporated in our observability

10:45.060 --> 10:52.500
package. Then each project just can just create the test defining the custom valuations that

10:52.500 --> 11:00.980
they want and then just call linked alerts and pass the list of alerts and expect no problems.

11:01.060 --> 11:09.940
So we tried to, as I said in the beginning, as much as we could from each of the repositories

11:09.940 --> 11:18.980
and tried to move it all to the operative observability package. This also provided as

11:18.980 --> 11:30.980
ways to make it easier to write unit tests. Since the monitoring logic is now very limited to

11:30.980 --> 11:37.620
those that monitoring package and we have strict interfaces for how to create those metrics,

11:37.620 --> 11:46.660
we can just simplify the ways that we write tests. For example the collection for virtual machines

11:46.740 --> 11:54.340
in for just received the list of VMs so we can easily define a specific list of VMs that we want

11:54.340 --> 12:03.700
to track and just write very tiny specific unit tests for those behaviors that we want to track.

12:04.820 --> 12:11.220
Previously many times we saw that the metrics didn't have any unit tests because people would just say

12:11.220 --> 12:16.900
how we are just testing the business logic and the metric follows all that business logic so

12:16.900 --> 12:24.580
it's probably fine that was the initial thought that people have about making sure that the

12:24.580 --> 12:31.540
metrics were right. As I said in the beginning there was just trying to create them and go to the next

12:31.540 --> 12:39.540
feature that the customers were looking for and this mentality only changed as the project to grow more mature.

12:41.300 --> 12:51.700
And this also allowed to have more clearly bounded ownership over the observability code. Usually

12:52.980 --> 13:00.980
incubate the for example the curcubate there's like six different teams for a single repository.

13:00.980 --> 13:06.900
So most of the teams have very clearly scope permissions to review and approve code that

13:06.900 --> 13:15.220
she's certain packages and just a few approvers that can approve for the whole project.

13:16.260 --> 13:24.260
What we found in the beginning was that each time I was going to open a public list it touched the

13:24.260 --> 13:31.460
packages from like three different teams so I will always have to ask the general approvers for the

13:31.460 --> 13:39.540
project to review and approve the PRs and this usually took a long time because they were

13:39.540 --> 13:47.780
always busy and the PRs always took as I said a lot of time to move forward.

13:48.660 --> 13:55.140
With this approach we were able to simplify all these processes because everything that is

13:55.220 --> 14:03.620
related to defining the values of matrix and testing matrix were just on our team and when

14:03.620 --> 14:09.940
sometimes we needed to change something in the logic to pass more data to fetch more data from the VMs.

14:09.940 --> 14:14.980
Then yes it made sense that for people that were responsible for that code to also take a look to

14:14.980 --> 14:20.820
also verify that everything was working right. But if we want to change matrix accommodations

14:21.060 --> 14:28.180
other new matrix from the data we are receiving in some collector that's more of our knowledge

14:28.180 --> 14:34.180
than their knowledge. So it didn't make sense that they were forced to review that code and this

14:34.180 --> 14:44.420
was a big advantage to our velocity. Obviously there were still a lot of challenges in the transition

14:44.420 --> 14:52.420
to this new format. There was a lot of over-have. I think that first PR it was one of the biggest

14:52.420 --> 14:59.460
but we had to open at least like two or three PRs across all those repositories and they were like

14:59.460 --> 15:07.460
to eight probably so at least 24 PRs just for all these these effort. But we found that later

15:08.420 --> 15:17.940
the velocity that we are able to move now was is worth all that initial work and some of

15:17.940 --> 15:25.140
those PRs were also very very complex. The matrix were scattered all over the code. The business

15:25.140 --> 15:35.140
logic was very intertwined. So we had to be careful with moving all that logic to the new package.

15:35.140 --> 15:42.260
And one also our challenge was also advocating developers on the things that should not

15:42.260 --> 15:49.220
do anymore. Like sometimes we saw that they were still creating like new matrix that I thought was

15:49.220 --> 16:00.020
useful around those all the blocks of code. So we also created a winter that verifies that these

16:00.740 --> 16:06.580
practices that I'm sharing here were being followed by them that they were not using

16:06.580 --> 16:13.380
promitius Hesister, Muskhegeister calls across the code and even operator observability calls

16:13.380 --> 16:17.860
to Hesister matrix and to define were all in being done in that specific package.

16:18.420 --> 16:37.060
So now a bit of demo I created I created a new operator with operation SDK it's very empty just

16:37.060 --> 16:43.300
to an API for the same cache those of you work with Kubernetes R and with operators should know

16:43.300 --> 16:50.020
this is like a default example. And for example let's imagine that this project was being used

16:50.020 --> 16:57.300
by some teams usually what we will do is just install that monitoring linker and try to see

16:57.300 --> 17:06.580
where it will catch definitions of matrix that we don't like and that we want to remove.

17:06.580 --> 17:13.940
So here we will consult the controller and see here it's just a handy minute function with the

17:13.940 --> 17:22.180
metric inside the master register. So we want to delete this piece of code obviously it's not

17:22.180 --> 17:28.820
being called I just started this year for example purposes but usually we want to remove and then

17:28.820 --> 17:35.060
move to package monitoring under the metric sections that we will take a look in a second.

17:35.060 --> 17:40.260
And then when we call it monitoring linker we expect it to be queen.

17:43.620 --> 17:51.460
As I was showing on those small examples during the presentation we have the main setup for the

17:51.460 --> 17:59.700
matrix and we want to register all the matrix that are defined across the specific files in this package

17:59.700 --> 18:09.140
and also the collectors. We see that here the formatics we just you only have operators matrix

18:09.140 --> 18:19.780
and just these custom resource collectors. Here we have the example for the operators matrix.

18:19.780 --> 18:25.620
For example right now we have a reconcil count, a reconcil action and old count at many of

18:26.020 --> 18:33.220
we are also what was the idea just for this example. The reconcil count is totally normal

18:33.220 --> 18:40.900
metric as I said we just enforce like some fields and for example we added also these extra fields

18:40.900 --> 18:48.100
I was mentioning the beginning that when we wanted to have new features we need to go to all

18:48.100 --> 18:53.140
repositories to add support for all those features because for example these extra fields

18:53.220 --> 18:59.620
does not come easily with the parameters library but is something that is on on the Kubernetes

18:59.620 --> 19:06.260
code and just we just know for example to add the stability stability level so this is possible now

19:06.260 --> 19:11.940
and we cannot for those tests on operator package the validation to make sure the stability

19:11.940 --> 19:20.180
level is present and the values are in a given list of values in this case we can see that this

19:20.340 --> 19:28.020
metric is in beta the reconcil action for example in alpha and this old count that I mentioned

19:28.020 --> 19:35.460
was for example in deprecation and we can also state the version in which it was deprecated

19:35.460 --> 19:42.500
because one of the problems that we also faced a lot of times was matrix with incorrect names

19:42.500 --> 19:48.660
because sometimes developers would not use the best practices for parameters in terms of the

19:48.740 --> 19:57.460
units for the metrics or the names for the counters and we wanted to rename those metrics but

19:57.460 --> 20:04.500
a lot of customers were already using them so we added these new features to be able to state

20:04.500 --> 20:11.140
for them what's the best ability level of the metric and in case it's separated the version so we

20:11.140 --> 20:19.700
can later remove it and he plays it with a new metric here you have the example for the resource

20:19.700 --> 20:27.620
collected it's very simple the only difference is we have a callback function and for the callback

20:27.620 --> 20:36.580
function we can just access the a collector client variable that is able to for example list

20:36.580 --> 20:44.180
for all the resources in the cluster so that we can count how many custom resources were created

20:44.180 --> 20:50.420
and the idea is as I said was tried to have all these logic separated from the other logic code

20:50.420 --> 21:01.380
so we can do this in a neatly tight scope and the same for the holes if you notice the

21:01.380 --> 21:10.180
structure is very similar to the metrics you want to make everything as most consistent as possible

21:10.180 --> 21:16.420
across the files and across all the repositories so that when we change projects even if it's a

21:16.420 --> 21:23.220
storage project a networking project cork you will we know that everything is implemented the same way

21:23.220 --> 21:30.660
it's easy to hit the environmental model and we don't need to spend a lot of time going all over

21:30.660 --> 21:38.900
again through the code to see in this repository our things implemented and yeah we just have

21:38.900 --> 21:44.260
the normal alerts not much differences here you can see that we are using the hook definition from

21:44.340 --> 21:55.860
prometius and the same for the recording holes and we also add some utilities I've show you one

21:55.860 --> 22:04.900
in the slides about the unit testing and here I'll show you that we also try to make it simple

22:04.900 --> 22:11.700
to generate documentation the package already provides like a basic basic template to generate

22:11.700 --> 22:18.980
documentation we just need to set up the holes and build the alert box and it would just

22:20.580 --> 22:32.340
generate it to in this case by default it's just a mark one without definition for for the alerts

22:32.340 --> 22:40.420
and all of its fields for example for the metrics I just I put the extra template

22:41.220 --> 22:47.860
and here you can see that it's defining the a certain structure for the depacated version

22:47.860 --> 22:56.260
and for the stability levels and here we just call view of metrics box with custom template

23:03.300 --> 23:10.020
and we show that with something very similar and this it was actually useful later

23:10.420 --> 23:18.660
because initially we just wanted this mark round that list of all the metrics but now we are trying

23:18.660 --> 23:26.660
to feed the for example open shift systems AI with the metric limitations and we found that for

23:26.660 --> 23:33.860
example mark round it's not the best input for it so it was really easy to just go to the

23:33.860 --> 23:41.540
template and change these to to XML and just have all the tags we need with the fields helps

23:41.540 --> 23:47.380
and even the labels we are working now also going through all the labels that metrics

23:47.380 --> 23:54.740
have trying to add some more detail descriptions for what the labels mean the values they can have

23:54.740 --> 24:02.900
so when later can go to those AI systems and request I want to track this information or generate

24:02.980 --> 24:10.020
these dashboards everything about the metrics and their labels is documented neatly here

24:10.020 --> 24:13.780
and it's easier to to return the correct answers for them

24:19.380 --> 24:26.420
so a few of the lessons that we learned like obviously we should try to review technical

24:26.500 --> 24:36.340
that early many times we think like okay I'm a good developer I'll take care of putting

24:36.340 --> 24:42.580
all the beautiful and everything very well structured but then we need this feature for tomorrow

24:42.580 --> 24:51.860
and this feature for the next day so sometimes we we prioritise some things we thought are smaller

24:51.940 --> 24:58.740
usually metrics in the beginning of projects tend a bit to follow on those categories but

24:58.740 --> 25:07.780
and that's why we try to create these consistent and strict dependency models so everything

25:07.780 --> 25:14.500
it's easy to do but still enforce to have the right best practices obviously encourage also

25:15.220 --> 25:20.500
observability mindset and probably we are also in the monitoring observability room we understand

25:20.580 --> 25:26.580
the importance of monitoring but we need also educate our developers and other teams of the

25:26.580 --> 25:34.580
importance of this might not work for you but in my experience I suggest something like this

25:34.580 --> 25:42.980
library based approach every everything neatly stuck with some toys separate from the logic

25:43.380 --> 25:50.420
to code and obviously continuous iteration and advances you need new features and try to see

25:50.420 --> 25:55.300
what works best for you and for your team

26:05.300 --> 26:07.940
thanks a lot for your talk are there any questions

26:12.980 --> 26:31.780
thank you so much did you manage to fly I can deal closer is that good did you manage to find any

26:31.780 --> 26:37.220
efficient ways I still I think it's starting to evolve maybe I'm like do you hear me now

26:38.420 --> 26:46.820
by the way let me tie now it's okay did you manage to find any efficient ways of actually

26:46.820 --> 26:53.780
measuring the cognitive load before and after your approach to understand and find the improvements

26:54.740 --> 27:05.700
yeah you ask about ways to measure in the performance let me go back yeah it's easy sorry I

27:05.700 --> 27:09.780
are really and I can't repeat the question maybe like

27:14.420 --> 27:19.460
the velocity of code development the velocity of now

27:20.020 --> 27:21.460
it's a question

27:26.660 --> 27:42.900
but going to yeah so repeat the question yes so the the question was about if we have measured the

27:42.980 --> 27:52.820
differences in cognitive load before and after the changes so like cognitive load is a bit of

27:57.220 --> 28:04.820
an ideal right and for me it's about the field that I have before and after and after the

28:04.820 --> 28:12.580
changes I was mentioning that before every time I had to change hypothesis to work and I don't

28:12.580 --> 28:19.780
know metrics I felt that I needed to take like some time to just to all things were being

28:19.780 --> 28:27.380
specifically implemented in that repository and afterwards now I go to any of those repositories

28:27.380 --> 28:33.220
and I know exactly where things are being created or how metrics are being registered

28:33.220 --> 28:42.500
and you mentioned about measuring I didn't bring any like data on the velocity but I know that

28:43.300 --> 28:49.700
on the gd that we used to track our features we were able to do more story points

28:49.700 --> 28:57.700
per sprint than we were previous capable of that because we are now able to develop the

28:57.700 --> 29:05.540
features faster but also because the PR's are much quicker review than approved because we can

29:05.540 --> 29:16.100
rely mostly on our team to do it so what do you think about group state metrics approach

29:16.100 --> 29:21.860
compared to this one because you could totally separate metrics from your business logic at all

29:22.580 --> 29:30.580
so we tried to take inspiration from both the approach that Kubernetes main code to

29:30.580 --> 29:36.020
conduct and both cube state metrics I think cube state metrics you know like the best example

29:36.020 --> 29:45.140
for this because the usual is just everything formation on the resources that are available on the

29:45.140 --> 29:51.300
on the cluster for example when I show the the collector approach cube state metrics is a lot like

29:51.300 --> 29:56.820
that right because you can just go to to the resources and fetch information there so it's a

29:56.900 --> 30:01.780
more of a collector based approach but we also took a lot of inspiration from the Kubernetes code

30:02.820 --> 30:07.700
Kubernetes works a bit different from cube it because they have more of a

30:09.060 --> 30:15.860
structure the package we go back home and just take everything from there but I think we try to

30:17.540 --> 30:24.020
to go at the way we would Kubernetes and many of the things that we did here they are also doing that

30:24.020 --> 30:28.580
all right then thanks a lot thank you