WEBVTT

00:00.000 --> 00:13.000
Hello and good afternoon and welcome to the talk of the art of the twight given it is

00:13.000 --> 00:15.000
of liberty three core strategies.

00:15.000 --> 00:19.000
I am Prathikpanda, a site celebrity engineer at Red Hat.

00:19.000 --> 00:24.000
Hi all, I am with Alibullah, site reliability engineer at Red Hat and this is our first

00:25.000 --> 00:31.000
time at first time.

00:31.000 --> 00:40.000
So, this talk is more based on our experiences of managing kind of large fleet of clusters

00:40.000 --> 00:47.000
and that kind of covers the strategies that you can use to ensure that your fleet of

00:48.000 --> 00:52.000
clusters is managed properly on the observability site.

00:52.000 --> 00:57.000
So, moving to the agenda what we will be discussing today is what is fleet wide

00:57.000 --> 01:02.000
observability at first then looking at the observability challenge at scale looking

01:02.000 --> 01:07.000
at the three strategies that we will be discussing and implementing the fleet wide

01:07.000 --> 01:12.000
observability with an example and then looking ahead to the future of observability

01:13.000 --> 01:16.000
and then concluding with the Q&A.

01:16.000 --> 01:21.000
So, starting with what is fleet wide observability it would mean the observability

01:21.000 --> 01:28.000
at a scale where you have unified metrics, logs, alerts, traces anything and then you

01:28.000 --> 01:34.000
are able to perform your specific requirements at a fleet.

01:35.000 --> 01:39.000
So, looking at the key aspect that you would take for a fleet wide observability it

01:39.000 --> 01:46.000
will be the multi cluster correlation, multiple clusters exist in the same symptoms

01:46.000 --> 01:48.000
should have a correlation.

01:48.000 --> 01:54.000
Then the centralized monitoring giving you a view for everything at once across your fleet

01:54.000 --> 02:00.000
whether it is your Q&A cluster or maybe your application running out those or any

02:00.000 --> 02:01.000
places that you might be running.

02:01.000 --> 02:07.000
A centralized monitoring view unified plane of glass for them proactive insights that

02:07.000 --> 02:15.000
would include the early detection as well as the remediation of issues that could affect

02:15.000 --> 02:20.000
your fleet before they actually are kind of dangerous.

02:20.000 --> 02:26.000
Standardization would mean that we have a specific format for everything that you are capturing

02:26.000 --> 02:32.000
everything we are monitoring as and then we have scalability that means as your fleet grows,

02:32.000 --> 02:37.000
your monitoring should be, your observability should be growing as well.

02:37.000 --> 02:43.000
Moving to the next slide is more on the observability on this challenge at a scale.

02:43.000 --> 02:47.000
So, the first challenge would be a high-metre volumes.

02:47.000 --> 02:55.000
Let us say a single part that can expose at minimum 20, 20 to 30 metrics and then adding

02:55.000 --> 02:58.000
a container moving those metrics to 25.

02:58.000 --> 03:05.000
So, let us say a single node cluster could easily expose 5000 metrics running about

03:05.000 --> 03:12.000
10 pods and that would mean that you have quite a lot of metrics to work around and what is useful

03:12.000 --> 03:18.000
what is not can be a challenge and that would mean that you are stuck at a phase where

03:18.000 --> 03:20.000
you do not know what to work on.

03:20.000 --> 03:25.000
Then data silos that would mean that there could be multiple systems that are running,

03:25.000 --> 03:29.000
multiple clusters running different kinds of workloads, different configurations,

03:29.000 --> 03:34.000
both everything having different kind of logs, metrics.

03:34.000 --> 03:41.000
In a fragmented way that would mean there are multiple challenges to work on at a unified manner.

03:41.000 --> 03:47.000
Then there is the performance state of when you work at with metrics as well for any smaller

03:47.000 --> 03:52.000
system and when you scale those up the challenges with the cardinality stuff.

03:52.000 --> 04:00.000
What happens is when you scale up those metrics the cardinality would give you a lot amount of data

04:00.000 --> 04:07.000
and that could lead to higher costs and then the specific thing that you have been looking

04:07.000 --> 04:10.000
is not visible to you.

04:10.000 --> 04:12.000
Then you have the operational overhead.

04:12.000 --> 04:18.000
So, operational overhead would mean that if you manage different environments, different

04:18.000 --> 04:23.000
configurations there are a lot of different stuff that you that could be going on.

04:23.000 --> 04:30.000
And there needs to be specific teams that needs to monitor on those and that could kind of lead

04:30.000 --> 04:36.000
to kind of a fatigue between those teams and they are kind of managing this at a scale.

04:37.000 --> 04:43.000
So, looking at these strategies one that is metrics, then what that would mean identity

04:43.000 --> 04:45.000
find what matters actually.

04:45.000 --> 04:52.000
So, if we look at the Kubernetes metrics or metrics that are really useful, we have the infrastructure

04:52.000 --> 04:58.000
metrics that would include the CPU, memory, disk usage, network usage coming from the core

04:58.000 --> 05:05.000
infrared cells, any cloud provider or bare metal anything, then you have the application metrics

05:05.000 --> 05:10.000
request latency is error rates through quotes coming on top.

05:10.000 --> 05:14.000
We will be cumulative specific, we thought restart what is the node utilization that you could

05:14.000 --> 05:18.000
be looking at or any resource specific utilization.

05:18.000 --> 05:25.000
And at the top is something quite important is your specific SLO or SLI based metrics.

05:25.000 --> 05:30.000
That is the custom metric that you are working on and that is where you provide your service

05:30.000 --> 05:31.000
on.

05:31.000 --> 05:39.000
So, if we look at a simple diagram that could help is how can you make that metric could be useful.

05:39.000 --> 05:46.000
So, this is the flow that might not be relevant at all times, but can provide you with

05:46.000 --> 05:50.000
series of questions that can help filter out if you need to include it in your

05:50.000 --> 05:51.000
observability or not.

05:51.000 --> 05:54.000
Let us say is it actionable?

05:54.000 --> 05:59.000
metric should be providing you with action to work on, let us say a pod restart.

05:59.000 --> 06:05.000
So, a pod restart which is occurring multiple times over a course of a few minutes.

06:05.000 --> 06:12.000
That would mean you need to look and see what has happened and may be increase the resource

06:13.000 --> 06:16.000
for it or see if the configuration is correct.

06:16.000 --> 06:18.000
Does it provide context?

06:18.000 --> 06:25.000
So, if a metric should be, should not be existing in isolation, it should have a context

06:25.000 --> 06:26.000
rate.

06:26.000 --> 06:28.000
Let us say there was a spike in CPUs.

06:28.000 --> 06:34.000
It should correlate with a metric that should give you a part that this spike was coming

06:34.000 --> 06:35.000
from something.

06:35.000 --> 06:40.000
It is a higher workload demand or higher usage on your service itself.

06:41.000 --> 06:42.000
Is it predictive?

06:42.000 --> 06:49.000
So, prediction here would mean that let us say the metric would allow us to know what possibly

06:49.000 --> 06:53.000
could happen if this metric is still continuing to interact.

06:53.000 --> 06:58.000
Let us say there is high CPUs spike that you would see for a specific pod.

06:58.000 --> 07:00.000
That could mean or high memory spike.

07:00.000 --> 07:05.000
That could mean a future potential out of memory issue that could be going on into

07:05.000 --> 07:07.000
the application that is running on that pod.

07:07.000 --> 07:11.000
And then is it available in both real time and historic view?

07:11.000 --> 07:16.000
Something that is important for a good metric could be is when you look at real time,

07:16.000 --> 07:21.000
it should indicate that you have an action to perform on that you can work on this metric,

07:21.000 --> 07:25.000
figure out something and then clear out the root cause.

07:25.000 --> 07:30.000
And then historic view is something that would give you with an inside and analysis part

07:30.000 --> 07:36.000
that if over a period of let us say few weeks we see that at this point this metric was

07:36.000 --> 07:40.000
coming out we identify what was the cause and then we try to remove it.

07:40.000 --> 07:45.000
So, this kind of covers what could a metric be useful.

07:45.000 --> 07:54.000
And then of some things that we can discuss on or related to metrics is what could

07:54.000 --> 07:56.000
mislead and how to avoid it.

07:56.000 --> 08:02.000
The first would be over collection that we discussed like lot of metrics what is useful

08:03.000 --> 08:07.000
needs to be defined on what are you providing the service at.

08:07.000 --> 08:13.000
Let us say any metrics that have a high signal, high value are the prioritized metrics

08:13.000 --> 08:17.000
that you should be collecting rather than all of the metrics that you are picking up.

08:17.000 --> 08:24.000
Like of standardizations, multiple metrics and have the same there could be metrics

08:24.000 --> 08:29.000
that kind of convey the same thing but in different ways but you need to figure out which

08:29.000 --> 08:35.000
metrics are more better and then move them into a consistent monitoring framework.

08:35.000 --> 08:42.000
Ignoring cardinality that has the risk associated of let us say the performance as well as

08:42.000 --> 08:43.000
cost.

08:43.000 --> 08:49.000
So, you need to be selective on the label usage itself you need to know which labels are the

08:49.000 --> 08:55.000
ones that you would be using and if not look to avoid them and then reactive monitoring

08:56.000 --> 09:02.000
if you try to alert on everything that has that is based on the reactive metrics.

09:02.000 --> 09:07.000
Let us say CPU spikes that might not be the best way for your fleet to work on.

09:07.000 --> 09:16.000
Instead you should be looking to prioritize on proactive and absolute event metrics.

09:16.000 --> 09:21.000
Let us focus on the next strategy which is from noise to actionable signals.

09:21.000 --> 09:27.000
Transforming from noise to actionable signals is very crucial for maintaining reliability,

09:27.000 --> 09:31.000
reducing alert fatigue and for incident response.

09:31.000 --> 09:38.000
It basically involves refining raw system data to draw insights that can help in decision making

09:38.000 --> 09:43.000
and remediation actions.

09:43.000 --> 09:49.000
Now the big question what actually makes an alert actionable in fleet wide observability

09:49.000 --> 09:56.000
and to get an answer to this question there are a series of questions that you might need to go through.

09:56.000 --> 09:58.000
Let us say you have an alert trigger.

09:58.000 --> 10:04.000
The very first question should be is the alert relevant to the systems SLOs or business goals.

10:04.000 --> 10:11.000
If not then it is best advised to exclude it from the observability or just put it to silence test.

10:11.000 --> 10:16.000
The next question does the alert provide clear and sufficient context.

10:16.000 --> 10:18.000
Let us say I have an alert trigger.

10:18.000 --> 10:23.000
Is it just telling me something or it is actually telling me what component is expected?

10:23.000 --> 10:25.000
What component is failing?

10:25.000 --> 10:31.000
What are some outcomes that I might expect to see in some moment of time?

10:31.000 --> 10:35.000
The next question is the alert actionable in real time?

10:35.000 --> 10:42.000
Does an on-call engineer will be able to solve this upcoming issue or is it just a warning?

10:42.000 --> 10:46.000
Or is it just a system in for that is being shown to me?

10:46.000 --> 10:50.000
The next question is the alert prioritized appropriately?

10:50.000 --> 10:54.000
Do I know what is the severity of this particular alert?

10:54.000 --> 10:57.000
How is my customer affected at the moment?

10:57.000 --> 11:02.000
If not then I better be reassigning and re adjusting its priority.

11:02.000 --> 11:06.000
The next we can alert be automated.

11:06.000 --> 11:09.000
I would say this is a bonus golden question.

11:09.000 --> 11:15.000
If there are certain actions that can be automated and help me reduce the resolution time.

11:15.000 --> 11:19.000
And of course then at the end we get an actionable alert notified.

11:19.000 --> 11:21.000
Take this as an example.

11:21.000 --> 11:24.000
Let us say we have QBAPI down alert.

11:24.000 --> 11:33.000
Now we all know it triggers when the QBAPI servers are not reachable by my observability or by monitoring by within 15 minutes.

11:33.000 --> 11:37.000
I know that it is affecting my SLOs of business goals.

11:37.000 --> 11:40.000
I know it has certain clear context.

11:40.000 --> 11:42.000
I know QBAPI is badly affected.

11:42.000 --> 11:45.000
I know it is actionable in real time.

11:45.000 --> 11:48.000
I know it has a very high severity.

11:48.000 --> 11:51.000
My customers' workloads will be badly affected.

11:51.000 --> 11:54.000
I know there are certain actions I can automate.

11:54.000 --> 11:59.000
For example, I can automate the process of restarting the QBAPI reports.

11:59.000 --> 12:06.000
So I think going through this will help you get an answer whether an alert is actionable or not in free flight.

12:06.000 --> 12:14.000
Now we should also keep that in mind that all these questions might not be relevant at every stages,

12:14.000 --> 12:24.000
but they will definitely help us optimize towards mature observability scenario.

12:24.000 --> 12:31.000
Now let's journey to the path of effective alerting, which basically considers two key principles.

12:32.000 --> 12:36.000
Solid alerting framework and insights and analysis.

12:36.000 --> 12:48.000
In alerting framework, we consider a few things such as early detection, which basically covers proactively identifying potential issues before the escalate.

12:48.000 --> 12:56.000
Intelligent alerting, we use some AIML based anomaly detection so that we focus on real issues.

12:56.000 --> 13:07.000
Custom thresholds, we move towards dynamic thresholds instead of static thresholds so that false positives can be eliminated as much as possible.

13:07.000 --> 13:19.000
Context rich alerts enriching my alerts within sites such as logs, remeditions, steps including SOPs that will help engineers to reduce the MTTR.

13:19.000 --> 13:28.000
Self-feeling mechanisms, setting of auto recovery actions such as restarting of pods and essential services.

13:28.000 --> 13:32.000
Next we move towards insights and analysis.

13:32.000 --> 13:41.000
These cover are few methods, first being trend analysis, where we analyze the performance degradation over a period of time.

13:41.000 --> 13:47.000
Capacity planning, using historical data to foresee resource needs.

13:48.000 --> 14:01.000
User behavior analytics, it is a mechanism where I try to detect anomalies at an early stage as much as possible and then try to forecast any upcoming potential issues.

14:01.000 --> 14:10.000
MTTR reduction, providing insights, logs and correlation of logs and alerts helps to reduce the MTTR.

14:10.000 --> 14:23.000
And of course, incident post-mortems, learning from incidents so that we do not see the repetition of similar issues in upcoming times as well.

14:23.000 --> 14:31.000
Yeah, so that covers the second strategy that we had and moving to the third is the correlation that is connecting the dots.

14:31.000 --> 14:40.000
So, the connection on this is through the context and the journey that is the logs and the traces.

14:40.000 --> 14:51.000
So, logs as we know it is the detailed insights to any specific events more like the why it happened and traces is the visualization of the request.

14:51.000 --> 14:57.000
The traces might be or might not be relevant when you kind of work on a fleet level.

14:57.000 --> 15:10.000
It depends on what specific goal you have, what specific service you are working on and combining or maybe using these is what we get on a correlated insights.

15:10.000 --> 15:22.000
So, when we did this, this kind of gave you the idea on the three strategies and how do you the complete fleet wide observability picture look like.

15:22.000 --> 15:34.000
It is a kind of the common area between metrics, alerts, the logs, metrics being the what is happening alerts when does it need attention and logs or traces why is it happening.

15:34.000 --> 15:50.000
And when you kind of focus on the correlation part, it will help you combine signals to find the root cause to find the solutions that you are looking for much faster and kind of help us resolve the issues much faster.

15:50.000 --> 16:03.000
Now, let us see how we implement fleet wide observability and this will primarily involve leveraging the concepts of SLOs SLAs and SLIs.

16:03.000 --> 16:14.000
Now, how we decide SLOs in fleet wide of less than 95% of our clusters are unable to meet the requirement of getting upgraded within two hours.

16:14.000 --> 16:23.000
Now, we check if an SLO is being breached, we trigger an alert for that and then we investigate and resolve accordingly.

16:23.000 --> 16:43.000
Once the incident is resolved, the issues are all in place, we do some post incident actions such as revisiting our budget burn and then reviewing and adjusting our SLOs and then the cycle goes and goes on and on.

16:43.000 --> 17:01.000
Some benefits of SLO driven approach, scalable reliability management. Now, this because this provides a uniform environment for trouble shooting and monitoring, decreasing cross service dependencies and providing better user experience.

17:01.000 --> 17:11.000
Proactive issue management as we just discussed with error budget framework and alerting and insights we are proactively working on issue management.

17:11.000 --> 17:21.000
And of course enhanced user experience wherein we are focusing on customer impact and we are trying to reduce the downtime as much as possible.

17:21.000 --> 17:33.000
A line goals across teams as uniformity comes in, we have unified metrics, we have prioritized efforts, everyone knows what they have to do and of course this means that goals are aligned.

17:33.000 --> 17:47.000
And of course this overall indicates that our environment is continuously improvement, our observability, the fleet wide observability as such is evolving in an ineffective manner.

17:47.000 --> 18:07.000
And of course business driven observability, we are able to align our technical and business needs and we are providing end to end visibility which is further increasing our customer impact because at the end of the day if customer is happy we are happy.

18:07.000 --> 18:23.000
Looking ahead, the future of observability, AI driven insights, AI is going to play a huge role in observability by enabling to provide resolution to problem detecting anomalies and of course identifying issues.

18:23.000 --> 18:39.000
Cloud native and cross cloud provider observability is going to evolve to support and enhance more of the cross cloud workloads providing more insights at the time do not increasing the performance overhead.

18:39.000 --> 18:59.000
Automated remediation observability is going to move towards proactive monitoring rather than passive monitoring wherein identifying issues recovery actions are going to be all automated requiring less and less human intervention.

18:59.000 --> 19:13.000
Open standards and interoperability open source is going to be the backbone of observability supporting vendor neutral and flexible integrations.

19:13.000 --> 19:21.000
This is all we had today in a kitty folks we would like to have any questions or any discussions you might want to take.

19:21.000 --> 19:41.000
Any questions and please we do five minutes Q and A if you have the time stay seated because this half the room leaves now we will not be able to understand the questions and answers.

19:41.000 --> 19:45.000
Does anyone have a question?

19:45.000 --> 20:07.000
Hi. So you said that we should trigger the alert when the SLA is bridged shouldn't we trigger when we measure that it will be bridged sometimes so that we are proactive working on the incident before the SLA itself is bridged.

20:07.000 --> 20:17.000
I am going to answer that with a question how do you know that you are close to an SLA being bridged.

20:17.000 --> 20:45.000
That's a good thing. So what I would recommend is the approach I have seen being implemented is let's say I talked about a two-hour window right.

20:45.000 --> 20:57.000
At my internal level I realize that up to let's say one hour it is okay if the alert is not progressing but if the one hour beyond that is something that might become troublesome.

20:57.000 --> 21:05.000
So I set up my alerting at that level only let's say control upgrade is delayed is what I get triggered at.

21:05.000 --> 21:12.000
So I am already getting alerted per cluster level when the upgrade is in between failing.

21:12.000 --> 21:23.000
So the thing that I talked about the SLA being bridged it is at level let's say when you are getting continuous alerts of different different clusters that the upgrade is between halted.

21:23.000 --> 21:27.000
So that is already set up in a proactive.

21:27.000 --> 21:31.000
Do we have more questions?

21:31.000 --> 21:42.000
So you mentioned that we should enhance our alerts with logs right alert are based on metrics.

21:42.000 --> 21:49.000
How can we identify which logs to add to those specific alerts and if you can give an example as well?

21:49.000 --> 22:04.000
I think I can take that. So that depends on the type of alerting that you will be focusing on and also on the of the ability tool that might be using.

22:04.000 --> 22:18.000
There are certain observability tools that kind of provide the scenario the view at a unified level let's say let's say that has logs metrics everything at once.

22:18.000 --> 22:30.000
And that has the capability and I think that has the capability as well to kind of do something like a log based alerting as well when you do the metric based.

22:30.000 --> 22:42.000
So both metrics and logs kind of combine and do the alerting as also if you work on something let's say at a very cluster level and then you are integrating something like

22:42.000 --> 22:52.000
pzduk it it would mean that you reference something on an SOB based level that the engineer kind of works on to kind of figure out from that alert what the logs means.

22:52.000 --> 23:06.000
What the logs are at that point of time. So it's kind of an enrichment of alerts can be depending on what this scenario is and what the actual work load being affected.

23:06.000 --> 23:23.000
If you are standing all of the metric logs at once that could mean you can enrich it if not if it's on a more more let's say a wider level that would mean the engineer needs to do a bit of research before you get the exact cost for that alert.

23:23.000 --> 23:32.000
More questions or there in the back.

23:32.000 --> 23:45.000
Hi, a very basic question. Do you have any publication of like what are the basic SLIs indicators or what are the basic SLOs for different parts of your Kubernetes cluster as well as your whole fleet.

23:45.000 --> 23:57.000
Some kind of examples that we could use in our deployments and how do you define those SLOs.

23:57.000 --> 24:06.000
Now, please take a mic because there's people watching the videos and streaming.

24:06.000 --> 24:11.000
Okay, so to answer that that will depend upon service to service.

24:11.000 --> 24:16.000
For example, if you talk about just Kubernetes let's say we talk about upstream Kubernetes.

24:16.000 --> 24:23.000
So you can refer the Google's guidelines on that. Google has already published on what is being happened and where.

24:23.000 --> 24:31.000
And since we work on managed open shift that is our service. I don't think that is publicly open as of now.

24:31.000 --> 24:41.000
But for definitely you can refer to the ones that are available by Google and Google SRE book is the Bible for managed services when it comes to Kubernetes.

24:41.000 --> 24:51.000
So you'll find everything there and a bonus atlashing also has some of that available. So it's it's just all available openly.

24:52.000 --> 24:58.000
Yeah, and something to add is like it's a continuous evolving process as well like we mentioned.

24:58.000 --> 25:09.000
So something that you might have let's when you take it at a very basic SLO level you try to improve it based on what your team is requiring or what you are actually providing out.

25:09.000 --> 25:14.000
So it kinds of goes on evolving as well.

25:14.000 --> 25:18.000
All right, one more question.

25:18.000 --> 25:21.000
No, then thanks again.

