An important SRE best practice is analyzing and learning from incidents. When an incident occurs, you shouldn’t think of it as a setback, but as an opportunity to grow. Good incident analysis involves building an incident retrospective. This document will contain everything from incident metrics to the narrative of those involved. These metrics aren’t the whole story, but they can help teams make data-driven decisions.
But choosing which metrics are best to analyze can be difficult. You need to find the valuable signals among the noise. You’ll want your metrics to reflect how the incident impacted your customers. In this blog post, we’ll cover:
Here are some common categories of metrics and how they can be helpful.
Number of incidents: This is the most basic thing you’ll want to keep track of. You can make this measure more meaningful by classifying your incidents.
Number of alerts: Different types of incidents will require different levels of alerts. Keeping track of these can help balance on-call loads.
Mean time to detect: This tracks the average amount of time it takes for your system to register an incident. To lower this time, consider investing in monitoring tools.
Mean time to acknowledge: This is the average time between the system registering an incident and the team responding. Alerting and on-call policies can impact this indicator.
Mean time to resolve: This covers the average time between the incident response starting and the service returning to full functionality. This number will be highly variable depending on the service, type of incident, and more.
These metrics, commonly referred to as MTTx metrics, may not always reflect the improvements you’re making in your overall reliability efforts. It’s important to understand which metrics are most indicative of certain areas of improvement. As Štěpán Davidovič noted in Incident Metrics in SRE, “If you are improving one step of the journey, including all other steps in the aggregate makes your ability to understand the impact of the change worse.”
There are alternatives to MTTx metrics that can better depict changes in reliability. As Štěpán also noted, SLOs can help answer the most important question, which is “Is our reliability getting better or worse, as a company?”
Metrics can offer insights into your practices, but you need more context. Without this context and deeper analysis, these numbers can be shallow indicators of how well your team is responding to incidents. They’re also very team-centric metrics. Customer-centric metrics can shed more light on the impact of an incident. To reflect customer happiness in your metrics, you can use SRE tools like SLIs and SLOs. Let’s break down the process of how to develop these.
Get into the mindset of a typical customer. Think about how they engage with your service. What aspects do they rely on? What slowdowns would annoy them most? Think of each action they take in your service as part of a user journey. Partnering with Customer Support or product can be useful during this process.
Determine the monitoring data that is most representative of what your customers find valuable. This could include the latency of a search result, the freshness of the search data, or the availability of the search service as a whole. Once you’ve determined which data type best fits your customers’ needs, you can begin measuring your performance.
Now consider what metrics for the SLI would be unacceptable for the customer. This is where you’ll set your SLO, or service level objective. It should be comfortably above any legal agreements (SLAs) to provide wiggle room for the team. When incidents occur, determine their impact on customer happiness by looking at how service performance measures against the SLO.
SLOs allow teams to have a better idea of how impactful an incident is. This can be more indicative of service health than many other solitary metrics.
For instance, imagine a team has a very minor incident. If the incident went unresolved for a week or two, it might have little to no impact on the customer. Yet, if you’re setting goals on MTTR, this outlier would “point” to an issue in your incident response process.
But, when you look at the incident through the lens of customer happiness, the response was appropriate. This context is important to note.
Another important consideration is the error budget. This also informs how critical an incident is. Let’s take a look at error budgets and how they help teams prioritize reliability.
The reciprocal of the SLO is the error budget. This reflects how much unreliability the system can experience within a window. As the error budget decreases, certain policies can kick in to preserve the SLO or respond to reliability challenges.
For example, if a service is only half way through its window but has burned through 85% of the error budget, the policy might implement a localized code freeze to keep from exceeding the error budget.
Or, a team that has exceeded their error budget the last two windows might reallocate development resources to work on reliability needs. These methods can help teams maintain customer happiness and better prioritize development work.
After an incident, you should record your insights in an incident retrospective. Some of your incident metrics can provide valuable context for each incident. Were there any outliers? Did an incident take an unusually long time to detect? Is an incident of this severity rare in this service area? Include a discussion section where you analyze potential contributing factors.
Teams should share these incident retrospectives throughout the organization. This helps everyone learn from failure. This can also inform further incident response policies. Look at what worked and what didn’t, and adjust. If a runbook was out of date, it’s time to update it. If a gap in monitoring caused a delay in response, look for ways to fill in this knowledge.
These learnings will benefit the customer, as well. As you get better at analyzing and learning from incidents, your response process will also mature. By looking at metrics through a customer-centric lens, you can hone in on the metrics that matter. SLOs and error budgets are important indicators for your system’s reliability performance. They can act as guides when other metrics appear inconclusive.
If you enjoyed this blog post, check out these resources: