Data helps best-in-class teams make the right decisions. Analyzing your system’s metrics shows you where to invest time and resources. A common type of metric is Mean Time to X, or MTTx. These metrics detail the average time it takes for something to happen. The “x” can represent events or stages in a system’s incident response process.
Yet, MTTx metrics rarely tell the whole story of a system’s reliability. To understand what MTTx metrics are really telling you, you’ll need to combine them with other data. In this blog post, we’ll cover:
For each metric, trends can help suggest where to work on improvement. For example, if the MTTD is increasing, you might work to improve your monitoring. But, MTTx metrics alone are insufficient to identify trends in reliability.
In an experiment detailed in the ebook Incident Metrics in SRE, author Štěpán Davidovič ran simulations of multiple systems with varying incident frequencies and durations. He generated sets of hypothetical data and compared the MTTx metrics from each. The goal was to determine if changes made to improve MTTx metrics (such as buying a tool) would reflect in the system.
The findings were conclusive: “MTTx metrics will probably mislead you.” As the experiment stated, “Even though in the simulation the improvement always worked, 38% of the simulations had the MTTR difference fall below zero for Company A, 40% for Company B, and 20% for Company C. Looking at the absolute change in MTTR, the probability of seeing at least a 15-minute improvement is only 49%, 50%, and 64%, respectively. Even though the product in the scenario worked and shortened incidents, the odds of detecting any improvement at all are well outside the tolerance of 10% random flukes.”
This means that even if your tool or process improvement is working, you may not even be able to detect it. This makes it hard to understand what actually improves incident response. And, it doesn’t really tell us anything about the overall system reliability.
MTTx metrics are more helpful when contextualized with other information about the incident. As Blameless SRE Architect Kurt Andersen suggests, “What can be enlightening is to combine these metrics with some form of incident categorization.” Using your incident classification process, you can analyze MTTx metrics for a smaller subset of incidents.
Here are some ways you can further categorize incidents to work with more meaningful data:
Here are some examples of how these combinations can lead to actionable change:
As you conduct deeper analysis on your metrics, you’ll find there’s no single MTTx metric that can tell the whole story. However, there are better ways you can analyze your data to gain insight into your overall reliability and incident response processes.
One of the most important things you want to assess after an incident is customer impact. This can be difficult to determine. Reliability is subjective, based on how customers perceive your service.
To determine the impact on customer happiness, you can use SLIs and SLOs. SLIs, or service level indicators, measure how key areas of your services are performing against customer expectations. SLOs, or service level objectives, mark where customers begin to be pained by unreliability.
How you perform against your SLO is often a better indicator of reliability than MTTx metrics. This is because reliability is determined by your users. SLOs help you understand the effect that incidents have on customer happiness. As SLOs are moveable goals that will change as your customers’ needs change, you should never find yourself or your team goaled for an arbitrary number. Revision is part of setting good SLOs.
Kurt also suggests looking at outliers instead of averages: “In general, I don't find the ‘central tendency’ to be as interesting as investigating outliers for a distribution.” Although they may not represent the typical incidents, outliers in your MTTx trends can be valuable.
Discover what was different about the incident that made it an outlier. Is it something that could occur again? You might need to focus on a qualitative rather than quantitative approach. Lorin Hochstein breaks this concept down in a blog post. Rather than relying on metrics to prevent major incidents, Lorin suggests looking for “signals.” Use your team’s expertise to catch and act on noteworthy data.
In a post for Adaptive Capacity Labs, John Allspaw looks at how to move beyond shallow data. His conclusion is that “meaningful insight comes from studying how real people do real work under real conditions.” Metrics alone cannot contain the many complicating factors in real work.
John shows how to build a “thicker” understanding of data. You can map out how an incident developed and was resolved. This is much “messier” than a single metric, but often more insightful. These complicated representations should be examined when they’re a deviation from the mean.
When you rely on shallow metrics, it can become desirable to game the system or even give up trying to meet KPIs. Team members may feel that their performance is measured by a particular (and sometimes irrelevant) metric. They could be tempted to work to just improve that metric instead of actually improving the system. This phenomenon exists in many industries, from manufacturing to healthcare.
This causes a multitude of problems:
To empower and encourage employees, you need to cultivate a blameless culture. Moving away from shallow metrics is part of this transformation. Emphasize that everyone has a shared goal of customer satisfaction. Using SLOs as your guiding metric can help teams quantify this.
Emphasize that there is no single “score” for an employee or team’s performance. This encourages teams to see incidents as a chance to learn rather than a major setback.
If you’re looking to get more from your metrics, we can help. Blameless SLOs incidents in the context of customer satisfaction, and Reliability Insights allows teams to sort MTTx metrics into more informative subsets of data. To see how, feel free to sign up for a demo.
If you enjoyed this blog post, check out these resources: