The most essential lesson of SRE is that failure is inevitable. This shouldn’t be a cause for despair. SRE shows how embracing failure is empowering. By celebrating failure, you can accelerate development and foster a culture of learning.
Rather than hoping to prevent failure, SRE prepares you to respond well to it. It can be difficult, if not impossible, to anticipate where failure will occur in complex systems given unknown unknowns. It follows that understanding how to evaluate severity and respond appropriately becomes a complex task, making preparedness even more critical. In this blog post, we’ll look at SRE techniques for mitigating the impacts of system failure, including building runbooks, assessing with SLOs, monitoring metrics, and building a blameless culture.
Runbooks are tools that help engineers respond to incidents. A runbook contains detailed checks and steps for incident responders to follow. For example, consider a server outage. The corresponding runbook could instruct the engineer to run a series of diagnostic tests. Based on the results of the tests, the runbook would then recommend a fix.
The goal of runbooks is to reduce toil for responding engineers. Runbooks codify knowledge and experience. This helps to decrease tribal knowledge and limit information silos. Where possible, runbooks can be automated to save time and minimize toil. This is critical to helping responders get services back online faster.
To create a runbook, first consider potential failure points of your system. What would need to change to bring your system back to normal operation? Break that solution down into steps that are as simple as possible. You can rearrange these steps in a modular fashion in order to build different runbooks. Look for situations where these steps are insufficient, and where a more complex response is necessary. These can be areas that are especially sensitive to failure, as they involve processes that aren’t as standardized. Consider how to document the appropriate response for these areas.
Runbook documentation can be helpful in making your runbooks more robust. This gives you space to consolidate all the information around each step of the process. The documentation can include code snippets that enable simple steps and tests. It can also include more complex information, like diagrams or models. This enables more nuanced, context-rich decision making.
Just as failure is inevitable, there will always be incidents without runbooks. This is why it is essential to review, revise, and create new runbooks. Look at what worked and what didn’t. Is there documentation, dashboarding, or other context you could add to clarify? Or do certain incidents require you to create a new runbook all together? As you review, you’ll also find opportunities to anticipate failure. If a runbook was ineffective at dealing with a certain incident, think about why. Is there other existing but underutilized documentation that overlaps in those areas?
In a complex system, it can be difficult to measure the impact of possible failures. SRE suggests that you look to the customer to classify the incident’s impact. Consider an incident that causes downtime for a particular service. Imagine that this service is only used by a small fraction of customers. Even though this is a total outage, it may be less impactful than an incident which causes a popular service to lag slightly.
SLIs and SLOs allow you to evaluate customer impact. SLIs, or service level indicators, are monitoring metrics that reflect the most important user journeys. SLOs, or service level objectives, set the minimum acceptable level for one or more SLIs. When incidents occur, the impact is reflected in the SLO. The distance between the current metric and the SLO is known as your error budget. As the error budget runs out, policies can kick in to redirect efforts towards maintaining reliability.
These tools can also help you anticipate the customer effect of incidents. By looking at the rate your error budget decreases, you can predict when your SLO may be breached. SLO breaches may represent a major failure, where customers are likely to be impacted. This gives you a chance to anticipate the most impactful failures and intervene before your error budget runs out.
You can also analyze patterns in incidents to look for potential failures. Use incident retrospectives to build a library of data about past incidents. Combine this data with the impact these incidents had on SLOs. This will allow you to see where failure is most likely, as well as what incidents are most customer-impacting.
Another useful way to anticipate failure is by monitoring data. Failure doesn’t always occur in sudden incidents. Sometimes, things will decline in operational quality until they fail. Keeping an eye on key metrics will ensure these failures are caught in time. Monitoring operational data isn’t limited to the functioning of the service itself. It also involves looking at how the engineering team is operating.
Key metrics for operational health include:
Metrics like the ratio of thrash to development may not directly connect to points of failure. However, if your engineering team is operating effectively, you’ll be much better prepared to deal with failure. You’ll be able to respond much more quickly, and spend more time proactively anticipating and preparing for potential failure scenarios. Additionally, by decreasing toil, you’ll be able to spend more time on work that provides business value, such as innovation and new feature development. This value-adding work can also include investments in reliability that decrease the likelihood of certain future incidents.
Failure can be difficult to accept, but if people are blamed for failure, it will dissuade teams from discussing and preparing for potential issues in a transparent way. People won’t want to report their concerns if they’re afraid that they’ll be punished for them. This is debilitating for anticipating future failure events. As such, SRE teaches us that it is crucial to foster a culture of blamelessness.
When failure occurs in blameless culture, teams can work together to find systemic causes. Even if there was a mistake, in such a context, there is ‘no such thing as human error’ as incidents are understood to be multi-layered problems within socio-technical systems. Instead, the reasons behind them making the mistake are investigated. Healthy, high-performing software teams incorporate this practice into incident retrospectives and other team interactions.
For people to truly believe in the blameless culture, they need to feel psychologically safe. When people are psychologically safe, they’re more curious and creative. Laura Delizonna notes in her article for the Harvard Business Review that curiosity is an alternative to blame. This is instrumental for anticipating future failure. Employees will proactively look for things that could go wrong. They’ll raise these issues with confidence that they’ll be protected, even if they might seem responsible. Once this blameless culture is established, the whole team can look forward together.
Blameless offers tools such as runbook documentation, incident retrospectives, and SLOs to help you effectively prepare and respond to failure. To see how, check out a demo.
If you enjoyed this blog post, check out these resources: