Originally published at The New Stack (https://thenewstack.io/how-to-avoid-the-5-sre-implementation-traps-that-catch-even-the-best-teams/).
Today’s businesses must release new features into production on a regular basis. Yet the majority fail to deliver these features with the quality users expect. A recent DevOps survey conducted by Harvard Business Review confirms this software delivery pain point: While 86 percent of respondents said it is important for their organizations to build and deploy software fast, only 10 percent report being successful at doing so.
We hear from companies that are experiencing the same challenges when it comes to implementing SRE: high importance, low success rate. SRE (Site Reliability Engineering) is one method that puts DevOps (collaboration between Development and Operations teams to deploy code to production) into practice. Yet, whether you are implementing SRE or DevOps, your best intentions are likely going to disappoint you on the first try. Much like with DevOps, the path to successful SRE implementation is not as elusive as it may seem. If you take the right steps, you can avoid costly mistakes and team dissatisfaction.
How can organizations maximize their chance at SRE success while avoiding potential pitfalls? We’ve pinpointed five traps that companies fall into on the path to SRE adoption. Here’s what they are and how you can avoid them:
1. You don’t have enough cross-team usage or buy-in.
You’ve formed a core group that believes in SRE. This team may be members of your engineering, ops or DevOps, or even a full-fledged SRE team. This is a great start, but be careful about not getting enough cross-team buy-in. We’ve seen companies assign only 1-2 SRE engineers for the entire organization, and that proved to be insufficient. SRE needs buy-in from ops, engineering, and product to stick. For high severity issues, sales, support and customer success should be looped in as well. Without broader buy-in, teams outside of the SRE group will lack motivation to improve their operations. These operational improvements, in turn, could prevent underlying issues that cause incidents that the SRE group then must address.
Another challenge: How do you manage all the streams of cross-team communication? During an incident, it may be too much of a burden to inform and get input from so many teams. One solution we’ve seen work well is to split the incident command role and have a dedicated communications lead role. The primary metric for the comms lead is how well stakeholders synchronize rather than the resolution time. If there’s cross-team buy-in, then all the streams of cross-team communication exist in one SRE platform.
2. Your difficult and dense process is slowing down incident response.
If your incident response process references multiple docs and wikis, that’s too much. Proven research, including The Checklist Manifesto, establishes that when humans are under stress, tasks should be as simple as possible. The same applies to high severity incidents that your company faces. Even the best-written procedures docs are hard to follow under stress. But single checklists are insufficient for complex tasks and team activities. In the most successful SRE implementations, we observe that teams use checklists but customize them by roles. The ideal system knows how to show the right information and level of detail at the right time, while tasks stay as small as possible. And when looking back at the system, each checklist item can be measured for efficiency to allow for future improvements.
3. Teams underutilize retrospectives and don’t apply in-depth learnings.
Your company creates an incident retrospective for every incident. That’s already a big win as many teams don’t get this far due to the toil of creating retrospectives. If you’re still not at the point where your teams can create comprehensive retrospectives with low toil, then you may need to automate this process. Assuming you are past that, the next major pitfall we see is that retrospectives get written but are not reviewed again. Memory fades over time and issues repeat themselves. Moreover, the depth of lessons learned and the contents of each retrospective are inconsistent. Often, the experience of the retrospective team varies per incident. These challenges are usually due to the free-form nature of retrospectives.
An ideal system is one that turns unstructured retrospectives into a taxonomy with metadata. You should be able to extract data that across the entire body of retrospectives. You can then analyze this data. What was the customer impact? Which of your monitoring tools caught this? Did your automation tests catch this? This makes the greatest impact for retrospectives with action items as it allows for prioritization. In other words, identified bugs cannot be punted down the road if it impacts an important customer. In most cases, this process is manual and painful. It results in retrospective actions not getting the right understanding and priority.
4. You wait for incidents to happen.
Even the most senior of engineers still needs to learn from live incidents. But it is too costly to wait for issues to hit and impact customers before this learning can take place. The average cost of network downtime is around $300,000 per hour, according to Gartner, but the brand impact may be much higher than that. Companies cannot afford to train on real incidents. The ideal SRE implementation would help train new team members with simulated incidents based on your own environment and past incidents. At a minimum, regular tabletop exercises are a good way to do this if no automation is in place.
5. You stop at incident management without SLOs.
This next pitfall often happens after the initial success with SRE, so it is hard for teams to spot. With the initial pieces of SRE, you will have a solid incident response process in place and are likely using retrospectives effectively. However, you will hit a plateau if you see SLOs (service level objectives) as a bonus component of SRE rather than the most important component.
In fact, SLOs are the essential building block of an SRE implementation. An SLO is the line between happy and unhappy customers. It also prevents companies from setting a reliability target that is higher than necessary or unrealistic. SLOs also enable error budgets (error budget = 1 – SLO, an error budget for an SLO of 99% availability is 1% unavailability), an allowance for downtime that gives the dev team permission to make riskier changes while setting the “stop” threshold. With error budgets, dev and ops teams are aligned for the first time on their reliability goals and can work together to achieve these goals long-term.
Thus, SLO is the essence of SRE. Two challenges for many teams are learning how to measure for SLOs and getting teams to agree on an SLO at the level of customer impact. We have seen some teams take 6-9 months to agree to a “v1” of their SLOs. Much of this difficulty is due to not understanding the concept. However, teams can access Google’s new Coursera course on this topic. Even after understanding the SLO concept and how to query data for your SLIs (service level indicators), it is still hard to visualize and monitor your SLOs. An ideal solution would help identify your SLOs based on your user journeys and then create the metrics needed for tracking. Visualizing SLO success enables you to communicate the value of SRE to the entire company, thereby driving more investment into SRE. Basically, to reap the full rewards of SRE, you cannot stop at incident management.
As we’ve outlined above, you can successfully navigate the SRE implementation journey. These five SRE implementation traps are coupled with specific best practices that allow you to bypass the pitfalls — from cross-team collaboration and checklists to retrospectives, training and SLOs. By following SRE best practices, you can make the software development path a smoother one.
Written by: Lyon Wong, Christina Tan
Co-founder and CEO