Originally published at The New Stack (https://thenewstack.io/how-to-avoid-the-5-sre-implementation-traps-that-catch-even-the-best-teams/).
Today’s businesses must quickly release new software into production, yet the majority fail to deliver quality software efficiently. A recent DevOps survey conducted by Harvard Business Review confirms this software delivery pain point: While 86 percent of respondents said it is important for their organizations to build and deploy software quickly, only 10 percent report being successful at doing so.
We hear from companies that are experiencing the same challenges when it comes to implementing SRE: high importance, low success rate. SRE (Site Reliability Engineering) is one method that puts DevOps (collaboration between Development and Operations teams to deploy code to production) into practice. However, whether you are implementing SRE or DevOps, your best intentions are likely going to disappoint you on the first try. Much like with DevOps, the path to successful SRE implementation is not as elusive as it may seem. Costly mistakes and team dissatisfaction can be avoided — if you take the right steps.
How can organizations maximize their chance at SRE success while avoiding potential pitfalls? We’ve pinpointed five traps that companies fall into on the path to SRE adoption. Here’s what they are and how you can avoid them:
You’ve formed a core group that believes in SRE and is committed to its execution. This team may be members of your engineering, ops or DevOps, or even a full-fledged SRE team. This is a great start, but be careful about not getting enough cross-team buy-in. We’ve seen companies assign just 1-2 SRE engineers for the entire organization, and that proved to be insufficient. SRE needs buy-in from ops, engineering, and product to stick. For high severity issues, sales, support and customer success should be looped in as well. Without broader buy-in, teams outside of the SRE group would not be motivated to improve their operations. These operational improvements, in turn, could prevent underlying issues that cause incidents that the SRE group then must address.
Another challenge: How do you manage all the streams of cross-team communication? During an incident, it may be too much of a burden to inform and get input from so many teams. One solution we’ve seen work well is to split the incident command role and have a dedicated communications lead role. The primary metric for the comms lead is how well stakeholders are synchronized rather than the resolution time. If there’s cross-team buy-in, then all the streams of cross-team communication can be managed in one SRE platform.
If your incident response process references multiple docs and wikis, that’s too much. Proven research, including The Checklist Manifesto, has already established that when humans are under stress, tasks should be as simple as possible. The same applies to high severity incidents that your company faces. Even the best-written procedures docs are impossibly hard to follow under stress. But single checklists are insufficient for complex tasks and team activities. In the most successful SRE implementations, we observe that checklists are used but they are customized for different roles. The ideal system knows how to show the right information and level of detail at the right time, while tasks are kept as small as possible. And when looking back at the system overall, each checklist item can be measured for efficiency to allow for future improvements.
Your company is disciplined and creates a postmortem for every incident. That’s already a big win as many teams don’t get this far due to the toil of creating postmortems. If you’re still not at the point where your teams can easily create comprehensive postmortems with timestamps and key events, then you may need to automate this process. Assuming you are past that, the next major pitfall we see is that postmortems get written but are not reviewed again. Memory fades over time and issues repeat themselves. Moreover, the depth of lessons learned and the content of each postmortem are inconsistent and insufficient. Often, the experience of the postmortem team varies per incident. These challenges are usually due to the free-form nature of postmortems.
An ideal system is one that turns unstructured postmortems into a taxonomy with metadata and data that can be extracted across the entire body of postmortems. This data can then be analyzed: How many customers were impacted? Which of our monitoring tools caught this? Did our automation tests catch this? This makes the greatest impact for postmortems with action items, because it allows for proper prioritization. In other words, identified bugs cannot be punted down the road if it impacts an important customer. In most cases, trying to do this process manually is painful and results in postmortem actions not getting the right understanding and priority.
Even the most senior of engineers still needs to learn from live incidents. But it is too costly to wait for issues to hit and impact customers before this learning can take place. The average cost of network downtime is around $300,000 per hour, according to Gartner, but the brand impact may be much higher than that. Companies cannot afford to train on real incidents. The ideal SRE implementation would help train new team members with simulated incidents based on your own environment and past incidents. At a minimum, regular tabletop exercises are a good way to do this if no automation is in place.
This next pitfall often happens after the initial success with SRE, so it is hard for teams to spot. With the initial pieces of SRE, you will have a solid incident response process in place and are likely using postmortems effectively. However, you will hit a plateau if you see SLOs (service level objectives) as a bonus component of SRE rather than the most important component.
In fact, SLOs are the essential building block of an SRE implementation. An SLO is the line between happy and unhappy customers. It also prevents companies from setting a reliability target that is higher than necessary or unrealistic. SLOs also enable error budgets (error budget = 1 – SLO, an error budget for an SLO of 99% availability is 1% unavailability), an allowance for downtime that gives the dev team permission to make riskier changes while setting the “stop” threshold. With error budgets, dev and ops teams are aligned for the first time on their reliability goals and can work together to achieve these goals long-term.
Thus, SLO is the essence of SRE. Two challenges for many teams are learning how to measure for SLOs and getting teams to agree on an SLO at the level of customer impact. We have seen some teams take 6-9 months to agree to a “v1” of their SLOs. Much of this difficulty is due to not understanding the concept. However, teams can access Google’s new Coursera course on this topic. Even after understanding the SLO concept and how to query data for your SLIs (service level indicators), it is still hard to visualize and monitor your SLOs. An ideal solution would help identify your SLOs based on your user journeys and then create the metrics needed for tracking. Visualizing SLO success enables you to communicate the value of SRE to the entire company, thereby driving more investment into SRE. Basically, to reap the full rewards of SRE, you cannot stop at incident management.
As we’ve outlined above, you can successfully navigate the SRE implementation journey. These five SRE implementation traps are coupled with specific best practices that allow you to bypass the pitfalls — from cross-team collaboration and checklists to postmortems, training and SLOs. By following SRE best practices, you can make the software development path a smoother one.