In the world of SRE, incidents are unplanned investments in reliability. Why? Because they are valuable opportunities to learn and grow. This perspective can be difficult to communicate to other stakeholders. Some may be upset about the cost incurred or the affected customers. Others might not understand why incidents happen in the first place. It is important to show how the lessons of an incident are relevant to each stakeholder role.
One of the most valuable tools in sharing these lessons is the incident retrospective or postmortem. These documents are built after the incident response process and reviewed in internal meetings. Sometimes an edited version is shared with external stakeholders. In this blog post, we’ll show how to coordinate incident retrospectives across different stakeholder groups, how to cultivate a culture of blamelessness during the process, and how to drive change from key findings.
Many stakeholder groups will want to see and discuss incident retrospectives. When writing them, keep in mind a few key pieces of information that each group will be most interested in. This can save you time answering questions later on. Modular documents with moveable sections can be especially helpful for this.
Incident retrospectives are sometimes shared with users of the service. With this release, you need to state the effect the incident will have on users. Use concise and specific language. Don’t bury this information amid technical details and other background. Instead, try to keep it in a TL;DR at the top, or bolded at the bottom. This way users who don’t want to read the entire document can scan the parts most important to them.
If the incident affects users in different ways, use a table to break down the impact for each user group. Where appropriate, consider emphasizing what didn’t happen to reassure customers—e.g., no account information was leaked, there was no security breach, etc.
Also add a summary of your internal followup if possible. Customers will feel more confident in your service if they see that learning is taking place.
Marketing and Public Relations
For severe incidents, marketing and public relations teams may need to prepare additional statements, both during and after the incident. Executives may reach out to top customers acknowledging the incident to maintain a trusting relationship. PR teams may need to meet with press and field questions. Real-time stakeholder communications are critical to drive clarity during incidents, and the incident retrospective creates critical post-incident alignment, learning, and dialogue.
Retrospectives tailored to these teams should help communicate the facts of the incident to a variety of audiences. When explaining an incident to a tech blog, a prospective customer, and a mainstream newspaper, different language is needed for each audience. This can lead to complications in ensuring the details are consistent across all responses. By sourcing each response from the retrospective, the statements will always line up.
Legal contracts will likely include specific thresholds for the performance of the service. While some incidents may be customer-affecting, it doesn't always guarantee a breach of contract. The retrospective can show how performance during the incident compares to these service level agreements, protecting an organization for overpaying.
Additionally, retrospectives are useful to legal teams even if there isn’t a violation. Legal can use them to assess the risk of future incidents and prepare responses.
The executive team focuses on the greater context of the incident. Executives need to take a long-term perspective on the direction of the organization. To build strategy and long-range planning, they think in time horizons of years or even decades. The incident retrospective should help connect the specific incident to these larger objectives.
Details like the top- and bottom-line costs of the incident need to be emphasized. Try to make this number as comprehensive as possible. Include the lost revenue during downtime or overtime costs of incident responders. But also look at how the incident impacts the entire development cycle. This doesn’t always have to be negative. You should also factor in the benefits of learning from the incident and how they can be applied to make the organization more resilient. This will help the executive team have a more complete and accurate picture of the incident’s impact.
Developers will need one of the most technical details of any stakeholder group. An incident can have complex ramifications for future projects and operations. It can reveal bugs in the codebase or highlight insufficient production resources. It can also impact SLOs and error budgets, potentially leading to a temporary emphasis on reliability work. The incident retrospective can serve as a hub for these details.
The retrospective can reference and link specific tickets, sprints, or larger projects. Conversely, these projects can link back to the retrospective. This enables availability and consistency of information. This can help diagnose larger issues that the incident is symptomatic of.
Site reliability engineers themselves are very interested in incident retrospectives. As the stewards of SLOs and other reliability metrics, they need to know the incident’s impact. It's important to emphasize changes in availability, response time, and other key metrics. Beyond this, SREs are also interested in a meta-analysis of incident retrospectives.
Retrospectives should contain communication logs and timelines of the response. They should detail what tools, such as runbooks, the team used, as well as how well they worked. SREs can use this as a starting point to review the effectiveness of incident response procedures. Even the retrospective process itself can be analyzed and improved. The retrospective serves as a test unit of a wide range of SRE procedures.
Incident retrospectives should always be available for others to learn from. But, availability isn’t enough. Meetings should be scheduled for stakeholders to review and discuss incident retrospectives. This helps transform the lessons of the retrospective into change and action.
Depending on the scale and severity of an incident, different stakeholders may be invited to the meeting. In general, you should trend towards inclusiveness. Give people a chance to attend review meetings, even if they aren’t involved in the incident or its follow ups. The lessons of an incident aren’t just technical or procedural. You can learn a lot about teamwork, problem-solving, and other valuable skills.
With this in mind, the meeting has to have an atmosphere of blamelessness and empathy. If people blame individuals for the incident, the meeting will devolve into unproductive arguments. Work backwards through the causality of the incident. Assume that everyone had the best intentions at each step, doing the best with the information they had. If someone made a choice that contributed to a failure, what were the factors that led them to make that choice at the time? If something slipped past someone’s notice, what check can you add to catch it next time? Keep investigating and asking the right questions.
Empathize with the challenges that each person faced during the incident. Consider what sort of guardrails would have helped them best, and what actions can be taken to codify such guardrails in the future. By working from this perspective, you can ensure that your procedures are helping your team. This provides a common ground for all stakeholders in the meeting.
After the incident retrospective, it's time to apply what you learned. During operational reviews, consider having each stakeholder group create tasks for each lesson learned. These can involve specifics like fixing certain bugs or adding new checks. Or they can be adjustments to development policies or incident response procedures.
This work can then be incorporated into upcoming sprints. This ensures tasks don’t fall by the wayside, and are integrated into longer-range planning. The retrospective can house these tasks and schedules. As follow-up tasks are completed, they can be checked off in the document. The retrospective then becomes a log of all the work completed because of the incident. This can be helpful for analyzing the impact of an incident. It will include not only the disruption or impact caused by unplanned work during the incident, but also the downstream value generated by incorporating associated learnings.
If you enjoyed this blog post, check out these resources: