When an incident occurs, it's important to take the time to review what happened, understand all the contributing factors, and identify systemic changes to prevent similar incidents from happening in the future. This process is known as an incident retrospective. However, conducting incident retrospectives can be time-consuming and difficult, especially when dealing with multiple stakeholders and a large amount of data.
Blameless Incident Retrospectives can greatly reduce the overhead of retrospectives by automatically gathering and organizing information from the response process. The information gathered and how it’s organized is based on the retrospective template you use. You’ll want to customize and use different templates based on the audience of the retrospective, the type and severity of the incident, and the process used to resolve it. Let’s look at different templates you may consider using.
Probably your most standard retrospective template will be targeted towards developers, operators, and other engineers. It focuses on the technical causes of the incident, the technical diagnosis and resolution of the incident, and what follow-up tasks can be done to make the codebase more robust against such incidents.
These retrospectives are likely to be accessed by future on-call engineers dealing with similar incidents. To help them, make sure the section detailing the solution itself is complete, easy to follow, and points to specific steps and resources.
When an incident impacts customers, releasing a public retrospective is a good way to restore confidence and trust. It shows customers you’re being transparent and upfront about what went wrong. It gives you a chance to show what you’re doing to stop the incident from happening again. It also reassures them that nothing additionally went wrong – you can highlight that their data is secure, that they won’t be charged for erroneous usage, or anything else that they may be worried about.
The template for these retrospectives should highlight the impact right at the start. Be frank and straightforward about all of the impact the incident had, also mentioning what is still safe and secure to alleviate their worries. Then move on to what steps are being taken to prevent recurrence of the incident. Details about the technical causes and solutions are less important, but can be summarized generally to give some credibility that you really understand what went wrong.
Sometimes when an incident is severe enough, you want to use it as a jumping off point for more substantial systemic changes. For example, an extreme outage may reveal the need to fundamentally rework your architecture. Other times, you may want to highlight an incident, even if it wasn’t impactful, as a useful example to use to review more general processes.
A strategic retrospective should focus on the contributing factors of an incident. Use techniques such as the five whys to dig deep into why an incident occurred. Be holistic, considering not just technical causes, but things like human resources, training, coinciding problems, and customer expectations. The retrospective should encourage broad conversations about all aspects of the organization in order to reveal where the most impactful changes can be made.
Every organization will have their own specific needs for incident information. Having useful customized templates at the ready will make it easy to get everything you need to learn from each incident. To see it all in action, start a free trial today!