Curious about the post-incident review process? We give a complete explanation of post-incident reviews and why they are important and discuss best practices.
What is a post-incident review?
A post-incident review is an evaluation of the incident response process. The goal of the process is to have clear actions to improve the incident response process and to also help prevent further incidents.
Post-incident reviews vs incident retrospectives/postmortems
If you’re familiar with SRE practices, you’ve probably heard of incident retrospectives, also known as postmortems. This is a valuable document that guides learning and improvement of a system after an incident. A post-incident review can be built off of the information in the retrospective, but it has a different goal. Rather than summarize the lessons of the incident itself, it focuses on improving the incident response process itself.
Whatever policies you have around creating and reviewing retrospectives should also apply to post-incident reviews. This includes deciding when creating the document is necessary based on incident classification, how to drive followup tasks based on the incident, and how to ensure that people can review the document when needed.
The information in a post-incident review will often be contained in the retrospective itself – analyzing the response process will always go hand-in-hand with analyzing the incident. However, it can be helpful to have a separate document that focuses on the process to make sure you’re always improving your response.
Building a post-incident review
Defining the requirements
The first stage of having a good post-incident review is to define what needs to be contained in the document. Without defining these ahead of time, your reviews will be inconsistent and difficult to work with. SRE and/or DevOps teams will likely take the lead in setting up these policies by working with both development and operations teams.
The first thing to define is when a post-incident review is necessary. Generally, you want to conduct a post-incident review for each incident, as even small incidents can exemplify your response process and provide insights. However, you can use incident classification to decide what qualifies as an incident or what requires review. A definition of an incident that could be useful is “something that disrupts normal operation and requires some response to restore system functionality”. Determining this point will differ from organization to organization, as “normal operation” depends on the type of system.
Next, these teams need to agree on what needs to be included in the post-incident review document, and the policies around its review. We’ll cover these in the following sections.
Making the document
Helpful sections to include in the post-incident review document include:
- A timeline of communication and steps taken
- A list of resources used in the response and their effectiveness
- Monitoring data to provide context for the system’s health, in order to judge effectiveness
- Comments from responders giving insights on what was helpful and what wasn’t
- Suggestions for improvement to the response process
As the incident response happens, this document should be built in parallel. Tools such as Blameless’s retrospectives can help automatically collect and organize this information without respondents having to lose focus and momentum on solving the problem. Other information needs to be manually added by respondents, such as their commentary on the process. Policies should be in place with reminders to make sure these components are added. They could come about during the review meetings.
Reviewing the document
Once the post-incident review document is completed, it’s important to learn from it so that improvements can be made to the review process. The people involved should hold a review meeting alongside SREs and other people involved in creating the response process.
In these meetings, the incident should be walked through again. Where did the process move smoothly, and where did things stall out? Look for major pauses in the timeline. It could be that the team was sidelined by some other priority, or it could be that their attempted solutions weren’t working. It’s important to determine what causes of slowdown are avoidable and which are inherent to the incident.
Looking at the resources used and their effectiveness is also important. Do runbooks need to be updated or automated? Is there a process that was relied on that’s now out of date? Did the solution involve a particular person’s expertise, creating a single point of failure? Examining how the response would play out in different scenarios can help you create more robustness.
Improving from the post-incident review
Simply identifying issues in your response process is, of course, not enough to actually improve the process. Your post-incident review needs to drive systemic change.
During the review process, create followup tasks for improving the process. This could involve revising a runbook, teaching other people how to do processes, documenting helpful solutions, or overhauling code to be easier to work with during incidents.
The post-incident review process can be improved and made easier with tooling. Blameless provides a suite of tools that help improve your incident response. Blameless retrospectives automatically build documents to help with post-incident review without distracting respondents. CommsFlow allows for respondents to keep stakeholders informed without breaking focus, leading to smoother responses. Find out how by checking out a demo!