Incident retrospectives (or postmortems, post-incident reports, RCAs, etc.) are the most important part of an incident. This is where you take the gift of that experience and turn it into knowledge. This knowledge then feeds back into the product, improving reliability and ensuring that no incident is a wasted learning opportunity. Every incident is an unplanned investment and teams should strive to make the most of it.
Yet, many teams find themselves unable to complete incident retrospectives on a regular basis. One common reason for this is that day-to-day tasks such as fixing bugs, managing fire drills, and deploying new features take precedence, making it hard to invest in a process to streamline post-incident report completion. To make the most of each incident, teams need a solid post-incident template that can help minimize cognitive load during the analysis process. Below is an example of what a comprehensive, narrative incident retrospective could look like.
This should contain 2-3 sentences that gives a reader an overview of the incident’s contributing factors, resolution, classification, and customer impact level. The briefer, the better as this is what engineers will look at first when trying to solve for a similar incident.
Example: Google Compute Engine Incident #17007
This summary states “On Wednesday 5 April 2017, requests to the Google Cloud HTTP(S) Load Balancer experienced a 25% error rate for a duration of 22 minutes. We apologize for this incident. We understand that the Load Balancer needs to be very reliable for you to offer a high quality service to your customers. We have taken and will be taking various measures to prevent this type of incident from recurring.”
This section describes the level of customer impact. How many customers did the incident affect? Did customers lose partial or total functionality? Adding tags can be helpful here as well to help with future reporting, filtering and search.
Example: Google Cloud Networking Incident #19009
In the section titled, “DETAILED DESCRIPTION OF IMPACT,” authors thoroughly breakdown which users and capabilities were affected.
This section is incredibly important to ensure that accountability around addressing incident contributing factors looks forward. Follow-up actions can include upgrading your monitoring and observability, bug fixes, or even larger initiatives like refactoring part of the code base. The best follow-up actions also detail who is responsible for items and when the rest of the team should expect an update by.
Example: Sentry’s Security Incident (June 12 2016)
While detailed action items are rarely visible to the public, Sentry did publish a list of improvements the team planned to make after this outage covering both fixes and process changes.
With the increase in system complexity, it’s harder than ever to pinpoint a root cause for an incident. Each incident might have multiple dependencies that impact the service. Each dependency might result in action items. So there is no single root cause. To determine a contributing factor, consider using “because, why” statements.
Example: Travis CI’s Container-based Linux Precise infrastructure emergency maintenance
In this retrospective, authors cover contributing factors such as a change in docker backend executes build scripts, missing coverage in terms of alerting for the errors, and more.
This section is one of the most important, yet one of the most rarely filled out. The narrative section is where you write out an incident like you’re telling a story.
Who are the characters and how did they feel and react during the incident? What were the plot points? How did the story end? This will be incomplete without everyone’s perspective.
Make sure the entire team involved in the incident gets a chance to write their own part of this narrative, whether through async document collaboration, templated questions, or other means.
The timeline is a crucial snapshot of the incident. It details the most important moments. It can contain key communications, screen shots, and logs. This can often be one of the most time-consuming parts of a post-incident report, which is why we recommend a tool for automation. The timeline can be aggregated automatically via tooling.
Technical analyses are key to any successful retrospective. Afterall, this serves as a record and a possible resolution for future incidents. Any information relevant to the incident, from architecture graphs, to related incidents, to recurring bugs should be detailed here.
Here are some questions to answer with your team:
- Have you seen an incident like this before?
- Has this bug occurred previously, and if so, how often?
- What dependencies came into play here?
Incident management process analysis
At the heart of every incident is a team trying to right the ship. But how does that process go? Is your team panicked, hanging by a thread and relying on heroics? Or, does your team have a codified process that keeps everyone cool? This is the time to reflect on how the team worked together.
Here are some questions to answer your team:
- What went well?
- What went poorly?
- Where did you get lucky and how can you improve moving forward?
- Did your monitoring and alerting capture this issue?
Communication during an incident is a necessity. Stakeholders such as managers, the line of business (i.e. sales, support, PR, etc.) C-levels, as well as customers will want updates. But communication internally and externally might look very different. Even communication internally might differ between what you would send a VPE, vs. your sales team.
Here, document the messaging that was disseminated to different categories of stakeholders. This way, you can build templates for the future to continue streamlining communication.
Example: Google Compute Engine Incident #15056
In this incident, Google ensures that all major updates are regularly communicated. The team also lets users know when they can next expect to be updated. “We are still working on restoring the service of Google Compute Engine Persistent Disks in europe-west1-b. We will provide another status update by 19:00 US/Pacific with current details.”
Other Best Practices to Keep in Mind
- Do the report within 48 hours
- Ensure reports are housed such that they can be dynamically surfaced during incidents
- Add graphics and charts to help readers visualize the incident
- Be blameless. Remember that everyone is doing their best and failure is an opportunity to learn
Failure is the most powerful learning tool, and deserves time and attention. Each retrospective you complete pushes you closer to optimal reliability. While they do take time and effort, the result is an artifact that is useful long after the incident is resolved.
By using this template, your team is on the way to taking full advantage of every incident.
If you enjoyed this blog post, check out these resources: