Incident retrospectives (or postmortems, post-incident reports, RCAs, etc.) are the most important part of an incident. This is where you take the gift of that experience and turn it into knowledge. This knowledge then feeds back into the product, improving reliability and ensuring that no incident is a wasted learning opportunity. Every incident is an unplanned investment and teams should strive to make the most of it.
Yet, many teams find themselves unable to complete incident retrospectives on a regular basis. One common reason for this is that day-to-day tasks such as fixing bugs, managing fire drills, and deploying new features take precedence, making it hard to invest in a process to streamline post-incident report completion. To make the most of each incident, teams need a solid post-incident template that can help minimize cognitive load during the analysis process. In this article we provide an example of what a comprehensive, narrative incident retrospective could look like.
What is a postmortem (now called retrospective) template?
A retrospective template is a tool for your team to report upon an incident, with details on what happened, what went wrong, and what systemic improvements could be made.
Why are retrospectives important?
Incident management is a stressful process where everyone is working hard to resolve the issue as soon as possible. When focusing on restoring service as quickly as possible, there is often no time to reflect during that process or dig deep into how to prevent the incident from happening again.
The immediate need is to resolve the issue and ensure customers have the best experience with the solution. However, after the incident is dealt with, a retrospective is a crucial part of the process. A retrospective enables teams to look deeper at their effort and identify areas of improvement for next time.
But that’s not all. Retrospectives also have value in other ways too. For example, retrospectives are also a useful tool when big projects are completed or even a regularly scheduled revisit of how the team is doing. The ultimate goal of the retrospective is to look back on the work completed and identify where things went well, what didn’t, and where improvement is needed.
Using the information, teams work together to create and implement solutions. Doing so fosters collaboration among teams while also ensuring that everyone’s voice is heard during the process. The retrospective can serve as a hub for this followup work. Keep track of all the systemic improvements that result from a given incident in the retrospective, and check in with it to make sure people are on track.
What are the different types of retrospectives I can do?
There are many types of retrospectives that your team can do, so it’s essential to look for templates that can support that.
Types of retrospectives could include:
- Habit building: What are things your team needs to start doing and/or stop doing? What are some things to continue doing? Group ideas together as they come in, and talk about common themes and next steps. Building and putting the spotlight on helpful processes can make team members execute them more naturally.
- Emotional: Another type of retrospective is to consider emotional health and how that can improve. What are team members upset about, and what are they happy about? How does that change before, during, and after an incident? If team members feel heard and supported, they’ll have psychological safety to continue working at their best.
- Vision building: How do teams envision their work personally and in the larger context? What is stopping them from achieving that vision, and how can the team move forward as a whole and as individuals?
- Process: This can work for incident management and other situations where teams come together to identify what works in their current processes and what needs to improve.
- Incident management: After a major incident, teams can come together to discuss what went well, what didn’t, and what they learned to improve incident management moving forward. Below is an example of an incident retrospective template.
Sections of a good retrospective
This should contain 2-3 sentences that gives a reader an overview of the incident’s contributing factors, resolution, classification, and customer impact level. The briefer, the better as this is what engineers will look at first when trying to solve for a similar incident.
Example: Google Compute Engine Incident #17007
This summary states “On Wednesday 5 April 2017, requests to the Google Cloud HTTP(S) Load Balancer experienced a 25% error rate for a duration of 22 minutes. We apologize for this incident. We understand that the Load Balancer needs to be very reliable for you to offer a high quality service to your customers. We have taken and will be taking various measures to prevent this type of incident from recurring.”
This section describes the level of customer impact. How many customers did the incident affect? Did customers lose partial or total functionality? Adding tags can be helpful here as well to help with future reporting, filtering and search.
Example: Google Cloud Networking Incident #19009
In the section titled, “DETAILED DESCRIPTION OF IMPACT,” authors thoroughly breakdown which users and capabilities were affected.
This section is incredibly important to ensure that accountability around addressing incident contributing factors looks forward. Follow-up actions can include upgrading your monitoring and observability, bug fixes, or even larger initiatives like refactoring part of the code base. The best follow-up actions also detail who is responsible for items and when the rest of the team should expect an update by.
Example: Sentry’s Security Incident (June 12 2016)
While detailed action items are rarely visible to the public, Sentry did publish a list of improvements the team planned to make after this outage covering both fixes and process changes.
With the increase in system complexity, it’s harder than ever to pinpoint a root cause for an incident. Each incident might have multiple dependencies that impact the service. Each dependency might result in action items. So there is no single root cause. To determine a contributing factor, consider using “because, why” statements.
Example: Travis CI’s Container-based Linux Precise infrastructure emergency maintenance
In this retrospective, authors cover contributing factors such as a change in docker backend executes build scripts, missing coverage in terms of alerting for the errors, and more.
This section is one of the most important, yet one of the most rarely filled out. The narrative section is where you write out an incident like you’re telling a story.
Who are the characters and how did they feel and react during the incident? What were the plot points? How did the story end? This will be incomplete without everyone’s perspective.
Make sure the entire team involved in the incident gets a chance to write their own part of this narrative, whether through async document collaboration, templated questions, or other means.
The timeline is a crucial snapshot of the incident. It details the most important moments. It can contain key communications, screen shots, and logs. This can often be one of the most time-consuming parts of a post-incident report, which is why we recommend a tool for automation. The timeline can be aggregated automatically via tooling.
Technical analyses are key to any successful retrospective. Afterall, this serves as a record and a possible resolution for future incidents. Any information relevant to the incident, from architecture graphs, to related incidents, to recurring bugs should be detailed here.
Here are some questions to answer with your team:
- Have you seen an incident like this before?
- Has this bug occurred previously, and if so, how often?
- What dependencies came into play here?
Incident management process analysis
At the heart of every incident is a team trying to right the ship. But how does that process go? Is your team panicked, hanging by a thread and relying on heroics? Or, does your team have a codified process that keeps everyone cool? This is the time to reflect on how the team worked together.
Here are some questions to answer your team:
- What went well?
- What went poorly?
- Where did you get lucky and how can you improve moving forward?
- Did your monitoring and alerting capture this issue?
Communication during an incident is a necessity. Stakeholders such as managers, the line of business (i.e. sales, support, PR, etc.) C-levels, as well as customers will want updates. But communication internally and externally might look very different. Even communication internally might differ between what you would send a VPE, vs. your sales team.
Here, document the messaging that was disseminated to different categories of stakeholders. This way, you can build templates for the future to continue streamlining communication.
Example: Google Compute Engine Incident #15056
In this incident, Google ensures that all major updates are regularly communicated. The team also lets users know when they can next expect to be updated. “We are still working on restoring the service of Google Compute Engine Persistent Disks in europe-west1-b. We will provide another status update by 19:00 US/Pacific with current details.”
Other Best Practices to Keep in Mind
- Do the report within 48 hours
- Ensure reports are housed such that they can be dynamically surfaced during incidents
- Add graphics and charts to help readers visualize the incident
- Be blameless. Remember that everyone is doing their best and failure is an opportunity to learn
Failure is the most powerful learning tool, and deserves time and attention. Each retrospective you complete pushes you closer to optimal reliability. While they do take time and effort, the result is an artifact that is useful long after the incident is resolved.
By using this template, your team is on the way to taking full advantage of every incident.
If you enjoyed this blog post, check out these resources: