Do blameless retrospectives (or postmortems) help your team? We will explain what they are, if they really work, and how to do them right.
What is a Blameless Postmortem?
A blameless postmortem (or retrospective) is a post-incident document that helps teams figure out why an incident happened, and brainstorm how to improve the process to prevent similar incidents from happening again.
In most engineering organizations, everyone agrees that in complex systems, failure is inevitable. It’s possible to prevent the recurrence of certain incidents, reduce their impact, or shorten the time to resolution. However, it’s impossible to avoid them altogether.
In the end, what really matters is how we respond to failure. A blameless retrospective begins with a cultural mindset, which means no throwing blame on any individual or team. Incident postmortems are also referred to as incident retrospective, post-incident review, post-action review, and may include a root cause analysis (RCA) exercise.
What is Included in the Incident Retrospective Document?
Through the retrospective process, engineers (whose actions contributed to the incident) provide detailed accounts without the fear of punishment or retribution. The detailed account includes:
- Their actions at a particular time.
- Efforts observed
- An understanding of the timeline of events
Goals of the Incident Retrospective
SREs work on complex distributed systems that are continuously changing. New features are added, and with change, comes the likelihood of instability. When an incident occurs, SREs identify and resolve the underlying issues, so the service can return to its normal operation.
However, unless there’s a formal strategic plan in place to learn from the process, the same patterns will likely recur, indefinitely. A blameless retrospective is primarily a learning opportunity for everyone. Across the entire engineering team and in fact, the company. The goal is not to find the culprit, but to find out how and why an incident happened, and how to stop it from happening again. Since the whole process is blameless, no one is held responsible for the incident, and everyone can pitch in to help solve it, without fear.
Finally, the incident is documented and preventive measures are put in place. The document is not to be written and forgotten, it’s an opportunity for engineers to fix a weakness in the system and make it more resilient. The incident retrospective should contain action items to learn and move from the incident response to development. These action items should be reviewed alongside the retrospective to ensure they’re completed.
Why it’s Only Human to Blame
In a 2010 TED talk, Brené Brown, an American researcher, described the human tendency to blame as “a way to discharge pain and discomfort”. We, as humans are hardwired through to use blame to channel their pain and discomfort. Going against millennia of biology and sociology is a tough task, so it makes sense why (in the back of our minds), we tend to point blame and usually away from ourselves.
It’s difficult to adopt a blameless culture because ignoring an ingrained part of ourselves that uses blame as a coping mechanism.
Blame Vs. Blameless
In the blameless culture, language is of utmost importance. Here are a few statements that are blameless versus blameful.
How is Blame Culture Restraining Growth?
Blameless culture ensures that people won’t be blamed or fired for errors, mishaps, and events.
You might believe that everyone gets off the hook for making a mistake in a blameless culture. That’s not entirely true. Having a blameless culture means that instead of focusing your efforts on the witch hunt, you’re investigating mistakes to emphasize the systemic causes of failure.
Anyone would be reluctant to say that they messed up if they were afraid of getting fired or ridiculed. That cultivates a culture of fear and fearful employees don’t usually call attention to potential problems early. They would rather not take the blame if things go south.
Throwing blame around is a poisonous practice and recognizing this negative culture is a good start. However, cultivating a true blameless culture requires significant organizational maturity.
Adopting a blameless culture is a gradual process and requires consistent effort over time, according to a STELLA report on how to better cope with the complexities of anomalies. We have summarized various tiers of the “blame culture” in the table below.
How to Cultivate a Blameless Culture in your Organization?
Cultivating a blameless culture in your organization is easier said than done. Acknowledging blame and working past it can be a challenging task. To start, all you need is the right attitude to use the right words. The words that we use to discuss events are very important.
Never underestimate the importance of words.
The role of the leader in this scenario is especially critical. According to an article on higher ambition leaders, they assume personal responsibility when things are not going well, and share credit when things are well. When a leader takes the blame and shares credit, people trust them more and have the psychological safety to be ambitious.
Here are a few things that you can adopt to cultivate a blameless culture in your organization:
Ask the What, How, and Why Questions
In the blameless culture, asking the whats and hows help us analyze a situation without placing blame on a human. What unravels reasoning - an important factor towards building empathy and how to articulate the mechanics. For example:
“What steps did you take to resolve the incident?” or “How did you respond to the incident?”.
The why questions allow you to understand better the systemic causes of the incident. By understanding why someone did something that was mistaken, you’ll find out what information they lacked or what safeguards should have been in place.
Avoid the Who Questions
Who always points towards an individual. Who is used to asking for a scapegoat, someone to blame the incident on.
Blameful culture focuses so much on people that learning how the incident occurred and how to stop it from happening again is often lost in the mayhem. Neither one of the questions require blaming someone or pointing fingers.
How Do Incident Retrospectives Work?
Incident retrospectives generally involve two major components: a retrospective document and a meeting reviewing the document.
In the meetings, an emphasis on blamelessness is key. Conversations are focused on gaps in the service rather than the people working on the service. Throughout the meeting, make sure that any biased remarks targeting individuals are redirected to investigating systemic issues.
Share the Meeting Agenda
Before the meeting, sharing an agenda that covers all items that will be reviewed in the meeting is a great idea. That way, everyone comes prepared and no one is blindsided. Additionally, let everyone know explicitly that the meeting will be blameless. The goal is to learn from failure, and not to blame each other.
Start on the Right Note and Maintain
At the very beginning, reiterate and focus on the fact that the meeting is not about the blame game. Since your participants may be nervous, you can use a little humor to set a pleasant tone. Throughout the meeting, focus on what went right and applaud it. The actual meeting is usually no longer than a couple of hours. Ideally, it’s just long enough that everyone can discuss ideas at length.
Focus on the Contributing Actions
To map the incident, you need to focus on the steps in the incident instead of the people involved. To understand the situation, ask the following questions:
- What systems were affected?
- Who was involved in responding on the spot?
- How did we find out about the incident?
- When did we start responding to the incident?
- What temporary or permanent mitigation steps did we deploy?
This information should be collected into the blameless retrospective document. This document can form the foundation for deeper conversation.
Hear from the Right People
During the meeting, the people who were directly involved in resolving the issue must be heard. That involves people who introduced, identified, responded to, debugged, resolved the issue, or want to weigh in on an issue.
Along with those directly involved, people from all across the organization should be welcome to attend the meeting. They can offer unique insights into possible action items, and the learning around this specific incident can become broadly applicable.
Identify Potential Solutions
Once you’ve identified the potential problem and spoken to the relevant teams, it’s time to do a root cause analysis (RCA) to identify the action items. The action items are then prioritized (Priority0, Priority1, onwards), categorized according to type (prevent, mitigate, other), and assigned to various team members. These action items shouldn’t be able to fall through the cracks. Make sure clear timelines are set out and progress is regularly reviewed. The goal is to make sure any lessons learned from the incident are shared among everyone who can act on them.
Document the Meeting
Postmortem meetings shouldn’t be audio-only. You need to reconstruct what happened after the fact, which requires recording every detail. That can be done by recording the meeting and documenting it alongside. The notes can be shared with the teams for discussions and potential solutions.
How Often Should You Perform an Incident Retrospective?
An incident retrospective usually takes place after any incident. Ideally, the meetings are held shortly (within 48 hours) after the incident, and when the context is still fresh in the mind of all respondents.
Retrospectives usually fall into two categories. The first type of retrospective meeting is held after a DevOps or IT incident such as data corruption or website crash. The second type of retrospective takes place after project completion where the team looks at the project from the start to the end to determine what went smoothly and what can be improved.
Incident retrospectives should be performed after every incident. Every incident has the potential to teach you something about your system. Depending on the severity, the amount of time spent on each retrospective can vary.
Who is Involved in the Postmortem Process?
Humans are the most important part of incident management and retrospective. Often there's a huge conundrum where there's too much communication and too little communication happening at the same time. Everyone wants to participate and sometimes, in the spirit of participation, they end up talking without contributing relevant information. Among all that noise, people doing all the work don't get a say. Therefore it's important to ensure that you've invited all the right people to the post mortem meeting and remained focused on the problem at hand.
At the top of the incident response hierarchy is the incident commander (IC) - the team lead, responsible for driving the incident to resolution. The IC selects a postmortem owner to avoid the bystander effect, whose job is to ensure that the process goes smoothly.
The retrospective or postmortem is a collaborative process that starts with a draft written by the incident commander. All participants contribute relevant information using a collaborative tool such as Google Docs. It’s easier to collaborate as you go rather than piecing together everything after the fact.
Incident response team members who attend the postmortem meeting include:
- Incident commander (IC): responsible for keeping the discussion on track and focused on solutions.
- Incident commander shadowee or Deputy (if any): helps the IC stay on track.
- Postmortem owner: team lead running the incident points.
- Project manager: PM of the affected system.
- Key engineer: the one involved directly in incident response.
- Project team members: the ones who responded to the incident and contributed to its resolution.
- Scribe: documents the incident timeline and records all important decisions for review later.
- Internal liaison: interacts with the internal stakeholders by notifying the internal team or mobilizing more responders within the organization.
- Customer liaison: interacts with customers either directly or through public communication channels).
- Subject matter expert (SME) or responder: a domain expert who helps identify and fix issues within the service.
How can Blameless Help with the Process?
The blameless culture originated in the avionics and healthcare industry, where mistakes can be fatal. Flight controllers are considered the first SREs. They need to identify any potential causes of failure early and work toward mitigating them. Failure is an inevitable part of any operation, and blameless postmortems help companies learn from their mistakes and grow. At the end of the day, what’s even the point of blaming? It’s easier to fix a mistake in the system than fix a human being!
Conducting a blameless retrospective can be testing at the beginning, but with Blameless’s Deep Retrospective Platform, you can turn incidents into investments. It takes the toil out of post-incident learning and turns it into a collaborative experience for your business. Sign up for a free trial today.