Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

What are Blameless Retrospectives? How Do You Run Them?

In most engineering organizations, everyone agrees that in complex systems, failure is inevitable. It’s possible to prevent the recurrence of certain incidents, reduce their impact, or shorten the time to resolution. However, it’s impossible to avoid them altogether.

In the past, we asserted failures are a result of people’s mistakes. It was all about “the bad apple theory,” focused on finding the “guilty party” and removing them to prevent future failures. Although this approach is about accountability, it assumes negligence, or in some cases even bad intent, leading to action taken against the individual via some form of punishment.

The result of the blame game is disincentivizing to team members who fear outcomes of unintentional (and what we know today as inevitable) failures. The ensuing silence in the wake of a failure makes it impossible to objectively find the cause and collectively find solutions.

Today, human errors are treated as a symptom of a systemic problem where collaborators feel empowered to speak up. This creates an environment where mistakes are used to learn what went wrong and avoid them to see system improvements.

In the end, what really matters is how we respond to failure. This is where the blameless retrospective comes into play. The process begins with a cultural mindset, which means no blaming any individual or team. It fosters learning, focusing on improvement by eliminating finger-pointing among team members.

Here, we explain how to run a blameless postmortem and why a blameless culture is the key to creating a more collaborative method of problem-solving.

What is a Blameless Postmortem?

A blameless postmortem (or retrospective) is a post-incident document that helps teams figure out why an incident happened and brainstorm how to improve the process to prevent similar incidents. Also known as incident retrospectives, post-incident reviews, or post-action reviews, the blameless postmortem might also include a root cause analysis (RCA) exercise.

The document is created post-failure as a tool to help teams understand the causes and then work collaboratively to brainstorm ways to prevent similar incidents. This method ensures no one feels their reputation or job is on the line, so honesty prevails, bugs in the ointment are revealed, and all teams involved understand what went wrong.

SREs work on complex distributed systems that are continuously changing. New features are added, and with change, comes the likelihood of instability. When an incident occurs, SREs identify and resolve the underlying issues, so the service can return to its normal operation.

However, unless there’s a formal strategic plan in place to learn from the process, the same patterns will likely recur indefinitely. A blameless retrospective is primarily a learning opportunity for everyone, across the entire engineering team and, in fact, the company. The goal is not to find the culprit but to find out how and why an incident happened and how to stop it from happening again. Since the whole process is blameless, no one is held responsible for the incident, and everyone can pitch in to help solve it without fear.

What is Included in the Incident Retrospective Document?

Through the retrospective process, engineers who contributed to the incident are critical players providing detailed accounts without the fear of punishment or retribution. The detailed account comes directly from primary sources who witnessed or contributed to the failure, including:

·   Their actions at a particular time

·   Efforts observed

·   Expectations

·   Assumptions

·   An understanding of the timeline of events

·   Goals of the Incident Retrospective

The incident is documented, and preventive measures are put in place. The document is not to be written and forgotten and instead is an opportunity for engineers to fix a weakness in the system and make it more resilient. The incident retrospective should contain action items to learn and move from the incident response to development. These action items should be reviewed alongside the retrospective to ensure they’re completed.

Why It's Only Human to Blame

In a 2010 TED talk, Brené Brown, an American researcher, described the human tendency to blame as “a way to discharge pain and discomfort”. We, as humans, are hardwired to use blame to channel their pain and discomfort. Going against the millennia of biology and sociology is a tough task, so it makes sense why (in the back of our minds), we tend to point blame and usually away from ourselves.

It’s difficult to adopt a blameless culture because it forces us to ignore an ingrained part of ourselves that uses blame as a coping mechanism.

Blame vs. Blameless Culture

In the blameless culture, language is of the utmost importance. Here are a few statements separating the blameless from the blameful.

Ask How, Not Why

Asking why in a blame culture puts the onus on the actor, which makes them feel like their actions are being scrutinized. Asking how, on the other hand, allows them the opportunity to explain what happened in detail without worries of being blamed for the outcomes of their actions. It allows the person to distance themselves from those actions to create neutral ground. How it happened is more important than why it happened. For example, why it happened might be that I missed a step, while how it happened explains what step was missed. Save the whys for later.

Say We Not You

Another important part of blameless language is avoiding the use of “you.” By focusing on a “we” mentality, blame is not fixed on any one individual. Instead, those involved in the glitch feel like a member of the team trying to resolve the issue, not the rest of the team putting blame on any one individual. You remove the blame-based victim versus villain storyline. For example, instead of saying, “It appears you didn’t ensure the reliability and performance of these backbone components which caused an access failure.” You might say something more like, “Okay so we missed a step in the testing to confirm access to backbone components.” From there, you can focus on implementing proper processes to ensure their reliability and performance.

Use Affirming Language

One last example is using affirming language as individuals participate in the postmortem. By acknowledging contributions and stating how their input is an important part of the puzzle to finding solutions, people will feel more comfortable and willing to provide valuable, honest input that is actionable.

In a blame culture, people get their backs up and lose focus on explaining the systemic factors that led to the incident. Affirmation helps maintain a position of mutual respect. Again, keeping the why questions on the back burner helps keep language more neutral and positive. As a result, you avoid implying someone has acted badly, even in situations where the team is questioning an individual’s actions.

How is Blame Culture Restraining Growth?

Blameless culture ensures that people won’t be blamed or fired for errors, mishaps, and events.

You might believe that everyone gets off the hook for making a mistake in a blameless culture. That’s not entirely true. Having a blameless culture means that instead of focusing your efforts on the witch hunt, you’re investigating mistakes to emphasize the systemic causes of failure.

Anyone would be reluctant to say that they messed up if they were afraid of getting fired or ridiculed. That cultivates a culture of fear, and fearful employees don’t usually call attention to potential problems early. They would rather not take the blame if things go south.

Throwing blame around is a poisonous practice and recognizing this negative culture is a good start. However, cultivating a true blameless culture requires significant organizational maturity.

Adopting a blameless culture is a gradual process and requires consistent effort over time, according to a STELLA report on how to better cope with the complexities of anomalies. We have summarized various tiers of the “blame culture” in the table below.

How to Cultivate a Blameless Culture in your Organization?

Cultivating a blameless culture in your organization is easier said than done. Acknowledging blame and working past it can be a challenging task.

Focus on Language

As mentioned, all we need is the right attitude and language to discuss events.

Never underestimate the importance of words. It’s an excellent place to start.

The role of the leader in this scenario is especially critical. According to an article on higher ambition leaders, they assume personal responsibility when things are not going well, and share credit when things are well. When a leader takes the blame and shares credit, people trust them more and have the psychological safety to be ambitious.

While we’ve contrasted examples of a blameless versus blame culture above, it helps to further understand ways to use language to start cultivating a blameless culture in your organization:

Ask the What, How, and Why Questions

In the blameless culture, asking the whats and hows help us analyze a situation without placing blame on a human. “What” unravels reasoning - an important factor towards building empathy and how to articulate the mechanics. For example:

“What steps did you take to resolve the incident?” or “How did you respond to the incident?”

Once the what and how are understood, the why questions should be left as follow-ups. This allows for a better understanding of the systemic causes of the incident. By understanding why someone did something mistaken, the information they lacked or what safeguards should have been in place are revealed.

Avoid the Who Questions

Blameful culture focuses so much on people that learning how the incident occurred and how to stop it from happening is often lost in the blame mayhem. Neither one of these questions requires blaming someone or pointing fingers. How describes the events and clarifies important technical details.

Who, on the other hand, always points towards an individual, a scapegoat to take blame for the incident. In other words, when asking “who” did something, it insinuates you’re looking for a patsy to take the fall. By focusing on how teams and individuals are more improvement-minded and honest without worrying they’ll incriminate themselves.

Recognize Biases

One of the biggest roadblocks in adopting a blameless culture is moving past our biases. By acknowledging common biases, we can overcome them when performing postmortems. Here are some common biases that are likely to come up, with tips to help redirect the discussion:

  • Fundamental attribution error: Believing what people do reflects their character as opposed to the circumstances that triggered those actions leads to the assumption character flaws result in human error. In this case, redirecting focus on situational causes takes focus away from an individual’s actions.
  • Confirmation bias: Favoring information that reinforces assumptions or beliefs based on ambiguous information allows us to support those assumptions based on irrelevant details. In a blameless culture, removing assumptions someone is at fault allows us to see the facts instead of trying to find insignificant details that justify our beliefs. Having an unbiased person ask the questions, such as someone from another team, avoids pointed questions that skew answers.
  • Hindsight bias: Recalling events in a distorted way to avoid blame makes it too easy to see the event as predictable even though no one did, in fact, predict it. It is often related to a know-it-all attitude where the person analyzing the causes acts like they saw it coming and even pointed out the event would occur when the project was underway. It’s important to avoid hindsight in the analysis by focusing on what occurred before the incident, not the incident itself.
  • Negativity bias: This idea relates to the bad apple theory where negative information is given prominence in the analysis. It can even have team members believing there was intent behind the failure in a situation where being neutral is crucial. It creates a mindset that things always go wrong when in reality, they usually go right. This negative attitude is demoralizing to the entire team. Focusing on learning opportunities and the positive steps of individuals helps overcome this bias and keep things in perspective.

How Do Blameless Incident Retrospectives Work?

Incident retrospectives generally involve two major components: a retrospective document and a meeting reviewing the document.

In the meetings, an emphasis on blamelessness is key. Conversations are focused on gaps in the service rather than the people working on the service. Throughout the meeting, make sure that any biased remarks targeting individuals are redirected to investigating systemic issues. 

Share the Meeting Agenda

Before the meeting, sharing an agenda that covers all items that will be reviewed in the meeting is a great idea. That way, everyone comes prepared and no one is blindsided. Additionally, let everyone know explicitly that the meeting will be blameless. The goal is to learn from failure, and not to blame each other. 

Start on the Right Note and Maintain

At the very beginning, reiterate and focus on the fact that the meeting is not about the blame game. Since your participants may be nervous, you can use a little humor to set a pleasant tone. Throughout the meeting, focus on what went right and applaud it. The actual meeting is usually no longer than a couple of hours. Ideally, it’s just long enough that everyone can discuss ideas at length. 

Focus on the Contributing Actions 

To map the incident, you need to focus on the steps in the incident instead of the people involved. To understand the situation, ask the following questions:

  • What systems were affected?
  • Who was involved in responding on the spot?
  • How did we find out about the incident?
  • When did we start responding to the incident?
  • What temporary or permanent mitigation steps did we deploy?

This information should be collected into the blameless retrospective document. This document can form the foundation for deeper conversation. 

Hear from the Right People

During the meeting, the people who were directly involved in resolving the issue must be heard. That involves people who introduced, identified, responded to, debugged, resolved the issue, or want to weigh in on an issue.

Along with those directly involved, people from all across the organization should be welcome to attend the meeting. They can offer unique insights into possible action items, and the learning around this specific incident can become broadly applicable.

Identify Potential Solutions

Once you’ve identified the potential problem and spoken to the relevant teams, it’s time to do a root cause analysis (RCA) to identify the action items. The action items are then prioritized (Priority0, Priority1, onwards), categorized according to type (prevent, mitigate, other), and assigned to various team members. These action items shouldn’t be able to fall through the cracks. Make sure clear timelines are set out and progress is regularly reviewed. The goal is to make sure any lessons learned from the incident are shared among everyone who can act on them.

Document the Meeting

Postmortem meetings shouldn’t be audio-only. You need to reconstruct what happened after the fact, which requires recording every detail. That can be done by recording the meeting and documenting it alongside. The notes can be shared with the teams for discussions and potential solutions. 

How Often Should You Perform an Incident Retrospective?

An incident retrospective usually takes place after any incident. Ideally, the meetings are held shortly (within 48 hours) after the incident, and when the context is still fresh in the mind of all respondents. 

Retrospectives usually fall into two categories. The first type of retrospective meeting is held after a DevOps or IT incident such as data corruption or website crash. The second type of retrospective takes place after project completion where the team looks at the project from the start to the end to determine what went smoothly and what can be improved.

Incident retrospectives should be performed after every incident. Every incident has the potential to teach you something about your system. Depending on the severity, the amount of time spent on each retrospective can vary.

Who is Involved in the Postmortem Process?

Humans are the most important part of incident management and retrospective. Often there's a huge conundrum where there's too much communication and too little communication happening at the same time. Everyone wants to participate and sometimes, in the spirit of participation, they end up talking without contributing relevant information. Among all that noise,  people doing all the work don't get a say. Therefore it's important to ensure that you've invited all the right people to the post mortem meeting and remained focused on the problem at hand.

At the top of the incident response hierarchy is the incident commander (IC) - the team lead, responsible for driving the incident to resolution. The IC selects a postmortem owner to avoid the bystander effect, whose job is to ensure that the process goes smoothly. 

The retrospective or postmortem is a collaborative process that starts with a draft written by the incident commander. All participants contribute relevant information using a collaborative tool such as Google Docs. It’s easier to collaborate as you go rather than piecing together everything after the fact. 

Incident response team members who attend the postmortem meeting include:

  • Incident commander (IC): responsible for keeping the discussion on track and focused on solutions. 
  • Incident commander shadowee or Deputy (if any): helps the IC stay on track. 
  • Postmortem owner: team lead running the incident points. 
  • Project manager: PM of the affected system. 
  • Key engineer: the one involved directly in incident response.
  • Project team members: the ones who responded to the incident and contributed to its resolution.
  • Scribe: documents the incident timeline and records all important decisions for review later. 
  • Internal liaison: interacts with the internal stakeholders by notifying the internal team or mobilizing more responders within the organization.  
  • Customer liaison: interacts with customers either directly or through public communication channels).
  • Subject matter expert (SME) or responder: a domain expert who helps identify and fix issues within the service. 

How can Blameless Help with the Process?

The blameless culture originated in the avionics and healthcare industry, where mistakes can be fatal. Flight controllers are considered the first SREs. They need to identify any potential causes of failure early and work toward mitigating them. Failure is an inevitable part of any operation, and blameless postmortems help companies learn from their mistakes and grow. At the end of the day, what’s even the point of blaming? It’s easier to fix a mistake in the system than fix a human being! 

Conducting a blameless retrospective can be testing at the beginning, but with Blameless’s Deep Retrospective Platform, you can turn incidents into investments. It takes the toil out of post-incident learning and turns it into a collaborative experience for your business. Sign up for a free trial today.

Resources
Book a blameless demo
To view the calendar in full page view, click here.