Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

Why Every Company Can Benefit from a Blameless Culture

|
7.1.2019

In 2005, a major explosion at BP’s refinery in Texas City killed 15 people and injured 180 others.

The vice president blamed the staff, stating that “if [our people] followed the start-up procedures, we wouldn’t have had this accident.” But, analyses found the explosion to be “years in the making” due to substandard equipment and inadequate safety procedures. The blame was then placed on management officials for choosing to operate under flawed conditions.

Safety conditions did not improve after this incident. More incidents occurred after the initial explosion, resulting in over $100 million in fines, dozens of lawsuits, and payouts to victims of up to $1.6 billion. Fires continue to ravage refineries on a weekly basis. Data shows that at least 58 people have died in U.S. refineries since 2005.

Why It’s Human to Blame

What does this story tell us about the nature of blame? For one, it highlights our tendency as a species to focus on human fallibility as the reason for failure. This is a common theme across industries, particularly in aviation safety and healthcare. In fact, the U.S. National Transportation Safety Board lists “people” as the probable cause in 96% of accidents. We have this innate need to localize error, to find “what” broke, and to determine who was the “Bad Apple.”


We have this innate need to localize error, to find “what” broke, and to determine who was the “Bad Apple.”


Psychologists term this the Fundamental Attribution Error. This is the belief that individuals, not situations, cause error. For example, imagine an incident takes a site down. We discover during the retrospective that the engineer in charge could have prevented the incident. We tend to judge the engineer as sloppy or neglectful instead of considering the contributing factors.

It’s hard to fault this tendency because blame satisfies two key human needs.

  • If we can point to one specific person as the cause of failure, then the rest of the organization is no longer responsible. The thinking goes: By getting rid of these “Bad Apples,” we can get rid of failure. But it is human nature to make mistakes, so it is impossible to remove all failure. Firing employees who made a mistake will not help.
  • We have a deep fear of uncertainty, ambiguity and loss of control. After major catastrophes, we grasp at tangible reasons for failure. If we don’t, we must acknowledge that at any time, any number of complex situational and contextual factors can cause failure.

While it is an unsettling thought, it also speaks to a wider culture of blame.

How Companies Get Stifled by a Culture of Blame

Companies fall into this trap when they breed a culture of placing blame on people. By punishing and scapegoating individuals, companies create an environment of fear. Fearful employees are not incentivized to surface incidents early, resolve technical debt buried deep within the system, or take risks in shipping new features. They don't want anyone to blame them when things go south.

When companies blaming individuals to resolve problems, they miss the chance for introspection. This introspection is necessary for system-wide improvements that can prevent future failures. If we don't improve systems, they will continue to fail.

Recognizing a poisonous culture is a good start. But cultivating a culture of true blamelessness requires significant organizational maturity.

Blameless Culture Maturity Model

The following table summarizes the different tiers of “blame culture” in companies today. We've adapted this from STELLA, a report published by a consortium of site reliability engineers (SREs) on how to better cope with the complexities of anomalies.

True Blamelessness

A blamelessness culture focuses on finding systems-level problems. Teams should be able to engage in collaborative dialogue to mitigate future incidents. Far too often, we attribute one singular reason to failure. We acknowledge success as requiring  contributing conditions. Yet we attribute failure to a single cause.[1]

Research shows that systemic failure is caused by complex and non-linear conditions. So, to discover the full story of an incident, we need to draw out the narratives and circumstances leading up to it. Sidney Dekker, one of the foremost scholars in human factors theory, writes:

“Rather than judging people for not doing what they should have done, the new view presents tools explaining why people did what they did. Human error becomes a starting point, not a conclusion.”

The most successful incident retrospectives do this.

Yes, they draw out the technical barriers causing an incident. But they also provide insights on the systemic factors that contribute to failure. By adopting a culture of true blamelessness, teams will adapt to unforeseen circumstances. while engaging in constant learning to improve systematic complexities.

What a Blameless Culture Enables

Even experts admit to their own lack of understanding in the face of complexity. The STELLA report proposes that as the complexity of a system increases, the accuracy of an engineer’s own mental model of that system decreases. During retrospectives, even engineering experts discovered that their own mental models of the system didn’t match the actual behavior of the system.

The fact is, complex systems fail. Rather than blaming individuals for these failures, we should empower people. People have the adaptive capacity that machines do not.

We can't place the demands for resilience on the individual. It requires addressing the organizational and environmental factors that support or undermine resilience. Whether it’s setting a tolerance for failure with error budgets, managing technical debt, or implementing blameless retrospectives, companies need to understand the complex dynamics between humans and systems.

A true blameless culture is the first step to enabling human resilience. The ability to navigate unanticipated, unforeseen, and unexpected incidents is key to success. When navigating a world filled with uncertainty, only the adaptable will succeed.

Written by Rui Su and Christina Tan

References

  1. https://www.oreilly.com/ideas/the-infinite-hows
  2. https://qualitysafety.bmj.com/content/26/8/671
  3. https://snafucatchers.github.io/
  4. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3115647/
  5. https://www.kitchensoap.com/2012/02/10/each-necessary-but-only-jointly-sufficient/
  6. http://radar.oreilly.com/2014/11/if-it-werent-for-the-people.html
  7. http://www.bertramgawronski.com/documents/G2007EncycFAE.pdf
  8. https://books.google.ca/books?id=lpIcmy9pLAcC&pg=PA14&lpg=PA14&dq
  9. https://apps.texastribune.org/blood-lessons/disaster/
  10. https://apps.texastribune.org/blood-lessons/
  11. https://www.adaptivecapacitylabs.com/blog/2018/11/06/redeploy-conference-finding-sources-of-resilience/
Resources
Book a blameless demo
To view the calendar in full page view, click here.