In 2005, a major explosion at BP’s refinery in Texas City killed 15 people and injured 180 others.
The vice president publicly blamed their staff, stating that “if [our people] followed the start-up procedures, we wouldn’t have had this accident.” However, when analyses found the explosion to be “years in the making” due to substandard equipment and inadequate safety procedures, the blame was then placed on management officials for choosing to operate under flawed conditions. The safety conditions did not improve after this incident, and more incidents occurred after the initial explosion, resulting in over $100 million in fines, dozens of lawsuits, and payouts to victims of up to $1.6 billion. A decade later, fires continue to ravage refineries on a weekly basis and data shows that at least 58 people have died in U.S. refineries since 2005.
What does this story tell us about the nature of blame? For one, it highlights our tendency as a species to focus on human fallibility as the reason for failure. This is a common theme prevalent across industries, particularly in aviation safety and healthcare. In fact, the U.S. National Transportation Safety Board lists “people” as the probable cause in 96% of accidents. We have this innate need to localize error, to find “what” broke, and to determine who was the “Bad Apple.”
We have this innate need to localize error, to find “what” broke, and to determine who was the “Bad Apple.”
Psychologists term this the Fundamental Attribution Error — the belief that individuals, not situations, cause error. For example, if an incident takes a site down and we discover during the post-mortem that the engineer in charge could have prevented the incident, we tend to judge the engineer as sloppy or neglectful instead of considering other situational factors that could have contributed to the crash (such as an understaffed team or a lack of standardization of development practices).
It’s hard to fault this tendency because blame satisfies two key human needs.
While it is a deeply unsettling thought, it also speaks to a wider culture of blame.
Companies fall into this trap when they breed a culture of placing blame on people. By punishing and scapegoating individuals, companies create an environment of fear. Fearful employees are not incentivized to surface incidents early, resolve technical debt buried deep within the system, or take risks in shipping new features — all out of fear of getting blamed for things going south.
When companies consistently look to blaming individuals to resolve problems, they miss out on thorough, multi-level introspection required to design system-wide improvements that can prevent future failures. If we don't improve these systems, servers will continue to fail, airplanes will continue to crash, and explosions will continue to take innocent lives.
Recognizing a poisonous culture is a good start, but cultivating a culture of true blamelessness requires significant organizational maturity.
If companies consistently look to blaming individuals instead of learning for improvement, servers will continue to fail, airplanes will continue to crash, and explosions will continue to take innocent lives.
The following table summarizes the different tiers of “blame culture” present in companies today. This was adapted from STELLA, a report published by a consortium of major site reliability engineers (SREs) on how to better cope with the complexities of anomalies.
A culture of true blamelessness focuses on finding systems-level problems and engaging in collaborative dialogue with employees to mitigate future incidents. Far too often, we attribute one singular reason to failure, but how is it that when success is universally acknowledged as requiring multiple contributing conditions, failure needs just one?
Emerging research has shown that failures in IT systems are often caused by complex and non-linear conditions. Therefore, in order to discover the full story of an incident, we need to draw out the narratives and circumstances leading up to it. Sidney Dekker, one of the foremost scholars in human factors theory, writes:
“Rather than judging people for not doing what they should have done, the new view presents tools explaining why people did what they did. Human error becomes a starting point, not a conclusion.”
The most successful blameless post-mortems do just this.
They are able to draw out not only the technical barriers causing an incident, but they can provide valuable insights on the organizational, economical, and political factors that contribute to success and failure within a company. By adopting a culture of true blamelessness, companies empower employees to anticipate and adapt to unforeseen circumstances while engaging in constant learning to improve systematic complexities.
By adopting a culture of true blamelessness, companies empower employees to anticipate and adapt to unforeseen circumstances.
Even experts admit to their own lack of understanding in the face of complexity. The recently released STELLA report proposes that as the complexity of a system increases, the accuracy of an engineer’s own mental model of that system decreases. They discovered that repeatedly during post-mortems, even engineering experts were surprised that their own mental models of the system didn’t match the actual behaviour of the system. Sometimes, even their fundamental beliefs were challenged by these unanticipated incidents.
The fact is, complex systems fail. Rather than blaming individuals for these failures, the only way to navigate this complexity is to empower people to have the adaptive capacity that machines do not.
The only way to navigate this complexity is to empower people to have the adaptive capacity that machines do not.
We can't place the demands for resilience on the individual. It requires addressing the organizational and environmental factors that support or undermine resilience. Whether it’s setting a tolerance for failure with error budgets, proactively managing technical debt, or implementing blameless post-mortems, companies need to invest in understanding the complex dynamics between humans and systems in order to cultivate resilience.
A true blameless culture is the first step to enabling human resilience — the ability to navigate unanticipated, unforeseen, and unexpected incidents. After all, when navigating a world filled with uncertainty and complexity, only the adaptable will succeed.(Better team photo to come!)
Written by Rui Su