In 2005, a major explosion at BP’s refinery in Texas City killed 15 people and injured 180 others.
The vice president blamed the staff, stating that “if [our people] followed the start-up procedures, we wouldn’t have had this accident.” But, analyses found the explosion to be “years in the making” due to substandard equipment and inadequate safety procedures. The blame was then placed on management officials for choosing to operate under flawed conditions.
Safety conditions did not improve after this incident. More incidents occurred after the initial explosion, resulting in over $100 million in fines, dozens of lawsuits, and payouts to victims of up to $1.6 billion. Fires continue to ravage refineries on a weekly basis. Data shows that at least 58 people have died in U.S. refineries since 2005.
What does this story tell us about the nature of blame? For one, it highlights our tendency as a species to focus on human fallibility as the reason for failure. This is a common theme across industries, particularly in aviation safety and healthcare. In fact, the U.S. National Transportation Safety Board lists “people” as the probable cause in 96% of accidents. We have this innate need to localize error, to find “what” broke, and to determine who was the “Bad Apple.”
We have this innate need to localize error, to find “what” broke, and to determine who was the “Bad Apple.”
Psychologists term this the Fundamental Attribution Error. This is the belief that individuals, not situations, cause error. For example, imagine an incident takes a site down. We discover during the retrospective that the engineer in charge could have prevented the incident. We tend to judge the engineer as sloppy or neglectful instead of considering the contributing factors.
It’s hard to fault this tendency because blame satisfies two key human needs.
While it is an unsettling thought, it also speaks to a wider culture of blame.
Companies fall into this trap when they breed a culture of placing blame on people. By punishing and scapegoating individuals, companies create an environment of fear. Fearful employees are not incentivized to surface incidents early, resolve technical debt buried deep within the system, or take risks in shipping new features. They don't want anyone to blame them when things go south.
When companies blaming individuals to resolve problems, they miss the chance for introspection. This introspection is necessary for system-wide improvements that can prevent future failures. If we don't improve systems, they will continue to fail.
Recognizing a poisonous culture is a good start. But cultivating a culture of true blamelessness requires significant organizational maturity.
The following table summarizes the different tiers of “blame culture” in companies today. We've adapted this from STELLA, a report published by a consortium of site reliability engineers (SREs) on how to better cope with the complexities of anomalies.
A blamelessness culture focuses on finding systems-level problems. Teams should be able to engage in collaborative dialogue to mitigate future incidents. Far too often, we attribute one singular reason to failure. We acknowledge success as requiring contributing conditions. Yet we attribute failure to a single cause.
Research shows that systemic failure is caused by complex and non-linear conditions. So, to discover the full story of an incident, we need to draw out the narratives and circumstances leading up to it. Sidney Dekker, one of the foremost scholars in human factors theory, writes:
“Rather than judging people for not doing what they should have done, the new view presents tools explaining why people did what they did. Human error becomes a starting point, not a conclusion.”
The most successful incident retrospectives do this.
Yes, they draw out the technical barriers causing an incident. But they also provide insights on the systemic factors that contribute to failure. By adopting a culture of true blamelessness, teams will adapt to unforeseen circumstances. while engaging in constant learning to improve systematic complexities.
Even experts admit to their own lack of understanding in the face of complexity. The STELLA report proposes that as the complexity of a system increases, the accuracy of an engineer’s own mental model of that system decreases. During retrospectives, even engineering experts discovered that their own mental models of the system didn’t match the actual behavior of the system.
The fact is, complex systems fail. Rather than blaming individuals for these failures, we should empower people. People have the adaptive capacity that machines do not.
We can't place the demands for resilience on the individual. It requires addressing the organizational and environmental factors that support or undermine resilience. Whether it’s setting a tolerance for failure with error budgets, managing technical debt, or implementing blameless retrospectives, companies need to understand the complex dynamics between humans and systems.
A true blameless culture is the first step to enabling human resilience. The ability to navigate unanticipated, unforeseen, and unexpected incidents is key to success. When navigating a world filled with uncertainty, only the adaptable will succeed.
Written by Rui Su