Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

DevOps & SRE Words Matter: How Our Language has Evolved

Emily Arnott
DevOps Practices

As the tech world changes, language changes with it. New technologies will always introduce new terms and descriptions to provide clear understanding. For example, the emergence of the cloud introduced language to describe the changing relationship between servers and clients. Then, of course, product providers will also dictate how their products are to be described, i.e. describing services as “cloud-native”.

On other occasions, language changes through deliberate effort to influence behavior. Thought leaders will often invent alternative words to describe existing ideas in order to effect cultural change. Even a slight change in diction can massively affect one’s engagement, attitude, and even their worldview. In this blog, we’ll look at how language colours how we perceive our environments, and we’ll break down three examples of how language has evolved in tech.

How language affects and shifts world perspectives

We all have associations with language. Because of our past experiences and culture, different types of messages will trigger different emotional responses. The language we use thus influences the way we think. Whether our associations are positive or negative can impact things such as:

  • Whether we dread something or get excited by it
  • How important we perceive something to be
  • If we perceive something to be collaborative or combative...
  • Innovative or legacy
  • Bleeding-edge or mainstream
  • Safe or provocative

“Postmortem” vs. “Retrospective”

Both of these terms refer to a document that summarizes a past incident and the steps that were taken to resolve it. “Postmortem” was originally a medical term dating back to the 1820s. The metaphorical usage of examining other things after their “death” has been widely used in many industries, including tech.

In recent years, many organizations are differentiating the idea of a retrospective from a postmortem as the culture mindset shifts to the ongoing learning from events and failures. The two practices are commonly considered to have some small differences, such as the timing and content of the documents. However, just as important as these differences are the psychological effects of the terminology being used, especially when these may be conducted in a high-pressure environment. Here are some of the reasons we’re using “retrospective” instead of “postmortem” at Blameless.

The negativity of postmortems: death has a negative association in most people’s minds. As responders attend to incidents, the negative connotation lingers. Engineers may feel worried about the consequences of an incident, and the idea of “death” surrounding this process may encourage feelings of guilt and fear. By removing negative associations, people will be more eager to review and look back at what actually occurred and take the time to revisit it as a team. 

The finality of postmortems: at Blameless, we don’t see failure as the end. We see it as an opportunity to learn and grow, a starting point for positive change. Postmortems are very final; no examination happens “post-postmortem”. A retrospective implies that you’re looking back at something that just happened or occured a while ago, that still could have a purpose in the future.

The wide scope of retrospectives: a postmortem is defined by the single moment of failure and works backwards to determine the causes. A retrospective is concerned with more than just the direct causes of failure. Instead, it seeks to tell the complete story of the service, systems, and people, up to and beyond the incident.

We want our incident retrospectives to be documents that we are proud to contribute to, that serve as hubs of learning and impetus for change going forward. We believe that by using the word “retrospective”, it conveys this intent much better than “postmortem”.

“Root Cause Analysis” vs “Contributing Factors Analysis”

When determining why something went wrong, there are several competing schools of thought. The root cause analysis, or RCA, is a popular tool for uncovering the reason for failure. The idea of a “root cause” as being the primary factor causing failure dates back to the early 1900s, with “root cause analysis” emerging as a concept in engineering companies in the 1930s. It is commonly attributed to Kiichiro Toyota, founder of the Toyota Motors Corporation, who developed the Five Whys technique to find root causes.

Contributing factor analysis is a more recent term that has been growing in popularity. It also seeks to understand the causes of an incident, but with a different mindset. That mindset is reflected in the language itself as much as any specific practice. Let’s look at some examples of these differences, and why we at Blameless feel the contributing factors analysis is more useful.

The singularity of RCAs: the most obvious difference is that a root cause analysis refers to a singular root cause, where contributing factors emphasizes multiple factors. This is more important than it may seem. If you set out looking for a singular cause, you’ll resist branching out to other impactful areas. For example, if you only look for an engineering cause, you’ll disregard factors arising from product design or team culture.

The hierarchy of RCAs: the idea of a “root” cause is that it is the source from which other causes grow and branch off. Understanding what causes are more significant for the incident is necessary to properly prioritize follow-up items, but it isn’t the full story. You have to also consider how these changes will affect the team and system as a whole. Thinking about each factor’s contribution without trying to determine which is the “root” keeps you more open-minded.

The neutrality of contribution: when considering the cause of an incident, you’ll be inclined to find failures, mistakes, and other negative things. Instead you can think about every factor that contributed to the story of the incident - including things that went well, like helpful playbooks and good communication. The totality of this factor analysis gives you a more complete picture of how to respond to incidents going forward.

Blameless advocates SRE as a holistic practice, one that incorporates learning from all available sources. The Contributing Factors Analysis brings in as many sources as possible to best understand incidents.

“Disaster Recovery” vs “Incident Response”

The overall process initiated by something going wrong has gone by different names over the years. The attitudes people have towards this have changed alongside the evolution of language and terminology. At first, organizations typically referred to this as disaster recovery. This terminology dates back to the 1970s, where it focused on how systems would recover if natural (or other) disasters wiped out infrastructure and its ability to operate.

As IT systems became more virtual, outages started to be caused by a much wider range of technical aspects other than natural disasters. Organizations moved to referring to this process as incident response to reflect the range of problems and new processes and tools. Also, the processes themselves evolved along with the technology changes. Let’s look at how these terms reflect the attitudes of each era, and why we now use incident response.

The singularity of recovery: incident response, sometimes referred to as incident management, is much more than just restoring the environment to its previous state. After services are back online, you still need to gather information from the incident itself and build a retrospective, develop action items to carry the learning forward, and review the effectiveness of the response steps and procedures. Recovery is really only the first step towards resolution, and doesn’t convey how you can get the most learning and improvement from each incident.

The severity of disasters: people see disasters as major catastrophic events. Setting up policies and procedures to trigger only in the event of a “disaster” is a very high bar. However,  your incident response process should work just as efficiently for all incidents In other words, not all incidents are ‘Sev 1” and so knowing the right steps to take depending on each incident is equally important. We believe there’s learning in every incident, and so every incident is worth responding to properly.

The inevitability of incidents: disasters are also thought of as something to avoid at all costs. 

Any effort spent on reducing the chances of a disaster would be justified, given how severe disasters can be to both customers and engineering teams. A goal of zero disasters is reasonable. However, we know that 100% reliability is impossible. By recognizing the inevitability of incidents, you embrace them and avoid overspending on infrastructure and other resources in trying to prevent them. Using the term “incidents'' vs “disasters” helps team-members understand their true inevitability and impact.

Incident response is a major component of the Blameless platform. We are the foundation and relied upon the platform when things go wrong, no matter the severity of the impact.

As language and culture in the market evolves, you should embrace change and new concepts that emerge. Blameless helps you modernize and stay current. To find out how, check out a demo.

Book a blameless demo
To view the calendar in full page view, click here.