Want to up-level your reliability program? Let's start by identifying your opportunities for growth.
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

Failure Analysis: Engineering incidents are a bigger problem than you think

Aaron Lober
|
1.5.2023

How big of a problem are engineering incidents?

Engineering incidents can be quite harmful for companies, both in terms of financial costs and reputational damage. In some cases, engineering incidents can even put people's lives at risk, which can have serious legal and moral implications for the company involved. 

You don’t have to look far to find contemporary examples of how software hiccups wound companies. For instance, Southwest Airlines recently canceled 2,900 flights during the 2022 Winter holiday season. It cost them over $800 million US. Employees who have been vocal about the matter blame Southwest’s technology as the main problem. More specifically, union members described “inadequate computer systems” that could not handle the overwhelming load of pairing crews with flights. Millions of Americans were left unable to find alternative flights or travel accommodations to reach their final destinations, leading to a major class action lawsuit. Southwest Airlines could pay a costly fine — in the billions of dollars — on top of the public rebuke from the Secretary of Transportation, Pete Buttigieg. 

Another very high profile example of the cost of system failure is the self driving capability in Tesla automobiles. A recent test by The Dawn Project says its test track results revealed that the latest version of Tesla’s Full Self-Driving (FSD) Beta software failed to detect a stationary, child-sized mannequin at an average speed of 25mph.

These examples both illustrate very different and extreme types of engineering or platform failures. They are examples of the BIG, or obvious costs of engineering incidents. However, most engineering incidents are smaller and more easily contained, making their impact more subtle and difficult to quantify.

The subtle human cost of engineering incidents

Engineering incidents can contribute to burnout among engineers, especially if they are dealing with a high volume or if they are not properly supported during the incident response process. Burnout can have a number of negative effects, including reduced productivity, increased absenteeism, and a higher likelihood of making mistakes.

To prevent burnout, it's important for engineering teams to have adequate support and resources to manage incidents. This can include having clear procedures in place for responding to incidents, providing regular training and support to engineers, and ensuring that there is adequate time for rest and recovery after an incident. It can also be helpful to have a plan in place for managing the psychological impact of incidents, such as providing access to counseling or other forms of support.

The effects of burnout can’t be understated. At SREcon23 APAC, the most popular talk at the conference by far was one titled "Burnout: What to do when you just can't". I attended the talk thinking it would be chock full of scientific, actionable steps to combat burnout, as if there is a guaranteed formula for success. As if humans are robots that can be restored with a simple battery recharge. First, the speaker spent a whole 20 minutes talking about the horrors of on-call and intense outages. It felt dragged on for my taste, but as I looked around, I saw unanimous head nodding, engaged eyes, and a lot of affirming "mmm"s. Then the speaker began sharing her advice on how to react in those situations. She talked about taking breaks, breathing exercises, talking to loved ones, going for walks, playing with pets like dogs and cats. She even mentioned eating proper meals. 

After the session, there was chatter throughout the entire expo. Everyone kept saying how much the talk resonated with them, how helpful it was, and how encouraged they felt. This topic is clearly relevant to engineers. In fact, a Google Engineering Manager told me that he’s going to take the advice and actually use it. If you are in a leadership role, assume that your engineers are experiencing some form of burnout. And maybe they aren't right now, but what are you doing to prevent that from happening? Engineers are important. They are not dime a dozen. They are also humans. If you take care of people, they will take care of you too.

How can my site reliability engineering team reduce the impact of incidents?

Whether big or small, it's important for companies to take steps to prevent engineering incidents from occurring, and to have plans in place to mitigate the damage if an incident does happen. This can include regular safety inspections and training for employees, as well as having backup systems and contingency plans in place.

There are a few key ways that your engineering team can reduce the impact of incidents:

  1. Invest in prevention: This means regularly inspecting infrastructure and systems to identify potential issues before they become major problems. It can also include implementing safety protocols and training employees on best practices.

  2. Have a plan: Your team should have a plan in place for responding to incidents, including steps for mitigating the damage and communicating with stakeholders. This plan should be regularly reviewed and updated.

  3. Be transparent: If an incident does occur, it's important to be transparent and open with stakeholders. This can help to build trust and minimize reputational damage.

  4. Learn from incidents: After an incident, your team should conduct a thorough review to identify what went wrong and how to prevent it from happening again in the future. This can help to improve processes and prevent future incidents.

  5. Overall, reducing the impact of incidents requires a combination of prevention, planning, transparency, and continuous learning. By taking these steps, your engineering team can minimize the impact of incidents and help to protect the company's reputation and bottom line.

When should I invest in incident management tooling?

It can be helpful to invest in incident management tooling if your engineering team is frequently dealing with incidents, or if the potential consequences of an incident are particularly severe. Incident management tooling can make it easier to respond to incidents quickly and effectively, which can help to minimize the damage and get systems back up and running as quickly as possible.

Some signs that your engineering team could benefit from incident management tooling include:

  1. Your team is regularly dealing with a high volume of incidents.
  2. Your current incident response process is manual and time-consuming.
  3. Your team is having difficulty coordinating and communicating during incident response.
  4. Your team is unable to track the status of incidents and resolution efforts effectively.
  5. The potential consequences of an incident (e.g. financial losses, reputational damage) are significant.

If any of these apply to your team, it may be worth investing in incident management tooling to improve your incident response process and reduce the impact of incidents. To learn more about incident management tooling, sign up for a free trial today!

Resources
Book a blameless demo
To view the calendar in full page view, click here.