Blameless has been speaking to on-call engineers...

We've found that incidents take longer than you think. You might look back at an incident and remember how long it took to find and implement a solution, but there's so much more than that. We broke down every part of incident management using the experiences of your on-call peers to see how much time you really spend.

Responding to the Alert

Spending too much time on incidents can cause problems in a variety of areas, including...

Delays planned work

Every minute you spend fixing an incident is a minute that could have been spent on new feature work.

Engineers burn out

No engineer wants to spend all day fighting fires. Less time on incidents means happier engineers.

Reputation takes a hit

If you don't handle incidents quickly and thoroughly, customers will worry and look to competitors.

WHAT IS AN INCIDENT?

Incidents are more than just bugs or outages. They’re anything that pulls you from your planned work and they last until you’ve returned to your planned task.

Responding to the Alert

When you get the alert that something is wrong, it takes you some time to process and act on the information.

minutes per 
incident

Wrapping up Previous Work

You probably want to leap immediately onto solving an incident you’ve been alerted to, but things aren’t so simple. Every interruption comes with some time spent wrapping up the previous task.

minutes per 
incident

Classifying the Incident

Before you know how to triage and respond, you need to judge the severity of the incident and the areas it affected. This will inform who you invite to resolve it.

minutes per 
incident

Gathering Responders

Few incidents can be resolved totally solo. You need to look up on-call schedules and figure out service owners and subject matter experts, then ping them to gather in a central place.

minutes per 
incident

Coordinating Work

Once you have your dream team prepared, you need to deploy them efficiently. Making sure every task has been covered without redundancy isn’t trivial.

minutes per 
incident

Collecting Information

Understanding what’s going wrong requires gathering information from your monitoring tools. You need to know about how your system is functioning vs. when it was in a healthy state.

minutes per 
incident

Diagnosing the Problem

Now that you have some context, you need to zoom into the specific problem through testing. Analyze your codebase and architecture to pinpoint where things are breaking.

minutes per 
incident

Communicating Incident Status

While you're busy working, many stakeholders – from customers to executives – will want to know what’s happening. Time needs to be allocated to inform them.

minutes per 
incident

Devising a Solution

This is the biggest chunk of the resolution process, and the one that can vary the most in time spent. Sometimes you’re up all night, and sometimes the fix comes to you in an instant. An hour is an optimistic estimate based on conversations with your peers.

minutes per 
incident

Implementing the Solution

This stage can also vary wildly in time, from a single line of code needing a change, to a tedious migration of databases, to a total architectural overhaul. We’ll remain optimistic with a half hour estimate.

minutes per 
incident

Testing the Solution

As tempting as it is, deploying the first idea you come up with can lead to an even worse disaster. Spending time for quick Q&A is vital.

minutes per 
incident

Deploying the Solution

Depending on your architecture and deployment process, this can take a long time or a little. We’ll reflect a fairly automated and mature process.

minutes per 
incident

Verifying that the Problem was Fixed

Now that the fix is out in the users’ hands, you need to make sure it actually… fixed it. Running some more tests aligned with your first diagnosing makes sure you’re in the clear.

minutes per 
incident

Now that the solution has been deployed and the problem has been solved, you’re done, right?

WRONG!

You’re just getting started.

Summarizing the Resolution

You need to create a retrospective document as a resource for future incident responders. Step one is collecting and summarizing what went wrong, what was tried, what happened, and when.

minutes per 
incident

Judging the Impact

The next step for your retrospective is figuring out how big an impact it actually had. Look at how many users were affected, how badly they were affected, and for how long.

minutes per 
incident

Tracking the Impact

Now that you understand the impact the incident had, see how it changes your tracked metrics such as overall uptime.

minutes per 
incident

Analyzing the Causes

This is the most substantial part of your retrospective. Work together to think holistically about every factor that contributed to the incident occurring. Dig deep, and think about the causes of causes.

minutes per 
incident

Devising Systemic Changes

When you’ve identified the causes of the incident, do what you can to prevent them from recurring. Find what systemic changes can be made to prevent those scenarios. They can be code base changes, new policies, additional resources, and more.

minutes per 
incident

Implementing Systemic Changes

Actually implementing the changes you’ve prioritized could take minutes, hours, or weeks. They’ll likely involve other teams. But merely tracking and allocating these tasks is time consuming enough.

minutes per 
incident

Communicating the Retrospective

Many stakeholders will need to know what went wrong and how it was fixed. Take time to make sure they receive the retrospective document and answer questions about it.

minutes per 
incident

Refocusing on your original Work

You’re finally done with incident related work. Hooray! However, the incident isn’t really over until you’re back making progress on your original task. Don’t discount the time it takes to refocus on where you were before.

minutes per 
incident

So, what’s our

Grand Total?

Thinking holistically, we find that each incident causes a delay of 475 minutes, or almost 8 hours. That’s a whole working day spent resolving an incident!

475

Minutes

There probably hasn’t been many times where you consciously spent a whole day on an incident. If there have been, we feel for you. Often these tasks are distributed over multiple days, or handled by multiple people. But that time is still being spent, and planned work is being delayed.

If a typical engineering salary is about $150,000 per year, this will cost you $660 in engineering time alone. If engineers are dealing with an incident a week, this adds up to $34,320 in engineering costs for each engineer. Check out our Return on Investment calculator to see more about the costs of incidents.

The good news is that Blameless can reduce many of these times. We make many incident tasks automatic, letting you focus on resolving the problem fast with our role-based guidance. After you're back online, prevent repeat incidents and strengthen your system with our suite of features for incident learning.
Find out more by signing up for a demo!

Blameless has been speaking to on-call engineers...

Responding to the Alert

Responding to the Alert

Wrapping up Previous Work

Classifying the Incident

Gathering Responders

Coordinating Work

Collecting Information

Diagnosing the Problem

Communicating Incident Status

Devising a Solution

Implementing the Solution

Testing the Solution

Deploying the Solution

Verifying that the Problem was Fixed

Now that the solution has been deployed and the problem has been solved, you’re done, right?

You’re just getting started.

Summarizing the Resolution

Judging the Impact

Tracking the Impact

Analyzing the Causes

Devising Systemic Changes

Implementing Systemic Changes

Communicating the Retrospective

Refocusing on your original Work

Wrapping up Previous Work

Communicating Incident Status

Implementing Systemic Changes