WHAT IS AN INCIDENT?
Incidents are more than just bugs or outages. They’re anything that pulls you from your planned work and they last until you’ve returned to your planned task.

Responding to the Alert

When you get the alert that something is wrong, it takes you some time to process and act on the information.
minutes per

incident

Wrapping up
Previous Work

You probably want to leap immediately onto solving an incident you’ve been alerted to, but things aren’t so simple. Every interruption comes with some time spent wrapping up the previous task.
minutes per

incident

Classifying the Incident

Before you know how to triage and respond, you need to judge the severity of the incident and the areas it affected. This will inform who you invite to resolve it.
minutes per

incident

Gathering Responders

Few incidents can be resolved totally solo. You need to look up on-call schedules and figure out service owners and subject matter experts, then ping them to gather in a central place.
minutes per

incident

Coordinating Work

Once you have your dream team prepared, you need to deploy them efficiently. Making sure every task has been covered without redundancy isn’t trivial.
minutes per

incident

Collecting Information

Understanding what’s going wrong requires gathering information from your monitoring tools. You need to know about how your system is functioning vs. when it was in a healthy state.
minutes per

incident

Diagnosing the Problem

Now that you have some context, you need to zoom into the specific problem through testing. Analyze your codebase and architecture to pinpoint where things are breaking.
minutes per

incident

Communicating
Incident Status

While you're busy working, many stakeholders – from customers to executives – will want to know what’s happening. Time needs to be allocated to inform them.
minutes per

incident

Devising a Solution

This is the biggest chunk of the resolution process, and the one that can vary the most in time spent. Sometimes you’re up all night, and sometimes the fix comes to you in an instant. An hour is an optimistic estimate based on conversations with your peers.
minutes per

incident

Implementing the Solution

This stage can also vary wildly in time, from a single line of code needing a change, to a tedious migration of databases, to a total architectural overhaul. We’ll remain optimistic with a half hour estimate.
minutes per

incident

Testing the Solution

As tempting as it is, deploying the first idea you come up with can lead to an even worse disaster. Spending time for quick Q&A is vital.
minutes per

incident

Deploying the Solution

Depending on your architecture and deployment process, this can take a long time or a little. We’ll reflect a fairly automated and mature process.
minutes per

incident

Verifying that the Problem was Fixed

Now that the fix is out in the users’ hands, you need to make sure it actually… fixed it. Running some more tests aligned with your first diagnosing makes sure you’re in the clear.
minutes per

incident

Now that the solution has been deployed and the problem has been solved, you’re done, right?

WRONG!

You’re just getting started.

Summarizing the Resolution

You need to create a retrospective document as a resource for future incident responders. Step one is collecting and summarizing what went wrong, what was tried, what happened, and when.
minutes per

incident

Judging the Impact

The next step for your retrospective is figuring out how big an impact it actually had. Look at how many users were affected, how badly they were affected, and for how long.
minutes per

incident

Tracking the Impact

Now that you understand the impact the incident had, see how it changes your tracked metrics such as overall uptime.
minutes per

incident

Analyzing the Causes

This is the most substantial part of your retrospective. Work together to think holistically about every factor that contributed to the incident occurring. Dig deep, and think about the causes of causes.
minutes per

incident

Devising Systemic Changes

When you’ve identified the causes of the incident, do what you can to prevent them from recurring. Find what systemic changes can be made to prevent those scenarios. They can be code base changes, new policies, additional resources, and more.
minutes per

incident

Implementing
Systemic Changes

Actually implementing the changes you’ve prioritized could take minutes, hours, or weeks. They’ll likely involve other teams. But merely tracking and allocating these tasks is time consuming enough.
minutes per

incident

Communicating the Retrospective

Many stakeholders will need to know what went wrong and how it was fixed. Take time to make sure they receive the retrospective document and answer questions about it.
minutes per

incident

Refocusing on your original Work

You’re finally done with incident related work. Hooray! However, the incident isn’t really over until you’re back making progress on your original task. Don’t discount the time it takes to refocus on where you were before.
minutes per

incident