The Blameless Complete Guide to Incident Management (Part 1)

This eBook will break down what to do when things go wrong. Let's dive in and level up our incident management skills!

The Blameless Complete Guide to Incident Management (Part 1)

This eBook will break down what to do when things go wrong. Let's dive in and level up our incident management skills!


Incidents are inevitable. As your service expands and becomes more complex, you’re more likely to encounter outages, slowdowns, errors, and other disruptions to healthy operation. At the same time, as your service becomes more popular and relied on by users, the cost of incidents becomes higher. A bad incident could impact all of the following and more:

  • Loss of revenue from a service being unavailable or substandard
  • Customers churning to more reliable competitors
  • Potential customers abandoning the product during evaluation
  • Delay of feature work that could provide a competitive advantage

All of these factors directly and negatively impact your business’s bottom line. Studies have shown that the cost of downtime is high, and growing fast in the digital-first world. Since you can never fully prevent incidents, it’s important to resolve them as efficiently as possible.

This eBook breaks down what to do when things go wrong. We’ll cover:

  • What to do in the heat of an incident
  • How to prepare for incidents by building resources
  • How to learn from incidents to become more resilient and robust

Let’s dive in and level up our incident management skills!

Key Takeaways

  1. The most important thing to know is what you’ll do while things are going wrong. No matter how much preparation and learning you do, there will always be things you aren’t ready for. 
  2. Don’t hesitate to declare an incident: if you aren’t sure, remember there’s a reason you’re concerned enough to consider declaring an incident. Even if there’s nothing wrong, this is still an opportunity to collaborate and learn.
  3. Diagnose and solve with deliberation: To stay grounded and focused, come up with a hypothesis of what is causing the problem, test it, and adjust and retest based on what you see.
  4. Keep communication flowing: Continually communicate in a central area, such as a dedicated Slack channel set up for the incident. Appoint one person as the designated Communications Lead whose sole job it is to ensure internal and external stakeholders are being communicated with.
  5. Escalate and ask for help: It can be tough to admit that you aren’t sure what to try, or that you don’t have the resources you need to continue. But getting help is essential to solving some incidents, so do your best to ask when it’s necessary.

Table of Contents

1. Introduction

2. During an incident

Don’t hesitate to declare an incident

Diagnose and solve with deliberation

Keep communication flowing

Escalate and ask for help

3. Learning from incidents

Incident retrospectives

Patterns in incidents

4. Conclusion & next steps

"I have less anxiety being on-call now. It’s great knowing comms, tasks, etc. are pre-configured in Blameless. Just the fact that I know there’s an automated process, roles are clear, I just need to follow the instructions and I’m covered. That’s very helpful."
Jean Clermont, Sr. Program Manager, Flatiron
"I love the Blameless product name. When you have an incident, "Blameless" serves as a great reminder to not blame anything or anyone (not even yourself) and just focus on the incident resolving itself."
Lili Cosic, Sr. Software Engineer, Hashicorp
Read their stories

Sign up for our monthly newsletter

Be the first to hear about new content and events happening at Blameless.