Ebook
This eBook will break down what to do when things go wrong. Let's dive in and level up our incident management skills!
The Blameless Complete Guide to Incident Management (Part 1)
Summary
Incidents are inevitable. As your service expands and becomes more complex, you’re more likely to encounter outages, slowdowns, errors, and other disruptions to healthy operation. At the same time, as your service becomes more popular and relied on by users, the cost of incidents becomes higher. A bad incident could impact all of the following and more:
- Loss of revenue from a service being unavailable or substandard
- Customers churning to more reliable competitors
- Potential customers abandoning the product during evaluation
- Delay of feature work that could provide a competitive advantage
All of these factors directly and negatively impact your business’s bottom line. Studies have shown that the cost of downtime is high, and growing fast in the digital-first world. Since you can never fully prevent incidents, it’s important to resolve them as efficiently as possible.
This eBook breaks down what to do when things go wrong. We’ll cover:
- What to do in the heat of an incident
- How to prepare for incidents by building resources
- How to learn from incidents to become more resilient and robust
Let’s dive in and level up our incident management skills!
Key Takeaways
- The most important thing to know is what you’ll do while things are going wrong. No matter how much preparation and learning you do, there will always be things you aren’t ready for.
- Don’t hesitate to declare an incident: if you aren’t sure, remember there’s a reason you’re concerned enough to consider declaring an incident. Even if there’s nothing wrong, this is still an opportunity to collaborate and learn.
- Diagnose and solve with deliberation: To stay grounded and focused, come up with a hypothesis of what is causing the problem, test it, and adjust and retest based on what you see.
- Keep communication flowing: Continually communicate in a central area, such as a dedicated Slack channel set up for the incident. Appoint one person as the designated Communications Lead whose sole job it is to ensure internal and external stakeholders are being communicated with.
- Escalate and ask for help: It can be tough to admit that you aren’t sure what to try, or that you don’t have the resources you need to continue. But getting help is essential to solving some incidents, so do your best to ask when it’s necessary.
Table of Contents
1. Introduction
2. During an incident
Don’t hesitate to declare an incident
Diagnose and solve with deliberation
Keep communication flowing
Escalate and ask for help
3. Learning from incidents
Incident retrospectives
Patterns in incidents