Looking into Incident Response? We explain incident response, the end-to-end process, the teams involved, and steps to take to avoid friction and slow-down. The goal is to manage the incident as efficiently as possible in order to restore or resume the service to its expected operational state.
Incident response (IR) is the way an organization reacts to an event that has caused a disruption of services. Typically, the steps in the incident response process include:
An incident occurs when something unexpected happens to your code, system infrastructure (including network), and underlying data that threatens the integrity of the overall service. The factors that cause an incident are different in almost every case, but the consequences usually involve latency, availability (an outage), data integrity, and security, etc.
When an incident occurs, you want to know about it as soon as possible so that you can address it as quickly as possible. That is the process of “detection” which is made possible with the help of observability and monitoring tools. Once an incident is detected, and if it needs to be addressed, the next step is to formally “acknowledge” the incident. The Google SRE Book provides the following guidelines on when it is appropriate to acknowledge an incident:
The goal of every organization is to run its operations smoothly and as consistently as possible. However, issues naturally occur from time to time. When an incident happens, the responder not only handles the incident response, but also coordinates the communication between teams and with other stakeholders — think of this as informing up, down, and across the chain of relevant team members. Successful incident response makes sure that all parties involved have a clear understanding of the situation at hand, the path toward resolution, and everyone’s individual tasks.
Incident response and, more specifically, the process around incident response is vital to every organization. How teams detect, analyze, resolve, and learn from incidents are the keys to improving over time and building more reliable services.
As soon as an incident is detected, the first step should be to designate clearly defined roles. It may seem simple, but the line of command can mark the difference between a managed and unmanaged incident. Incident management often involves many different teams, but the main roles involved are:
The major role in incident response is that of an Incident Commander, who overlooks the following areas:
For complicated or challenging incidents, the IC may sometimes enlist the help of another team lead to manage specific areas of the incident response. If the incident is small or minor, the IC can sometimes take on the role of CL.
Earlier we briefly touched on the key steps to incident response and how important they are to building reliable services. The steps are: detect, acknowledge, respond, resolve, and learn. One thing that amplifies the success of each step is a proper flow of communication. To that end, it’s best practice to document your incident management process so that it’s universally clear and always available. This is especially important for new team members that need to come up to speed as quickly as possible. The following is a summary of the basic stages of the incident management lifecycle. As you build out your own workflow specifications, remember to write them down!
The first and most critical step in incident response is incident detection. Before you can even know to take any action, you have to identify that there is in fact an incident.
Implementing continuous monitoring practices increases the visibility of a system’s overall state. Make your system observable by monitoring and querying the right events including metrics, logs, and traces. It’s also important to evaluate monitoring tools, thresholds, and alert rules from time to time to make sure that you’re tracking the right parts of your system and notifying the right team members when anything unexpected happens.
It’s also more than just relying on monitoring dashboards and charts. You should understand how the service and all the code work and interact with various parts of the underlying infrastructure.
As soon as an incident is detected and acknowledged, the next step is to efficiently get the right people involved. It’s important to know the context of an incident in order to figure out who should be involved so you communicate the relevant information using the proper channels.
Usually, co-located teams would gather together in a conference room. In a mostly remote world, teams depend on communication tools such as Slack or Microsoft Teams to plan and communicate. Automated incident response platforms can integrate with Slack, Microsoft Teams, PagerDuty, Jira, and other tools to automate both workflows and communication, making it much more efficient to collaborate.
Runbooks are also extremely helpful to have during an incident because they include a checklist of items that need to be addressed at the outset. Remember earlier we said that documenting your process is important? During critical moments like these, runbooks reduce stress and ensure that nothing is forgotten.
After successful detection and response, teams must start working on resolving or fixing whatever caused the incident. Hopefully, responders collect the necessary information and context they need during incident detection, and upon further inspection, to make remediation a smooth process.
A key part of resolution is to involve the right subject matter expert(s) (SEM). Most engineering teams have an on-call rotation in place and so not everyone that is on-call at the moment the incident occurs will have the knowledge to easily resolve. Waking up a developer in the middle of the night is never ideal but high severity incidents often require such actions. The trick is knowing exactly who to include when, and so knowing the type of incident and severity (or criticality) is very important before disrupting anyone on the team.
As with other stages in the incident management lifecycle, collaboration is critical during resolution. In many cases, several teams get involved in working or triaging a particular issue. Staying in communication each step of the way makes the process run faster and much more efficiently.
Resolving an incident is the end goal, but it shouldn’t be the end of the story. Mature engineering teams conclude incidents by debriefing and learning from the event. A live discussion coupled with thorough documentation via post-mortems, or what we now prefer to call retrospectives, begins the process of systemic improvement. It allows the relevant teams to come together and share information, metrics, and insights to improve the system’s reliability and better prepare for future incidents.
One more thing. It’s vital to keep the retrospective blameless. What does this mean? Pointing fingers never helps anyone. In fact, it disincentivizes people from identifying roadblocks or suggesting creative solutions. A good rule of thumb is to approach retrospectives with the attitude that everyone has valuable insights to share, and there is always an opportunity to improve.
Hope for the best, but prepare for the worst. Incidents are inevitable, so make sure you are equipped when the time comes.
Improvement is the last step in incident response. Think of it as preparation. It involves taking proactive measures to protect your organization from further incidents. Getting up-to-date runbooks is a great first step. Informing all teams whether on-call or not is also a very important step. Looking back at the previous step, teams have to figure out how to translate their post-incident analyses into actionable steps. Identify what needs to be improved and then start working on those tasks.
Once you’ve mastered the basic stages of the incident management lifecycle, elevate your process by employing some industry best practices. Below are five examples of incident response best practices along with brief explanations on why they are helpful and how to implement them.
During an incident, getting the right people involved and keeping everyone informed can be quite a challenge. Automating your workflow and process with an incident response platform can lead to improved communication between various teams. On top of that, an up-to-date runbook helps teams hit the ground running when an incident occurs.
Communication must remain persistent throughout the incident lifecycle. Keep your team members and other stakeholders in the loop regarding any progress with the incident. The best way to do that is to record any progress in a live incident document or channels such as Slack or Microsoft Teams. That way, anyone can take a look any time and know what has been done and what is currently happening.
Once you’ve resolved an incident, the next step is to come together as a team for a blameless retrospective or post-mortem. During the review, avoid pointing fingers and focus on sharing anything that can improve the process (including the runbook), the tooling, and of course the system or service itself. This learning helps the entire organization better manage incidents in the future.
With the right mindset, every incident is an opportunity to learn and grow — that includes learning how to improve the response process too. The process of incident response gives us an opportunity to break down organizational silos and improve collaboration among various teams, from developers to release engineers, operations, and site reliability engineers. How you manage incidents inside your organization will likely evolve over time as the company grows and the team matures. At a minimum, everyone should have a solid understanding of the process. That way you can carry on building your product and ultimately deploy more frequently, with minimal downtime.
Over time, tracking mean-time-to-detection (MTTD), mean-time-to-repair (MTTR), and mean-time-between-failures (MTBF) can provide insight into your team’s rate of improvement.
The only real way to build a reliable system with resilient teams and practices in a constantly shifting environment is through practice and preparation. Developers and sysadmins often become site reliability engineers (SREs) that focus on responding to and resolving incidents. Chaos Engineering and running a GameDay are two excellent ways to prepare your team for various incidents:
The very first incident response plan depends on the maturity of an organization’s incident management processes. Some organizations are in the proactive stage while some are still in the reactive stage.
To define an incident response plan for your organization, start with a template that can be applied to any incident within the system. It should also determine which individual or team is to be notified first. After defining the most common template, the next target should be the more frequent incidents that you encounter.
An efficient incident response plan focuses on the following factors:
Whether an incident gets resolved or not is often the difference between whether or not it was well-managed. Let’s look at an example of an unmanaged incident to see how good management or orchestration plays a role in incident response.
Suppose it’s 3 a.m. on a Wednesday and the on-call engineer, David, is working on regular everyday tasks. Suddenly he is alerted that one of their data centers is down. At once, he goes through the logs, and after a brief look, it tells him that a recently updated feature is creating the issue. He tries rolling back to the previous version which doesn’t do the trick, so he calls the developer who worked on that exact update and asks them to look at the problem.
So far, only the dev team is involved, and as soon as the management team finds out about the outage, they will want some answers and updates. However, David can only focus on one thing at a time. Hours pass and two more data centers go out, and there’s only one server to handle all the traffic, which ultimately brings down the entire service altogether.
The challenges with how this incident was managed can be broken down into four points:
Let’s explore that same scenario with some minor changes. David, the on-call engineer, is going through routine tasks when he’s paged that one of the data centers is down. As he starts to investigate, yet another alert notifies him that a second data center is also down. He immediately contacts another teammate, Ariana, to ask if she can take command while he continues troubleshooting.
After assuming command, Ariana quickly goes through a rundown with David and sends out the incident details to a pre-arranged list via email. David and Ariana discuss the details and agree that users will be impacted if a third data center goes down. They record the assessment in a live incident document.
As soon as the third alert goes out, Ariana updates everyone on the same email list, follows up with David, and alerts the on-call developer, Marie, who has expert knowledge about data centers. She and her team go through the incident document, prioritize tasks, and start working on the problem. They try a few fixes that don’t work, and Ariana updates the incident management document.
The day is coming to an end, so Ariana starts looking for replacement staff to take over the incident so her colleagues can go home and rest after many hours of intense work. Before handing off the command, Ariana has a Zoom video meeting with the new team she’s handing off to and everything runs efficiently with clear responsibilities.
By the next morning, David gets back to work and finds out that the problem was mitigated, and the incident has been closed. Currently, the teams are working on the retrospective report. Finally, David settles down to document improvements and follow-up actions, so that a similar or future incident will be handled and well managed, and everyone can learn from all steps taken.
In the same incident, here are some details that changed the outcome.
Hopefully, you found this article helpful, and you feel confident about approaching incident response in your organization. If you’re looking for a way to track the stages of the incident management lifecycle, Blameless is a great way to collate system reliability metrics, store runbooks, assign tasks during incident response, and even store retrospectives that promote learning and growth. To see it in action, schedule a demo or sign up for the newsletter to learn more about building reliable services.
Acknowledge — The action of identifying an event as an actual incident that needs to be worked on, triaged, or debugged which then leads to the next natural step of taking ownership of the incident response
Availability/Uptime — Defines the percentage of time a system is accessible and functioning as intended, usually measured over a month-long time frame
Detect — The process and work to analyze and introspect using a variety of tools (APM, Monitoring, Observability) to determine what caused an incident
Latency — The response time of a system or the total time a system takes to respond to a request
Monitoring — The practice of ”watching” or continuously monitoring important predetermined metrics in charts and dashboards that tell you how the system is behaving overall
MTBF (Mean Time Between Failures) — The average time between failures (or incidents)
MTTD (Mean Time to Detect) — The average time interval between when an issue occurs and when an alert for the issue is triggered
MTTR (Mean Time to Repair) — The average time lapsed between acknowledging an incident and resolving the incident
Network — The connection between computers, servers, and other devices that enables data sharing, allowing users to pull requests from a server and to run services using data and code contained on servers
Observability — Observability is being able to fully understand our systems. In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. This is the ability to ask any question of your system to better understand how it’s behaving without having to re-instrument or build new code.
Postmortem/Retrospective — A post-incident record documenting the impact, process steps, and resolution of an incident, which helps teams improve and manage incidents better in the future
Reliability — A measure of how likely it is that a system will perform and function properly as it is intended to, which includes assessment of availability, latency, and stability among other performance metrics.
Resolve — The process of fixing the contributing factors that led to an incident in order to restore the service
Respond — An organized approach to addressing and managing an incident including logging steps, recording actions or tasks by ‘owner’, and communicating across relevant stakeholders
Runbook — A document compiling the necessary procedures and operations to follow when an incident happens. In other words the ‘recipe’ for how to manage an incident end-to-end.
SRE (Site Reliability Engineering) — A set of practices and principles aimed to improve a service’s reliability
System — A grouping of interconnected components including code, infrastructure, and networking that together make a greater whole - i.e. “the system”