Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

Incident Management Process | A Step-By-Step Guide

How does an incident management workflow look? We give a step-by-step guide to the ITIL process and best practices for an effective resolution.

What is an Incident Management Process?

An incident management process is the actions and procedures an organization takes to recover from an unplanned service interruption. 

Importance of a Standardized Incident Management Workflow

Incidents affect companies in drastic ways. The unavailability of service or downtime can create huge costs for businesses. In an ITIC research, 98% of organizations said that one hour of downtime costs them over $100K, and 81% of organizations said that one hour of downtime costs their business more than $300K. A study by Gartner reports that a system or service downtime can cost organizations up to $300K per hour.

Defining a clear incident management workflow is key to resolving incidents faster and reducing costs. IT support teams are most efficient when you’ve implemented a clear incident management process following the best practices. The benefits of having a clear incident management workflow include:

  • Faster incident resolution and improved MTTR (Mean Time to Resolution)
  • Reduced costs and impact on revenue for the business
  • Better internal and external communication during incident management 
  • Continuous improvement and learning
  • Improved customer experience

Incident Management and ITSM 

The incident management process is not usually defined or reinvented by organizations, but drawn on industry best practices. These best practices are adopted by organizations to fit their individual needs. Before diving deeper into incident management, the following are some important terms and definitions that we must discuss.

An incident is any unplanned event that disrupts the normal operations of service or impacts the quality of the service. Anything from a service downtime to a slow web server can be categorized as an event. 

Incidents are often confused with problems, but incidents are unplanned events whereas problems are the underlying cause behind the incident. Incident management is focused on solving the problem and involves returning the service back to its normal operation. Problem management involves identifying the root cause of the incident to prevent it in the future.

ITSM (IT Service Management) includes the processes and tasks involved in managing end-to-end IT services delivery. ITSM’s key concept is that IT should be delivered as a service, and incident management is one of its practices. 

ITIL (IT Infrastructure Library) is a detailed set of best practices (similar to a playbook) focused on aligning IT services with business needs. 

ITIL 4 vs ITIL 3 Incident Workflows - is there a difference?

ITIL 3 and ITIL 4 have the same overall goals in managing incidents effectively and consistently. The major difference is in how they accomplish these goals.

ITIL 3 perscribes 26 processes to follow in the incident management workflow. These processes take you through the development and operation of a service in five major categories: service strategy, service design, service transition, service operation, and continual service improvement.

ITIL 4, on the other hand, is less prescriptive with processes and instead encourages best practices that can be applied to create processes specific to your organization. It thinks more holistically, not just looking at the specific steps of development and operations, but also including the contextual factors of your organization that affect how you can respond. For example, best practices around talent management and training are included.

ITIL Incident Management Workflow: Step by Step

We will follow the ITIL framework to go through a high-level overview of proper ticket handling in incident management. Most other frameworks outline roughly similar concepts. In incident management, it’s vital to have a good process and stick to it. 

Step 1: Incident Identification and Logging 

Anyone can identify an incident. Sometimes, an employee reports the issue, and sometimes it’s identified via end-users or monitoring systems. Anyone can identify and report an incident via an automatic alert, text message, email, or phone call. Upon receiving the report, the service desk team records and identifies whether it’s an incident or a service request as each one is handled differently.

After identifying an incident, the help desk team logs it as an incident, and creates a ticket with the following information:

  • Name and Contact of the person who reported the incident.
  • Date and Time of the incident report.
  • Incident Description along with what is not working properly or went down.
  • A unique Incident ID for tracking the incident.

Step 2: Incident Categorization 

The second step in ITIL incident management, categorization, marks the difference between an efficient and inefficient help desk team. An efficient incident categorization streamlines the logging process and reduces redundancy while speeding up the overall incident resolution.

Firstly, you must assign a category (and sub-category) to every incident. Categorization helps the help desk team sort and prioritize issues. For example, an incident categorized as Category: “Network” and Sub-category: “Network Outage” will be considered high-priority as it has a direct impact on the customer.

Categorized incidents are also easier to track in the long run. When an incident is accurately categorized, patterns emerge making it easier to identify trends that require problem management or training. Trends also make it easier for teams to sell an idea to the C-suite. For example, if a trend indicates that you need to update your hardware, then the CFO is more likely to approve. 

Step 3: Incident Prioritization 

Once categorized, every incident must be prioritized. Prioritization helps teams identify which incidents are causing more damage and require an urgent response. Incidents are prioritized by considering various factors including:

  • How many people will be impacted
  • Potential financial impact
  • Security impact
  • SLA compliance implications

Incidents are usually classified into three types, low priority, medium priority, and high priority based on the level of damage and urgency. 

  1. Low-priority incidents: small, non-critical issues that do not interrupt users or the business and can be resolved quickly. 
  2. Medium-priority incidents: issues that impact some part of the staff or some business operations, but do not have a big impact on the customers.
  3. High-priority incidents: issues that impact a large number of users or customers, interrupt regular business operations and impact service delivery. High-priority events usually have a financial impact. 

When in doubt about the priority of an incident, always go with a high-priority level. It’s better to err on the side of caution rather than letting a severe incident slip through the cracks. 

Roles Involved in IT Incident Management?

Although the basics are the same, every organization has custom roles and responsibilities according to the incident requirements. However, every organization has the following most common IT incident management roles.

  1. End-user/user: the stakeholder who first experienced and reported the issue.
  2. Incident manager or Incident Commander: the person who has the overall responsibility and authority. 
  3. Tech lead: the senior technical responder who is responsible for restoring the service to its operations. 
  4. Communications lead: a person from the customer support or PR teams, responsible for internal and external communication about the incident progress.  
  5. Tier 1 Service desk: the front-line service team consisting of people with knowledge and experience about the most common incidents such as password resets and Wi-Fi problems.
  6. Tier 2 Service desk: people with advanced incident management knowledge and experience. They primarily work on escalated incidents.
  7. Tier 3 Service desk: specialists and subject matter experts with advanced knowledge of a particular domain within the IT infrastructure.

Step 4: Incident Response

Now that we have logged the incident, categorized it, and prioritized it, the next step is incident response. Incident response is a pretty broad term and breaks down into further steps. There are generally five steps involved in incident response that we will discuss below.

  1. Initial diagnosis

The initial diagnosis is similar to what a doctor does after listening to a patient’s symptoms. He cannot diagnose the exact illness, but the symptoms make it easier to draw a hypothesis about what possibly is wrong. At this step in incident management, diagnostic manuals, troubleshooting runbooks, and knowledge bases can come in handy.

The first responder to the incident tries to resolve the issue at this stage based on their own initial diagnosis. If they can’t resolve the problem, then it escalates to the next level. 

  1. Incident Escalation and SLA management 

The front-line support team is able to solve most of the small issues. However, if the problem is more complex, then they gather and log information and pass it on to the next level of technical support. That way, the second or third-level support teams can quickly and efficiently start working on the problem. 

During this stage, the support teams must ensure that they do not exceed the error budget and the SLA (service level agreement) is not breached. An error budget is the accepted level of unavailability before customer happiness is impacted. SLA is a formal agreement between the customer and the service provider that specifies the repercussions of failure. Breaching an SLA usually has financial repercussions on the business

In case the SLA is about to be breached or has already been breached, the incident is promptly escalated functionally (escalated to a specialized or high-level team) or hierarchically (incident escalated to a person of authority who assigns a specialized resource to resolve the issue). 

  1. Investigation and diagnosis 

In ITIL incident management, the investigation is singled out as a particular step. However, an investigation happens at every step of the process. 

It starts with the front-line team during the initial diagnosis. If they successfully resolve the issue, then you directly skip to the resolution and closure phases. Otherwise, investigation and diagnosis are carried out as the incident is escalated to the level 2 and 3 support teams. In some cases, a specialized resource is assigned or other department members come together to assist with the problem. 

  1. Incident resolution and recovery

Once diagnosed correctly, the team promptly starts working on an incident resolution. At this stage, the service desk team confirms that the service has been restored. Recovery is the amount of time it takes to fully restore a service’s operations. Even after finding a resolution, some fixes need to be tested and deployed.

  1. Incident closure

After incident resolution, it’s passed back to the service desk for closure. To ensure quality, only the service desk team can close an incident. Additionally, before closing the incident, it’s important to check with the person who reported the issue to confirm that the resolution is satisfactory and services have been fully restored. 

Incident Response Best Practices

Once you’ve mastered the basic stages of the incident management lifecycle, elevate your process by employing some industry best practices. Below are five examples of incident response best practices along with brief explanations on why they are helpful and how to implement them.

Workflow Automation

During an incident, getting the right people involved and keeping everyone informed can be quite a challenge. Automating your workflow and process with an incident response platform can lead to improved communication between various teams. On top of that, an up-to-date runbook helps teams hit the ground running when an incident occurs.

Proper Communication Between Teams

Communication must remain persistent throughout the incident lifecycle. Keep your team members and other stakeholders in the loop regarding any progress with the incident. The best way to do that is to record any progress in a live incident document or channels such as Slack or Microsoft Teams. That way, anyone can take a look any time and know what has been done and what is currently happening.

Take a Blameless Approach to Incident Response

Once you’ve resolved an incident, the next step is to come together as a team for a blameless retrospective or post-mortem. During the review, avoid pointing fingers and focus on sharing anything that can improve the process (including the runbook), the tooling, and of course the system or service itself. This learning helps the entire organization better manage incidents in the future.

Learn From and Improve Incident Response

With the right mindset, every incident is an opportunity to learn and grow — that includes learning how to improve the response process too. The process of incident response gives us an opportunity to break down organizational silos and improve collaboration among various teams, from developers to release engineers, operations, and site reliability engineers. How you manage incidents inside your organization will likely evolve over time as the company grows and the team matures. At a minimum, everyone should have a solid understanding of the process. That way you can carry on building your product and ultimately deploy more frequently, with minimal downtime.

Over time, tracking mean-time-to-detection (MTTD), mean-time-to-repair (MTTR), and mean-time-between-failures (MTBF) can provide insight into your team’s rate of improvement.

Practice and Prepare

The only real way to build a reliable system with resilient teams and practices in a constantly shifting environment is through practice and preparation. Developers and sysadmins often become site reliability engineers (SREs) that focus on responding to and resolving incidents. Chaos Engineering and running a GameDay are two excellent ways to prepare your team for various incidents:

  • Chaos engineering: a discipline that involves testing a distributed system to improve reliability.
  • Gameday: an incident simulation to test the system and its processes alongside the team’s response to the incident

Example of an Unmanaged Incident

Whether an incident gets resolved or not is often the difference between whether or not it was well-managed. Let’s look at an example of an unmanaged incident to see how good management or orchestration plays a role in incident response.

Suppose it’s 3 a.m. on a Wednesday and the on-call engineer, David, is working on regular everyday tasks. Suddenly he is alerted that one of their data centers is down. At once, he goes through the logs, and after a brief look, it tells him that a recently updated feature is creating the issue. He tries rolling back to the previous version which doesn’t do the trick, so he calls the developer who worked on that exact update and asks them to look at the problem.

So far, only the dev team is involved, and as soon as the management team finds out about the outage, they will want some answers and updates. However, David can only focus on one thing at a time. Hours pass and two more data centers go out, and there’s only one server to handle all the traffic, which ultimately brings down the entire service altogether.

What Made the Incident Unmanageable?

The challenges with how this incident was managed can be broken down into four points:

  • Extreme focus on the technical side of the issue.
  • Poor or lack of communication between the dev and other teams.
  • Poor collaboration across team members and no tracking of communication such as a phone call.
  • No centralized command.

Example of a Well-Managed Incident

Let’s explore that same scenario with some minor changes. David, the on-call engineer, is going through routine tasks when he’s paged that one of the data centers is down. As he starts to investigate, yet another alert notifies him that a second data center is also down. He immediately contacts another teammate, Ariana, to ask if she can take command while he continues troubleshooting.

After assuming command, Ariana quickly goes through a rundown with David and sends out the incident details to a pre-arranged list via email. David and Ariana discuss the details and agree that users will be impacted if a third data center goes down. They record the assessment in a live incident document.

As soon as the third alert goes out, Ariana updates everyone on the same email list, follows up with David, and alerts the on-call developer, Marie, who has expert knowledge about data centers. She and her team go through the incident document, prioritize tasks, and start working on the problem. They try a few fixes that don’t work, and Ariana updates the incident management document.

The day is coming to an end, so Ariana starts looking for replacement staff to take over the incident so her colleagues can go home and rest after many hours of intense work. Before handing off the command, Ariana has a Zoom video meeting with the new team she’s handing off to and everything runs efficiently with clear responsibilities.

By the next morning, David gets back to work and finds out that the problem was mitigated, and the incident has been closed. Currently, the teams are working on the retrospective report. Finally, David settles down to document improvements and follow-up actions, so that a similar or future incident will be handled and well managed, and everyone can learn from all steps taken.

What Makes the Incident Well-Managed?

In the same incident, here are some details that changed the outcome.

  • A clear chain of command allowed communication to flow properly between teams, giving David adequate time to focus on the issue at hand.
  • Separation of responsibilities between individuals. The incident command, ops team, and communication team had their own set of responsibilities and tasks.
  • A live incident document kept a record of various steps taken to resolve the incident.
  • There was a clear and specific handoff between the initial team and the new team.

Learning from Incidents: Incident Retrospectives 

Efficient incident management goes beyond resolving the issue. You should always analyze and learn from incidents to see how you can be better prepared in the future. An incident retrospective or “postmortem” is a document that outlines the details of an incident such as its contributing factors, the response team members, the steps taken to resolve the incident, and other contextual information to provide a full story.

An incident retrospective should always be blameless. Having a blameless attitude means no pointing fingers, and it encourages orgs to root out systemic problems (and solutions), which is much more productive anyway. Another benefit of taking a blameless approach is that everyone feels safe to share creative ideas and solutions without fear of retaliation. 

 

The general analysis of an incident not only involves writing the retrospective, but there’s also an actual meeting that takes place between the stakeholders. The meeting should typically be held within 24 hours of the incident resolution, while the context is still fresh in everyone’s minds. 

An important goal of the meeting is to identify areas of improvement for the org, whether that be the service itself, the process, or the tooling. How can the system overall be improved? Feel free to ask questions from the group to really get a sense of what led to the incident and what steps led to the resolution. Take what’s shared and turn them into learning. This is the hard part. There’s always a valuable lesson to be learned. Make sure to document the discussion.

How can Blameless Help?

Service reliability is a supremely important quality of any organization, and incident management plays an important part in maintaining that. We hope this article helped answer most of your questions about the incident management workflow. Blameless is the home base for on-call teams that want to achieve seamless incident response and smooth workflows. Our incident response feature allows you to manage runbooks, automate task checklists, assign roles, and easily communicate with stakeholders throughout the entire process. Sign up for a free trial today.

Resources
Book a blameless demo
To view the calendar in full page view, click here.

Learn all about building a robust incident management system in our complete guide!

Read more