Looking into the Incident Management Process? We give a step-by-step guide to the incident management process and best practices for an effective resolution.
An incident management process is the actions and procedures an organization takes to recover from an unplanned service interruption.
Incidents affect companies in drastic ways. The unavailability of service or downtime can create huge costs for businesses. In an ITIC research, 98% of organizations said that one hour of downtime costs them over $100K, and 81% of organizations said that one hour of downtime costs their business more than $300K. A study by Gartner reports that a system or service downtime can cost organizations up to $300K per hour.
Defining a clear incident management process is key to resolving incidents faster and reducing costs. IT support teams are most efficient when you’ve implemented a clear incident management process following the best practices. The benefits of having a clear incident management process include:
The incident management process is not usually defined or reinvented by organizations, but drawn on industry best practices. These best practices are adopted by organizations to fit their individual needs. Before diving deeper into incident management, the following are some important terms and definitions that we must discuss.
An incident is any unplanned event that disrupts the normal operations of service or impacts the quality of the service. Anything from a service downtime to a slow web server can be categorized as an event.
Incidents are often confused with problems, but incidents are unplanned events whereas problems are the underlying cause behind the incident. Incident management is focused on solving the problem and involves returning the service back to its normal operation. Problem management involves identifying the root cause of the incident to prevent it in the future.
ITSM (IT Service Management) includes the processes and tasks involved in managing end-to-end IT services delivery. ITSM’s key concept is that IT should be delivered as a service, and incident management is one of its practices.
ITIL (IT Infrastructure Library) is a detailed set of best practices (similar to a playbook) focused on aligning IT services with business needs.
We will follow the ITIL framework to go through a high-level overview of proper ticket handling in incident management. Most other frameworks outline roughly similar concepts. In incident management, it’s vital to have a good process and stick to it.
Anyone can identify an incident. Sometimes, an employee reports the issue, and sometimes it’s identified via end-users or monitoring systems. Anyone can identify and report an incident via an automatic alert, text message, email, or phone call. Upon receiving the report, the service desk team records and identifies whether it’s an incident or a service request as each one is handled differently.
After identifying an incident, the help desk team logs it as an incident, and creates a ticket with the following information:
The second step in incident management, categorization, marks the difference between an efficient and inefficient help desk team. An efficient incident categorization streamlines the logging process and reduces redundancy while speeding up the overall incident resolution.
Firstly, you must assign a category (and sub-category) to every incident. Categorization helps the help desk team sort and prioritize issues. For example, an incident categorized as Category: “Network” and Sub-category: “Network Outage” will be considered high-priority as it has a direct impact on the customer.
Categorized incidents are also easier to track in the long run. When an incident is accurately categorized, patterns emerge making it easier to identify trends that require problem management or training. Trends also make it easier for teams to sell an idea to the C-suite. For example, if a trend indicates that you need to update your hardware, then the CFO is more likely to approve.
Once categorized, every incident must be prioritized. Prioritization helps teams identify which incidents are causing more damage and require an urgent response. Incidents are prioritized by considering various factors including:
Incidents are usually classified into three types, low priority, medium priority, and high priority based on the level of damage and urgency.
When in doubt about the priority of an incident, always go with a high-priority level. It’s better to err on the side of caution rather than letting a severe incident slip through the cracks.
Now that we have logged the incident, categorized it, and prioritized it, the next step is incident response. Incident response is a pretty broad term and breaks down into further steps. There are generally five steps involved in incident response that we will discuss below.
The initial diagnosis is similar to what a doctor does after listening to a patient’s symptoms. He cannot diagnose the exact illness, but the symptoms make it easier to draw a hypothesis about what possibly is wrong. At this step in incident management, diagnostic manuals, troubleshooting runbooks, and knowledge bases can come in handy.
The first responder to the incident tries to resolve the issue at this stage based on their own initial diagnosis. If they can’t resolve the problem, then it escalates to the next level.
The front-line support team is able to solve most of the small issues. However, if the problem is more complex, then they gather and log information and pass it on to the next level of technical support. That way, the second or third-level support teams can quickly and efficiently start working on the problem.
During this stage, the support teams must ensure that they do not exceed the error budget and the SLA (service level agreement) is not breached. An error budget is the accepted level of unavailability before customer happiness is impacted. SLA is a formal agreement between the customer and the service provider that specifies the repercussions of failure. Breaching an SLA usually has financial repercussions on the business
In case the SLA is about to be breached or has already been breached, the incident is promptly escalated functionally (escalated to a specialized or high-level team) or hierarchically (incident escalated to a person of authority who assigns a specialized resource to resolve the issue).
In ITIL incident management, the investigation is singled out as a particular step. However, an investigation happens at every step of the process.
It starts with the front-line team during the initial diagnosis. If they successfully resolve the issue, then you directly skip to the resolution and closure phases. Otherwise, investigation and diagnosis are carried out as the incident is escalated to the level 2 and 3 support teams. In some cases, a specialized resource is assigned or other department members come together to assist with the problem.
Once diagnosed correctly, the team promptly starts working on an incident resolution. At this stage, the service desk team confirms that the service has been restored. Recovery is the amount of time it takes to fully restore a service’s operations. Even after finding a resolution, some fixes need to be tested and deployed.
After incident resolution, it’s passed back to the service desk for closure. To ensure quality, only the service desk team can close an incident. Additionally, before closing the incident, it’s important to check with the person who reported the issue to confirm that the resolution is satisfactory and services have been fully restored.
Although the basics are the same, every organization has custom roles and responsibilities according to the incident requirements. However, every organization has the following most common IT incident management roles.
Efficient incident management goes beyond resolving the issue. You should always analyze and learn from incidents to see how you can be better prepared in the future. An incident retrospective or “postmortem” is a document that outlines the details of an incident such as its contributing factors, the response team members, the steps taken to resolve the incident, and other contextual information to provide a full story.
An incident retrospective should always be blameless. Having a blameless attitude means no pointing fingers, and it encourages orgs to root out systemic problems (and solutions), which is much more productive anyway. Another benefit of taking a blameless approach is that everyone feels safe to share creative ideas and solutions without fear of retaliation.
The general analysis of an incident not only involves writing the retrospective, but there’s also an actual meeting that takes place between the stakeholders. The meeting should typically be held within 24 hours of the incident resolution, while the context is still fresh in everyone’s minds.
An important goal of the meeting is to identify areas of improvement for the org, whether that be the service itself, the process, or the tooling. How can the system overall be improved? Feel free to ask questions from the group to really get a sense of what led to the incident and what steps led to the resolution. Take what’s shared and turn them into learning. This is the hard part. There’s always a valuable lesson to be learned. Make sure to document the discussion.
Service reliability is a supremely important quality of any organization, and incident management plays an important part in maintaining that. We hope this article helped answer most of your questions about the incident management process. Blameless is the home base for on-call teams that want to achieve seamless incident response and smooth workflows. Our incident response feature allows you to manage runbooks, automate task checklists, assign roles, and easily communicate with stakeholders throughout the entire process. To learn more, schedule a demo or sign up for the newsletter below.