How does an incident management workflow look? We give a step-by-step guide to the ITIL process and best practices for an effective resolution.
What is an Incident Management Process?
An incident management process is the actions and procedures an organization takes to recover from an unplanned service interruption.
Importance of a Standardized Incident Management Workflow
Incidents affect companies in drastic ways. The unavailability of service or downtime can create huge costs for businesses. In an ITIC research, 98% of organizations said that one hour of downtime costs them over $100K, and 81% of organizations said that one hour of downtime costs their business more than $300K. A study by Gartner reports that a system or service downtime can cost organizations up to $300K per hour.
Defining a clear incident management workflow is key to resolving incidents faster and reducing costs. IT support teams are most efficient when you’ve implemented a clear incident management process following the best practices. The benefits of having a clear incident management workflow include:
- Faster incident resolution and improved MTTR (Mean Time to Resolution)
- Reduced costs and impact on revenue for the business
- Better internal and external communication during incident management
- Continuous improvement and learning
- Improved customer experience
Incident Management and ITSM
The incident management process is not usually defined or reinvented by organizations, but drawn on industry best practices. These best practices are adopted by organizations to fit their individual needs. Before diving deeper into incident management, the following are some important terms and definitions that we must discuss.
An incident is any unplanned event that disrupts the normal operations of service or impacts the quality of the service. Anything from a service downtime to a slow web server can be categorized as an event.
Incidents are often confused with problems, but incidents are unplanned events whereas problems are the underlying cause behind the incident. Incident management is focused on solving the problem and involves returning the service back to its normal operation. Problem management involves identifying the root cause of the incident to prevent it in the future.
ITSM (IT Service Management) includes the processes and tasks involved in managing end-to-end IT services delivery. ITSM’s key concept is that IT should be delivered as a service, and incident management is one of its practices.
ITIL (IT Infrastructure Library) is a detailed set of best practices (similar to a playbook) focused on aligning IT services with business needs.
ITIL 4 vs ITIL 3 Incident Workflows - is there a difference?
ITIL 3 and ITIL 4 have the same overall goals in managing incidents effectively and consistently. The major difference is in how they accomplish these goals.
ITIL 3 perscribes 26 processes to follow in the incident management workflow. These processes take you through the development and operation of a service in five major categories: service strategy, service design, service transition, service operation, and continual service improvement.
ITIL 4, on the other hand, is less prescriptive with processes and instead encourages best practices that can be applied to create processes specific to your organization. It thinks more holistically, not just looking at the specific steps of development and operations, but also including the contextual factors of your organization that affect how you can respond. For example, best practices around talent management and training are included.
ITIL Incident Management Workflow: Step by Step
We will follow the ITIL framework to go through a high-level overview of proper ticket handling in incident management. Most other frameworks outline roughly similar concepts. In incident management, it’s vital to have a good process and stick to it.
Step 1: Incident Identification and Logging
Anyone can identify an incident. Sometimes, an employee reports the issue, and sometimes it’s identified via end-users or monitoring systems. Anyone can identify and report an incident via an automatic alert, text message, email, or phone call. Upon receiving the report, the service desk team records and identifies whether it’s an incident or a service request as each one is handled differently.
After identifying an incident, the help desk team logs it as an incident, and creates a ticket with the following information:
- Name and Contact of the person who reported the incident.
- Date and Time of the incident report.
- Incident Description along with what is not working properly or went down.
- A unique Incident ID for tracking the incident.
Step 2: Incident Categorization
The second step in ITIL incident management, categorization, marks the difference between an efficient and inefficient help desk team. An efficient incident categorization streamlines the logging process and reduces redundancy while speeding up the overall incident resolution.
Firstly, you must assign a category (and sub-category) to every incident. Categorization helps the help desk team sort and prioritize issues. For example, an incident categorized as Category: “Network” and Sub-category: “Network Outage” will be considered high-priority as it has a direct impact on the customer.
Categorized incidents are also easier to track in the long run. When an incident is accurately categorized, patterns emerge making it easier to identify trends that require problem management or training. Trends also make it easier for teams to sell an idea to the C-suite. For example, if a trend indicates that you need to update your hardware, then the CFO is more likely to approve.
Step 3: Incident Prioritization
Once categorized, every incident must be prioritized. Prioritization helps teams identify which incidents are causing more damage and require an urgent response. Incidents are prioritized by considering various factors including:
- How many people will be impacted
- Potential financial impact
- Security impact
- SLA compliance implications
Incidents are usually classified into three types, low priority, medium priority, and high priority based on the level of damage and urgency.
- Low-priority incidents: small, non-critical issues that do not interrupt users or the business and can be resolved quickly.
- Medium-priority incidents: issues that impact some part of the staff or some business operations, but do not have a big impact on the customers.
- High-priority incidents: issues that impact a large number of users or customers, interrupt regular business operations and impact service delivery. High-priority events usually have a financial impact.
When in doubt about the priority of an incident, always go with a high-priority level. It’s better to err on the side of caution rather than letting a severe incident slip through the cracks.
Step 4: Incident Response
Now that we have logged the incident, categorized it, and prioritized it, the next step is incident response. Incident response is a pretty broad term and breaks down into further steps. There are generally five steps involved in incident response that we will discuss below.
- Initial diagnosis
The initial diagnosis is similar to what a doctor does after listening to a patient’s symptoms. He cannot diagnose the exact illness, but the symptoms make it easier to draw a hypothesis about what possibly is wrong. At this step in incident management, diagnostic manuals, troubleshooting runbooks, and knowledge bases can come in handy.
The first responder to the incident tries to resolve the issue at this stage based on their own initial diagnosis. If they can’t resolve the problem, then it escalates to the next level.
- Incident Escalation and SLA management
The front-line support team is able to solve most of the small issues. However, if the problem is more complex, then they gather and log information and pass it on to the next level of technical support. That way, the second or third-level support teams can quickly and efficiently start working on the problem.
During this stage, the support teams must ensure that they do not exceed the error budget and the SLA (service level agreement) is not breached. An error budget is the accepted level of unavailability before customer happiness is impacted. SLA is a formal agreement between the customer and the service provider that specifies the repercussions of failure. Breaching an SLA usually has financial repercussions on the business
In case the SLA is about to be breached or has already been breached, the incident is promptly escalated functionally (escalated to a specialized or high-level team) or hierarchically (incident escalated to a person of authority who assigns a specialized resource to resolve the issue).
- Investigation and diagnosis
In ITIL incident management, the investigation is singled out as a particular step. However, an investigation happens at every step of the process.
It starts with the front-line team during the initial diagnosis. If they successfully resolve the issue, then you directly skip to the resolution and closure phases. Otherwise, investigation and diagnosis are carried out as the incident is escalated to the level 2 and 3 support teams. In some cases, a specialized resource is assigned or other department members come together to assist with the problem.
- Incident resolution and recovery
Once diagnosed correctly, the team promptly starts working on an incident resolution. At this stage, the service desk team confirms that the service has been restored. Recovery is the amount of time it takes to fully restore a service’s operations. Even after finding a resolution, some fixes need to be tested and deployed.
- Incident closure
After incident resolution, it’s passed back to the service desk for closure. To ensure quality, only the service desk team can close an incident. Additionally, before closing the incident, it’s important to check with the person who reported the issue to confirm that the resolution is satisfactory and services have been fully restored.
Roles Involved in IT Incident Management?
Although the basics are the same, every organization has custom roles and responsibilities according to the incident requirements. However, every organization has the following most common IT incident management roles.
- End-user/user: the stakeholder who first experienced and reported the issue.
- Incident manager or Incident Commander: the person who has the overall responsibility and authority.
- Tech lead: the senior technical responder who is responsible for restoring the service to its operations.
- Communications lead: a person from the customer support or PR teams, responsible for internal and external communication about the incident progress.
- Tier 1 Service desk: the front-line service team consisting of people with knowledge and experience about the most common incidents such as password resets and Wi-Fi problems.
- Tier 2 Service desk: people with advanced incident management knowledge and experience. They primarily work on escalated incidents.
- Tier 3 Service desk: specialists and subject matter experts with advanced knowledge of a particular domain within the IT infrastructure.
Learning from Incidents: Incident Retrospectives
Efficient incident management goes beyond resolving the issue. You should always analyze and learn from incidents to see how you can be better prepared in the future. An incident retrospective or “postmortem” is a document that outlines the details of an incident such as its contributing factors, the response team members, the steps taken to resolve the incident, and other contextual information to provide a full story.
An incident retrospective should always be blameless. Having a blameless attitude means no pointing fingers, and it encourages orgs to root out systemic problems (and solutions), which is much more productive anyway. Another benefit of taking a blameless approach is that everyone feels safe to share creative ideas and solutions without fear of retaliation.
The general analysis of an incident not only involves writing the retrospective, but there’s also an actual meeting that takes place between the stakeholders. The meeting should typically be held within 24 hours of the incident resolution, while the context is still fresh in everyone’s minds.
An important goal of the meeting is to identify areas of improvement for the org, whether that be the service itself, the process, or the tooling. How can the system overall be improved? Feel free to ask questions from the group to really get a sense of what led to the incident and what steps led to the resolution. Take what’s shared and turn them into learning. This is the hard part. There’s always a valuable lesson to be learned. Make sure to document the discussion.
How can Blameless Help?
Service reliability is a supremely important quality of any organization, and incident management plays an important part in maintaining that. We hope this article helped answer most of your questions about the incident management workflow. Blameless is the home base for on-call teams that want to achieve seamless incident response and smooth workflows. Our incident response feature allows you to manage runbooks, automate task checklists, assign roles, and easily communicate with stakeholders throughout the entire process. Sign up for a free trial today.