Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

What Is Incident Management in ITIL? Best Practices

Myra Nizami
|
2.26.2024
|
Incident Management Process

Incidents happen, so how do you handle them? We explain incident management, how to prioritize incidents, and the process involved to resolve the incident.

What is incident management?

Incident management is the process software teams use to identify, analyze, and resolve incidents to resume normal operations as soon as possible. An incident refers to an unexpected disruption to a service that affects the end user. A wide range of incidents can occur, including server crashes, network issues, and authentication errors, but ultimately, an incident will affect the end user somehow. 

With that in mind, incident management has two overarching goals: respond and resolve. Therefore, incident management will include procedures and actions that must be taken to respond to the incident and fix it to minimize end-user disruption as much as possible. 

What is ITIL?

The Information Technology Infrastructure Library or ITIL, is an essential set of IT best practices that empower companies to efficiently meet their customer or business needs. It plays a critical role in incident management, ensuring you address every aspect of incident management to mitigate risks and restore services as quickly as possible.

Why does incident management matter?

Incidents are inevitable, but having a straightforward process to manage them when they occur has significant benefits not just for the team but also for the business and customers. For teams following a DevOps model, an SRE team and a dedicated incident management process can be designed to implement your DevOps goals. With incident management processes, teams feel empowered with tools and resources in place, and responsibilities are clearly distributed. Incident resolution is stressful for the team, and a clear process reduces scrambling and panic.

Ultimately, once teams hone their incident management process, it enables them to work faster and better. That means software metrics such as mean time to resolution and downtime reduction see improvement. As a result, customers aren’t impacted for as long. And all of that combined leads to better business outcomes since the customer experience improves, and the solution becomes more efficient overall.  

However, using the ITIL framework provides an even more detailed playbook based on the insights of thousands of other IT service teams. There’s no need to reinvent the wheel when you have access to proven methods that create a coordinated response to minimize incident impact.

Incident management is a critical function for businesses of all sizes, ensuring that vulnerabilities and issues are addressed – especially when it comes to meeting service level agreements (SLAs)

Benefits of Incident Management

Incident management is critical to any organization that relies on systems, networks, applications, and software that contribute to customer experience, protect confidential and sensitive data or provide business-critical technology. The benefits of incident management include:

·   Improved system reliability using lessons-learned improved responses can find and repair vulnerabilities. It can also contribute to reduced call center volumes with quick answers to help reduce the impact on users.

·   Faster response times to resolve incidents ensure containment of breaches, stop bad actors in their steps, interrupt deployment of malware/ransomware, etc. to reduce the amount of damage the incident causes.

·   A clear understanding of roles and responsibilities and ongoing training and simulation exercises improve cross-departmental cooperation and enhance collaboration.

·   Service disruptions are minimized by quickly identifying issues to contain them and responding so the system is up and running in as short a time as possible.

Who Uses Incident Management?

Incident management is a cross-functional approach that includes many stakeholders, including team leaders, investigative leads, communication specialists and analysts. It is used by development and IT Operations teams to address unplanned events and service interruptions to ensure service and function are restored using a logical process.

Companies often enlist outsourced expertise to help provide quick resolutions and fill skills gaps when internal resources are limited. However, reliance on internal IT teams and system users is crucial to provide valuable insights regarding abnormal activity, evidence of bad actors, critical asset prioritization and navigation of the IT environment. Incident management also applies to all organizations and industries, ensuring response to incidents is swift to mitigate risks related to unwanted activities and breaches.  

Types of Incident Management Processes

There are three types of incident management processes you can follow:

1. ITIL

The ITIL framework ensures improvement to service quality and customer satisfaction by identifying ways to achieve continuous improvement. Steps include:

  • Incident Identification: This is usually initiated with a report sent to the IT service desk from employees, live chat, or network monitoring systems. The service desk reviews the report to determine if it is an incident or simply a request to determine how to respond.  
  • Incident Logging: When an incident is identified, it is logged by the service desk (or whoever makes the determination) opening a ticket or incident log that includes:
    o   Contact information of the person who reported the incident
    o   The date and time
    o   Incident description
    o   Usually applying some form of tracking number
  • Incident Categorization: This step assigns a category and subcategory that makes it easier for the service desk to analyze the incident and spot patterns to help prevent future incidents. It also streamlines the logging process to prioritize resolutions based on the seriousness of the incident.
  • Incident Prioritization: Prioritization is assigned based on an assessment of the impact the incident will have and the assets involved to determine how quickly the team must act to resolve the issue.
  • Incident Response: This process requires several steps, including:
    o   Initial diagnosis directing the ticket to the best team
    o   Incident escalation for complex or urgent issues
    o   Investigation and diagnosis conducted by the most qualified team confirming the initial diagnosis
    o   Incident resolution and recovery fixing the issue and investigating the causes to help ensure it doesn’t happen again
    o   Incident closure including follow up with the originator of the complaint, confirming everything is working as expected and providing insights into ways to improve the process for future incidents

2. Site reliability engineering (SRE)

The primary goal of SRE is to create scalable, reliable solutions using software to streamline the process. Although the process is similar to other incident management practices, emphasis is placed on preventing incidents from happening. Steps include:

·   Incident Identification, Logging, and Categorization following similar steps to ITIL

·   Incident Notification, Assignment, or Escalation following similar steps to ITIL

·   Incident Investigation and Diagnosis: In this case, observability tools are used to determine the state of the system and collect information to build a hypothesis about the causes to inform the resolution.

·   Incident Resolution: The responder team fixes the problem and observes performance to ensure everything is functioning appropriately. Each attempt to resolve complex issues is used to evolve the hypothesis to create more effective fixes.

·   Incident Closure: Once resolutions are confirmed, follow-up determines what is needed, such as a permanent fix, further preventative maintenance, etc.

3. DevOps

DevOps creates a unified approach with a focus on continuous delivery and infrastructure as code. This approach uses incidents to improve processes with resolutions followed up by adjustments such as changing code or updating automated tests to avoid future issues. Steps include:

·   Detection: DevOps incident response teams plan their responses to potential incidents proactively, identifying system weaknesses using monitoring tools, alert systems, and runbooks. They work collaboratively to ensure the right person is contacted based on the type of incident to streamline escalation.

·   Response: Designation to a selection of multiple team members ensures should one on-call engineer fail to resolve the issue, they can use the runbook to bring in the right people.

·   Resolution: At this stage, resolutions happen quickly thanks to the run book to ensure access to the best knowledge on all aspects of the application or system code. In hand with their proactive problem-solving approach, there is a font of knowledge ready to resolve issues quickly.

·   Analysis: Once the issue is resolved, they analyze the incident as a team, sharing information, reviewing metrics, and taking a lessons-learned approach to improve system resilience.

·   Readiness: The final step assesses the team’s readiness for future incidents, using their postmortem findings to update their runbooks, adjust their monitoring tools/alert systems, and improve their process both from a systems and skills standpoint.

 

What is the incident management process?

It’s unrealistic to go for a strategy where incidents never occur. So instead, the incident management process focuses more on how the incident is prioritized and assigning responsibility.

The incident management process broadly addresses each part of the incident response life cycle:

  • Incident monitoring and detection: Incidents are identified through various methods, including continuous monitoring tools, user reports, and more. After identification, the incident needs to be logged and categorized based on severity and impact on the end customer.
  • Communication to relevant team members when an incident is detected: Once an incident is detected, notifications to the incident response team members are crucial. Depending on how the incident is categorized, this could be official red alerts to team members. Or, just notifications can be sent without urgency if it's a minor issue. 
  • Assigning responsibility to team members to resolve the incident: How the incident is communicated will largely depend on the classification system. This part of the process needs to ensure that all team members are involved in deciding how team members will take ownership and responsibility. 
  • Steps needed to resolve the incident: This part of the process will include investigation, diagnosis, and potential resolution. Then, as responsibilities are assigned, teams can start to put together remediation steps as needed to resolve the incident while keeping customers and stakeholders in the loop. This tends to be the variable part of the process since it could mean intensive work to eliminate threats or resolve root issues if it’s a severe incident. 
  • Learnings or retrospectives and documentation of incident resolution afterward: After the incident is resolved, it’s essential to ensure the knowledge is shared so that team members can follow standard procedures and protocols as part of continuous improvement. Post-incident resolution steps could include retrospectives or automating runbooks based on how the incident was resolved to reduce the impact if the incident reoccurs.
  •  

Incident Management Tools

Incident Management Tools are available in several categories, allowing you to address a broad range of incident management challenges, including

·   Development Tools: These tools cover a broad spectrum of functions from task management to managing scrums and sprints. Jira tends to stand out in this category, making it easy to track and assign development projects.   

·   Integration Tools: Tools such as Beanstalk prevent coders from making conflicting changes while also allowing you to differentiate versions, roll back versions with issues, etc.

·   Continuous Testing Tools: Top-rated tools like Katalon allow you to conduct regularly scheduled automated tests and update code change‍ with simple integration of different environments and conditions.

·   Deployment Tools: Tools such as Jenkins are top-rated for their ability to facilitate your software deployment. You can automate key stages of configuration and deployment and enable continuous integration and deployment and frequent small iterations.  

·   Monitoring Tools: Highly rated tools such as Prometheus detect problems before they occur using automated monitoring to eliminate issues.  

·   Feedback Tools: Tools like Blameless, Hotjar, and UserReport gather data, providing insights based on user interaction with code to spot opportunities for improvements.

·   Operations Tools: Leading operation tools such as Blameless help manage changes and incidents by preventing and mitigating incidents, diagnosing problems and improving deployment of responses, and creating retrospectives to improve response for future incidents.

How to optimize incident management processes

Once a workflow is established for how incidents will be managed, teams can start to consider ways to improve the workflow where possible. Some ways to improve the incident management process include:

  • Tools: In each part of the process, teams must decide what tools are used to monitor, detect and resolve incidents. Tools should include clear responsibilities, assignments, alerts and notifications, and dedicated tools for automating parts of the process. 
  • Accessibility: Identifying multiple channels for communication to report incidents can help with monitoring. Think about opening up channels such as web, email, video conferencing, and phone so that users have the methods available to them to report problems. 
  • Effective communication: There must be procedures in place to notify users of incidents, such as email alerts and provide real-time updates on incident resolution
  • Automation: Teams can come together to look at processes in place and where Teams can automate intensive manual work. Automation helps free up team resources while keeping standard procedures in place. 
  • Don’t overload alerts: Alerts are integral to incident resolution, but alert fatigue is also a real issue. Instead, consider what categories of incidents need alerts and what types, and try to keep the system as easy to understand as possible

How Blameless can help

Blameless is a dedicated incident management tool that empowers engineering and DevOps teams through incidents and retrospectives through data. With Blameless, teams can remove manual work across the incident management process, resolve incidents significantly faster and free up team resources. Teams can set up service level objectives (SLOs), and service level agreements (SLAs) and operate with the customer journey in mind. With detailed retrospectives after incidents, teams can take action to optimize incident management through automation and a centralized resource for data and steps taken. Request a demo today to see how Blameless enables teams to improve incident management workflows and accelerate development velocity. 

Resources
Book a blameless demo
To view the calendar in full page view, click here.

Learn all about building a robust incident management system in our complete guide!

Read more