Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

The Role of the SRE in the Incident Management Process

Lee Atchison
|
3.14.2024

In the world of modern businesses, where IT systems play a major role in all types of businesses, the role of the Site Reliability Engineer (SRE) has become central to managing the effectiveness and reliability of the entire business. SREs are the bridge between the rapid deployment of software and systems and the stable operation of those systems in a production environment. They ensure that reliability and performance criteria are defined and are met.

When an incident does occur, they are at the forefront of resolving it and, even more importantly, ensuring it does not repeat in the future.

Understanding the SRE Role

The SRE role was first formulated by Google as a means to create a hybrid dev-ops position that combines aspects of both software engineering and operations. SREs are tasked with creating scalable and highly reliable software systems. This is more than writing software that scales. This is architecting and building infrastructure systems that support that highly scalable, highly reliable software to create a full modern operating application.

Central to this hybrid role is the incident response process.

Incident Response and the SRE

Incident management is a structured method for responding to unplanned interruptions or reductions in quality of service. It involves identifying, analyzing, and correcting hazards to prevent a future re-occurrence. For SREs, incident management is not just about fixing what’s broken; it’s about understanding why an incident occurred and how to prevent it in the future.

A modern incident response process typically consists of the following components:

·      Ongoing monitoring to detect and notify when problems occur.

·      A standardized and reliable notification process to engage the proper individuals when an incident does occur.

·      A set of standards and tools for resolving an incident using well-understood procedures.

·      An escalation process to ensure the correct resources are brought in when needed based on the severity and duration of the incident.

·      A standard communications process for informing individuals inside the company about the status of the incident. This can be important, for example, for a sales team during an evaluation so they can give the proper guidance to the prospective customers.

·      A retrospective process that is designed to evaluate what happened and how to ensure the problem doesn’t repeat. Additionally, the retrospective process can be used to determine if the incident management and response process itself needs to be updated.

·      A process to follow up on assigned tasks to ensure the learnings from the incident are incoroporated into the business’ improved applications, processes, and other systems.

The SRE's role is the driver to ensure that this process is created and followed for all incidents in all situations. The SRE is the process champion to upper management, and they ensure the process is followed for every incident.

They additionally drive the efforts post-incident to ensure all incidents are understood and that improvements are in place to make sure that major incidents do not repeat in the future. Preventing repeat incidents is a central strategy in improving the reliability of an application and is a foremost goal of the SRE.

The Process is the Key

This incident management process is critical to creating the feedback loop to improve the quality product and service that is produced by a modern digital business, and the role of the SRE is critical to creating, championing, driving, and improving this process.

Resources
Book a blameless demo
To view the calendar in full page view, click here.