Want to up-level your reliability program? Let's start by identifying your opportunities for growth.
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

Alert Fatigue in SRE: What It Is & How To Avoid It

Emily Arnott

Wondering about alert fatigue? We describe what it is, how it affects software development teams, and how to avoid it.

What is alert fatigue?

Alert fatigue is the phenomenon of employees becoming desensitized to alert messages because of the overwhelming volume they receive, and the number of false positives they receive. The risk with alert fatigue is that important information will be overlooked or ignored.

Recognizing alert fatigue

Setting up an alert system to notify people when something breaks in your system is a necessary part of your reliability solution. You want your monitoring tools to detect any anomaly in your system, and your alerting tools to respond appropriately as soon as detection happens. Your top priority is to avoid missing any incidents, as an incident that goes unnoticed can cause continuous damage to your system and customers.

This priority of not missing incidents can lead to an alerting system that overreacts. This can lead to alert fatigue, which could be even more damaging. Alert fatigue leads to employee burnout and inefficient responses. In the worst cases, it leads to many missed or ignored alerts, creating the same problem the alerting system was meant to solve!

How do you tell if your alert system is overreacting and your employees are suffering from alert fatigue? Let’s take a look at some common alert fatigue symptoms:

Engineers are alerted when they aren’t needed. When something goes wrong, it’s tempting to get all hands on deck. However, only people who have the expertise to help with the response or have responsibilities affected by the incident that require them to be directly involved in the resolution process, should be involved. If your system is bringing in people to do nothing, people will start to expect that an alert might not mean they actually need to do something and could ignore them.

Small incidents get major responses. If engineers are woken up in the middle of the night, brought in on huge team responses, or told to prioritize a response over any other tasks, they’d reasonably expect that the incident must be severe – something causing immediate customer pain. If instead they’re brought in urgently to deal with minor troubles, like an unpopular service running slow, they’ll become desensitized to alerts and not respond quickly when major incidents do occur.

Engineers are burnt out. Being alerted is stressful. Each time an engineer’s pager goes off, there’s some cost to their focus or their rest. You have to weigh the cost to the engineer’s ability to do good work with the benefit of alerting them. As stress accumulates, engineers can become burnt out – totally unable to do good work, and very likely to leave the organization. Burnt out employees lower morale and slow progress. If you’re noticing stressed out and unproductive engineers, alert fatigue could be an issue.

Solving alert fatigue

Getting alert fatigue under control can be a challenge. Finding the line where an alert system doesn’t over alert people but also doesn’t miss any incidents can be tough. However, it’s worth investing in solving this problem. These techniques not only address alert fatigue, they make your system more robust and informative.

Making a robust classification system

All incidents aren’t created equal, and they shouldn’t have the same responses. We discussed how important it is to get only the right people involved when something goes wrong – it’s much easier to not be fatigued when every alert you get is actually relevant to you. Classification is how you determine and alert based on the severity and service area of an incident.

Building a classification system is a collaborative and iterative process. Each service area’s development and operation teams should have input on marking which incidents they have expertise and ownership over. They’ll also be experts on how to recognize severe and minor incidents for their service area. You won’t build a classification system perfectly the first time – review how each incident was classified and who was alerted after each incident, and refine these to make sure the right team was on board.

There can be disputes between teams as to how to judge severity. The solution is to use customer happiness as a universal metric for the whole organization. Using SLIs and SLOs, you can build a metric that reflects if customers are happy with any particular experience using your services. When an incident occurs, you can judge how much it disrupts that experience, and use that as the basis of your severity. If people know what an alert means in terms of user happiness and business value, they’ll feel less overwhelmed and fatigued.

Adding nuance to your on-call schedule

On-call engineers are those that are available to respond to incidents the moment they happen. Generally, engineers will take rotating shifts of being on-call. This allows for some responders to be available 24/7 while also giving everyone periods of total rest. Making an on-call schedule that is fair, keeps people from burning out, and responds to incidents effectively can be challenging, but it’s necessary.

When building an on-call schedule, your first instinct may be to give everyone an equal amount of time for their shifts. However, not all shifts are the same. Incidents can often correlate with periods where services are used more frequently, or when updates are pushed. By tracking patterns in incidents, you can judge when severe incidents occur most frequently. You can also judge what types of incidents are most difficult to resolve, which can be different from those that are most severe.

By having this more nuanced understanding of incidents, you can judge the overall “difficulty” of on-call shifts. This likely corresponds to how much alert fatigue will accumulate from working that shift. Balancing on-call schedules with more nuance will greatly reduce the alert fatigue of any given engineer. Of course, you won’t get the perfect balance right away. Reviewing and adjusting on-call shifts continuously is necessary to keep everyone at their best. The most important thing is to communicate and empathize with on-call engineers to make sure their needs are being met.

Improving the reliability of your system

It seems like it goes without saying – of course, if you have fewer incidents, you’ll have fewer alerts, and less alert fatigue! But it’s worth considering proactively improving reliability and reducing incidents, instead of just reactively alerting better. Improving reliability is a complex and multifaceted process. However, in terms of reducing alerts, the main factor to consider is how often your system produces incidents that require an alert and response.

Some amount of failure is inevitable, some incidents will occur. An important goal should be to prevent incidents that you’ve already dealt with before – not making the same mistake twice. There’s a lot more fatigue in getting alerted for something going wrong again. To prevent repeat incidents, use tools like incident retrospective. These documents help you find the causes of incidents and drive follow-up changes to stop those causes from recurring.

Tooling to reduce alert fatigue

Having tooling to implement these changes can make a much bigger impact on alert fatigue. A sophisticated alerting process can’t be toilsome, or it can create more work than it saves. Blameless can help. Our SLOs can show you the true severity of incidents, and our retrospectives let you learn and prevent recurrence. To see how, check out a demo.

Book a blameless demo
To view the calendar in full page view, click here.