Alert fatigue, or pager fatigue, is something that can drastically reduce even the most seasoned team’s ability to respond to incidents. This is the effect of receiving too many alerts, either because there are simply too many incidents occurring, or because your monitoring is picking up on insignificant issues and notifying you for things that do not require your attention (also known as alert noise). This can lower your team’s cognitive ability and capacity, making incident response a slow, difficult process. It can also lead your team to ignore crucial alerts, resulting in major incidents going unresolved or unnoticed until it’s too late.
A good parallel for understanding alert fatigue during an on-call rotation is that of the medical community. Reports have been written on the danger of allowing alerts to run your team. In an article in the Journal of Graduate Medical Education, Jess L. Rush et al states, “A large number of irrelevant alerts may result in alert fatigue... which may result in critical warnings being missed. The ECRI Institute, a nonprofit medical safety organization, listed alert fatigue as a top technology hazard. The consequences are illustrated when a child received 38 times the normal dose of an antibiotic largely due to this information being overshadowed by a number of clinically inconsequential alerts.”
In the worst case scenario, alert fatigue can be deadly. In our day-to-day operation, it’s more like death by a thousand cuts, rather than one fatal wound. However, the results are still the same; alert fatigue can wear down your team. It’s important to minimize alert or pager fatigue as much as possible, for the health and well being of your team members. After all, the health of your systems is dependent on the health of your people.
Here are 5 tips on how to cut down on alert fatigue and improve your signal-to-noise ratio.
Service ownership can help cut down on the amount of alerts you get, simply because it’s a way to help improve service robustness and minimize repetitive incidents. Service ownership can help:
What happens when you receive a notification that something is wrong with your system and you have no clue what it means, or why you’re receiving that alert? Maybe you have to parse through the alert conditions to suss out what the alert indicates, or maybe you need to ping a coworker and ask. Not knowing what to do with an alert also contributes to alert fatigue, because it increases the toil and time required to respond.
To resolve this, make sure that you create runbooks for the alert conditions you set. These runbooks should explain why you received this alert and what the alert is monitoring. Runbooks should also contain the below information:
Runbooks can be a huge help to on-call engineers, but they need to be made easily accessible to the right people on the front lines. In addition to making sure your team has all the information they need to deal with alerts while on-call, it’s also important to take a deeper look at the on-call schedule you maintain.
Maybe you have to parse through the alert conditions to suss out what the alert indicates, or maybe you need to ping a coworker and ask. Not knowing what to do with an alert also contributes to alert fatigue, because it increases the toil and time required to respond.
Maybe you don’t have superfluous alerts or an overwhelming amount of incidents to tackle, but some of your team members are constantly burning the midnight oil and experiencing alert fatigue. This is likely an issue of improper load balancing of the on-call schedule, rotation, and escalation policies.
One way to make sure that your on-call schedule minimizes alert fatigue is to take a qualitative approach rather than a quantitative one. Imagine you have two engineers. Engineer A spends a full week on call and receives notifications twice about suboptimal service function. Each alert triggers an incident which requires an hour to resolve.
Engineer B spends only a weekend on call and receives seven notifications. Three of those seven are deemed irrelevant (more on how to limit this later!), and the other four trigger incidents which take two hours each to resolve.
Who should be on call next? And who needs a break? By measuring time spent on call, it would seem that Engineer A needs a break. However, it’s Engineer B who is spending more time directly dealing with interrupt work, and likely more stressed by alert fatigue. By taking a proactive and qualitative approach, managers can anticipate burnout and fatigue before it becomes an issue.
In addition to looking at who is on call, you should also be checking in on those who were pinged on their day off to help. You may notice that some grey beards on your team are answering calls when they shouldn’t have to. In this case, you’ll need to work to eliminate SPOFs (single points of failure), and train your on-call engineers so that when they’re scheduled to handle the pager, they don’t need to always phone a friend.
Sometimes the problem isn’t that you have too many incidents; instead it could just be that you’re alerting on the wrong things, or set the wrong alerting thresholds. To minimize alert fatigue, it’s important to distinguish what is worth alerting on and what isn’t. One way to do this is with SLOs.
SLOs are internal thresholds that allow teams to guard their customer satisfaction. These thresholds are set based on SLIs, or singular metrics captured by the service’s monitorable data. SLIs take into account points on a user journey that are most important to a customer, such as latency, availability, throughput, or freshness of the data at certain junctions. These metrics (stated as good events/valid events over a period of time) indicate what your customers will care most about. SLOs are the objectives that you must meet to keep them happy.
Imagine that your service is an online shopping platform. Your customers care most about availability. In this case, you’ve determined that to keep customers happy, your service requires 99.9% availability. That means you can only have 43.83 minutes of downtime per month before customer happiness will be affected.
Based on this, you have wiggle room for planned maintenance or shipping new features that might risk a potential blip. But how much wiggle room are you comfortable with? An error budget policy is an agreement that all stakeholders make on what will be done in the event an SLO is threatened.
So imagine that out of your 43.83 minutes, you’ve used 21 minutes. Do you need minute-by minute alerts? Not likely. Instead, you’ll want to set up alerting thresholds that let you know when you’ve reached certain milestones such as 25%, 50%, 75%, and so on. Maybe you even automate these alerts so that, in the event that you hit these thresholds during different times of your monthly rolling window, you aren’t alerted at all. For example, you might not care that 75% of your error budget has been used if there are only 2 days left in the window.
By setting SLOs and creating thresholds that apply to them, you can minimize the unnecessary alerts you receive, allowing you to pay attention to the ones you really need to know about.
How much wiggle room are you comfortable with? An error budget policy is an agreement that all stakeholders make on what will be done in the event an SLO is threatened.
The only thing more annoying than being paged is being paged for the same thing over and over again. After a while, it’s likely that an engineer will begin to ignore alerts for repeat bugs, especially if they’re not customer-impacting. They will become desensitized to it, possibly overlooking it until the issue becomes larger and customer-affecting.
To avoid this, make sure that repeat issues are addressed or alerting for these is turned off. Take a look through your incident retrospectives and notice what issues crop up again and again and get alignment between product and engineering on the severity of this issue.
Does it need to be prioritized in the next sprint to cut down on repeat incidents? If so, taking care of that issue as soon as possible can save your on-call engineers a lot of unnecessary stress and frustration. If the bug isn’t important enough to prioritize anytime soon, then consider turning off alerting on it, or only alert when it’s tied to customer impact.
While it may take some time from your next sprint or two to work on getting your alerts right, it’s certainly worth it. Minimizing alert fatigue has many benefits, and will eventually give your team more time and energy. This short-term sacrifice has long-term benefits. A few of the most noticeable benefits include:
For all the reasons above, minimizing alert fatigue is an incredibly strategic, important use of time and has a significant positive impact on the team. By using these techniques to cut down on alert fatigue, you’ll be able to concentrate on what matters: exciting your users with the awesome things you build.
If you’re looking to reduce on-call stress and alert fatigue, Blameless partners closely with top on-call management platforms such as PagerDuty and OpsGenie to automate context, streamline communication, minimize toil, and more.
If you liked this article, check out these: