Alert fatigue, or pager fatigue, is something that can drastically reduce even the most seasoned team’s ability to respond to incidents. This is the effect of receiving too many alerts, either because there are simply too many incidents occurring, or because your monitoring is picking up on insignificant issues and notifying you for things that do not require your attention (also known as alert noise). This can lower your team’s cognitive ability and capacity, making incident response a slow, difficult process. It can also lead your team to ignore crucial alerts, resulting in major incidents going unresolved or unnoticed until it’s too late.
A good parallel for understanding alert fatigue during an on-call rotation is that of the medical community. Reports have been written on the danger of allowing alerts to run your team. In an article in the Journal of Graduate Medical Education, Jess L. Rush et al states, “A large number of irrelevant alerts may result in alert fatigue... which may result in critical warnings being missed. The ECRI Institute, a nonprofit medical safety organization, listed alert fatigue as a top technology hazard. The consequences are illustrated when a child received 38 times the normal dose of an antibiotic largely due to this information being overshadowed by a number of clinically inconsequential alerts.”
In the worst case scenario, alert fatigue can be deadly. In our day-to-day operation, it’s more like death by a thousand cuts, rather than one fatal wound. However, the results are still the same; alert fatigue can wear down your team. It’s important to minimize alert or pager fatigue as much as possible, for the health and well being of your team members. After all, the health of your systems is dependent on the health of your people.
Here are 5 tips on how to cut down on alert fatigue and improve your signal-to-noise ratio.
Encourage service ownership
Service ownership can help cut down on the amount of alerts you get, simply because it’s a way to help improve service robustness and minimize repetitive incidents. Service ownership can help:
Prevent code from being “thrown over the wall” and encourage learning. When developers are responsible for owning the services they build, it encourages better practices around shipping performant code. It’s easy to overlook potential flaws in your code when you’re not the one supporting the service, and focusing on reliability over shipping new features doesn’t seem as important when you’re not the one being paged over the weekend. But when service ownership is encouraged, developers will begin scrutinizing their code more thoroughly. Additionally, action items and learnings from incidents are more likely to be fed back into the software lifecycle (SDLC) when developers are in the loop on what issues are occurring.
Keep teams under pressure from burning out by sharing the load. If your traditional ops team is up every night, working 70-hour work weeks and spending every weekend with their laptop, it should come as no surprise that eventually productivity will falter. People need breaks and time away from work. Without that time to themselves, teams under pressure will be susceptible to burnout. Management will be stuck in a cycle of hiring and training for roles they filled just a few months ago. Service ownership helps spread the on-call responsibilities out so that everyone has a turn carrying the pager. This can also have the unexpected benefit of familiarizing on-call engineers with their product a little more, as they’ll have to triage it during incidents. Service ownership helps balance on-call to keep engineers practiced and prepared.
Create the same incentive for everyone, limiting siloes. Service ownership can also help ease the tension between innovation and reliability, and will encourage even heavily siloed organizations to talk between teams in order to prioritize feature work and reliability work better. Everyone wants to move fast, sleep well, and have strong enough systems that they aren’t alerted about an issue every time they check their phone.
Create runbooks for your alerts
What happens when you receive a notification that something is wrong with your system and you have no clue what it means, or why you’re receiving that alert? Maybe you have to parse through the alert conditions to suss out what the alert indicates, or maybe you need to ping a coworker and ask. Not knowing what to do with an alert also contributes to alert fatigue, because it increases the toil and time required to respond.
To resolve this, make sure that you create runbooks for the alert conditions you set. These runbooks should explain why you received this alert and what the alert is monitoring. Runbooks should also contain the below information:
Map of your system architecture: You’ll need to understand how each service functions and connects. This can help your on-call team have better visibility into dependencies that might be what ultimately triggers an alert.
Service owners: This gives you someone to contact in the event that the alert is still not making sense, or the incident requires a technical expert for the service affected.
Key procedures and checklist tasks: Checklists can give on-call engineers a place to start when looking into an alert. This helps preserve cognitive capacity for resolving the actual issue behind the alert.
Identify methods to bake into automation:Does this alert actually require human intervention? If not, add in scripts that can handle this alert and which notify you only if the automation cannot fix the issue for you.
Continue refining, learning, and improving: Runbooks are next to worthless if they aren't up to date. When you revisit these to make updates, take the opportunity to learn from them again, looking for new opportunities to automate and optimize.
Runbooks can be a huge help to on-call engineers, but they need to be made easily accessible to the right people on the front lines. In addition to making sure your team has all the information they need to deal with alerts while on-call, it’s also important to take a deeper look at the on-call schedule you maintain.
Maybe you have to parse through the alert conditions to suss out what the alert indicates, or maybe you need to ping a coworker and ask. Not knowing what to do with an alert also contributes to alert fatigue, because it increases the toil and time required to respond.
Take a closer look at your on-call schedule
Maybe you don’t have superfluous alerts or an overwhelming amount of incidents to tackle, but some of your team members are constantly burning the midnight oil and experiencing alert fatigue. This is likely an issue of improper load balancing of the on-call schedule, rotation, and escalation policies.
One way to make sure that your on-call schedule minimizes alert fatigue is to take a qualitative approach rather than a quantitative one. Imagine you have two engineers. Engineer A spends a full week on call and receives notifications twice about suboptimal service function. Each alert triggers an incident which requires an hour to resolve.
Engineer B spends only a weekend on call and receives seven notifications. Three of those seven are deemed irrelevant (more on how to limit this later!), and the other four trigger incidents which take two hours each to resolve.
Who should be on call next? And who needs a break? By measuring time spent on call, it would seem that Engineer A needs a break. However, it’s Engineer B who is spending more time directly dealing with interrupt work, and likely more stressed by alert fatigue. By taking a proactive and qualitative approach, managers can anticipate burnout and fatigue before it becomes an issue.
In addition to looking at who is on call, you should also be checking in on those who were pinged on their day off to help. You may notice that some grey beards on your team are answering calls when they shouldn’t have to. In this case, you’ll need to work to eliminate SPOFs (single points of failure), and train your on-call engineers so that when they’re scheduled to handle the pager, they don’t need to always phone a friend.
Set SLOs to create guidelines for alerts
Sometimes the problem isn’t that you have too many incidents; instead it could just be that you’re alerting on the wrong things, or set the wrong alerting thresholds. To minimize alert fatigue, it’s important to distinguish what is worth alerting on and what isn’t. One way to do this is with SLOs.
SLOs are internal thresholds that allow teams to guard their customer satisfaction. These thresholds are set based on SLIs, or singular metrics captured by the service’s monitorable data. SLIs take into account points on a user journey that are most important to a customer, such as latency, availability, throughput, or freshness of the data at certain junctions. These metrics (stated as good events/valid events over a period of time) indicate what your customers will care most about. SLOs are the objectives that you must meet to keep them happy.
Imagine that your service is an online shopping platform. Your customers care most about availability. In this case, you’ve determined that to keep customers happy, your service requires 99.9% availability. That means you can only have 43.83 minutes of downtime per month before customer happiness will be affected.
Based on this, you have wiggle room for planned maintenance or shipping new features that might risk a potential blip. But how much wiggle room are you comfortable with? An error budget policy is an agreement that all stakeholders make on what will be done in the event an SLO is threatened.
So imagine that out of your 43.83 minutes, you’ve used 21 minutes. Do you need minute-by minute alerts? Not likely. Instead, you’ll want to set up alerting thresholds that let you know when you’ve reached certain milestones such as 25%, 50%, 75%, and so on. Maybe you even automate these alerts so that, in the event that you hit these thresholds during different times of your monthly rolling window, you aren’t alerted at all. For example, you might not care that 75% of your error budget has been used if there are only 2 days left in the window.
By setting SLOs and creating thresholds that apply to them, you can minimize the unnecessary alerts you receive, allowing you to pay attention to the ones you really need to know about.
How much wiggle room are you comfortable with? An error budget policy is an agreement that all stakeholders make on what will be done in the event an SLO is threatened.
Squash repeat bugs
The only thing more annoying than being paged is being paged for the same thing over and over again. After a while, it’s likely that an engineer will begin to ignore alerts for repeat bugs, especially if they’re not customer-impacting. They will become desensitized to it, possibly overlooking it until the issue becomes larger and customer-affecting.
To avoid this, make sure that repeat issues are addressed or alerting for these is turned off. Take a look through your incident retrospectives and notice what issues crop up again and again and get alignment between product and engineering on the severity of this issue.
Does it need to be prioritized in the next sprint to cut down on repeat incidents? If so, taking care of that issue as soon as possible can save your on-call engineers a lot of unnecessary stress and frustration. If the bug isn’t important enough to prioritize anytime soon, then consider turning off alerting on it, or only alert when it’s tied to customer impact.
Benefits of minimizing alert fatigue
While it may take some time from your next sprint or two to work on getting your alerts right, it’s certainly worth it. Minimizing alert fatigue has many benefits, and will eventually give your team more time and energy. This short-term sacrifice has long-term benefits. A few of the most noticeable benefits include:
No more bloodshot eyes! All jokes aside, when your team is up each night and working each weekend, both physical and mental health decline. By minimizing alert noise, your team will be less susceptible to burnout, they’ll be more focused on the task at hand, and your customer and employee retention rates will be better.
You know what’s important. There’s a rush of adrenaline and cortisol that most engineers feel when they get notified of an incident. That constant feeling of being on-edge can weigh on a person and even lead to trauma. When you figure out what to alert on and what to ignore, your team doesn’t need to experience all those false alarms, and is more likely to respond in an optimal way to issues that need attention.
Innovation, innovation, innovation. When your team is able to focus without context switching, and can spend their cognitive capacity on strategic work rather than unplanned work and responding to alerts, your innovation will skyrocket.
For all the reasons above, minimizing alert fatigue is an incredibly strategic, important use of time and has a significant positive impact on the team. By using these techniques to cut down on alert fatigue, you’ll be able to concentrate on what matters: exciting your users with the awesome things you build.