On-call: you may see it as a necessary evil. When responding to incidents quickly can make or break your reputation, designating people across the team to be ready to react at all hours of the day is a necessity, but often creates immense stress while eating into personal lives. It isn’t a surprise that many engineers have horror stories about the difficulty of carrying a pager around the clock.
But does on-call have to be so dreadful? We think not. Here are five best practices that can help your team respond quicker and build more resilient systems that minimize repetitive interruptions.
Not all incidents are created equal, and on-call escalations should only start when it’s really worth getting out of bed. The monitorable metrics, from which you can trigger alerts, might be too low-level to capture the actual severity of an incident. Instead, consider the impact different types of incidents have on your customers, and create severity tiers based on this.
To determine impact, use techniques such as user journeys (where metrics are consolidated based on typical usage patterns) and black box monitoring (where metrics are gathered only using what external customers can see). These will help you break down an incident into the specific metrics you’ll monitor to trigger alerts and help you cut out metrics that only make things noisier.
Once you have your metrics to understand customer impact, make sure your team is in agreement on how incidents should be classified and what response each class requires. Schedule time to review these choices based on postmortems of previous incidents. Was that Sev 0 really a Sev 0? Does a Sev 3 really need all those people alerted? Your classification system should be logical and consistent enough that classifying new incidents is easy for any engineer. Atlassian also has a good guide to help you get started.
Making sure that your team can tell the difference between a Sev 0 and a Sev 3 incident can save you from having to drive to the office on the weekend, or open your laptop at 2 AM. It can also save you from underestimating a critical, customer-facing incident.
Imagine an incident that is crucial enough to rouse a team member in the wee hours of the morning. What can your team do to help them resolve the incident and get back to bed as quickly as possible? The answer is a runbook.
A runbook is a set of detailed instructions that help engineers resolve each type of incident. This guidance helps ease the cognitive burden of on-call troubleshooting, and contains specific commands to execute or places in code to check.
Your runbooks should also contain incident response processes as a whole. For example, they should contain procedures for:
Creating runbooks will also help you discover procedures that can be automated, or toil that can be alleviated through tooling. Give engineers time to vent about what processes are most tedious or distracting, and work to improve them. When you’re working through these steps at 3AM it feels pretty different than writing them down in the afternoon.
Despite all the planning required, incidents often need more than just a standard solution, so good runbooks should also leave space for the engineers’ creativity. AWS has some fantastic guidance on the technical details of building a runbook.
It can be tough to strike this balance of freedom and guidance, so aim for continuous improvement. Runbooks should be regularly reviewed by the engineers using them. Make analyzing the runbook’s performance part of your postmortem review. Having ownership of the runbook across the team ensures that every engineer is confident in implementing it, and thus is confident to be on-call.
Load balancing people isn’t like load balancing servers; numbers will never tell the full story, and isn’t as simple as giving people an equal number of shifts. The goal is to ensure engineers don’t burn out or feel they’re being treated unfairly. Any given on-call shift could be blissful silence, or a sprawling maddening disaster. When people feel they’re taking on an unfair burden — even when they know it’s just unfortunate timing — morale can quickly drop and on-call dread can rise.
A good first step in distributing the most challenging incidents is using your severity classifications to estimate the workload of on-call shifts. An incident’s severity doesn’t necessarily reflect the difficulty in resolving it, though, so also incorporate metrics such as time to resolution.
Most importantly, listen to your on-call engineers. Use post-incident reviews to discuss the impact of on-call incidents and create qualitative metrics to describe how burnt out on-call incidents leave their respondents.
Establishing a system to manage on-call load isn’t easy, so again, continuous iteration and improvement is key. Buy-in from your on-call teams is essential, so make sure they’re involved in evaluating the load of their peers. Try techniques originally used to plan development time, such as story point estimation. These techniques can be repurposed to help teams collaborate on coming up with agreeable estimates of on-call load.
Blameless postmortems can help discover the true systemic causes of incidents and proactively address reliability issues. You can use metrics, such as time to resolution or severity, generated across a variety of postmortems to find recurring root causes, and prioritize development to resolve them. Remember that reliability is a feature: the value of reducing the unplanned work of incident management through reliability engineering is just as important as adding more time to development of new features.
Beyond developing reliability as a feature, incorporating other SRE principles will be instrumental in reducing your on-call load. Building service level objectives provides a safety net, warning you of potential crises long before they necessitate a call. Chaos engineering techniques, such as simulating incidents and practicing responses, can also help develop runbooks and uncover areas of underpreparedness.
You can also be considerate of on-call engineers by looking for patterns of incident timing. Although the business impact of an outage happening in the afternoon or in the middle of the night could be around the same, the outage that wakes up an on-call team has a greater human impact. Include these human considerations when looking for reliability areas to proactively address.
At the core of all our best practices is empathy towards the on-call engineers: considering their pain points, preparing things to help them, and proactively reducing their burden. Bake these empathetic practices into your culture to ensure that on-call decisions will always keep the human in mind.
Celebrate the successes of the on-call team through internal updates, emphasizing the sacrifices and challenges team members had to face. On-call incidents can begin and end in a single night, leaving other engineers unaware and the responders feeling unappreciated. Recognizing the hard work of being on-call can help motivate engineers and reduce burnout.
Try to shift the perception of incidents from unavoidable setbacks to unplanned investments. Every incident is an investment into learning and an opportunity to make all future incidents go better. Likewise, every on-call shift is an investment in learning to improve on-call going forward. Championing this attitude is powerful to make on-call a meaningful challenge, rather than a burden.
Sure, on-call might never be something that engineers enthusiastically sign up for. But it shouldn’t be something you dread, either. The most important thing is to make every effort to alleviate the pain of on-call, and these best practices are a great place to start. You can learn more in Increment’s on-call issue.