In the era of reliability, where mere minutes of downtime or latency can cost hundreds of thousands of dollars, 24x7 availability and on-call coverage to respond to incidents has become a requirement for the vast majority of organizations. But setting up an on-call system that drives effective incident response while minimizing the stress placed on engineers isn’t a trivial task. Establishing equitable on-call rotations, putting the right guardrails and automation in place, and regular incident practice are key. In this blog, we’ll share key tools and practices to ensure your on-call engineers are set up for success.
When setting up your on-call system, it is important to define clear and consistent policies and practices. When taking on on-call responsibilities, engineers shouldn’t need to reinvent the wheel when the pager goes off; ideally, the planning around severity, incident playbooks, and more should take place during peacetime. The team should work together to create rules that dictate when and how on-call escalations happen. Make sure you have the following worked out before implementing an on-call system.
First, you’ll need to build your on-call schedule. Work out which engineers would need to be available for different system areas where incidents could occur, by looking at where each engineer has ownership and domain expertise. Create teams to maximize diversity and coverage, allowing for each time to respond effectively to many different types of incidents. Fill out a calendar with these teams, making sure every shift is covered for your rotation period.
During all of this, consult with your engineers to ensure that your schedules are reasonable and fair. How long should an on-call shift last? How frequently should a team go on-call? What should the procedure be if an engineer has to change shifts? To keep morale high and teams responding effectively, make sure every engineer has a fair say in these choices.
Be prepared to change your rotation schedule frequently, even after implementation. The reality of working on-call shifts is often very different than predicted, so look at on-call data to uncover whether certain individuals are overburdened with off-hours interruptions or critical incidents, and load balance accordingly. Be flexible in hearing out people’s concerns as they develop. External business changes and stages in development cycles can also drastically change the nature of on-call shifts, so be prepared to reflect those with adjustments to shift lengths and rotation frequencies.
Because of these constant changes, it’s important to keep the rotation schedule up-to-date. Make sure it’s kept in a place where it’s convenient to make changes, automated and easy to integrate with different systems, and accessible to anyone. Many on-call platforms also offer scheduling tools to make this process easier and more robust.
The next set of policies you need to define is to decide when your on-call teams are actually contacted and how they respond. To combat alert fatigue, you’ll want to be judicious about when your teams are notified, but also ensure that critical incidents are not overlooked.
You should have a system to classify incidents, sorting them based on severity and affected area into established classifications. These classifications will determine who is alerted and what response is necessary. This response should also include timelines for when incidents of severity need to be resolved before you violate SLOs or SLAs.
You can determine severity by looking at the business impact of an incident — issues preventing customers from using services or violating SLAs require a much faster and larger response than a small component loading slightly slower than usual.
You’ll also need to prepare a defined response to each category of incidents. Engineers should be equipped with tools like runbooks to begin tackling an incident as soon as they’re alerted. These runbooks can also include checks for triggering further escalation. Make sure your on-call engineers are familiar with these runbooks, and confident about executing them when the time comes. Schedule regular review sessions to update runbooks based on incident retrospectives.
Between being called out of bed in the wee hours, having to handle incidents with fewer teammates and resources than normal, and facing extreme pressure to restore service as business reputation is on the line, on-call can be an extremely stressful experience. Being overwhelmed by on-call responsibilities, believing that on-call duties are assigned unfairly, or generally feeling under-appreciated can quickly destroy engineers’ morale and accelerate burnout.
Combat these challenges by cultivating an empathetic on-call culture that puts people first.
Involve engineers in setting schedules and other policies. Hear out their experiences, celebrating their successes and addressing their struggles. Make sure you hear these concerns blamelessly; instead of attributing setbacks or miscommunications to individuals, look at the systems behind them. Protect against a ‘hero’ culture, and embrace sustainable on-call through eliminating single points of failure, and embracing smaller and more frequent changes, distributed rotations, and continuous learning.
Reframe incidents from failures and setbacks to investments in future reliability — every incident, when properly addressed, makes the response to each future incident better. Likewise, each on-call shift is an investment in making future on-call shifts better. When there’s challenges in load balancing, having effective responses prepared, or proper escalation, embrace them as opportunities to refine and grow.
For more tips on how to implement empathetic and effective on-call practices, check out our top 5 on-call practices here.
Implementing on-call practices is a complicated process, but fortunately there are great paid as well as free on-call tools and platforms to help. The most popular tools include PagerDuty, OpsGenie, VictorOps, Cabot, and LinkedIn On-Call (open source)
When selecting an on-call tool, some important requirements to consider include:
On-call is an essential component of a reliable system. To take your on-call and reliability practice to the next level, you’ll need to codify context into guardrails and automation, minimize toil, and foster a culture that is inclined toward curiosity instead of blame. Blameless can help you get more out of your on-call and broader reliability efforts by integrating valuable data from SLOs, incident checklists, , postmortems, follow-up action items, and much more. To find out how to empower your SRE solution with Blameless, join us for a demo!