SLOs are key pillars in organizations’ reliability journeys. But, once you’ve set your SLOs, you need to know what to do with them. If they’re only metrics that you’re paged for once in a blue moon, they’ll become obsolete. To make sure your SLOs stay relevant, determine error budgets and policies for your teams. In this blog, we’ll look at the basics of error budgeting, how to set corresponding policies, and how to operationalize SLOs for the long term.
An error budget is the percentage of remaining wiggle room you have in your SLO. Generally, you’ll institute a rolling window versus historical purview into your data. This keeps the SLO fresh, monitored, and always moving forward. Error budget can be shown as the below calculation:
Imagine you’ve set an SLO for 99.5% uptime per month. This means your error budget is .5%. This is 3.65 hours of downtime per month. If an incident causes a 1.22 hour outage, you’ve lost approximately one third of your error budget for this month.
So, what does this information mean to teams? Your error budget policy will determine this.
It’s not enough to know what your error budget is. You also need to know what you’ll do in the event of error budget violations. You can do this through an error budget policy. This determines alerting thresholds and actions to take to ensure that error budget depletion is addressed. It will also denote escalation policies as well as the point at which SRE or ops should hand the pager back to the developer if reliability standards are not met.
Alerting: Alert (or pager) fatigue harms even well-seasoned teams' ability to respond to incidents. This is the effect of receiving too many alerts, either because there are too many incidents, or because monitoring is picking up on insignificant issues (also known as alert noise). This can lower your team’s cognitive capacity, making incident response more difficult. It can also lead your team to ignore crucial alerts, resulting in major incidents going unresolved or unnoticed.
You’ll want to make sure that your alerting isn’t letting you know every time a small part of your error budget is eaten. After all, this will happen throughout the rolling window. Instead, make sure that alerts are meaningful to your team and indicative of actions you need to take. This is why many teams care more about getting notified on error budget burndown rate over a specific time interval, compared to the depletion percentages themselves (i.e. 25% vs. 50% vs. 75%).
To determine if you need to take action for error budget burn, write in stipulations. Stipulations could look something like this: if the error budget % burned ≤ % of rolling window elapsed, no alerting is necessary. After all, a 90% burn for error budget isn’t concerning if you only have 3 hours left in your window and no code pushes.
But, if burn is occurring faster than time elapsing, you’ll need to know what to do. Who needs to be notified? At what point do you need to halt features to work on reliability? Who should own the product and be on-call for it at this point? Add answers to questions like these into your error budget policy Google produced an example of what this document looks like. It contains information on:
Handing back the pager: In the example policy above, Google reminds us, “This policy is not intended to serve as a punishment for missing SLOs. Halting change is undesirable; this policy gives teams permission to focus exclusively on reliability when data indicates that reliability is more important than other product features.” If a certain level of reliability is not met and the product is unable to remain within the error budget over a determined period of time, SRE or operations can hand back the pager to the developers.
This is not a punishment. It’s a way to keep dev, SREs and ops all on the same page, and shift quality left into the software lifecycle by incentivizing developer accountability. Quality matters. Developers are held to task for their code. If it’s not up to par, feature work will halt, reliability work will take center stage, and SRE or ops will hand the pager over to those who write the code. This helps protect SRE and ops from experiencing pager fatigue or spending all their time on reactive work. Error budget policies are an efficient way to keep everyone aligned on what matters most, which is customer happiness.
The process that goes into creating SLOs, especially the people aspect, is extremely critical for consistency and ability to scale it across your entire organization. To operationalize SLOs, you’ll need to remember a few key things:
Once you’ve got these basics down, you can begin to expand your SLO practices.
Here are some additional, more advanced SLO practices that you can start using once you’ve found success with the basics:
Maybe you’ll be ready to take these advanced steps in a few months. Maybe it will take a few years. No organization’s SLO journey looks the same. The important thing to remember is that iteration, alignment, and a blameless culture are what’s core to your SRE practice. SLOs and error budgets are only components.
If you enjoyed reading this, check out these resources: