When your product is down for 1 minute, your users might refresh their browsers. They might not even notice an issue. When it’s down for longer, your customer service representatives start to receive questions and complaints. The outage affect your customers’ core business processes. You run into SLA breaches, and harm customer satisfaction and renewal odds. SaaS executives know this first situation needs to be minimized and the latter two are unacceptable.
Walk the halls of any SaaS startup and you’ll hear the same thing over and over again. "Before we can scale, we have to be enterprise-ready.” While “enterprise ready” has many definitions, it’s generally defined as:
Being secure and compliant
Having the baseline features the market expects
More mature companies, which already have “enterprise-ready” offerings have different discussions. They focus on innovating faster to keep pace with startups without injecting too much risk into the system.
No matter the stage of your company, the trade-off question is the same. How can we speed up the rate of innovation while maintaining reliability?
This is the critical question that Site Reliability Engineering (SRE) seeks to answer. Here’s how:
1. Start by tracking Service Level Indicators (SLIs). SLIs are key performance metrics that describe the usability for your offering.
2. Set Service Level Objectives (SLOs). These are your internal targets for each SLI.
3. Determine your error budget, or the rate of acceptable failure. This helps engineering teams make trade-offs between innovation velocity and reliability. Teams that are not using most of their error budget should consider taking more risks. Teams that are using more than their allotted error budget need to focus on reliability.
But it doesn’t stop there.
Since 100% uptime is impossible, protecting the error budget during incidents is critical. The best SRE teams will design and then automate an incident management runbook. This should include defined roles and responsibilities and resolution work items. With this automated runbook in place, there’s no more uncertainty on how to resolve an issue.
After each incident, you'll need to conduct a blameless incident retrospective. Discovering the contributing factors while maintaining a collaborative, productive culture is imperative.
The best SRE teams take the learnings from these processes and adjust their runbook to mitigate future risk. This improves their incident response and reliability processes further.