When your product is down for 1 minute, your users might refresh their browsers, thinking the problem is on their side, or maybe not even notice an issue. When it’s down for longer, your customer service representatives likely start to receive questions and complaints and you start to affect your customers’ core business processes, run into SLA breaches, and harm customer satisfaction and consequently renewal odds. SaaS executives know this first situation needs to be minimized and the latter two are unacceptable.
Walk the halls of any SaaS startup and you’ll hear the same thing over and over again, “before we can really scale, we have to be enterprise-ready.” While “enterprise ready” has many definitions, it’s generally defined as being secure and compliant, having the baseline features the market expects, and being consistently reliable.
More mature companies, which already have “enterprise-ready” offerings have different discussions. Theirs are focused on how to innovate faster to keep pace with fast-moving disruptive startups, without injecting too much risk into the system.
No matter the stage of your company, the trade-off question is ultimately the same: how can we accelerate the rate of innovation while maintaining or improving reliability?
This is the critical question that Site Reliability Engineering (SRE) seeks to answer. Here’s how:
Start by tracking Service Level Indicators (SLIs), key performance metrics that describe the usability for your offering. Once you have defined and are starting to track your SLIs, set Service Level Objectives (SLOs), your internal targets for each SLI. Your SLOs will determine your error budget is: the rate of acceptable failure. This error budget helps engineering teams intelligently make trade-offs between innovation velocity and reliability. Teams that are not using most of their error budget should consider moving faster and taking more risks, whereas teams that are using more than their allotted error budget need to slow down and focus on reliability.
But it doesn’t stop there.
Since 100% uptime can never be achieved, protecting the error budget during incidents is critical. The best SRE teams will design and then automate an incident management playbook, which should include clearly defined roles and responsibilities and resolution work items. With this automated playbook in place, there’s no more uncertainty on how to resolve an issue.
Once you’ve successfully managed an incident, blameless postmortems can’t be skipped. Getting to the contributing factors of problems – while maintaining a collaborative, productive culture – is imperative. The best SRE teams take the learnings from these processes and adjust their playbook to mitigate future risk and improve their incident response and reliability processes further.
Want to get started with SRE? Reach out!