Getting SRE Buy-in from C-Levels for Error Budgets and SLOs, Part 3

You now have postmortems properly implemented, automated, and well-structured. You're generating reports and data automatically based on all your incidents. Two levels of management have agreed to your SRE buy-in efforts. That is a huge accomplishment! If you’re here, you're making great traction adopting SRE best practices, but the battle is not won yet. The hardest but most strategic, important effort will be proving to your C-levels why they should buy into SRE.

The situation

You’re now moving from an incident-driven reactive mode to a proactive mode with the overall goal of reliability. You want to elevate reliability as something that facilitates instead of blocks innovation, and guides decisions across the software lifecycle. You’ve assigned a metric to it and are upholding it. You’re looking onto the next phase.

This phase is characterized by SLOs and SLIs being tightly defined and hooked into right parts of the system. Additionally, you will have your business teams agreeing on the SLO, error budget thresholds, and what will happen in case of the thresholds being breached. To propose this, keep two key thoughts in mind.

First, what does your error budget policy include? We define error budget policies as including SLOs, SLIs, and negotiated error budget responses. Second, it’s important to remember that organization-wide adoption of SRE best practices will be a large undertaking for your C-levels. Your CEO/CTO/CIO will need company-wide support to connect engineering, product and business units to make this goal happen. So, your incentives need to speak to them convincingly.

This phase is characterized by SLOs and SLIs being tightly defined and hooked into right parts of the system.

The incentives

These incentives won’t be what you see day-in and day-out from SRE. Instead, these will be the ones that your C-level is most excited by.

  • Long-term competitive advantage for being able to better protect customer experience compared to competitors, hence increased customer loyalty.
  • Growing complexity of tech stacks and dependency on microservices means issues only get worse if unaddressed. As we move toward a world of complex, distributed systems, the way we operate must evolve to support that. This is the chance to catch up.
  • Reliability is feature #1 for championing the customer experience. If a user can’t access your service or has a degraded experience, then the latest feature is rendered irrelevant. Reliability is the foundation that all other features build upon.

Of course, you can anticipate resistance towards adoption, even with these high-level incentives.

The resistance

What this will come down to is company priorities. C-level executives might not see the link between business performance and reliability, as often, incentives are aligned toward new product innovation. Therefore, it may be difficult to convince them that SRE should be a company-level priority. However, by leveraging both an emotional and logical appeal, you can succeed.

The emotional appeal

Here, we lean heavily on customer impact. Everyone at the C-level cares about whether or not customers are satisfied with their service. Satisfied customers cultivate pride, while dissatisfied customers create fear.

Additionally, there is a significant financial aspect involved. Without SRE, organizations would have direct customer impact via SLA losses. That can be very expensive and hurtful to the brand and customer trust. If the reliability issues are too disruptive to overlook, customers may begin to churn. The data you can collect from the cost of downtime can indicate how reliability affects your brand value.

To avoid triggering an SLA breach, SLOs must be implemented. These often act as a safety net, letting you know when you’re in danger before you need to start sounding the alarms. To prove to C-levels that SLOs are crucial, you can do two things.

  • First, quantify the cost of downtime (e.g. measure SLA losses) and estimate a bottom line for reliability impact.
  • Second, you can show them your organization's NPS (or net promoter score) for an indicator of brand and customer sentiment, alongside a detailed customer satisfaction survey, to correlate the score with reliability.
Without SRE, organizations would have direct customer impact via SLA losses.

The logical appeal

The first logical appeal you can present is the need for a competitive advantage. When you share similar services as your competition (eg. AWS, GCP), you look like a less viable option when a competitor is able to respond to, recover from, and prevent incidents better than you. SLOs are an important lever to understand your product and customer experience, so you can be the best and stay ahead of the competition.

Show your executive the metrics on the SLOs and explain how they are set to optimize performance of most important paths in the user’s journey.  Also consider bringing to the table the amount of data and access points in the cloud and the number of services the company depends on. This can demonstrate the need for a system that can adapt to the complexity of technology moving towards cloud and microservices.

The second logical appeal is that proactive is always better than reactive. SLOs and the use of error budgets help us move from a reactive mode (knowing that incidents will occur but not where and why), to a proactive mode of anticipating areas of risk and failure. Error budgets with negotiated terms between the business and engineering teams allow teams to automatically respond in the right way, by standardizing actions and protocols.

To demonstrate this, you’ll need two metrics. First, you’ll need automated reporting on incidents, and SLOs and error budgets that highlight risk areas before customers are impacted. Secondly, you’ll need a map of all areas of customer impact could have been prevented with this knowledge in previous incidents. With these metrics, and appeals to both the emotion and logic of your C-level executives, you’ll be able to convince them that investing in SRE is a strategic initiative that impacts the success of the entire company.

SLOs and the use of error budgets help us move from a reactive mode (knowing that incidents will occur but not where and why), to a proactive mode of anticipating areas of risk and failure.

If you liked this piece, consider reading these:

About the Author
Lyon Wong

Co-founder and COO

Get the latest from Blameless

Receive news, announcements, and special offers.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.