You now have postmortems implemented, automated, and well-structured. You're generating reports and data automatically based on all your incidents. Two levels of management have agreed to your SRE buy-in efforts. That is a huge accomplishment! If you’re here, you're making great traction adopting SRE best practices, but the battle is not won yet. The hardest but most strategic effort will be proving to your C-levels why they should buy into SRE.
You’re moving from an incident-driven, reactive mode to a proactive mode. You're elevating reliability to help innovation and guide decisions across the software lifecycle. You’ve assigned a metric to it and are upholding it. You’re looking onto the next phase.
This phase revolves around well-defined SLOs and SLIs hooking into the right parts of the system. You'll need your business teams agreeing on the SLO, error budget thresholds, and what will happen in case of a threshold breach. To propose this, keep two key thoughts in mind.
- What does your error budget policy include? We define error budget policies as including SLOs, SLIs, and error budget responses.
- Organization-wide adoption of SRE will be a large undertaking for your C-levels. Your CEO/CTO/CIO will need company-wide support to connect engineering, product, and business units. So, your incentives need to persuade them.
These incentives won’t be what you see day-in and day-out from SRE. Instead, these will be the ones that your C-level is most excited by.
- Long-term competitive advantage: Protect customer experience compared to competitors and increase customer loyalty.
- Growing complexity of tech stacks and dependency on microservices: Issues worsen if unaddressed. As we move toward a world of complex, distributed systems, the way we operate must evolve to support that. This is the chance to catch up.
- Reliability is feature No. 1: If a user can’t access your service or has a degraded experience, then features are irrelevant. Reliability is the foundation that all other features build upon.
Of course, you can expect resistance towards adoption, even with these high-level incentives.
What this will come down to is company priorities. C-level executives might not see the link between business performance and reliability. This is because incentives are often aligned toward new product innovation. So, it may be difficult to convince them that SRE should be a company-level priority. But, by leveraging both an emotional and logical appeal, you can succeed.
The emotional appeal
Here, we lean on customer impact. Everyone at the C-level cares about customer happiness. Satisfied customers cultivate pride, while dissatisfied customers create fear.
Additionally, there is a significant financial aspect involved. Without SRE, organizations would have direct customer impact via SLA losses. That can be very expensive and hurtful to the brand and customer trust. If the reliability issues are too disruptive to overlook, customers may churn. The data you can collect from the cost of downtime can indicate how reliability affects your brand value.
To avoid triggering an SLA breach, you'll need to adopt SLOs. These often act as a safety net, letting you know when you’re in danger before you need to start sounding the alarms. To prove to C-levels that SLOs are crucial, you can do two things.
- Quantify the cost of downtime (e.g. SLA losses) and estimate a bottom line for reliability impact.
- Show them your organization's NPS (or net promoter score) alongside a detailed customer satisfaction survey to correlate the score with reliability.
The logical appeal
Need for a competitive advantage: When you share similar services as your competition, you look like a less viable option when a competitor is able to respond to, recover from, and prevent incidents better than you. SLOs are an important lever to understand your product and customer experience to stay ahead of the competition.
Show your executive the metrics on the SLOs and explain how they are set to optimize performance of most important paths in the user’s journey. Consider bringing the amount of data and access points in the cloud and the number of services the company depends on. This shows the need for a system that can adapt to the complexity moving towards cloud and microservices.
Proactive is always better than reactive: SLOs and the use of error budgets help us move from a reactive mode (knowing that incidents will occur but not where and why), to a proactive mode of anticipating areas of risk and failure. Error budgets with negotiated terms between the business and engineering teams allow teams to respond in the right way by standardizing actions and protocols.
To prove this, you’ll need two metrics:
- Automated reporting on incidents, SLOs, and error budgets that highlight risk areas before customers impact.
- A map of all areas of customer impact which could have been prevented with this knowledge.
With these metrics and appeals to both the emotion and logic of your C-level executives, you’ll be able to convince them that investing in SRE is a strategic initiative that impacts the success of the entire company.
If you liked this piece, consider reading these:
- Building Reliability Through Culture with Veteran Google SRE, Steve McGhee
- What Are Service-Level Objectives? Lessons Learned
- How to Champion SRE Investment to Different Levels of Leadership
Written By: Lyon Wong, Christina Tan