You now have postmortems implemented, automated, and well-structured. You're generating reports and data automatically based on all your incidents. Two levels of management have agreed to your SRE buy-in efforts. That is a huge accomplishment! If you’re here, you're making great traction adopting SRE best practices, but the battle is not won yet. The hardest but most strategic effort will be proving to your C-levels why they should buy into SRE.
You’re moving from an incident-driven, reactive mode to a proactive mode. You're elevating reliability to help innovation and guide decisions across the software lifecycle. You’ve assigned a metric to it and are upholding it. You’re looking onto the next phase.
This phase revolves around well-defined SLOs and SLIs hooking into the right parts of the system. You'll need your business teams agreeing on the SLO, error budget thresholds, and what will happen in case of a threshold breach. To propose this, keep two key thoughts in mind.
These incentives won’t be what you see day-in and day-out from SRE. Instead, these will be the ones that your C-level is most excited by.
Of course, you can expect resistance towards adoption, even with these high-level incentives.
What this will come down to is company priorities. C-level executives might not see the link between business performance and reliability. This is because incentives are often aligned toward new product innovation. So, it may be difficult to convince them that SRE should be a company-level priority. But, by leveraging both an emotional and logical appeal, you can succeed.
Here, we lean on customer impact. Everyone at the C-level cares about customer happiness. Satisfied customers cultivate pride, while dissatisfied customers create fear.
Additionally, there is a significant financial aspect involved. Without SRE, organizations would have direct customer impact via SLA losses. That can be very expensive and hurtful to the brand and customer trust. If the reliability issues are too disruptive to overlook, customers may churn. The data you can collect from the cost of downtime can indicate how reliability affects your brand value.
To avoid triggering an SLA breach, you'll need to adopt SLOs. These often act as a safety net, letting you know when you’re in danger before you need to start sounding the alarms. To prove to C-levels that SLOs are crucial, you can do two things.
Need for a competitive advantage: When you share similar services as your competition, you look like a less viable option when a competitor is able to respond to, recover from, and prevent incidents better than you. SLOs are an important lever to understand your product and customer experience to stay ahead of the competition.
Show your executive the metrics on the SLOs and explain how they are set to optimize performance of most important paths in the user’s journey. Consider bringing the amount of data and access points in the cloud and the number of services the company depends on. This shows the need for a system that can adapt to the complexity moving towards cloud and microservices.
Proactive is always better than reactive: SLOs and the use of error budgets help us move from a reactive mode (knowing that incidents will occur but not where and why), to a proactive mode of anticipating areas of risk and failure. Error budgets with negotiated terms between the business and engineering teams allow teams to respond in the right way by standardizing actions and protocols.
To prove this, you’ll need two metrics:
With these metrics and appeals to both the emotion and logic of your C-level executives, you’ll be able to convince them that investing in SRE is a strategic initiative that impacts the success of the entire company.
If you liked this piece, consider reading these:
Co-founder and CEO