After getting managerial approval for incident management, your SRE buy-in program is well underway. How can you prove that it's effective, and that adopting more best practices is necessary? In part 2 of this blog series, we're going to share how to convince a VP or director to invest in additional SRE practices to strategically improve business results: automated metrics and continuous learning.
Your team has implemented incident management and can react to incidents and resolve them faster than ever. However, you aren’t learning as much from these incidents as you could be. Manually trying to figure out what to measure (let alone how to do so) is extremely time-consuming, and you need to find a better way to report data so your team can stay focused on learning, improving, and innovating. This requires buy-in at the VP or director level. Before approaching this conversation, you need to take two considerations into account.
First, you need to understand that this is a big undertaking for your VP/director. They will need to get support from the entire engineering and DevOps team, as well as members of the product team, in order for this initiative to succeed. Thus, it’s important that your appeal takes their point of view and goals into account.
Second, you need to define what continuous learning looks like. We loosely define it as a collection of capabilities that drive shared context and focus. This includes but is not limited to automated and aggregated data measurement (MTTR, customers impacted, etc.), standardized means of reporting, dashboards, postmortems/post-incident reviews, and team training.
Now that you’ve got the basics of your proposal, it’s time to articulate the incentives.
There are several major incentives for automating metrics and continuous learning. These are framed in a way that your VP or director will care about.
Even with these incentives laid out, resistance is still likely. Here are some common rebuttals you should be prepared for.
Capturing crucial knowledge in the postmortems--and making it actionable--allows you to codify knowledge, train new engineers, and get them up to speed faster.
If you’ve just adopted incident management best practices, your VP or director might say that’s good enough. Additionally, he or she could argue that reliability isn’t a pressing issue at the moment, and that new features are a higher priority; in other words, the incentives favor immediate term as opposed to longer term goals. Another likely resistance to adoption is that postmortems are too varied, hard to review, and usually one-and-done. This means that after being completed, they are filed away and forgotten. Many might not be completed at all, as they take too much time to construct and don’t command as much immediate urgency compared to tasks like resolving incidents or shipping new product features. Though these concerns seem difficult to counter, by looking at them in terms of both emotional and logical appeal, we can present the reasons why adoption of continuous learning and automation are necessary, alongside metrics to prove it.
From an emotional perspective, in order to connect to VPs and Directors, it will be important to illustrate ‘hair on fire’ moments that prove that reliability is a pressing issue. You can begin by addressing team stress. Engineering teams are dealing with incidents, but it’s a continuous battle. Without the ability to aggregate overall systems, incident, and postmortem data and see patterns, the true extent of any reliability issue is hidden. This could result in frustration in engineering teams, as they are bogged down by manual or repetitive work, leading eventually to burnout and churn.
If engineering stress levels aren't a concern for your VP or director (although they always should be!), you can speak to customer satisfaction. Reliability is now the most important feature. Shipping new features is relatively easy, but reliability is the net sum of all features you've already shipped before. If any shipped feature is unreliable, the value of all the other features is moot. So the sum is greater than any single one new feature you're about to ship. If customers are unhappy with the sum, they will leave for a competitor that delivers a better, consistent experience.These appeals are important, but you need the data to back it up. To prove to your manager that automation efforts are worthwhile, quantify the number of incidents, bugs, regressions caused by new feature work and the additional time to fix them. How many fires and on-calls incidents are generated, and how do these correlate to feature and project work? How much money and resources go into new features that can’t hold up to customer standards? The numbers will likely surprise them.
It will be important to illustrate ‘hair on fire’ moments that prove that reliability is a pressing issue.
Let’s focus on the argument that incident management is already enough. Here, it’s crucial to point out that SRE is not just about incident recovery, but a way to maximize learning from the patterns of incidents and see the whole picture so that the same issues don’t repeat themselves. Without continuous learning efforts, you're not improving and the situation will worsen over time. You can look at a past set of incidents, do postmortems on those, and see what the correlation is. It’s likely that many incidents could have been prevented if you were able to automate metrics and track patterns from them.
Once you have the chance to do a trial run on automation and continuous learning, you’ll need to prove the effectiveness of the initiative. To do this, collect company-wide metrics on all incidents. Show the time saved with automated reporting on incidents and the rate of follow-up action items being completed. Additionally, you should measure new hire errors and issues, both with and without key learnings captured from patterns of incidents from the postmortem metadata.
These two metrics alone should demonstrate the need for automation and continuous learning. Postmortems are varied, yet patterns lurk under the surface and your team needs the ability to uncover them. However, there is still one common resistance to address.
Postmortems are too varied. Experienced teams have a gut instinct that incidents may be related but until a formal process to utilize and aggregate data is set in place, the metrics needed to drive organization change are time-consuming to produce. Generally, postmortems are written in a freeform structure and difficult to go back and analyze.
To combat this, you can suggest tooling that automates the creation of postmortems so that it's less time consuming. From a reporting perspective, you can create a metadata schema (e.g. services impact, customers impacted, contributing factors) to show underlying patterns. Map the metadata from postmortems to feature and project work to show correlation.
Here, it’s crucial to point out that SRE is not just about incident recovery, but a way to maximize learning from the patterns of incidents and see the whole picture so that the same issues don’t repeat themselves
With these metrics in hand, you’ll be prepared to voice your proposal for automated metrics and continuous learning with your VP or director. However, there’s one more level of leadership that will need to be convinced to facilitate SRE adoption. Stay tuned for part 3, where we discuss how to get buy-in from your CEO or CTO.
Until then, feel free to check out these posts: