Getting SRE Buy-in from a Manager or Lead for Incident Response, Part 1

Adopting SRE best practices can be difficult, especially when you need approval from managers, VPs, CTOs, and everything in between. In this blog series, we will walk you through how to come up with a winning pitch for each level of leadership to ensure that SRE buy-in will succeed in your organization. Let’s start at the beginning with your team lead or manager.

The situation

As one of the first steps towards SRE adoption, incident management is key. You have come to the realization you want to implement an effective incident management system within your team and now it’s time to convince your lead/manager. How will you accomplish this?

First, we need to recognize that your manager will need a lot of support from engineering and devops teams for this transition, because they will need to be trained in this incident management system and use it consistently when incidents occur.

Second, you need to define what you mean by incident management. For the purpose of this blog post, we will define incident management as the assembling, investigating, resolution, and learning process. This includes incident response playbooks, measuring time to detection, monitoring systems, and ticketing workflow.

Once you have a handle on the basic proposal you plan to bring to the table, it’s time to think about what the team (your manager included) will gain from implementing an incident management system.

The incentives

There are four main incentives that will motivate your team to adopting incident management best practices:

  • Incident management best practices restore your systems to working order as quickly as possible when an incident occurs.
  • A playbook gives everyone a sense of control amidst the chaos, defining a set of repeatable practices to drive consistency while helping everyone to be thorough with their problem-solving.
  • Measuring time to resolution (TTR) and time to detection (TTD) allows the manager to quantify the team’s improvement on TTR and TTD moving forward.
  • Integration with alerting and ticketing systems reduces time wasted on context switching between different apps during an incident, and reduces the stress from mentally keeping track of multiple systems.

However, simply explaining these incentives to your manager and hoping for immediate support will not guarantee buy-in. You need to anticipate the resistance your manager will have towards this big change.

The resistance

Your manager might say, “Our current process is manual but good enough.” OSAGE syndrome, or “Our Systems Are Good Enough” can be difficult to overcome. The system that's in place has been running for a long time, so it’ll be up to you to change your manager’s mind and convince them that it’s time for something better than “just okay.”

To make this argument, you’ll need to rely on both a factual, logical appeal, as well as an emotional one. While there is no one right answer to solve this problem, as every organization, team, and manager is different, there are some topics your manager might connect with better than others.

Here, you’ll have to empathize and put yourself in your manager’s shoes. What would motivate you?

OSAGE syndrome, or “Our Systems Are Good Enough” can be difficult to overcome.

The emotional appeal

If you were responsible for a whole team and a major incident occurred, what would your first emotion be? Most likely, you would be afraid. While a culture of fear is certainly not what you want when adopting SRE, it can certainly help spur the adoption of important best practices. After all, if new processes can help reduce your manager’s fear by establishing safeguards and preparedness, that would certainly appeal to them.

One of the major sources of fear is loss of control. When an incident occurs, current manual processes fail. And with the move to microservices, it can be hard to understand where the incident originated, and what path is best for mitigation. Rollbacks are an option, but they don’t solve the underlying problem. Your manager will be held accountable for the service returning to normal efficiency and answering why this happened in the first place.

This responsibility is a considerable challenge. With a better incident management system in place, your service can be up and functioning much quicker. And with automated runbooks, resolving the incident can be done with minimal chaos. Faster and more consistent incident resolution can help your manager regain some control.

Another source of fear is losing your team. If your teammates are waking up at 2:00 AM regularly with no end in sight, morale will be low. Additionally, manual processes are toilsome and stressful. The team wants to see the process getting better and less stressful over time, not worse as the number of services increases. Increasing operational complexity is inevitable, but if that results in more incidents and unplanned work, that will lead to burnout as well as unhealthy team culture. People will eventually begin searching for other employment options if these issues are not resolved. When headcount drops drastically and turnover rates soar, your manager will be pressured to keep the ship sailing while drowning in the labor-intensive process of backfilling, hiring, and onboarding new engineers. This cycle is not sustainable, and is probably enough to keep your manager up at night.

One of the major sources of fear is loss of control.

The logical appeal

This is where you’ll need to really tackle OSAGE syndrome. When your manager says, “the current process is manual but good enough,” ask them if all the process’s consequences are really intended. Are the repetitive 2 AM calls purposeful? Are the incidents and potential customer issues occurring at a frequency that puts customer trust at risk? If the answer is no, then your system is not good enough.

That being said, it’s important to not blame your manager for these struggles. After all, some of these issues are beyond their control. Systems have become more complex, and the bar has been raised. Instead of pointing fingers, it’s time to lay on some more logic. For this, you’ll need two important things to provide metrics for your manager in order to promote adoption within your team and company-wide:

  1. You’ll need to create a service catalog for the number of services/microservices you have and their dependencies, to show how these have grown and will continue to grow.
  2. During the new IM proof of concept phase, you’ll need to track the trends of TTD and TTR. If there are positive results, then you can justify rolling out the system and process changes to more teams at the company.

Armed with emotional and logical appeals, you can confidently approach your team lead or manager and have a discussion with them about improving your incident management system. This is a great first step towards SRE adoption, but you can’t stop here — you’ll reach a local maxima that falls short, long-term. You’ll need to think about how to frame SRE adoption for the next level of leadership in order to gain the SRE buy-in you need.

Watch for our next post in this series on getting buy-in from someone at the VP or director level, and in the meantime, check out these posts:

About the Author
Lyon Wong

Co-founder and COO

Get the latest from Blameless

Receive news, announcements, and special offers.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.