Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

Getting Buy-in from Management on Reliability Investments

Emily Arnott
|
2.1.2024

If you’re reading the Blameless blog, you probably have a good idea of how important reliability is to your customers’ happiness, your business’s bottom line, and your overall sanity. Unfortunately, this perspective is frequently downplayed by management. Even if they understand the importance of reliability, they often see it as something that should emerge automatically from having the right mindset, and not something that requires investment.

Convincing management to make investments in reliability essentially requires three successive arguments:

  • Having good reliability for your services is crucial to business success
  • Making investments into your reliability systems is essential to making things more reliable, you can’t just will better reliability into existence
  • These investments will pay for themselves in short order

Let’s break down each of these arguments to uncover what might convince your management to start down the journey to great reliability.

Having good reliability is crucial to business success

The business value of SRE and good reliability. The most direct value you can recognize is through incident management: when you have an incident, you lose money; the shorter and less frequent incidents are, the more money you save. Another fairly direct connection is with preventing burnout. Rehiring and retraining engineers who churn is a major cost, so the value of anything that keeps them happier, more productive, and staying at the organization longer should be obvious.

There are more subtle, but no less impactful, opportunities for business value to be created. For example, you’ll have better insights into customer happiness: what service aspects are most crucial to their retention, what expectations they have for those aspects, etc. This will let you be more strategic with where you spend your efforts – not overspending on unimportant areas, responding impactfully when customers are dissatisfied.

Having good reliability requires investment

When management understands the value of reliability, there’s often a temptation to not actually invest resources into improving it, but hoping it “just happens”. There is some logic to this. There are many impactful changes you can make to improve reliability that don’t require an additional spend. These include perspective shifts in culture, DIY implementation of practices like writing retrospectives or runbooks, or reconfiguring on-call shifts based on workload tracking.

As helpful as these free changes can be, they often create a double-edged sword. Asking engineers to tackle these changes on top of their regular workload will just lead to resentment and doing the bare minimum. Instead, these practices need to be made an explicit part of their workload, with other responsibilities and timelines adjusted to accommodate. In order to get the most from these practices, you may also need to spend money on tooling, training, or even hiring.

Investing in reliability pays for itself

So we’ve shown that there’s business value in having good reliability, and that improving reliability requires some level of investment, in the forms of readjusting workload or spending money. The final, and most important, part of this equation is proving that the benefits of this investment outweigh the costs.

The first thing to consider is that the costs of the investment are relatively minor. For example, investing in tooling can be a major force multiplier. Tools won’t typically cost anywhere near as much as an engineer’s yearly salary, but can provide major time and effort savings for every engineer.

The same force multiplier applies to scheduling time. Setting aside dedicated time in engineers’ schedules for implementing reliability practices may appear to delay other work, but scheduling just e.g. one weekly hour-long meeting to review incidents will inevitably and quickly reduce time spent on recurring incidents by more than an hour a week.

The second thing to consider is that the benefits are likely much greater than you estimate. For example, you may have some estimate of how much money your organization loses during an outage. But there’s likely other costs downstream that you aren’t considering. Check out our incident impact calculator to see our estimate for the full depths of incident costs – brand damage, burnout, opportunity costs, and more. Even a small reduction in incident frequency or length translates into major savings when all these costs are considered.

Taking this holistic perspective is necessary for other benefits too. The value of having a better understanding of customer happiness is difficult to quantify, but is profoundly impactful. Try to notice all the times in which you have to estimate whether customers would churn under certain conditions, or which service area ought to take priority. How helpful would it be to have a specific number you could cite with confidence to make these decisions?

Make your first investment Blameless

By automating SRE practices like retrospective building and incident pattern tracking, speeding up incidents with role-based checklists, and getting a handle on customer contentment with SLOs, Blameless is your all-in-one reliability solution. Check out a demo today to see it all in action!

Resources
Book a blameless demo
To view the calendar in full page view, click here.