The blameless blog

The Iceberg of Engineering Incident Costs

Blog home
Incident Response
Aaron Lober
Aaron Lober

I've long been fascinated with the metaphor of an iceberg to describe a problem who’s true magnitude is obscured beneath the surface. If you’re not familiar with this phenomenon, when ice freezes it decreases in density. This allows the solid ice to float, partially, atop the water with only a small fraction of it exposed. In fact, icebergs hold nearly 90% of their mass hidden below the water. What may look innocuous from above is more than capable of sinking your ship if not approached with care.

You don’t need to be an amateur sailor or a climate scientist to understand the metaphor. Thanks to the work of one visionary director, and two academy award winning actors, an iceberg’s ability to ruin an otherwise pleasant voyage may be one of the most well understood stories in our culture today. So it should surprise no one to hear that folks from nearly every discipline imaginable have adapted the iceberg metaphor to describe something within their domain of expertise.

So, at the risk of sounding cliche, and while standing on the shoulders of the giants who came before me, I am going to share with you The Iceberg of Engineering Incident Costs.

The tip of the iceberg – the most obvious costs of incidents

The elements of engineering incident cost that are clearly visible tend to be obvious and well understood. The reactions to these also tend to occur on a spectrum from “that is something we should probably avoid” to “dear god, steer us away from that thing before it kills us!”

For consumer brands or B2B companies running a product led growth strategy, engineering incidents often result in lost revenue. Amazon makes over $14,900 a second, every day. Google generates about $5700. A single high severity incident that interrupts business for 20 minutes, could cost these companies tens of millions. The history of catastrophic software incidents is an interesting blog post in and of itself. I recommend reading this article on historical incidents from You can also check my last blog exploring the cost of incidents here

For folks in the B2B space, the downtime = lost revenue formula isn’t always as cut and dry, but it’s still not very difficult to calculate. Violating an SLA is a sure fire way to get customers to withhold payment or cancel a contract, and may incur additional fines for your organization. For any company executing a PLG strategy, poor incident management is a sure fire way to damage your flywheel, by preventing new users from gaining access, or creating a bad experience for existing users. 

What lies just beneath the surface – unexpected costs of incidents

As I said, lost revenue and hampered customer acquisition are big problems, but they’re the most obvious. Once we get beneath the surface, we reveal a wonderland of new concerns to keep us up at night.

An example I find myself talking about often is LinkedIn. Nearly 141 million people log in every single day to the LinkedIn platform. 16% of their entire customer base. These are LinkedIn’s most important customers, and they do more to bring new people to the platform and drive monetization for Linkedin than any other group – save advertisers themselves. So what do these folks do if the LinkedIn feed goes down tomorrow? 

I’ll bet anyone a nickel that they’ll go straight to Twitter (or whatever they’re calling it now) and complain about it. Engineering incidents like this have the potential to do real brand damage that can cost businesses dearly. Whenever this example comes up, I find myself wondering: if job posting alerts on LinkedIn stopped working for a day, how many new accounts get created on Indeed? What is the true cost of alienating your super users and creating wind at your competitors’ backs?

There is another great hidden cost of engineering incident that we need to discuss at this level. The impact on your engineering team of having to put out the same fire over and over again. It’s easy to get swept up in the flashiness of a Sev0 incident. You forget that for every catastrophic failure, your team is fixing tens, maybe hundreds, of bugs on a regular basis. Don’t get me wrong, no one likes getting paged at 3am because the website is down and the company is hemorrhaging money. That is terribly stressful. It’s also far less common than something unexpected breaking during a code deployment. 

It is unfortunately common for engineering teams to confront the same, low grade failures over and over again as a product increases in complexity. This frustrates the hell out of engineering teams who care about the impact of their work and don’t like having their time wasted. Let’s not forget that this time also has a non-trivial value to the company itself. When organizations don’t take the retrospective and remediation process seriously, engineering hours get wasted, engineers get burned out and churn, rehiring and retraining takes massive costs, and everybody loses.

Traveling into the depths – the most hidden costs of incidents

Let’s travel even deeper and explore some of the most hidden, and most substantial, incident costs. I already introduced the idea of engineering frustration and lost engineering hours. Those hours have a dollar figure associated with them and job satisfaction can be surveyed easily enough. But how about the impact of slowing innovation? Poor, reactive incident response strategies force more engineering hours to be spent on unplanned work. That comes at the cost of new product development, and that stagnation creates opportunities for your competitors to challenge your product advantage. 

This is an anxiety that’s probably most keenly felt by leaders in B2B SaaS companies with competitors challenging for market dominance. Consider spaces like construction tech where companies like Procore, Autodesk, and Oracle are battling for market supremacy. A lost day of engineering focus may not cost you the market. But a month or a quarter could easily cost one of these competitors the edge. The competition is stiff and the margin for error is small. That reality ultimately led Procore to seek out a partnership with Blameless several years ago. You can watch a case study video detailing their experience here

Companies like Procore have recognized that empowering their engineering teams to approach incident management proactively can ultimately lessen the burden of incident management and  increase their productivity. Improved engineering output can, in turn, be a source of competitive advantage for companies large and small. That advantage can ultimately be the difference in becoming a multi-billion dollar business or a historical footnote.

In conclusion

If you’re interested in exploring the real cost of incidents for your business, check out our ROI calculator linked here. This can help you create a complex model of incident impact based on your business. You can even test the impact of specific variables like MTTx metrics, number of incidents, and many others. 

As always, if you’d like to speak with a member of our team or see the Blameless platform in action, contact us and we’ll schedule a personalized demo.

"I have less anxiety being on-call now. It’s great knowing comms, tasks, etc. are pre-configured in Blameless. Just the fact that I know there’s an automated process, roles are clear, I just need to follow the instructions and I’m covered. That’s very helpful."
Jean Clermont, Sr. Program Manager, Flatiron
"I love the Blameless product name. When you have an incident, "Blameless" serves as a great reminder to not blame anything or anyone (not even yourself) and just focus on the incident resolving itself."
Lili Cosic, Sr. Software Engineer, Hashicorp
Read their stories