I recently joined Blameless after a fruitful 5 ½ year stint at Box building and supporting production backend systems. How did 15 years of work in backend infrastructure lead me to Blameless? Why is it important to make infrastructure reliable and easier to manage blamelessly? I will explain the problems and the opportunity I see.
The Chaotic World We are Entering
With the ubiquity of services like Amazon S3, the surface area of cloud services is expanding at a breaknecking pace in our increasingly connected world. Dependencies between devices and services are becoming more complicated and opaque. It’s easy to overlook even minor details and affect millions of people across many products and services. The consequences can range from an annoying interruption while binge-watching Stranger Things to the life threatening risks of a buggy update to a self-driving car.
It’s easy to overlook even minor details and affect millions of people across many products and services.
With “The Cloud” came managed services. We are in the process of saying farewell to building and managing our own object storage systems, messaging systems, database systems, and so on. The processes needed to interface with external service providers is much more involved than homegrown services. As companies move to using managed services for storage, databases, log aggregation, metrics visualization, on-call management and so on, incident management and root cause analysis gets really complicated and time consuming.
That, and the “blame game” gets dirtier.
My Personal Experience with Chaos and Blame
I have dealt with the pains of multi-provider outages first hand. The entire incident management process in most cases was largely manual and involved way too many people. One particular outage lasted almost 12 hours due to these inefficiencies and a misunderstanding of who caused the problem (a.k.a. the blame-game). It turns out that the root cause was in a peripheral service within the service provider. The only way to mitigate was by re-directing traffic away from the problem service. This was just a problem with our product integrating with a single cloud service. Imagine the complexity of an issue across many providers. Simply having visibility into your own infrastructure is not going to be sufficient in a world of cloud integrations.
Simply having visibility into your own infrastructure is not going to be sufficient in a world of cloud integrations.
Entering Chaos with Grace
Managing such complexities is no easy feat. How can we make living in this chaotic world easier? Here are some mentality shifts that I believe can make our transition more graceful.
Mentality Shift #1: Software doesn’t “just work”, errors and incidents are the norm
“How many minutes of outage can your company tolerate?” Most executives will answer, “Zero.” In fact, many companies expect their software to “just work” all of the time. Errors and incidents should be anomalies, not the norm. But, are they?
In the world of “cheap, fast or good, pick two”, the software industry tends to prefer cheap and fast. This means that most software typically establishes “good enough”, then teams move onto something else.
Software development is all about change, and with change comes instability. Bugs are created not just from fast feature development, but also from updates of the complex downstream dependencies. How these changes affect a system can be very complex and unpredictable.
Failure is the norm. We just have to accept it. Your internet carrier cannot deliver 100% uptime, so it’s not worthwhile for you to do so. When you set an SLO (e.g. 99% availability), you will have automatically embraced failure with an error budget (e.g. 1% downtime). Google, Facebook, and some of the best tech companies have embraced failure as the norm (even outside of software, as seen with hard drives with RAID), leading to phenomenal results like 99.999% availability.
Google, Facebook, and some of the best tech companies have embraced failure as the norm, leading to phenomenal results like 99.999% availability.
Mentality Shift #2: Blame is counterproductive, focus on resolution and learning
In many companies, incident management has turned into a game of hot potato. Our natural human tendency is to blame, but blaming actually discourages resolution, because whoever fixes the problem could be blamed for creating it in the first place.
Also, blame doesn’t fix incidents, but it does add stress and hinder creative thinking. More often than not, everyone involved in an incident has done their best. The root cause can often be systematically fixed via process changes rather than punishment of individuals.
In any incident, the focus should be on minimizing the impact, time-to-root cause, and time-to-resolve. Learning comes after incident resolution. In a blameless postmortem, ask “what action items or automations can will make the next incident easier to handle?”
Blaming actually discourages resolution, because whoever fixes the problem could be blamed for creating it in the first place.
Mentality Shift #3: Treat reliability not as a burden, but as a feature
Product managers want the development team to complete the features, fast. Operations teams want the development team to maintain reliability, seemingly slowing down development. The dev team is stuck in the middle.
But what if reliability was a quantifiable metric? Then it becomes a feature just like any other ones taken into a PM’s priority. Site reliability should be a core feature of every modern product. Realizing this requires all parts of the business to make informed trade-offs against reliability and technical risk. If customers are asking for a new feature, it should only be pursued if the constituent parts have the support to maintain the expected levels of reliability.
Site reliability engineering (SRE) is like hygiene. You may not like brushing your teeth, but if you don’t do it regularly, the dentist will have a big and expensive mess to clean up.
What if reliability was a quantifiable metric? Then it becomes a feature just like any other ones taken into a PM’s priority.
The Opportunity – How it Comes Together
How do we maintain order in this world, while still making progress on the products and systems we love to create? Here is minimum list of things needed to set ourselves up for better practices and more efficient operations:
- Have a comprehensive understanding of internal and external dependencies: Without knowing how your infrastructure components relate to each other and external services, the incident management process will always be a manual treasure hunt without a map. In an ideal world, the tooling will draw the map for the incident commander.
- Get a firm handle on how things are changing: If you don’t know what is changing and when, you will have a hard time maintaining stability.
- Capture and aggregate incident metrics: Automating the incident management process has two main advantages: alleviates the pain of manually constructing the postmortem documentation and provides data to be used for planning, resourcing and product decisions.
- Define realistic SLOs and error budgets: Properly defined, these metrics are a good way to determine the health of your services. I say realistic, because I have seen aggressive SLOs cause noise and lax SLOs result in pointless alerts. In either case, messing this up generally causes angry engineers.
Without knowing how your infrastructure components relate to each other and external services, the incident management process will always be a manual treasure hunt without a map
Ideally, someone can build a platform that provides the above features. The platform would not only aggregate site reliability data, but integrate with other business metrics in a way that not only allows engineers to make good decisions, but also allows product, finance and business development leadership to also make more informed decisions.
To go even further, if one platform became ubiquitous, it could integrate with each provider’s infrastructure and be used as a bridge between disparate providers in our increasingly connected world. Imagine being able to diagnose and mitigate a problem between three providers without having to join a multi-day bridge. Hell, imagine being able to infer the root cause and only involve a single on-call engineer, or even trigger an automated mitigation.
The hope is with the right data, tools and processes we can incentivize people to move away from a culture of blame to one that focuses on the facts and doing the right things: getting to the root cause during issues, getting teams more operational support when they are on fire and empowering teams to replace old crusty systems.
This is the opportunity I see with Blameless.