Two day shipping guarantees. 5 9’s of uptime. Average page load times of under half a second. Tech companies are switching from talking about what they can do to how reliably they can do it, because in a digital world, people want things now. While speed of innovation is important, consumers place their trust in the services that “just work”.
You’ve likely heard about the principles of SRE, and the benefits of implementation: better incident management, continuous learning, faster development velocity, and more.
But how do you actually begin to implement a solution to improve software reliability? If you could wave a wand and have the ideal reliability solution drop into your lap, what would you want it to look like? Investing in such a solution and making it a fundamental part of your tech stack is a major commitment, so you should be confident that the solution will solve your individual pain points.
Whether you’re a manager looking to overhaul processes or an engineer curious about tooling options, we want to help guide you into the era of reliability. We’ll look at a few questions that will help you envision your reliability solution, and then discuss how you can implement it.
How will you understand service health?
Before attempting to prevent incidents, you need to understand where an incident can happen. This means understanding how your services look while healthy, and how they look when things go wrong.This sounds obvious, but you might be surprised at how obscure the most vital information about the health of your services is, especially as services splinter into more and more microservices. It’s easy enough to log incidents, but not so easy to turn a long list into a diagnosis — in other words, it’s difficult but incredibly important to find the signal across the noise.
You need something that makes these patterns obvious, as your time should be spent solving problems, not scouring huge spreadsheets looking for connections. Solutions that provide multifaceted insights are especially helpful, as they’ll help stakeholders on many levels of technological investment; you’ll have the ‘speeds and feeds’ level of detail for engineers, while connecting those back to business impact for management.
There shouldn’t be any gaps in these reports, so you’ll need to gather data from across your entire service registry. Look for solutions that “play nice” with all the different data sources across your environment. Your solution should help you consolidate this data into the most impactful metrics, helping you prioritize future development and potentially prevent incidents that would have previously gone unnoticed until too late.
How will you respond when something goes wrong?
On-call and alerting systems are important investments, but in the era of SRE, they’re the first step in making your system more resilient. In a complex world of distributed systems, you need a solution that empowers you to respond faster and more consistently, but which also facilitates true learning and prevention.
Key requirements to look for include:
Codified, role-based responses: Designate people to lead different facets of the response, and arm them with automation, to drive coordination, minimize cognitive load, and improve load-balancing.
Consistent playbooks: Establish response playbooks that give people the confidence to proceed, but leave them with agency to creatively solve problems.
Comprehensive timelines: Log the conversations, the decisions made, as well as the timing of each event (alerting, responding, resolving, etc.), to aid in learning.
Most importantly, you want these steps to be easy, and baked into your process. If the response procedure feels like tedious overhead, engineers will choose to skip the procedure entirely. If you ask your team, you’ll likely find that toil is discouraging logging of incidents or postmortems, allowing important pieces of information to slip through the cracks. To make your solution easy, make sure it integrates with the tools your team is already using. Every time an engineer swaps between applications, there’s a possible break in flow, which can be devastating to progress. You’ll want them to do as much as they can from a central place.
You’ll also want your solution to automate as much of the response process as possible. Engineers shouldn’t be distracted by having to remember how to log details or contact all the right people; instead, your solution should help prompt and automate these tasks.
What will you learn from incidents?
Once an incident is resolved, you might be tempted to pat each other’s backs, go for a well-deserved drink, and try to forget it ever happened. However, incidents are an opportunity to learn, grow, and prevent similar incidents from happening in the future.
You’re probably already collecting some incident data in various ad-hoc ways. And it’s true, making a quick note in a confluence doc or adding more comments in your code is much better than nothing, and can be a lifesaver in a repeat incident. But to truly learn from incidents, you’ll want a robust solution that can gather and present the entire narrative of the incident from discovery all the way to mitigation.
A postmortem (or RCA, incident retrospective, or post-incident report) document should do all of the following:
Answer key questions: Why the incident occurred, what impact it had, and how it fits into larger patterns.
Present information gathered during resolution efforts in ways accessible to each and every stakeholder.
Give an opportunity for those involved to further discuss the incident.
Help track immediate follow-up actions and persist across longer efforts to fix larger issues.
Most importantly, your solution should create this postmortem document as easily as possible, limiting manual toil on collecting data so more time can be spent on meaningful analysis. And just as with your incident response process, the best intentions are moot if people can’t be bothered to actually do it. Integrations and automated prompts smooth and speed the process, leading to less gaps in reporting as well as reducing friction to get them done in the first place.
How will you ensure things go better next time?
A key component of SRE is facing reality: there will always be another incident. And another, and another after that. What we can do is work to ensure that each time, it’ll go a little better. To prepare for what’s next, you’ll want a solution that helps you incorporate your learning into new practices and policies.
Using service level objectives and error budgeting helps you evaluate the tradeoffs between development speed and reliability risks. The ideal solution would track these using the information you’re already generating through your incident response and postmortem processes.
Your solution should be constantly building insights across past incidents. As we discussed earlier, consolidating data from all your services is key to diagnosing your services’ health. If you keep up this gathering and reporting, you’ll soon see patterns across time of where, and how often, things go wrong, and put a spotlight on opportunities to improve.
The goal is to become proactive instead of reactive. Rather than just treating each incident as an isolated event, unplanned and wholly outside of development, integrate what you’re seeing in these patterns across the entire software lifecycle. Your solution should help inform you of reliability repercussions, from the earliest development discussions all the way to the final production decisions.
Buy, Build, or Open Source?
Now that we have a good idea of what you want your reliability solution to do, there’s one final question: How will you implement it? You could evaluate different vendors and purchase an existing solution that meets your needs. Your engineering teams could build something entirely new for your purposes. Or you could turn to an open-source solution, where developers around the world freely collaborate to create a tool.
Let’s break down how these options compare in crucial categories:
As you can see, there are pros and cons to each approach. For each tool in your technology stack, and at different stages of your company’s growth, the best choice might vary. Typically, however, a good rule of thumb is you should “buy it, then blend it, but don’t build it unless you can really prove it’s necessary”.
The opportunity costs and slower time-to-value, as well as risks, of DIY or customizing an open-source solution are generally much greater than investing in a vendor who lives and breathes to deliver the solution you need. Ultimately, the important thing is to properly consider all the hidden costs and challenges of every option, and to get buy-in from all stakeholders before proceeding.
When we look at reliability software, we find that building tools is often only feasible for extremely large and mature tech-focused companies that can devote the permanent resources required for development and maintenance. These companies likely have plenty of custom-built infrastructure, skilled engineering talent, and require unique integrations, thus making the tailor-made approach a feasible investment.
However, if your company is like most, it probably isn’t yet at that stage. For most organizations, the costs of building and maintaining tools quickly exceeds the cost of buying them. Furthermore, when your own tool breaks, only you can fix it. The idea of an unreliable reliability solution is unthinkable, so why not let the vendor carry that responsibility?
An open-source solution, which is generally free and allows for full modification of the source code, can seem like an appealing middle ground. For small tools, like chatbots or chart-generators, you may be able to find open source projects that require little adapting, saving you time and money. However, it can open up potential risks.
Before leaping to an open-source solution, ask yourself the following questions:
Is the solution sufficiently scalable, performant, compliant to meet your business’ needs?
How often are fixes deployed?
Is there a devoted community behind it, or is it a side project that could be suddenly abandoned?
If things break, is reliable support available?
How well does it integrate into the rest of your technology stack?
How much work will your team have to do to adapt for your specific needs?
How much additional security work will you need to do to bring it up to your security standards?
Could the solution create additional data fragmentation? Does it simply automate tasks, or does it feed data back into the rest of your reliability system for complete context across postmortems, search/tagging, insights, and more?
In summary, you’ll want a solution that gives you insights into your services’ health, streamlines your incident response procedures, expands your capability to learn from incidents, and weaves that learning back through the software lifecycle — all without creating unnecessary overhead.
You’ll want to implement this solution without breaking the bank or draining limited budget and engineering resources. It may sound daunting, but in identifying what it is that you need, you’ve already taken a crucial first step.