By David Blank-Edelman
I've had the pleasure of talking to many organizations about their operations practices and how they think about the challenges they face around maintaining a production environment that adequately serves their business. I have yet to meet an organization that hasn't had to balance often conflicting needs around feature velocity and operational stability. Simply put, this is the classic "devs gotta make stuff (customers want features and functionality) and the ops gotta keep things running (systems that aren't up can't serve customers)" dichotomy that everyone wrestles with.
Two "movements" (for lack of a better word) sprung up in direct response to this challenge: DevOps and Site Reliability Engineering (SRE). The former is better known because it grew up in the more public sphere. The second, until relatively recently, was cloistered in larger organizations as part of their efforts to scale operations beyond anything the planet had seen before. As a result, public understanding and adoption of DevOps is significantly further ahead of SRE. This post is an attempt to provide a very brief introduction to SRE and to hopefully suggest how it could relate to your existing DevOps practices.
Just like there's no one canonical definition for DevOps, I won't pretend to be able to give you the final word on SRE. You will hear a range of answers starting from "SRE is what happens when you ask a software engineer to design an operations team" to "Site reliability engineering (SRE) is the application of scripting and automation to IT operations tasks such as maintenance and support."
When I speak about SRE I usually describe it as an engineering discipline devoted to helping an organization achieve the appropriate level of reliability in their systems, services, and products. There are two crucial parts to that definition: first, SRE is specifically focused on reliability as a fundamental property (perhaps the fundamental property). The rationale behind this is pretty straightforward. You can expend a huge amount of effort and resources adding features and functionality to your service or product. Kerjillions of dollars and countless hours could be expended to create something incredibly feature and functionality rich. But if it is not up, if it is not available when your customers attempt to use it, it doesn't do them or your business a lick of good (or your profits).
The second, slightly more subtle part of my definition hangs on the word "appropriate" when speaking about level of reliability. An important observation made by the SRE world early on was that there are actually very few systems and services that have to be 100% reliable. In fact, there are very few situations where it is even desirable because almost always the cost of achieving greater reliability from a cost and effort perspective, rises at a very steep rate. And as friends at Google are fond of pointing out, sometimes it's not even possible to hit certain levels of reliability. SRE seeks to not only acknowledge this gap between perfect reliability and desired reliability but in many cases, exploit it for the greater good of an organization's engineering priorities.
Many (Most?) of the same challenges SRE was created to address were also the same motivation for the formation of DevOps. I think of SRE and DevOps as parallel tracks both attempting to solve the same problems. As a result, it is no coincidence that some of the same best practices are requirements for both sets of practices. For example, both SRE and DevOps require you bring automation to bear as a way of addressing scaling (and other) problems. Sound release engineering processes (including CI/CD) are required to create a manageable production environment. And everybody's favorite subjects: monitoring and observability are both core to SRE and DevOps practices.
Well, not so much. While there is an overlap in practices, there is not an equivalence in philosophy, attitude and approach to many of them that is shared by the two practices. The emphasis in the two practices can often be different. Plus, in order for SRE to succeed en masse, crucial parts of the organization have to be willing to accept some of the values and priorities that permit SRE to properly operate. At the very least, there has to be buy-in in the right places around the value of reliability to the business as discussed earlier.
It's really important to distinguish between dedicated SRE roles (people who call themselves site reliability engineers) and SRE practices. In many organizations, as they grow, there comes an inflection point where it becomes appropriate to hire people (and then form teams of them) whose expertise and primary focus is on reliability. It is important to note that when you have such people on the payroll, it is not the case that they are the only people in the organization responsible for paying attention to reliability (everybody is responsible for constructing reliable software and infrastructure). SREs are the people who have a specialization that can be brought to bear on these challenges.In this regard there is a direct analogy to security. At a certain point, it makes sense for a business to hire people who focus primarily on security. They are not the only people in the organization paying attention to security (they better not be--security is everyone's responsibility), but they do serve a crucial role in regards to that domain.
That's SRE roles, but what about SRE practices? Before the inflection point mentioned before takes place, before you've hired dedicated SREs, it absolutely makes sense to introduce some of the more congruent SRE practices and SRE tools into your organization.
Often DevOps organizations are the perfect place to plant these seeds because they already have a culture that values modern operations practices (as mentioned in Whither DevOps above). An easy example is one near and dear to the people who run this blog: incident response and post-incident follow up (the blameless postmortems from those incidents). Another might be around the creation of service level objectives (more on this in a later piece). As these practices become popular, as the culture changes as a result, and as some individuals start to gravitate towards a reliability-centric viewpoint, it becomes natural to consider introducing SRE as a full-time role for these people.
SRE offers a set of principles, practices/ and a particular focus. If you already have a DevOps culture and practices in place in your organization, there is no good reason to walk the halls ripping up people's business cards and handing out new titles... In fact, don't do that (that's a subject for a future blog post).
But if you do look at SRE and find its ability to deftly address the feature velocity vs. operational stability conundrum compelling, by all means, consider exploring those principles. Experiment with some of the practices and tooling that makes adoption easier. See if it can offer you the same benefits enjoyed by the many organizations who have already walked this path. And be sure to let me know how it goes for you.
If you'd like to see how a platform can help you adopt SRE (and DevOps) best practices via tools, the Blameless team can show you how. Sign up for a trial at www.blameless.com.