In the world of technology, the stakes have never been higher. The move to the cloud and microservices to maximize agility has given way to digital disruptors and unprecedented competitive threats. As distributed systems become increasingly complex, the scale of ‘unknown unknowns’ increases. On top of this, customer expectations are sky-high.The cost of downtime is catastrophic, with customers willing to churn if their needs are not promptly met. According to Gartner, the average cost of downtime is $300,000 per hour. For some companies, this number is considerably higher; for example, Amazon lost approximately $90 million during their Prime Day outage in 2018, and the outage only lasted 75 minutes.
Organizations need to prioritize reliability so they can innovate as quickly as possible on top of a strong foundation that won’t compromise customer experience. This will become even more critical as more businesses move toward distributed systems with high reliability requirements.
That’s where site reliability engineering (SRE) comes in. The SRE function is growing quickly (30-70% YoY growth in job listings), but there is not enough skilled talent in the market to compensate. In other words, it will be important to understand how you can not just hire SREs, but grow your existing organization to adopt the practices and mindsets required for production excellence. With the shortage of SREs for hire, what can you do to ensure your service’s reliability?
To answer this question, you’ll need a deeper understanding of what SRE actually is.
SRE is a practice first coined by Google in 2003 that seeks to create systems and services that are reliable enough to satisfy customer expectations. Since then, many large organizations such as LinkedIn and Netflix have adopted SRE best practices.
In recent years, SRE has become more widely adopted by many organizations globally, with the goal of reliability and resilience in mind in light of exponentially growing customer expectations as well as systems complexity.
SRE is based on a customer-first mentality. This means that SRE efforts are all tied to customer satisfaction, even if the customers using the service are actually internal users. Each decision should result in protecting or improving customer satisfaction.
Teams work together to determine which factors and experiences affect customer happiness, measure them, set goals, and balance reliability requirements with the innovation velocity required to stay viable in an increasingly competitive digital landscape.
To achieve this goal, SREs and teams that have adopted SRE best practices refer to several key tenets of SRE.
According to Google, these include:
According to Forrester, 46% of the tenets can be applied out-of-the-box for most software teams in the enterprise, but the rest require customizations or won’t make sense for the vast majority of organizations. The important question to ask yourself is how these tenets fit in with what you’re already doing, and how your teams can improve. We’ve got more answers below.
A common early mistake in adopting SRE best practices is assuming that following SRE best practices means you’ll need to rip and replace your current procedures, which simply isn’t true. In fact, SRE can work as a complement to both DevOps and ITIL methodologies. The trick is to ensure that regardless of your organizations’ different operating models or toolchains, there is shared visibility, communication, and collaboration across teams.
This will allow your disparate teams to stay aligned while using the best practices from each methodology.
Think of SRE as the practice that brings life to the DevOps philosophy. The core principles ofDevOps and SRE are nearly identical.
According to Google’s course on SRE, “classSRE implements DevOps,” the 5 DevOps principles are as follows:
In practice, ITIL and SRE can also make for a great combination. The first reason why is simple: every organization wants happy customers, and ITIL and SRE can help different functions work together to make that a reality. Embedding reliability throughout the software lifecycle can ensure a higher rate of customer happiness.
With the newest revision of ITIL (ITIL 4), which introduces seven guiding principles, SRE and ITIL align even more closely.
Whether you identify as a DevOps or ITIL shop, your organization has something to gain by following the principles of SRE.
Let’s dive into what exactly these principles entail.
Resiliency isn’t something that just happens; it’s a result of dedication and hard work. To reach your optimal state of resilience, there are some crucial SRE best practices you should adopt to strengthen your processes.
As you know, failure is not an option… because actually, it’s inevitable. Things will go wrong, especially with growing systems complexity and reliance on third-party service providers. You’ll need to be prepared to make the right decisions fast. There’s nothing worse than being called in the wee hours of a Sunday morning to handle a situation where thousands of dollars are going down the drain every second. Your brain is foggy, and you’ll likely need time to adjust to the extreme pressure of a critical incident. In these cases (and really, all cases where an incident is involved), incident playbooks can help guide you through the process and maximize the use of time.
According to Chris Taylor at Taksati Consulting, good incident playbooks help you cover all your bases. They typically include flowcharts and checklists to depict both the big picture and the minute details, a RACI (responsible, accountable, consulted, informed) chart for each step, and a list of environmental influences that are unique to your system.
To create your incident playbook, Chris recommends aggregating the following information:
By developing incident playbooks and practicing running through them, you’ll be more prepared for the inevitable.
Change management is often done haphazardly, if at all. This means that organizations are unable to manage the risk of pushing new code, possibly leading to more incidents. Rather than employ ITIL’s arduous CAB method, SRE seeks to empower teams to push code according to their own schedule while still managing risk. To do this, SRE uses SLOs and error budgets.
SLOs, or service level objectives, are internal goals for service availability and speed which are set according to customer needs. These SLOs serve as a benchmark for safety. Each month, you have a certain allowable amount of downtime determined by your SLO. You can use this downtime to push new features. If a feature is at risk for exceeding your error budget, it cannot be pushed until the next window. If the feature is low to no risk to your SLO, then you can push it.
Each month teams should aspire to use the entirety, but not exceed, their error budgets. This way, your organization can optimize for innovation, but do so safely without risking unacceptable levels of customer impact.
Black Friday outages, scaling, moving to cloud. All of these big events required heightened capacity planning. If you don’t have enough load balancers on Black Friday or Cyber Monday, you might be sunk. Or, if your company is simply growing quickly, you’ll need to adopt best practices to make sure that your team has everything it needs to be successful. There are two types of demand that require additional capacity: the first is organic demand (this is your organization’s natural growth) and inorganic demand (this is the growth that happens due to a marketing campaign or holiday. To prepare for these events, you’ll need to forecast the demand and plan time for acquisition.
Important facets of capacity planning include regular load testing and accurate provisioning. Regular load testing allows you to see how your system is operating under the average strain of daily users. As Google SRE Stephen Thorne writes, “It’s important to know that when you reach boundary conditions (such as CPU starvation or memory limits) things can go catastrophic, so sometimes it’s important to know where those limits are.” If your service is struggling to load balance, or the CPU usage is through the roof, you know that you’ll need to add capacity in the event of increased demand. That’s where provisioning comes in.
Adding capacity in any form can be expensive, so knowing where you need additional resources is key. It’s important to routinely plan for inorganic demand so you have time to provision correctly.The process of adding capacity can sometimes be a lengthy effort, especially if it’s the case of moving to cloud. You’ll also need to know how many hands you’ll need on deck for these momentous occasions.
Capacity planning is an important part of having a resilient system because in thinking about the allocation of resources, your team members matter. They need time off for holidays, personal vacations, and the obligatory annual cold. When you fail to plan for time off, you won’t have enough hands on deck to handle incidents as they occur. Denying people time off is obviously not the answer, as that will only lead to burnout and churn. So it’s important to develop a capacity plan that can accommodate people being, well, people.
Johann Strasser shares four steps you can take to develop a capacity plan that will eliminate staffing insecurity:
So, now you’ve got the people and the process, but how can you learn and improve on your resilience? For that, you’ll need great postmortem practices in place that facilitate real introspection, psychological safety, and forward-looking accountability.
When something goes wrong, it’s important to learn from it to prevent the same mistake from happening again. To do this, it’s important to craft and analyze postmortems (or post-incident reviews, RCA reports, or whatever you like to call them). To have postmortems worthy of analysis, applying SRE best practices will be key. In fact, postmortems are a great place to begin your SRE adoption journey.
As Steve McGhee, SRE Leader at Google shares, “Conducting blameless postmortems will enable you to see gaps in your current monitoring as well as operational processes."
Armed with better monitoring, you will find it easier and faster to detect, triage, and resolve incidents. More effective incident resolution will then free up time and mental bandwidth for more in-depth learning during postmortems, leading to even better monitoring.
Building a postmortem practice will eventually enable you to identify and tackle classes of issues, including fixing deeply rooted technical debt. With time, you’ll be able to directly improve systems continuously.
One of the most important elements of a postmortem, and of SRE as a whole, is the notion of blamelessness. To learn from postmortems, there needs to be total transparency. Opening up about mistakes can often be frightening, and requires a psychologically safe space to do so. Positive intent should always be assumed in order to foster the trust that allows for true openness. Blaming team members or defining people as the root cause for failure will only lead to more insecurity, covering up the important truths that postmortems are meant to uncover.
To craft great postmortems, there are four other best practices that will ensure your incidents are being used to their full advantage:
Creating incident playbooks, utilizing change management and capacity planning, and following postmortem best practices will all contribute to your system’s , but that’s not all that SRE seeks to do.
Focusing on the customer has been a key business strategy since the beginning of time. But how do you really know what your customers want, and how can you guarantee you’re providing it? SRE’s concept of SLIs (service level indicators), SLOs (service level objectives), and error budgets will keep your organization aligned on what customer success looks like.
When you look at your product through the eyes of your user, you aren’t just finding the right SLIs, but creating key information for constructing a user journey. A user journey is a powerful tool for many aspects of product design as it helps designers focus on users’ priorities. The lessons you learn from developing and analyzing user journeys can be insightful in the most fundamental areas of product design, but for these insights to be accurate, the underlying data must be carefully selected.
The touch points between the user and your service all involve requests and responses – the building blocks of SLIs. For each touchpoint you identify, you should be able to break down the specific SLIs measuring that interaction. From there, you can follow each branch that the user could take, gathering the SLIs for the following requests into a bundle for that journey.
To understand user intent, you must identify potential pain points for the chosen journey. Your bundle of SLIs can be instrumental in finding pains that might otherwise be invisible.
Let’s say that a user’s channel involves making a dozen requests to the same service component – like clicking through many pages of search results. Separately, these requests return quickly enough that users won’t be bothered, maybe under a second, and a user looking at just one or two pages will be satisfied with this speed. However, if your user journey involves looking through twenty pages, the annoyance of nearly a second wait, repeated twenty times, could be intolerable. Only through looking at relevant monitoring data as well as understanding the broader context could you discover this point of user frustration.
Finding these pain points along the user journey could lead to a radical redesign of the service as a whole. Additionally, it opens up a path to solutions deep in the backend and helps determine priorities for development. In our example above, you could either redesign the catalog to avoid the need to look through twenty pages, or you could optimize the components serving those pages until the total delay across the twenty pages is still acceptable.
Once you identify what makes your customer happy, it’s important to set goals to reach them.
Service Level Objectives, or SLOs, are an internal goal for the essential metrics of a service, such as uptime or response speed, and correlate to customer happiness.
As SLOs are always set to be more stringent than any external-facing agreements you have with your clients (SLAs), they provide a safety net to ensure that issues are addressed before the user experience becomes unacceptable. For example, you may have an agreement with your client that the service will be available 99% of the time each month. You could then set an internal SLO where alerts activate when availability dips below 99.9%. This provides you a significant time buffer to resolve the issue before violating the agreement:
Service Level Agreement with Clients: 99% availability – 7.31 hours acceptable downtime per month
Service Level Objective Internally: 99.9% availability – 43.83 minutes acceptable downtime per month
Safety Buffer: 6.58 hours
Knowing that you’ll have over six and a half hours between your internal objective and an agreement breach can provide some peace of mind as you deploy. However, it can be difficult to determine a buffer that provides sufficient time to respond when disruptions occur. Garrett Plasky, who previously led Evernote’s SRE team, describes this challenge:
“Setting an appropriate SLO is an art in and of itself, but ultimately you should endeavor to set a target that is above the point at which your users feel pain and also one that you can realistically meet (i.e. SLOs should not be aspirational).”
It may be tempting from a management perspective to set an SLO of 100%, but it just isn’t realistic. Development would be paralyzed by fear that the smallest change could trigger an SLO breach. Moreover, such a high target isn’t helpful. As Garrett points out, the SLO should still be set above the point where the users of the service are pained, as any refinement beyond that quickly gives diminishing returns for additional user satisfaction.
Setting SLOs can also positively impact development velocity by giving developers the opportunity to use small amounts of downtime to improve the service. This amount of time allowed is called an error budget.
Error budgets are the amount of downtime that can be spared per window before violating an SLO. Setting error budgets can positively impact your organization in many ways. First, it can increase the rate of innovation. Developers no longer need to spend time consulting with other teams before doing a code push, as long as the push won’t endanger the SLO and falls within the error budget. They can spend down the error budget on new features, or choose to allocate time instead to fixing technical debt or infrastructure. This also ensures that pushes don’t threaten the reliability of your system or customer satisfaction.
Beyond increasing innovation, error budgets also align different parts of the organization on incentives and consequences. With an error budget in place, developers can push code as fast as they need to without compromising reliability. Thus, developers, product, and production teams are all happy. If error budgets are overextended for a certain period of time, there are also consequences predetermined by the error budget policy, such as a code freeze.
SRE not only helps customers stay happy, it also boosts morale within the organization.
Happy engineers means happy customers, as engineers won’t build the best products possible without support from the organization.
There are two majors ways that SRE can help brighten engineering’s day.
Additionally, SREs invest in cultural change that prevents more tech debt from accruing in the future, while still making way for innovation. Jean Hsu, Co-Founder of Co Leadership, wrote about refactoring Medium's codebase, and realized that the most important thing she could do for her team wasn’t just to fix spaghetti code; it was to create a culture that fixes technical debt as it goes along, deleting old code as needed.
Jean wrote, “I realized that if I always did this type of work myself, I would be constantly refactoring, and the rest of the team would take away the lesson that I'd cleanup after them. Though I did enjoy it myself, I really wanted to foster a long-term culture where engineers felt pride and ownership over this type of work.”
SREs are often the cultural drivers for this sort of work, improving the way engineering teams function as a whole rather than simply going from project to project fixing bugs. These changes are long-term initiatives that spark growth and adoption of best practices for the entire organization.
As you can see, SRE could positively impact each engineer’s day-to-day productivity. In fact, SRE is not about tooling or job titles, and is rather a more human-centric approach to systems as a whole.
With this context in mind, adoption brings positive business benefits for everyone in the organization.
Resiliency engineering as a practice looks at systems holistically, considering not only infrastructure but also human, process, and cultural factors. Without adopting the culture and mindset behind SRE, you’ll simply have new processes with no uniting value at the center to keep the initiative in place. Focusing on the human approach to systems requires reevaluating your organization’s attitude towards the following: On-call & full service ownership practices, keeping burnout at bay, and celebrating failure.
Any organization can adopt SRE best practices, and it can begin in small increments. The most important change you will make will be the cultural one. As organizations are made of people, any organization can foster continuous learning, blameless culture, and psychological safety so long as its people are committed to a growth mindset. Once these cultural factors are in place, it becomes much easier to implement the practices, processes, and tools that scale that culture of excellence.
To dive deeper and get more bonus reading material on the above topics, download your copy of The Essential Guide to SRE.