Whether you're just adopting SRE or optimizing your current processes, we can help. We’ll explain the 7 key principles of SRE and how to put them into practice.
So, what are the SRE principles? The fundamental SRE principles are:
- Embracing Risk
- Service Level Objectives
- Eliminating toil
- Release Engineering
Why SRE principles are important
SRE is a method that operates through principles. Instead of prescribing specific solutions, it guides you with best practices. These SRE principles help organizations decide what's best for them. Once you understand the principles, you can apply them in many areas. When considering a new policy or procedure, you can judge it in the context of these principles.
All SRE principles align on one ultimate goal: customer satisfaction. By following these SRE core tenets, your efforts will make a positive impact on customers. It’s important to maintain this focus on business value.
SRE principles vs DevOps principles
SRE and DevOps both operate based on a set of principles. Both sets of principles drive alignment towards business goals. Some of their principles overlap. When comparing SRE vs DevOps, the biggest difference is that DevOps principles describe goals. SRE principles describe processes to achieve goals. In this sense, SRE best practices are a way of implementing DevOps principles.
The seven principles of SRE
These principles were developed by Google as part of their book on Site Reliability Engineering.
Principle 1: embracing risk
Embracing risk means weighing the costs of improving reliability and the impact it has on customer satisfaction. No service can ever be 100% reliable. Customers accept this, and will only be unhappy if unreliability causes them pain. Increasing reliability past this pain point will likely go unnoticed by customers. This means further improvements won’t generate business value.
Improving reliability always comes at some cost, whether it be money, time, or energy. Embracing risk allows you to know when this cost is unnecessary and could be better spent. Not overspending on reliability allows you to increase development velocity.
There is also a cultural component to embracing risk. Individuals should feel psychologically safe in their work. When you’ve chosen to embrace the risk of accelerating development, your team should have security and agency to take advantage of the opportunity. Know that, in cases of failure, all teammates had the best intentions.
How to implement the principle of embracing risk
- Determine an acceptable level of reliability for customers. Look at use patterns and gather feedback to find these levels. This is building an SLI and SLO, which we’ll cover in the next section.
- Determine the cost of any improvements to reliability. These improvements could include setting up redundant servers, automation efforts, or allocating engineers to reliability projects. Consider both the financial and opportunity cost.
- Determine the risk of not implementing the improvement. Could the reliability of the service become unacceptable? How likely is it? If there is a setback, how much damage could it cause?
- Weigh the costs vs. the risk. Set standards for when your team embraces risk with error budgets. Make sure your team feels psychologically safe to fail and learn.
Principle 2: service level objectives
Service level objectives, or SLOs, translate customer satisfaction into an internal goal. They help manage risk and budget for error. SLOs are based on Service level indicators, or SLIs. An SLI is a set of metrics that represent what’s most important to your customers. By looking at how customers use your service, you can make SLIs that represent reliability more than any single metric. Teams do this by mapping distinct user journeys. Then they set SLOs for the most important steps.
SLOs are set to the point where unreliability causes customers pain. They should be stricter than any legal agreements you have with customers, such as SLAs. SLOs can serve as a safety net to ensure that SLAs aren’t breached.
SLOs leave room for an error budget. This is the amount of unreliability allowable within a timeframe. Whenever a failure or degradation affects the service, the error budget decreases. When the error budget is high, you can speed up development. When the error budget is on track to run out, you can hit the brakes or focus on reliability work.
How to implement the principle of service level objectives
- Build SLIs by looking at how customers use your services. Craft user journeys and consider what services are most essential at each step.
- Set your SLO at the customer’s pain point. For each SLI, determine where the customer would experience pain from unreliability.
- Ensure that your SLOs are monitorable. Get access to all the data you need to keep the SLO up-to-date. Make sure that anything that could affect the SLI is being represented.
- Set policies for your error budget. When the error budget runs low, figure out what you’ll do to prevent an SLO breach. When you have budget left, figure out how you’ll increase development efforts.
- Review and revise your SLIs and SLOs. As your service changes and grows, what’s important to customers will change too. Set a schedule to review your SLOs and make sure they’re still reflecting customer happiness.
Principle 3: eliminating toil
Eliminating toil means reducing the amount of repetitive work a team must do. SRE advocates looking for ways to reduce toil in any area of work. By eliminating toil, you free up energy and time for other tasks. You also increase morale, as the team will be able to focus on more interesting work instead.
A common way of reducing toil is through automation, which we’ll cover later. You can also find steps in processes that are redundant or unnecessary. Manage the risk of eliminating tasks versus the opportunities it opens up.
Teams can also eliminate toil by adding guides and processes for tasks. Having to remember what to do or search for increases cognitive toil. By documenting it, teams can use their cognitive capacity for higher value work.
How to implement the principle of eliminating toi
- Create standards and templates for resources. Create guidelines for how to make guidelines, and invest in the tools needed to automate. This investment will enable you to remove toil more efficiently going forward.
- Look for areas of high toil. Look for common and time consuming tasks. Even if you can make small optimizations in them, it will reduce toil over time.
- Prioritize improvements. Include toil elimination in sprints and plan time for regular improvements. This can also be a dedicated duty of an SRE.
Principle 4: monitoring
Monitoring means looking at the meaningful and actionable data your system produces, and making decisions based on it. It may be tempting to log every bit of information you can get from your services. But, too much data can be overwhelming. Additionally, metrics can be misleading. You need to go past shallow metrics to understand how customers see your system.
Monitoring tools are a good way to help separate signal from noise. They can help you consolidate a lot of information into fewer meaningful metrics. Custom and out-of-the-box dashboards allows you to see the most important information.
The most common metrics focused on for reliability are the four golden signals:
- Latency: the time it takes for a service to respond to a request
- Traffic: the amount of load a service is experiencing
- Error rate: how often do requests to the service fail
- Saturation: how much longer the service’s resources will last
These are often the metrics you'll measure as components of your SLIs. By keeping an eye on these metrics, you can understand your customers’ happiness.
How to implement the principle of monitoring
- Make sure your service produces metrics you need. Your services should generate a log of requests and information on how they were served.
- Consolidate these metrics into statistics. Monitoring tools are the most efficient way to achieve this.
- Build up deeper metrics. Bridge these simple metrics to what impacts customers. This is a similar process to building an SLI. Make sure these metrics are available to anyone who needs them.
- Connect your alerting tools to monitoring data. Monitoring data is one of the most common ways of detecting incidents. Have your monitoring systems trigger on-call alerts.
- Incorporate monitoring data into incident retrospectives. Incident retrospectives are documents built to summarize how teams resolved an incident. Include monitoring data from when the incident occurred as context.
- Look for patterns in data. Schedule time to review trends in your monitoring data. Make sure there aren’t any trends that threaten your system.
- Consider data when making decisions. Make policies to incorporate data into your strategic decision-making
Principle 5: automation
Automation means creating ways to complete repetitive tasks without human intervention. This helps free up teams for higher value work and codify processes. Automation increases the speed of completing many tasks, improving your development velocity.
Automation can help in many different areas of work, such as:
- Testing: Tools can simulate use of your services to find bugs and test how your system handles load
- Deployment: Automate tasks like creating new servers, reallocating load, and swapping over codebases
- Incident response: Automated runbooks can help teams respond to incidents faster
- Communication: Tools can spin up collaboration channels and log key events
How to implement the principle of automation
- Look for places to automate. Keep track of smaller tasks that your team does that are repetitive. These will be your low-hanging fruit to automate. From there, build out to more ambitious tasks.
- Invest in automation tools. Either buy or build automation tools. These will be able to interface with your system as if they were a human. Investing in tooling will pay off in the long run.
- Roll out automation with testing. A benefit of automation is the consistency of outcomes. You want to make sure it’s the outcome you want. Test your automated processes on a regular basis to ensure they’re still functioning.
- Keep optimizing. Just because something is automated doesn’t mean it can’t be improved. Look for optimizations that increase speed or decrease resources used.
- Develop with automation in mind. When building new services, think about how the code will interact with automation tools.
Principle 6: release engineering
Release engineering means building and deploying software in a consistent, stable, repeatable way. It applies SRE principles to releasing software. Here are some of the qualities of good release engineering:
- Configuration management: Creating a singular, agreed upon standard for how releases should be configured. Some releases may need changes, but these should be created through modifying the baseline configuration.
- Process documentation: Invest in creating guides for different types of releases. This will reduce the toil of determining what to do each time. It will also increase the reliability of releases. Schedule reviews of your processes to ensure they’re still up to date.
- Automation: Where possible, automate the release process. This removes the chance of releases being done in an inconsistent way, leading to more reliable deploys.
- Rapid deployment: Through automation and documentation, deployment becomes faster and easier. This allows you to deploy frequent, small releases. This is more reliable, as it reduces the chance of a major incident caused by a bad deployment.
- Testing: Implement a continuous testing process to catch errors as soon as possible. Use automated testing to achieve this frequency without much toil.
How to implement the principle of release engineering
- Decide on release standards. Collaborate to come up with standards for all releases. This can cover timelines, testing protocols, resources to have available, and more. Decide on policies for modifications to this plan when the need arises.
- Build release guides. Create guidelines for releasing code in a way that meets the release standards. These should walk someone through the release as if it were their first time doing it.
- Automate. Once you have a process, look for steps to automate. Steps shared between many release processes, such as spinning up a new server, are good targets.
- Review and revise. Monitor statistics about your releases. What types of releases generally take longest? Which tests consistently catch errors, and which may be unnecessary? Use questions like this to optimize your release processes.
Principle 7: simplicity
Simplicity means developing the least complex system that still performs as intended. The goals of simplicity and reliability go hand-in-hand. A simpler system is easier to monitor, repair, and improve.
SRE advocates a holistic, end-to-end approach to reliability. Teams can apply this perspective to simplicity, too. Consider the complexity of your system on the most micro and macro levels. Building models is a good way to understand the interworking of your system. Look for unnecessary nodes to simplify. Each extraneous step represents time and energy that could be better spent elsewhere.
Systems trend towards complexity as new features are introduced. Consider the cost of additional complexity when proposing new features. Use customer satisfaction as your standard. Will these features contribute enough business value to offset the complexity?
How to implement the principle of simplicity
- Develop a shared understanding of complexity. Come up with metrics to evaluate the complexity of a system. These could include how long it takes someone to make a change, or how many other systems it interacts with.
- Model systems to find areas of unnecessary complexity. Map out how your systems work. Look for nodes and connections that are unnecessary. Evaluate the risk of removing them versus the time saved.
- Evaluate development with simplicity in mind. When designing new features, judge their business value against the complexity they would add. Set standards for how much complexity they can add to the system in design.
Other SRE best practices
These are the seven major principles of SRE, but many other core tenets exist. Here some other SRE best practices you can follow:
- Work blamelessly, always assuming the best intentions others, and finding systemic causes together
- Celebrate failure as an investment in reliability. Learn from each one with incident retrospectives
- Create on-call schedules that are empathetic and fair
- Treat reliability as a feature. Put your reliability goals in specifications right from the start
- Share information within the organization, and work in collaboration with other teams
- Build an SRE team that works in roles from code development to spreading cultural values
- Create an SRE culture that reinforces your SRE principles.
Ready to start the journey of adopting SRE principles? Blameless can help. Our SLO and incident retrospective tools help you implement SRE best practices. To learn more, check out a demo or sign up for our newsletter below.