Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

SRE Culture [How to Build a Better Team]

Emily Arnott
|
5.24.2021

If you're just adopting SRE or improving your current environment, we’ll help explain SRE culture and how to create a blameless development process.

So what is SRE Culture? SRE Culture is founded on these main tenets:

  • Accepting failure as normal and adopting a blameless approach
  • Creating strong teams and relationships
  • Hiring team players and educating your hires
  • Creating a shared ownership of the product among teams
  • Balancing a resiliency-first approach with risk acceptance

Why is SRE culture important?

SRE is more than a set of best practices; it’s a cultural movement based on a set of key principles. The cultural lessons SRE teaches are as important as the technical processes. It makes the structure of your organization more reliable. It uses empathy and psychological safety to build trust and agency.

SRE goals will not succeed without the cultural investment. It provides a foundation for the specific practices you put into place. You may feel unsure how to adopt SRE best practices for your organization. If you use the cultural tenets to guide you, you’ll have a stronger sense of where to invest time and energy.

SRE culture and DevOps culture

SRE and DevOps both have a set of cultural tenets that motivate their practices. Both methods encourage collaboration across teams and accepting failure. The major difference in SRE vs DevOps culture is that SRE focuses more on how to make these cultural improvements into actionable processes.

Hiring for SRE culture

When building your SRE team (or hiring for any other role) you need to ensure the candidate is a good cultural fit. When interviewing, don’t just focus on technical questions. It is often easier to get up to speed on technical practices than to change a cultural viewpoint.

Talk to candidates about their perspectives on reliability, failure, and blame. Ask them to remember times where their viewpoints were tested. You shouldn’t expect people to have perfect definitions or anecdotes on the spot. Rather, look for people who consider the questions in depth and answer from an empathetic perspective.

The seven cultural tenets of SRE

Accepting failure as normal

Accepting failure as normal means understanding that having 100% uptime is impossible. No matter how much effort you put into improving the reliability of something, there will always be the possibility of an unforeseen issue.

From this perspective, healthier attitudes towards failure emerge:

Understand tradeoffs in reliability: Attempting to reach 100% uptime is futile. Each additional “nine,” or investment in reliability has an exponential cost associated with it. And customers may not even notice! Instead, teams should set a goal for a level of reliability that satisfies users. Aim to maintain that level. Iinvest resources elsewhere when you’re consistently achieving this goal.

Celebrate failure as an opportunity to learn: Failure is inevitable, embrace it! Every failure is a chance to learn and grow. Understand that everyone on the team is doing their best. Use blameless retrospectives to uncover systemic issues rather than attribute blame. We’ll cover this in more depth later.

By normalizing failure, your team will operate more effectively, and with higher morale. Failure won’t be a demoralizing setback, but an unplanned investment in reliability.

Adopting a blameless culture

When failure occurs, address it blamelessly. Rather than finding faults with individuals, look for systemic causes together. Assume the best intentions from everyone. If someone’s judgment was incorrect, what information would help them make the right decision next time?

Adopting a blameless culture leads to improved agency and psychological safety. People won’t hesitate to proactively raise issues and concerns. They’ll know that they won’t be blamed or punished, and systemic issues will be addressed faster. People will also be more willing to take risks and experiment when they know their attempts will be taken in good faith.

Using blameless retrospectives

Blameless retrospectives, also known as postmortems, are a document built for each incident that occurs. They typically contain:

  • A summary of the events of the incident including classification and severity
  • A breakdown of the impact the incident had on customers
  • A timeline of the actions taken to resolve the incident
  • A log of communication between respondents
  • A technical analysis of what happened, including monitoring data to provide context
  • Followup tasks to convert the lessons of the incident into better practices

Creating retrospectives reflects an important cultural tenet: failure is an opportunity to grow. Whenever an incident occurs, you should collaborate to learn as much as possible from it. Establish a routine of reviewing past retrospectives in groups. Reflect on how they’ve influenced systemic changes. This reinforces motivation to continue learning and growing.

Retrospectives need to be blameless to be effective. If your conclusion boils down to human error, there is no effective change. Look for deeper causes, such as information being unavailable, or not having failsafes.

Building a strong team

SRE culture emerges from within a team, it cannot be enforced top-down without support. Involve people at every level to help form this culture mindfully. Allow people the agency to play to their strengths while providing them opportunities to grow. Psychological safety is key to encouraging everyone to their best. People need to feel secure enough to fail to be able to improve.

Breaking down silos of information is also key to good team dynamics. In a siloed structure, development ships code to operations without communicating intent. Operations then throws incidents back to development without context. With no additional communication, resentment can grow.

SRE culture supports communication and alignment between teams. Documents like incident retrospectives can link operations concerns with development goals. Shared metrics like SLOs make sure everyone’s focus is aligned on customer satisfaction. Runbooks codify information so everyone feels confident responding to incidents.

To coach a strong team, emphasize collaboration and transparency in decision making. Use accessible data to inform your choices. Then consult with your team members on where the data is pointing. Cultivate a shared trust in processes and cohesion rather than authority.

Shared ownership

Shared ownership is a cultural principle that shifts reliability concerns to earlier in the development cycle. Reliability shouldn’t just be the responsibility of operating the code. Instead, many teams should weigh in on reliability concerns throughout development.

Here are some of the teams that should share ownership of the service’s responsibility:

  • Operations should raise concerns about maintaining the code in production
  • Security teams need to proactively discover potential vulnerabilities
  • Product development can contribute estimates of the importance of services, so you can better understand how reliability will be perceived
  • SREs can help establish SLOs to reflect the required reliability
  • Engineering teams can integrate reliability requirements into the feature specs of services

Reliability should be part of the conversation at every stage of development. This investment in reliability will align teams on the most important goal: customer satisfaction.

Putting resiliency first

Reliability is feature No. 1. It doesn’t matter how feature-rich or innovative your service is if users can’t reliably access it.

When designing the specifications for a new service, include your reliability goals. Consider how these goals will be impacted by each development choice. This also requires estimating the importance of different service aspects. Crafting user journeys and noting points that are especially key to customers can help you weigh the areas where reliability is the most important.

When operating a service, use SLOs to keep its reliability in an acceptable range. When the reliability dips close to violating the SLO, trigger policies to keep it safe. These policies can ensure that teams make mindful tradeoffs between reliability work and feature work.

Your reliability goal shouldn’t be 100%. Customers will accept some amount of unreliability, and efforts made to improve reliability beyond that point will go unappreciated. We’ll cover embracing this risk of failure in the next section.

Embracing risk

Building a culture around embracing risk means evaluating risks to reliability against increases in development velocity. This is also a key SRE principle. Customers may not notice the difference between 99.9% and 99.99%. If this is the case, investing more resources in adding an additional nine may not add business value. Instead, you can look towards additional features that would increase business value.

Tools like SLOs and error budgets tell you when development velocity can be accelerated. Culturally, you also need people to be ready to take advantage of these opportunities. This involves cultivating psychological safety. People need to feel secure in their agency and freedom to develop. Error budgets help provide this by contextualizing the impact they could have. They’ll be able to evaluate the risk of their choices, and embrace risks that are worthy.

Other cultural tenets of SRE

These are the core SRE cultural beliefs. There are many other cultural tenets that support your SRE goals. Here are some examples:

  • Aligning teams on goals for customer satisfaction and business value with SLOs
  • Assuming that processes can be improved, and working on automation to better support the humans who run the system
  • Continually reviewing and revising policies, practices, and resources to promote continuous improvement
  • Embracing tooling to eliminate toil and leverage new types of information

Ready to start cultivating SRE culture at your organization? Blameless can help. Our SLO and incident retrospective tools help you achieve your SRE goals. To find out how, check out a demo or sign up for our newsletter below.

Resources
Book a blameless demo
To view the calendar in full page view, click here.