Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

"I'm Just Doing my Job," An SRE Myth

SRE Fundamentals

"Sorry, but I'm just doing my job." I heard this recently from a customer service representative.  What they were saying made sense (afterall, we don’t have total control over our work environments), but it felt wrong. As a customer, I was left dissatisfied with our interaction. However, the representative assured me that they were simply following protocol. This got me thinking: can established practices and protocols sometimes get in the way of excellent customer experience?

As an SRE, I know one thing: we exist to serve an end user. That's it. Sure, we are people, too, with our own needs and wants. But from a business standpoint, we have a customer to serve, not the other way around. SRE can help ensure that teams are customer-focused, even if the best way forward breaks the rules or requires you to re-write them. Two ways SRE accomplishes this are by fostering a culture of blamelessness and using SLOs to glean insights into the customers’ experience.

Sorry, I'm just doing my job.

Align incentives through a blameless culture

It’s become commonplace to justify actions that achieve the end result, even if the end result doesn’t make the customer happy. Consider automated phone systems. Are they more efficient? Maybe. Do users like them? No! We skip to the “speak to representative” option if we can.

Intent is everything. Organizations with the intention of getting the biggest payout regardless of customer happiness often risk churn. To align incentives, the first and most important step is to create a blameless environment where team members are challenged to bubble up issues and reimagine processes and procedures.

A blameless culture encourages a company to take a look at incidents (or any challenge) and shift blame from people to the system. A systems view provides teams with a way to be open and honest about issues that arise, and allows everyone to collaborate on a solution.

This is a struggle to do well because it is human nature to attach blame to others. If we can shift our perspective to focus on changing the system, we can quickly see that these problems are not caused by a single person, and can be rectified with analysis. 

Resilience engineering expands on this concept of blamelessness and supports the truism that there is no single root cause of an incident. In complex distributed systems, there are a multitude of failures and successes which contribute to an end-user’s feeling of satisfaction. 

When we hear people say, "I'm just doing my job,” it’s often a clue that there is a lack of deep analysis. Issues arise, the status-quo mitigation measures take place, and teams live to fight another day. But was there an improvement? Was there any learning? With moving beyond shallow incident data, teams can push past this  reactive mode and improve the experience for their customers. 

Here are my top tips for a deep, blameless analysis:

  • Allocate time for analysis in the first place. If your engineers don’t feel supported to take their time with an analysis, it may not be done well—or at all.
  • Ask what is responsible for an outcome, not who. This helps move blame off individuals and ask hard questions about the system. Human error is not a sufficient analysis. Just like root cause, it is only a starting point for investigation. See The Field Guide to Understanding Human Error by Sidney Dekker.
  • Understanding how operators made the decisions they did during an incident. What was important to them at that time, and why? What else was happening in the setting of the incident that influenced their perspective? Avoid hindsight bias at all costs by looking through the lens of the operator. 
  • “Provide accountability that encourages learning” (Dekker). A safe environment to fail encourages more team members to feel comfortable participating in discussions. This establishes more accounts of the incident, and the variety of viewpoints contributes to greater understanding of the system.
  • Reviewing and discovering new contributing factors can get complex… and is impossible to get everything. Set clear expectations for the incident retrospective. Don’t be afraid to timebox the review process, and allow time away to soak between reviews. 

This type of analysis can help uncover issues that you might otherwise overlook. These issues extend beyond the product and into the overall business goals. Is your company focusing on the wrong metrics for success? Is short-term revenue being prioritized over long-term customer happiness? Are you going after the wrong users because you haven't established a product/market fit? Keeping things blameless means you can ask these questions without focusing on who made decisions, but why.

Use SLOs to get in touch with customers

Have unhappy users? It's time to delve into what your customer is feeling and why. Remember, an unhappy customer will eventually go somewhere else. If you don’t have SLOs (service level objectives) established yet, there are resources within your organization you can consult.

Your CSMs (customer support managers) are a great gauge for what makes customers tick. They’ve built strong relationships with clients and deeply understand their desires and pain points. This information can help you determine key user journeys to focus on and improve.

Once you have these user journeys marked, you can establish SLOs that dictate the acceptable performance level. This doesn’t mean 100% uptime. Nobody is perfect, and people don't expect you to be. SLOs can make sure that you’re in touch with customer needs by providing acceptable performance levels. 

Once you’ve established some key SLOs, we can create a corresponding error budget. An error budget can be used to enforce reliability standards. If the SLO is being impacted at a certain percentage, the error budget policy requires teams to prioritize system improvements over new work. Bumping up the priority shows the team is willing to slow development velocity to maintain the product’s No.1 feature: reliability. Yes, reliability. As I said before, we exist for our customers. If the product isn’t reliable, then to our end users, it may as well not exist.

Additionally, even if the application is usable, customers may still experience pain if the product doesn’t offer satisfactory experience. Creating a custom error budget is critical to help your product team bubble up important feedback and prioritize feature development. Your customer success teams can help validate the budget and priority.

This is an organization-wide effort not limited to engineering. Everyone has a part to play to ensure that customers are happy. In my interaction with the customer service representative, this blameless, customer-centric perspective could have dramatically changed the outcome for me, and likely many other customers struggling with similar issues. 

Customers deserve better, and we should always be their biggest advocate. So, next time you find yourself saying, “Sorry, but I’m just doing my job,” try to shift your perspective to the customer. View these problems as systemic, use SRE best practices like SLOs and error budgets, and embrace a blameless culture to help make a change.

If you enjoyed this blog post, check out these resources:

Book a blameless demo
To view the calendar in full page view, click here.