Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

The New Reliability

The New Reliability

What is reliability? When you try to nail it down, it's surprisingly nebulous. But with our new definition, you'll see reliability as something clear, measurable, and concrete. We'll equip you to present reliability to your org in a way that's true to your needs and resonates across the business overall. This talk reintroduces humanity into the reliability equation. It's not just about product health, it's about the humans on your app and the humans behind it.

Description

In this talk, Blameless’s Emily Arnott discusses The New Reliability, which is based on product health, customer happiness, and socio-technical resilience.

What is reliability? When you try to nail it down, it's surprisingly nebulous.

But with our new definition, you'll see reliability as something clear, measurable, and concrete. We'll equip you to present reliability to your org in a way that's true to your needs and resonates across the business overall.

This talk reintroduces humanity into the reliability equation. It's not just about product health, it's about the humans on your app and the humans behind it.

0:00 - 1:18 - Introduction

Emily Arnott, Content Marketing Manager at Blameless, proposes a new definition of reliability. What definition of reliability are you currently working with, and does it really help your team define what reliability is?

1:19 - 3:22 - What is reliability?

You might be surprised at the diversity of answers you get when you ask many different engineers to define reliability. Isn’t reliability just uptime? Google says to consider the customer’s expectations. That invites further questions. What customers are we looking at? How do we determine what those levels are? We know an absence of reliability incurs major costs to an organization.

3:23 - 4:27 - Our thesis: The New Reliability

We have spoken to many engineers about reliability from many organizations. It boils down to: the health of your product, the happiness of your customers, and the socio-technical resilience of your team.

4:28 - 12:00 - The reliability of flying

Emily provides a real world example. When you fly, you assume that the airline is prioritizing your safety and your needs (Customer Happiness). You also assume that the airline systems are working properly and the airplane is properly stocked (Product Health). You also assume that the pilot knows how to fly, that the crew will show up on time and the airport is properly staffed (Socio-technical resilience). The second real-world example is holiday flight disasters. Heightened demand and terrible weather lead to many travelers unable to arrive to their desired destinations on time. The airline systems hit their limit and flights were cancelled (product health), poor communication lead to travelers being frustrated (customer happiness), and the staff was not trained to handle this level of strain (socio-technical resilience). There are countless other examples - poor cell phone service, cars breaking down, and apartment buildings needing maintenance. This type of unreliability is everywhere.

12:00 - 17:35 - Let’s take a tech example

We evaluate three services: Service A, B, and C, based on system health, user expectations, and sociotechnical resilience. You may think one service should be improved based on bad system health, but if user expectations of the service are not high, it may not need to be prioritized over other services. You must consider all three buckets of the New Reliability framework in order to decide which services to prioritize assigning resources to.

17:35 - 22:13 - Let’s break it down a bit further

What exactly do product health, customer happiness, and socio-technical resilience entail? Product health includes observability, telemetry, and the four golden signals. Customer happiness includes the user experience, what is most important to them, and what their expectations are, in addition to their confidence in your service. Socio-technical resilience includes how effective your team is during incident response, whether we have clear service ownership, and whether teams are aligned on their priorities and responsibilities.

22:13 - 24:11 - Why this definition works for you

This defintion is all-encompassing and holistic. More importantly than this particular definition, it is important your entire team is aligned on a singular definition. This framework also motivates impactful changes. Lastly, this framework helps your team prioritize where changes are most needed.

24:11 - 27:27 - How to measure the new reliability

Ask yourself some questions. What are the sources of manual labour for each type of incident? What is toilsome, tedious and repetitive? How many incident hours has each engineer spent on-call? How much time has your team spent fixing each service?

27:27 - 29:01 - Conclusion

New Reliability takes your system’s health, contextualizing that based on your user’s expectations, and all of that prioritized based on your engineers’ sociotechnical resilience.

Speakers

Emily Arnott

Community Relations Manager, Blameless

Emily Arnott

Community Relations Manager, Blameless
Emily is the Community Relations Manager at Blameless, where she fosters a place for discussing the latest in SRE. She has also presented talks at SREcon, Conf42, and Chaos Carnival.
Blue cross X  - Blameless Images