Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison

LinkedIn’s Kurt Andersen Uncovers Blindspots in SRE Implementations

Christina Tan

9.19.2018

Kurt Andersen is an engineer who is fascinated by how entire systems interrelate. Through his work at NASA, IBM, HP, and now LinkedIn, Kurt distills insights on how to make hundreds of constantly moving parts work together. Blameless interviewed Kurt to shine light on the blind spots that companies often have when implementing SRE.

Besides his role as a senior staff site reliability engineer at LinkedIn, Kurt is also sitting on the board of USENIX, an organization that hosts a wealth of conferences that bring together top professionals in the computing world, including SREcon.

Here are the key nuggets of SRE wisdom from Kurt in the interview.

SRE = available + secure

Availability has the main spotlight whenever people explain the purpose of Site Reliability Engineering. However, LinkedIn shares the spotlight with an additional emphasis: security. The SRE team at LinkedIn works to keep the site available and secure. Data privacy and integrity are top priorities to LinkedIn’s SRE team.

The SRE team at LinkedIn works to keep the site available and secure.

Differentiating DevOps vs. SRE

Many DevOps engineers are still convincing their organizations the value of continuous integration (CI) and continuous delivery (CD). CI and CD are designed to do things faster, but that does not always mean doing the right things.

SRE teams focus on business success. Organizations with SRE teams tend to already have CI/CD as a staple, rather than a source of resistance. SRE builds on top of CI/CD and ensures that whatever moves fast contribute to business success. (See chapter 22 in the book Seeking SRE for detailed explanations from Kurt.)

Key Success Factor to SRE

Culture. A blameless culture is one that encourages learning and continuous improvement.

Feature Developers’ Blindspot: Retirement of their Services

Most feature developers don’t plan for retirement of features. Microservices gives you the illusion that you can yank and replace, but that’s not really the case. It’s tough to turn off a microservice without losing an arm or leg. That’s why it’s important for SREs to have a full life cycle engagement, providing input starting from the design phase, so we can avoid the high cost of fixing bugs (and retiring features) later. When SREs contribute throughout the entire life cycle of products, we can ensure that products are being built for observability, reliability, and resilience from day one.

When SREs contribute throughout the entire life cycle of products, we can ensure that products are being built for observability, reliability, and resilience from day one.

Terminology Confusion: SLO or SLA?

For companies that do not suffer financial penalties for violating Service Level Agreements (SLA), the internal engineering team tends to use SLO and SLA interchangeably. SLO, service level objective, is really an internal metric for services that depends on another service. Distinguishing the two will help with communications clarity when SLA does become important (or tied to dollar amount penalties).

Coming Up with Meaningful SLOs - a Missing Protocol

How would you come up with the best and most reasonable SLO for availability, latency (site speed), error rate, performance relative to traffic load, or how a service performs under stress conditions?

You can’t, not at the beginning. It’s hard to get the team’s buy-in for an arbitrary goal unless there’s a clear mechanism for revising the goal.For example, at Home Depot, SLOs are reviewed every 6 months. Teams can revise to have tighter or looser SLOs (E.g. Going from 99% availability to 99.5% or 98%). Each team at an organization can review their SLOs at a tempo that works that them. The key is to have a regular means to adjust rather than signing a lifelong commitment. (See chapter 3 in The Site Reliability Workbook for more details.)

It’s hard to get the team’s buy-in for an arbitrary goal unless there’s a clear mechanism for revising the goal.

SLO Challenge: Measuring the Business Impact of Grey Failures

A grey failure refers to partial failure of a system, for example, if a specific feature of LinkedIn were to stop responding only in Canada. Calculating the impact of a grey failure is difficult. The estimates are rough, the process is manual, and it’s difficult to take into account any bounce back effect. When Amazon Prime went down on Prime day, possibly more customers came back the next day to buy more, however, it’s also possible that what customers wanted to buy had already been sold out. Because it’s difficult to quantify the business impact, we currently bucket impact into 3 categories: minor, major, and critical; and prioritize accordingly.

Vision for SRE

SRE brings ongoing emphasis and continual drumbeat on the importance of reliability, like what QAs do for unit testing. In an ideal world, every engineer will take reliability into account for everything they do.

Resources

Book a blameless demo

To view the calendar in full page view, click here.

Share to

Get industry insights and events in your inbox.
Sign up for our monthly newsletter.

Company

About us Newsroom careers contact

Product

pricing integrations interactive Demo

Help Center

Getting Started Implementation Security Documents APIs & Webhooks

resources

Blog ebooks Incident Impact Calculator videos glossary Comparisons How Long do you Spend on an Incident?

legal

By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.

Based on the applicable laws of your country, you may have the right to request access to the personal information we collect from you, change that information, or delete it. To request to review, update, or delete your personal information, please fill out and submit a data subject access request to support@blameless.com.

I Accept

Preferences