Kurt Andersen is an engineer who is fascinated by how entire systems interrelate. Through his work at NASA, IBM, HP, and now LinkedIn, Kurt distills insights on how to make hundreds of constantly moving parts work together. Blameless interviewed Kurt to shine light on the blind spots that companies often have when implementing SRE.
Besides his role as a senior staff site reliability engineer at LinkedIn, Kurt is also sitting on the board of USENIX, an organization that hosts a wealth of conferences that bring together top professionals in the computing world, including SREcon.
Here are the key nuggets of SRE wisdom from Kurt in the interview.
SRE = available + secure
Availability has the main spotlight whenever people explain the purpose of Site Reliability Engineering. However, LinkedIn shares the spotlight with an additional emphasis: security. The SRE team at LinkedIn works to keep the site available and secure. Data privacy and integrity are top priorities to LinkedIn’s SRE team.
The SRE team at LinkedIn works to keep the site available and secure.
Differentiating DevOps vs. SRE
Many DevOps engineers are still convincing their organizations the value of continuous integration (CI) and continuous delivery (CD). CI and CD are designed to do things faster, but that does not always mean doing the right things.
SRE teams focus on business success. Organizations with SRE teams tend to already have CI/CD as a staple, rather than a source of resistance. SRE builds on top of CI/CD and ensures that whatever moves fast contribute to business success. (See chapter 22 in the book Seeking SRE for detailed explanations from Kurt.)
Key Success Factor to SRE
Culture. A blameless culture is one that encourages learning and continuous improvement.
Feature Developers’ Blindspot: Retirement of their Services
Most feature developers don’t plan for retirement of features. Microservices gives you the illusion that you can yank and replace, but that’s not really the case. It’s tough to turn off a microservice without losing an arm or leg. That’s why it’s important for SREs to have a full life cycle engagement, providing input starting from the design phase, so we can avoid the high cost of fixing bugs (and retiring features) later. When SREs contribute throughout the entire life cycle of products, we can ensure that products are being built for observability, reliability, and resilience from day one.
When SREs contribute throughout the entire life cycle of products, we can ensure that products are being built for observability, reliability, and resilience from day one.
Terminology Confusion: SLO or SLA?
For companies that do not suffer financial penalties for violating Service Level Agreements (SLA), the internal engineering team tends to use SLO and SLA interchangeably. SLO, service level objective, is really an internal metric for services that depends on another service. Distinguishing the two will help with communications clarity when SLA does become important (or tied to dollar amount penalties).
Coming Up with Meaningful SLOs - a Missing Protocol
How would you come up with the best and most reasonable SLO for availability, latency (site speed), error rate, performance relative to traffic load, or how a service performs under stress conditions?
You can’t, not at the beginning. It’s hard to get the team’s buy-in for an arbitrary goal unless there’s a clear mechanism for revising the goal.For example, at Home Depot, SLOs are reviewed every 6 months. Teams can revise to have tighter or looser SLOs (E.g. Going from 99% availability to 99.5% or 98%). Each team at an organization can review their SLOs at a tempo that works that them. The key is to have a regular means to adjust rather than signing a lifelong commitment. (See chapter 3 in The Site Reliability Workbook for more details.)
It’s hard to get the team’s buy-in for an arbitrary goal unless there’s a clear mechanism for revising the goal.
SLO Challenge: Measuring the Business Impact of Grey Failures
A grey failure refers to partial failure of a system, for example, if a specific feature of LinkedIn were to stop responding only in Canada. Calculating the impact of a grey failure is difficult. The estimates are rough, the process is manual, and it’s difficult to take into account any bounce back effect. When Amazon Prime went down on Prime day, possibly more customers came back the next day to buy more, however, it’s also possible that what customers wanted to buy had already been sold out. Because it’s difficult to quantify the business impact, we currently bucket impact into 3 categories: minor, major, and critical; and prioritize accordingly.
Vision for SRE
SRE brings ongoing emphasis and continual drumbeat on the importance of reliability, like what QAs do for unit testing. In an ideal world, every engineer will take reliability into account for everything they do.