Industry Leader Insights

Why We Curated This List

We've had the privilege of interviewing some of the leading minds in the SRE and resilience industry. Here is a handy collection of their insights from decades of experience in the trenches, to help you fast track your team's own journey to production excellence. If you are also interested in sharing your wealth of knowledge with the community, give us a shout!

SRE Best Practices

Improving Postmortem Practices

Steve McGhee, SRE Cloud Solutions Architect, Google

Description: Steve shares how to take your postmortems to the next level, offering pragmatic advice on action items, questions to ask, and more.

Narratives in Incidents

Lorin Hochstein, Sr. Software Engineer at Netflix and curator of surfingcomplexity.blog

Description: Lorin pioneered the “Oops” write-ups at Netflix. He dives into the organization's unique post-incident review culture, which embraces psychological safety and detailed narratives.

Twitter's SRE Journey

Brian Brophy, Sr. Staff SRE, Carrie Fernandez, Head of Site Reliability Engineering, JP Doherty, Engineering Manager, and Zac Kiehl, Sr. Staff SRE at Twitter

Description: We had the privilege of interviewing members of Twitter's team about how they practice SRE, and how their organization and practices have changed as they've scaled.

Metrics & Measures

Getting to 5 9's of Availability

Tyler Wells, Sr. Director of Engineering - SRE/Platform, Twilio

Description: Five 9s availability = less than 30 seconds of service unavailability per month. Tyler shares the key building blocks that the Twilio team adopted to reach five 9s.

SRE Panel: Managing Systems Complexity

Jessica Kerr, Host of Arrested DevOps and Greater Than Code, Tim Tischler, Site Reliability Champion at New Relic, Ward Cunningham, Staff Engineer at New Relic

Description: Panelists discuss how systems complexity has evolved since the early days of Agile, what the future of testing will look like in surfacing systems boundaries, how to understand socio-economic issues through the lens of SRE principles like observability, and much more.

Culture & Team Structure

Top Blindspots in SRE Implementations

Kurt Andersen, Sr. Staff SRE, LinkedIn

Description: Through his work at NASA, IBM, HP, and now LinkedIn, Kurt distills insights on managing complexity and blindspots that companies often encounter when implementing SRE.

How Resilience and Security Shift Left

Melody Hildebrandt, EVP Product & Engineering and CISO, FOX

Description: Melody gives us a front-row seat into the operations behind record-smashing events such as the Super Bowl, and how resilience and security are shifting left in the software lifecycle.

Work as Done vs. Imagined

Ben Rockwood, Packet Head of Site Engineering, Morgan Schryver, Netflix Sr. SRE, and Rein Henrichs, Procore Principal Engineer

Description: Ben, Morgan, and Rein discussed the effects and ways to counter imposter syndrome during high tempo situations, and how culture directly affects the availability of our systems.

Building Reliability through Culture

Steve McGhee, SRE Cloud Solutions Architect, Google

Description: Steve shares the three stages of incident preparedness, ensuring bug hygiene, building a flexible monitoring system, and other practices to improve control over reliability.

But I Already Have DevOps? (How SRE Fits in the Picture)

David Blank-Edelman, Sr. Cloud Advocate, Microsoft and Co-founder SREcon

Description: Based on his decades of experience within systems administration, DevOps, and SRE, David shares an introduction to SRE and how to relate it to your existing DevOps practices.

A Culture of Reliability

Matt Klein, Lyft Engineer & creator of Envoy, and David Blank-Edelman

Description: Matt shares how service mesh architectures improve the reliability and observability of microservice-based environments, and relates that to the evolving discipline around SRE.

Applying SRE Outside of Engineering

Dave Rensin, Sr. Director of Engineering, Google

Description: Dave shares how SRE can be applied outside of engineering to functions such as sales and marketing, creating new dimensions to IT operations principles.

Adaptability, Ego, and Scaling

Tim Banks, Technical Account Manager, Mission

Description: Tim shares the importance of adaptability, how ego can cause teams to pivot too slowly, and things leaders should consider when scaling in the face of uncertainty.

Fostering Inclusion and Integrity

Sidney Miller, Talent Acquisition Lead at Packet

Description: Sidney shares the importance of fostering inclusion and integrity within our organizations with best practices for recruiting and retention.

Service Level Objectives

SLO Adoption at Twitter

Brian Brophy, Sr. Staff SRE, Carrie Fernandez, Head of Site Reliability Engineering, JP Doherty, Engineering Manager, and Zac Kiehl, Sr. Staff SRE at Twitter

Description: We walk through several important breakthroughs for the Twitter SRE team that were crucial to facilitating the adoption of SLOs within the organization.

How SLOs transformed Evernote

Garrett Plasky, Sr. SRE Manager

Description: Traditionally, DevOps engineers have their hands tied when it comes to reducing technical debt, as it's difficult to quantify impact. At Evernote, SLOs provide exactly that clarity.

Getting the Most Out of SRE, SLOs, and Error Budgets

Joseph Bironas, Director of Engineering, Collective Health

Description: Joseph shares the nuances in execution and mentality between mature and immature SRE implementations, how to implement SLOs and error budgets, and more.

What are Service Level Objectives? Lessons Learned

Description: In this piece we take a look at SLOs as both a powerful safety net and a tool to inform the allocation of engineering resources, and walk through a few examples.

By embracing failure and learning, SRE enables teams to build and run more resilient systems.

Get the latest from Blameless

Receive news, announcements, and special offers.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.