Industry Leader Insights

Why We Curated This List

We've had the privilege of interviewing some of the leading minds in the SRE and resilience industry. Here is a handy collection of their insights from decades of experience in the trenches, to help you fast track your team's own journey to production excellence. If you are also interested in sharing your wealth of knowledge with the community, give us a shout!

SRE Best Practices

Improving Postmortem Practices

Steve McGhee, SRE Cloud Solutions Architect, Google

Description: Steve shares how to take your postmortems to the next level, offering pragmatic advice on action items, questions to ask, and more.

Narratives in Incidents

Lorin Hochstein, Sr. Software Engineer at Netflix and curator of surfingcomplexity.blog

Description: Lorin pioneered the “Oops” write-ups at Netflix. He dives into the organization's unique post-incident review culture, which embraces psychological safety and detailed narratives.

Twitter's SRE Journey

Brian Brophy, Sr. Staff SRE, Carrie Fernandez, Head of Site Reliability Engineering, JP Doherty, Engineering Manager, and Zac Kiehl, Sr. Staff SRE at Twitter

Description: We had the privilege of interviewing members of Twitter's team about how they practice SRE, and how their organization and practices have changed as they've scaled.

Bringing Operational Excellence to Development Teams

Lauren Rubin, Senior Software Engineer at Github

Description: Lauren Rubin describes what skills operationally mature teams can bring to development to improve processes.

Taking Postmortems from Chore to Masterclass

Paul Osman, Senior Software Engineer at Honeycomb

Description: Paul Osman speaks about how to take postmortems or incident retrospectives to a new level.

Industry Experts Explain how to Thrive in a Post-COVID World

Ashar Rizqi, CEO and Co-founder at Blameless, Raj Dutt, CEO and Co-founder at Grafana Labs, Kelsey Waters, Senior Director of Operations at Packet

Description: Industry experts chat about challenges and lessons learned during COVID-19.

SRE Thought Leader Pane: Testing in Production

Shelby Spees, Developer Advocate at Honeycomb.io, Talia Nassi, Developer Advocate at Split.io

Description: Our panelists discussed testing in production, how feature flagging and testing can help us do that, and how to get managers to be on board with testing in production.

Metrics & Measures

Getting to 5 9's of Availability

Tyler Wells, Sr. Director of Engineering - SRE/Platform, Twilio

Description: Five 9s availability = less than 30 seconds of service unavailability per month. Tyler shares the key building blocks that the Twilio team adopted to reach five 9s.

SRE Panel: Managing Systems Complexity

Jessica Kerr, Host of Arrested DevOps and Greater Than Code, Tim Tischler, Site Reliability Champion at New Relic, Ward Cunningham, Staff Engineer at New Relic

Description: Panelists discuss how systems complexity has evolved since the early days of Agile, what the future of testing will look like in surfacing systems boundaries, how to understand socio-economic issues through the lens of SRE principles like observability, and much more.

Enabling the Stripe and Lyft Platforms Through Modern Safety Science

Jacob Scott, Engineer at Stripe, previously at Lyft

Description: Blameless SRE Darrell Pappa interviews Jacob to delve into how his research has informed his career journey and experiences to-date, especially in his latest role at Stripe where he helps operate the economic infrastructure of the Internet.

Culture & Team Structure

Top Blindspots in SRE Implementations

Kurt Andersen, Sr. Staff SRE, LinkedIn

Description: Through his work at NASA, IBM, HP, and now LinkedIn, Kurt distills insights on managing complexity and blindspots that companies often encounter when implementing SRE.

How Resilience and Security Shift Left

Melody Hildebrandt, EVP Product & Engineering and CISO, FOX

Description: Melody gives us a front-row seat into the operations behind record-smashing events such as the Super Bowl, and how resilience and security are shifting left in the software lifecycle.

Work as Done vs. Imagined

Ben Rockwood, Packet Head of Site Engineering, Morgan Schryver, Netflix Sr. SRE, and Rein Henrichs, Procore Principal Engineer

Description: Ben, Morgan, and Rein discussed the effects and ways to counter imposter syndrome during high tempo situations, and how culture directly affects the availability of our systems.

Building Reliability through Culture

Steve McGhee, SRE Cloud Solutions Architect, Google

Description: Steve shares the three stages of incident preparedness, ensuring bug hygiene, building a flexible monitoring system, and other practices to improve control over reliability.

But I Already Have DevOps? (How SRE Fits in the Picture)

David Blank-Edelman, Sr. Cloud Advocate, Microsoft and Co-founder SREcon

Description: Based on his decades of experience within systems administration, DevOps, and SRE, David shares an introduction to SRE and how to relate it to your existing DevOps practices.

A Culture of Reliability

Matt Klein, Lyft Engineer & creator of Envoy, and David Blank-Edelman

Description: Matt shares how service mesh architectures improve the reliability and observability of microservice-based environments, and relates that to the evolving discipline around SRE.

Applying SRE Outside of Engineering

Dave Rensin, Sr. Director of Engineering, Google

Description: Dave shares how SRE can be applied outside of engineering to functions such as sales and marketing, creating new dimensions to IT operations principles.

Adaptability, Ego, and Scaling

Tim Banks, Technical Account Manager, Mission

Description: Tim shares the importance of adaptability, how ego can cause teams to pivot too slowly, and things leaders should consider when scaling in the face of uncertainty.

Fostering Inclusion and Integrity

Sidney Miller, Talent Acquisition Lead at Packet

Description: Sidney shares the importance of fostering inclusion and integrity within our organizations with best practices for recruiting and retention.

Engineering AMA with Dustin Pearce

Dustin Pearce, VP of Infrastructure at Instacart

Description: Blameless CEO Ashar Rizqi interviewed Dustin Pearce in a virtual executive fireside chat and AMA. Dustin is an experienced leader in scaling hyper-growth, cloud-native companies, as the VP of Infrastructure at Instacart and having previously served as Head of Service Engineering at Slack.

The Good Old Days of the Internet and SRE Education with Craig Sebenik

Craig Sebenik, SRE at Aurora

Description: Craig shares how his experiences of growing his career alongside the internet changed his trajectory as well as his thoughts on SRE and tech education.

Enabling the Stripe and Lyft Platforms Through Modern Safety Science

Jacob Scott, Reliability Engineer at Stripe

Description: Jacob shares how he has applied learnings from modern safety science to care for real, complex socio-technical systems at hyper-growth organizations such as Lyft and Stripe.

The Importance of Glue Work with Tammy Bryant and Eric Roberts

Tammy Brant, Principal SRE at Gremlin and Eric Roberts, Sr. Manager SRE at Under Armour

Description: Tammy and Eric join host Amy Tobey on an episode of Resilience in Action to discuss the importance of glue work, leadership skills, and having fun on the job.

How Mercari Scales Vision, Culture, & Reliability

Mohan Bhatkar, Head of Engineering for the Customer Reliability Platform at Mercari, Inc.

Description: Mohan Bhatkar, Head of Engineering for the Customer Reliability Platform at Mercari, Inc. sat down with Blameless Co-Founder Ashar Rizqi. They talked about scaling while avoiding silos, exciting day-to-day challenges, instilling a culture of empowerment, and more.

Service Level Objectives

SLO Adoption at Twitter

Brian Brophy, Sr. Staff SRE, Carrie Fernandez, Head of Site Reliability Engineering, JP Doherty, Engineering Manager, and Zac Kiehl, Sr. Staff SRE at Twitter

Description: We walk through several important breakthroughs for the Twitter SRE team that were crucial to facilitating the adoption of SLOs within the organization.

How SLOs transformed Evernote

Garrett Plasky, Sr. SRE Manager

Description: Traditionally, DevOps engineers have their hands tied when it comes to reducing technical debt, as it's difficult to quantify impact. At Evernote, SLOs provide exactly that clarity.

Getting the Most Out of SRE, SLOs, and Error Budgets

Joseph Bironas, Director of Engineering, Collective Health

Description: Joseph shares the nuances in execution and mentality between mature and immature SRE implementations, how to implement SLOs and error budgets, and more.

What are Service Level Objectives? Lessons Learned

Description: In this piece we take a look at SLOs as both a powerful safety net and a tool to inform the allocation of engineering resources, and walk through a few examples.

By embracing failure and learning, SRE enables teams to build and run more resilient systems.

Get the latest from Blameless

Receive news, announcements, and special offers.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.