We've had the privilege of interviewing some of the leading minds in the SRE and resilience industry. Here is a handy collection of their insights from decades of experience in the trenches, to help you fast track your team's own journey to production excellence. If you are also interested in sharing your wealth of knowledge with the community, give us a shout!
Improving Postmortem Practices
Steve McGhee, SRE Cloud Solutions Architect, Google
Description: Steve shares how to take your postmortems to the next level, offering pragmatic advice on action items, questions to ask, and more.
Lorin Hochstein, Sr. Software Engineer at Netflix and curator of surfingcomplexity.blog
Description: Lorin pioneered the “Oops” write-ups at Netflix. He dives into the organization's unique post-incident review culture, which embraces psychological safety and detailed narratives.
Brian Brophy, Sr. Staff SRE, Carrie Fernandez, Head of Site Reliability Engineering, JP Doherty, Engineering Manager, and Zac Kiehl, Sr. Staff SRE at Twitter
Description: We had the privilege of interviewing members of Twitter's team about how they practice SRE, and how their organization and practices have changed as they've scaled.
Bringing Operational Excellence to Development Teams
Lauren Rubin, Senior Software Engineer at Github
Description: Lauren Rubin describes what skills operationally mature teams can bring to development to improve processes.
Taking Postmortems from Chore to Masterclass
Paul Osman, Senior Software Engineer at Honeycomb
Description: Paul Osman speaks about how to take postmortems or incident retrospectives to a new level.
Industry Experts Explain how to Thrive in a Post-COVID World
Ashar Rizqi, CEO and Co-founder at Blameless, Raj Dutt, CEO and Co-founder at Grafana Labs, Kelsey Waters, Senior Director of Operations at Packet
Description: Industry experts chat about challenges and lessons learned during COVID-19.
SRE Thought Leader Pane: Testing in Production
Shelby Spees, Developer Advocate at Honeycomb.io, Talia Nassi, Developer Advocate at Split.io
Description: Our panelists discussed testing in production, how feature flagging and testing can help us do that, and how to get managers to be on board with testing in production.
Getting to 5 9's of Availability
Tyler Wells, Sr. Director of Engineering - SRE/Platform, Twilio
Description: Five 9s availability = less than 30 seconds of service unavailability per month. Tyler shares the key building blocks that the Twilio team adopted to reach five 9s.
SRE Panel: Managing Systems Complexity
Jessica Kerr, Host of Arrested DevOps and Greater Than Code, Tim Tischler, Site Reliability Champion at New Relic, Ward Cunningham, Staff Engineer at New Relic
Description: Panelists discuss how systems complexity has evolved since the early days of Agile, what the future of testing will look like in surfacing systems boundaries, how to understand socio-economic issues through the lens of SRE principles like observability, and much more.
Enabling the Stripe and Lyft Platforms Through Modern Safety Science
Jacob Scott, Engineer at Stripe, previously at Lyft
Description: Blameless SRE Darrell Pappa interviews Jacob to delve into how his research has informed his career journey and experiences to-date, especially in his latest role at Stripe where he helps operate the economic infrastructure of the Internet.
Top Blindspots in SRE Implementations
Kurt Andersen, Sr. Staff SRE, LinkedIn
Description: Through his work at NASA, IBM, HP, and now LinkedIn, Kurt distills insights on managing complexity and blindspots that companies often encounter when implementing SRE.
How Resilience and Security Shift Left
Melody Hildebrandt, EVP Product & Engineering and CISO, FOX
Description: Melody gives us a front-row seat into the operations behind record-smashing events such as the Super Bowl, and how resilience and security are shifting left in the software lifecycle.
Ben Rockwood, Packet Head of Site Engineering, Morgan Schryver, Netflix Sr. SRE, and Rein Henrichs, Procore Principal Engineer
Description: Ben, Morgan, and Rein discussed the effects and ways to counter imposter syndrome during high tempo situations, and how culture directly affects the availability of our systems.
Building Reliability through Culture
Steve McGhee, SRE Cloud Solutions Architect, Google
Description: Steve shares the three stages of incident preparedness, ensuring bug hygiene, building a flexible monitoring system, and other practices to improve control over reliability.
But I Already Have DevOps? (How SRE Fits in the Picture)
David Blank-Edelman, Sr. Cloud Advocate, Microsoft and Co-founder SREcon
Description: Based on his decades of experience within systems administration, DevOps, and SRE, David shares an introduction to SRE and how to relate it to your existing DevOps practices.
Matt Klein, Lyft Engineer & creator of Envoy, and David Blank-Edelman
Description: Matt shares how service mesh architectures improve the reliability and observability of microservice-based environments, and relates that to the evolving discipline around SRE.
Applying SRE Outside of Engineering
Dave Rensin, Sr. Director of Engineering, Google
Description: Dave shares how SRE can be applied outside of engineering to functions such as sales and marketing, creating new dimensions to IT operations principles.
Adaptability, Ego, and Scaling
Tim Banks, Technical Account Manager, Mission
Description: Tim shares the importance of adaptability, how ego can cause teams to pivot too slowly, and things leaders should consider when scaling in the face of uncertainty.
Fostering Inclusion and Integrity
Sidney Miller, Talent Acquisition Lead at Packet
Description: Sidney shares the importance of fostering inclusion and integrity within our organizations with best practices for recruiting and retention.
Engineering AMA with Dustin Pearce
Dustin Pearce, VP of Infrastructure at Instacart
Description: Blameless CEO Ashar Rizqi interviewed Dustin Pearce in a virtual executive fireside chat and AMA. Dustin is an experienced leader in scaling hyper-growth, cloud-native companies, as the VP of Infrastructure at Instacart and having previously served as Head of Service Engineering at Slack.
The Good Old Days of the Internet and SRE Education with Craig Sebenik
Craig Sebenik, SRE at Aurora
Description: Craig shares how his experiences of growing his career alongside the internet changed his trajectory as well as his thoughts on SRE and tech education.
Enabling the Stripe and Lyft Platforms Through Modern Safety Science
Jacob Scott, Reliability Engineer at Stripe
Description: Jacob shares how he has applied learnings from modern safety science to care for real, complex socio-technical systems at hyper-growth organizations such as Lyft and Stripe.
The Importance of Glue Work with Tammy Bryant and Eric Roberts
Tammy Brant, Principal SRE at Gremlin and Eric Roberts, Sr. Manager SRE at Under Armour
Description: Tammy and Eric join host Amy Tobey on an episode of Resilience in Action to discuss the importance of glue work, leadership skills, and having fun on the job.
How Mercari Scales Vision, Culture, & Reliability
Mohan Bhatkar, Head of Engineering for the Customer Reliability Platform at Mercari, Inc.
Description: Mohan Bhatkar, Head of Engineering for the Customer Reliability Platform at Mercari, Inc. sat down with Blameless Co-Founder Ashar Rizqi. They talked about scaling while avoiding silos, exciting day-to-day challenges, instilling a culture of empowerment, and more.
SRE Leader Panel: SRE Adoption as Organizational Transformation
Kurt Andersen SRE Architect at Blameless, Vanessa Yiu, Executive Director, Enterprise Architecture at Goldman Sachs, and Tony Hansmann, Former Global CTO at Pivotal Software, Inc.
Description: Industry leaders discuss how organizations adopt SRE, including the processes to put in place, how to change minds and behaviors, how to get the right message to the right people, and how to garner internal support with both individual contributors and leaders.
Brian Brophy, Sr. Staff SRE, Carrie Fernandez, Head of Site Reliability Engineering, JP Doherty, Engineering Manager, and Zac Kiehl, Sr. Staff SRE at Twitter
Description: We walk through several important breakthroughs for the Twitter SRE team that were crucial to facilitating the adoption of SLOs within the organization.
How SLOs transformed Evernote
Garrett Plasky, Sr. SRE Manager
Description: Traditionally, DevOps engineers have their hands tied when it comes to reducing technical debt, as it's difficult to quantify impact. At Evernote, SLOs provide exactly that clarity.
Getting the Most Out of SRE, SLOs, and Error Budgets
Joseph Bironas, Director of Engineering, Collective Health
Description: Joseph shares the nuances in execution and mentality between mature and immature SRE implementations, how to implement SLOs and error budgets, and more.
What are Service Level Objectives? Lessons Learned
Description: In this piece we take a look at SLOs as both a powerful safety net and a tool to inform the allocation of engineering resources, and walk through a few examples.
By embracing failure and learning, SRE enables teams to build and run more resilient systems.