What nuances in execution and mentality separate successful SRE implementations from the failed ones? How can you get the most out of your SLOs and error budgets?
Joseph Bironas shares the often-overlooked but critical insights to answer these questions. Joseph has 14 years of experience in SRE, 12 of which at Google. His insider's insights are uniquely incisive, multi-disciplinary, and empathetic, linking the significance of SRE to both business and engineering.
Joseph currently leads the SRE team at Collective Health, a company that is transforming the employer-driven healthcare economy, redefining the way health benefits work.
This “CliffsNotes” summary curates the key points that were discussed by Joseph Bironas in the 50-minute interview. It is not a standalone article and is most valuable when contextualized by the podcast.
The significance of reliability:
- To engineering: product quality is just as important as product functionality.
- To business: a reliable product is key to a company's brand & its customers' trust.
- SLIs are user experience-centric.
- SLOs are an organizational guardrail for managing risk.
- SRE teams set a perimeter of defense, then slowly expand.
- Operate without blame.
Counterintuitive Mentality Shifts
For successful SRE implementations
- Consider reliability as a core feature.
- People's minds are implicitly fixed to 100% reliability, but we should never aim for 100% reliability.
- It's not enough to set a boundary for risk with SLOs, you want to proactively control the risk with experiments to test and address key system vulnerabilities.