Are you looking to get up to speed on SRE fundamentals with the best SRE books and best DevOps books? Or are you hoping to expand your SRE knowledge into new domains? Either way, we’ve got you covered in our list of essential SRE resources!
The big books
These comprehensive tomes of SRE expertise are a great place to start.
Google’s Site Reliability Workbook
Google provides an overview of SRE implementation, covering the guiding principles that led the organization-wide adoption of SRE, and detailing practices ranging from upper-level management to the nuances of load balancing.
The Essential Guide to SRE Best Practices
Offered by Blameless, this eBook guides you through implementing your own SRE solution and is centered around three key principles: creating a mindset of resiliency, reducing engineering problems and innovation blockers, and approaching systems from a human perspective. If you’re looking to see how SRE will work within your organization, this eBook provides solutions that are not one-size-fits-all which you can begin implementing today..
Inspired by Google's SRE book, this book delves deeper into SRE, covering topics such as implementation methods and principles, best practices and technologies that make SRE easier, and the human side of SRE.
If you’re more pressed for time, Principal Developer Advocate for Honeycomb Liz Fong-Jones offers a playlist of essential O’Reilly SRE resources.
Site Reliability Engineering Tools
A variety of tools have been developed to help you on your SRE journey. These guides will help you decide what best fits your needs.
Blameless Buyers’ Guide for Reliability
Offered by Blameless, this guide looks at the goals of a successful SRE solution, and discusses what features a tool should have to accomplish them. It also breaks down the pros and cons of building tooling yourself, purchasing a tool, or adapting an open-source tool.
Awesome Site Reliability Tools
Curated by SREs, this list of tools is sorted by functions to help you find vendors who provide services ranging from project management tools to infrastructure and container orchestration..
This article looks at a complete cycle of development and operations and breaks down how SRE tooling could help DevOps teams at each stage.
Choosing the Right Tools when Building Your SRE Toolchain
This talk by engineers at VictorOps, Grafana, and Influxdata outlines what an SRE toolchain could look like and how to experiment with options to build a solution.
Hiring Site Reliability Engineers: Why You Need an SRE
Thinking about staffing an SRE team? Having dedicated engineers working on the long view of reliability problems is a worthy investment in your reliability. But how can you find good SREs, and what should they be doing? These articles and talks will answer these questions and more.
This SREcon talk given by Andrew Fong breaks down how Dropbox hired its SRE team, covering everything from sourcing talent to interviewing rubrics.
This guide explains the importance of investing in reliability staff and outlines how to find the perfect candidate for your first SRE role.
Training Site Reliability Engineers: What Your Organization Needs to Create a Learning Program
This report explains how to train SREs based on factors such as organizational maturity, candidate knowledge, familiarity with SRE, and more.
Becoming a Certified SRE
Are you looking to step into the exciting role of SRE? These links will help you find site reliability engineering certifications and other learning opportunities.
Kubedex - How do I become a SRE?
This guide provides a concise spreadsheet of online courses in SRE topics. It builds up the SRE role from fundamental skills in Linux system administration and software development, making it the perfect guide for someone starting their career.
Site Reliability Engineering: Measuring and Managing Reliability on Coursera
Created by the Google Cloud team, this course covers the Google SRE book in an engaging guided format. Quizzes and short assignments reinforce your learning, with an optional paid certification for completion.
Site Reliability Engineering Philosophy and Culture
SRE isn’t just a set of practices and tools. The underlying philosophies of SRE motivating these practices are fundamental to making your organization truly resilient. These articles and blogs will help you embrace failure as inevitable, put aside blame, develop for resiliency, and more.
The Many Shapes of Site Reliability Engineering
This article looks at the different ways SRE can be implemented and the benefits of each on both practical and cultural levels.
What exactly is the difference between DevOps and SRE? How do you incorporate the practices of each? This presentation by Google will answer these questions and more.
Convincing Management to Invest in Reliability
This talk by Blameless co-founder Lyon Wong provides strategies for getting SRE buy-in at the level of management, VPs, and CTOs. You can also read a series of blog posts covering the topic here: management, VP level, CTO level.
This weekly newsletter curated by Lex Neva, SRE at Fastly, brings you the latest in case studies, think pieces, and SRE news.
Many links in this list were sourced from the Awesome Site Reliability Resources page. Check it out if you’d like further resources for any of these topics, or there are other areas of SRE you’d like to explore.
If you’d like to learn more about SRE and how to begin employing best practices in your organization, feel free to reach out to us for a demo or try us out for free.