Since 2015, Lex Neva has been publishing SRE Weekly. If you’re interested enough in reading about SRE to have found this post, you’re probably familiar with it. If not, there’s a lot of great articles to catch up on! Lex selects around 10 entries from across the internet for each issue, focusing on everything from SRE best practices to the socio- side of systems to major outages in the news.
I had always figured Lex must be among the most well-read people in SRE, and likely #1. I met up with Lex on a call, and was so excited to chat with him on how SRE Weekly came to be, how it continues to run, and his perspective on SRE.
The origins of SRE Weekly
I felt like an appropriate start of our conversation was to ask about the start of SRE Weekly: why did he take on this project? Like many good projects, Lex was motivated to “be the change he wanted to see”. He was an avid reader of Devops Weekly, but wished that something similar existed for SRE. With so much great and educational content created in the SRE space, shouldn’t there be something to help people find the very best?
“I wanted there to be a list of things related to SRE every week, and such a thing didn’t exist, and I’m like… Oh.” Lex explained. “I almost fell into it sideways, I thought this was gonna be a huge time sink, but it ended up being pretty fun, actually.”
How SRE Weekly is made
When thinking about the logistics of SRE Weekly, one question likely comes to mind: how? How does he have time to read all those articles? SRE is a methodology of methodologies, a practice that encourages building and improving practices. Lex certainly embodies this with his efficient method of finding and digesting dozens of articles a week.
First, he finds new articles. For this, rss feeds are his favorite tool. Once he’s got a buffer of new articles queued up, he uses an Android application called @voice to listen to them with text to speech – at 2.5x speed! Building up the ability to comprehend an article at that speed is a challenge, but for someone tackling the writing output of the entire community, it’s worth it.
To choose which articles to include, Lex doesn’t have any sort of strict requirements. He’s interested in articles that can bring new ideas or perspectives, but also likes to periodically include well-written introductory articles to get people up to speed. Things that focus on the socio- side of the sociotechnical spectrum also interest him, especially when highlighting the diversity of voices in SRE.
Incident retrospectives are also a genre of post that Lex likes to highlight. Companies posting public statements about outages they’ve experienced and what they’ve learned is a trend Lex wants to encourage growing. Although they might seem to only tell the story of one incident at one company, good incident retrospectives can bring out a more universal lesson. “An incident is like an unexpected situation that can teach us something – if it’s something that made you surprised about your system, it probably can teach someone else about their system too.”
Lex explained how in the aviation industry, massive leaps forward in reliability were made when competing airlines started sharing what they learned after crashes. They realized that any potential competitive advantages should be secondary to working together to keep people safe. “The more you share about your incidents, the more we can realize that everyone makes errors, that we’re all human,” Lex says. Promoting incident retrospectives is how he can further these beneficial trends.
Lex’s view of SRE
As someone with a front row seat to the evolution of SRE, I was curious what sort of trends Lex had seen and how he foresees them growing and changing. We touched on many subjects, but I’ll cover three major ones here:
Going beyond the Google SRE book
Since it was published in 2016, the Google SRE book has been the canonical text when it comes to SRE. In recent years, however, the idea that this book shouldn’t be the end-all be-all is becoming more prominent. At SREcon 21, Niall Murphy, one of the book’s authors, ripped it up live on camera!
Lex has seen this shift in attitudes in a lot of recent writing, and he’s happy to see a more diverse understanding of what SRE can be: “Even if Google came up with the term SRE, lots of companies had been doing this sort of work for even longer,” Lex said. “I want SRE to not just mean the technical core of making a reliable piece of code – although that’s important too – but to encompass everything that goes into building a reliable system.”
As SRE becomes more popular, companies of more sizes are seeing the benefits and wanting to hop aboard. Not all of these companies can muster the same resources as Google… Actually, practically only Google is at Google’s level! Lex has been seeing more learning emerge around the challenges of doing SRE at other scales, like startups, where there aren’t any extra resources to spare.
Broadening what an SRE can be
As we break away from the Google SRE book, we also start to break away from traditional descriptions of what a Site Reliability Engineer needs to do. “SRE is still in growing pains,” Lex said. “We’re still trying to figure out what we are. But it’s not a bad thing. I’ve embraced that there’s a lot under the umbrella.”
We often think of the “Engineer” in Site Reliability Engineer to be like “Software Engineer”, that is, someone who primarily writes code. But Lex encourages a more holistic view: that SRE is about engineering reliability into a system, which involves so much more than just writing code. He’s been seeing more writing and perspectives from SREs who have “writing code” as a small percentage of their duties – even 0%.
“They’re focusing more on the people side of things, the incident response, and coming up with the policies that engender reliability in their company… And I think there’s room for that in SRE, because at the heart of it is still engineering, it’s still the engineering mindset. If you only do the technical side of things, you’re really missing out.”
Diversifying the perspectives of SREs
Alongside diversifying the role of SREs, Lex hopes to see more diversity among SREs themselves. In our closing discussion, I asked Lex what message he would broadcast to everyone in this space if he could. “It’s all about the people,” he said. “These complex systems that we’re building, they will always have people. They’re a critical piece of the infrastructure, just as much as servers.”
Even if what we build in SRE seems to be governed just by technical interactions, people are intrinsic to making those systems reliable. This isn’t a negative; this isn’t just people being “error-makers”. People are what gives a system strength and resiliency. To this point, Lex highlighted what can make this socio- side of systems better: diversity and inclusion.
“Inclusion is important for the reliability of our socio-technical systems because we need to understand the perspective of all our users, not just the ones that are like us. That means thinking across race, gender expression, class, neurodivergence, everything. It’s an area where we need to do better.” Lex hopes to highlight the richness in this diversity in SRE Weekly.
As people standing at the relative beginning of SRE, working together to build and evolve the practice, we’re given both a challenge and an opportunity. In order to truly understand and engineer reliability into what we do, we need to discuss proactively our goals and how we’re achieving them. We hope you take the time to reflect on the learning that many great SRE writers share through spaces like SRE Weekly.
Read SRE Weekly here, and follow it on Twitter here. What are your thoughts on how SRE changes and grows? Where do you like to keep up with SRE news? Let us know on Twitter!