Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

SREcon 2022 Americas Wrap Up

Emily Arnott
|
3.24.2022

Hi everyone! We had a fantastic time at SREcon 2022 Americas last week, and I thought I’d share our stories and experiences. As the SRE community grows and evolves, these chances for collaboration become more and more important… and fun! Although I only attended virtually, I could still feel an exciting atmosphere as great minds came together.

The Talks

The star of the show is, of course, the talks. I virtually attended nine this year, and I’d like to highlight some of the insights I got from them. 

A Postmortem of SRE Interviewing by Michael Kehoe

Michael summarizes his experience going through over 100 interviews for SRE positions. He gives great tips on how to build an interview process that respects the applicant, engages and challenges them, and makes sure you get a good fit. It's a candidates' market out there for SREs, so recruiters need to stand out.

Tales from the VOID: The Scary Truth about Incident Metrics by Courtney Nash

If you've never heard of the VOID (Verica Open Incident Database), check it out! It's a big collaborative database of incidents for everyone to study. Courtney Nash breaks down some key lessons from the database, with some pretty amazing insights: disentangling incident length from incident severity, finding common patterns even across different companies with different types of metadata, diving into more meaningful incident data (like which tools were used), and more! If you work with incident response, this is a can't-miss talk.

How We Survived (and Thrived) During The Pandemic and Helped Millions of Students Learn Remotely by Chinmay Tripathi

McGraw Hill, where Chinmay works, is an online education company. He starts the talk off with some jaw-dropping charts showing their traffic spike up at the start of the pandemic. But they rose to the challenge! Chinmay works through the changes they made to survive and thrive from the cultural to the technical. An inspirational story!

Are We There Yet? Metrics-Driven Prioritization for Your Reliability Roadmap by Christina Tan and Mindy Stevenson

Okay, of course I was going to pick this one. But I couldn't be prouder of my Blameless colleagues -- this was a fantastic talk! They break down how to go from business needs, to probing questions, to the specific metrics that answer them. This data transforms into a powerful "Reliability Dashboard" that can help you understand if your business needs are being met at a glance. The crowd of people snapping pictures of the dashboard slide is all the proof you need that it's a powerful tool.

SRE stands for...Skydiving Resilience Engineer by Victor Lei

One of the things that fascinates me most about SRE is how universal its mindset seems to be. At first you might think these are very specifically technical concepts, but you'll see them crop up again and again everywhere you look, and learn something new each time you do. So why not look... up in the sky? Skydiving certainly stands out as a sport where reliability REALLY matters, so it was exciting to see how they achieve safe jumps and how the concepts translate into our own discipline.

Building a Path to the Future: Mentoring New SREs by Chastity Blackwell

The SRE role is always evolving and growing as the practice blossoms into the mainstream. So what are the new generation of SREs thinking when they step into an SRE role for the first time? Maybe... "help me!"? Chastity gives valuable advice on how to be a good mentor, and how to establish beneficial mentorships in your organization. My favorite tip: talk to yourself on Slack as you work through a problem! Your colleagues can learn so much from seeing your process.

Modeling Alert Quality by Moshe Zadka

What's worse: missing an alert that you needed to see, or seeing an alert that you didn't need? Moshe works through how to determine this by breaking down the positive and negative outcomes of different alert categories. I really appreciate how they looked at costs holistically, incorporating every aspect of an incident's cost. Good alerts are what really allows your system to communicate its health to you, so it’s worth investing in.

Emergent Organizational Failure: Five Disconnections by Mattie Toia

Everyone comes into SRE with the best intentions, but sometimes assumptions can lead to disconnects in achieving your goals. Mattie's talk dives into five major disconnects that can emerge and comes up with solutions. I loved their emphasis on the psychological, looking into how humans form belief structures. It's important to revisit the fundamentals of how people model the world and how motivation is born from perception. It's a lot about trust, which you can only build through consistent action. A really insightful talk!

DO, RE, Me: Measuring the Effectiveness of Site Reliability Engineering by Dave Stanke

The DORA State of DevOps report has long been a gold standard and sort of barometer for how different practices help teams. Now, the DORA team is turning their attention to SRE practices! Dave's talk does an excellent job of not just reporting the findings, but explaining their focus and methodology. Their findings are encouraging: orgs of all sizes are starting to pick up SRE practices, their use of SRE has been beneficial, and they have tons of room to grow into further adoption. Dave finishes off the talk with a collection of "hot takes" that ought to stimulate some discussion on the fundamentals of SRE and DevOps.

Videos of the conference will be available on the USENIX YouTube channel soon, so check these and the other great talks out!

Reliability Jenga

One event we were super pumped to present was Reliability Jenga! Among drinks and snacks at our afterparty, teams competed to make the tallest freestanding structure out of the Jenga blocks. The prize: an Oculus VR headset!

The teams size up their building materials…

Our Head of Marketing, Deirdre Mahon summarized the experience: “The Jenga party was a perfect way to not only invite attendees to mix and have fun - it was actually a really interesting way to watch how the different teams competed to get the highest tower.

Some teams really came together to work on their base foundation, others focused on the design aspects and others inside teams took the approach to observe and share insights and strategy without getting hands on with the blocks. It was a great game to see how teams approach a problem and goal against the clock!”

Skyscrapers begin to emerge…

Tarun Nappoly, one of our Sales Development Representatives, has some great Jenga trivia: “Fun fact: The name Jenga is derived from a Swahili word that means “to build”.  It may seem corny that we picked this as our activity for this reason, but we thought it was a great choice for SREcon given we’re all striving to build and maintain reliable services!”

Will it fall??

Jenga really does exemplify a lot of the reliability messages we value the most. Like: reliability is a team sport! Everyone on the team has to get involved to build a tower quickly. The variety of the tower designs was also enlightening… even with such a simple task, different people approach it completely differently. Embracing this diversity in thinking is key to being reliable. Because reliability really is so holistic! You have to think of everything that can knock your tower down, from structural flaws to rumbling footsteps.

Reliability lessons really are everywhere!

Closing thoughts

My colleagues also had insights to share from their time at the conference:

“As I reflect on SRECon, I think there are a few things that really stood out to me,” said Jason Montgomery, our VP of sales. “First, (more internal) SRECon continues to validate the market opportunity and how early we are in the market. As far as the show goes, I feel that the market is maturing to be much broader than just SRE and reliability is becoming a part of organizations business strategy.

To help drive this at the business level, there is a desire for metrics that provide interlock across both technical and non technical stakeholders. I think a lot of organizations know reliability is important, but are struggling to articulate this across the business. I think that Mindy and Christina’s talk really touched on the importance of this and sparked a lot of good conversation at the conference. This seems to be an area we could lean into and own as a company.”

Deirdre also shared her thoughts on the conference: “Ditto what Jason said - I would reference Casey Rosenthal’s plenary session where he validated the criticality and importance of reliability for all businesses. Although it’s an event for engineering teams, it’s clear that their work extends far beyond what they do each day — managing incidents etc. is just the tip of the spear.

There were a lot of very large enterprises in attendance. Bloomberg sent over 30 attendees from multiple divisions. When big companies invest in an area such as this, that is market validation. The other main point I would bring up is that teams continue to struggle with hiring, onboarding and training for the SRE function. It is a practice that is still forming (& storming ;) ) and there’s now teams dedicated to help train other engineers into this role (rather than try to hire from the outside). Meta did a plenary on this which was an interesting approach in partnership with the Linux Foundation and of course large with deep pockets but it’s a model that I believe will be copied and edited based on team/co size.”

Tarun spoke on the atmosphere of the conference: “Although I wasn't able to attend the talks this time around, I learned a lot about the current sentiment of the market by simply listening to the attendees communicate their pain, their triumphs, their areas for improvement within their own organizations and teams, and their optimism for where they believe our industry is heading. 

I think a major focal point right now is hiring SREs. Right now it seems like teams want to grow but they don't know how to necessarily onboard and train these new hires. When a legacy employee leaves the company they also carry with them the tribal knowledge that guided their teams "ship" through the night. We have a chance at being that "lighthouse" to guide these lost SRE teams back to shore and provide structure, tooling, and support for these rapidly growing organizations.”

And finally, our presenter Christina Tan shared her highlights: “The highlights of the conference for me were 1) looking into the audience and seeing them taking their phones to photograph nearly every slide towards the end of our presentation, 2) seeing the Blameless team in the audience supporting us, 3) hearing the brilliant questions from the community during and after the talk. The drive in the community to elevate reliability teams from cost centers to top-line contributors is strong, and we hope to see more people do this through shared visibility and open conversations.

Working with Mindy on this talk was a very rewarding experience. We pushed each other to think outside of the boundaries of our worlds in business and SRE. These cross-functional conversations have tremendous potential to drive change in an organization. When we see one another's goals and consider that in the context of what the company as a whole is aiming to achieve, we can better align on metrics of success, headcount decisions, and reliability investments.”

As the world reopens, SREcon 2022 Americas is just the tip of the iceberg for conferences. We’re excited to keep arriving and showing off the next level of reliability software. Check out our upcoming events here!

Resources
Book a blameless demo
To view the calendar in full page view, click here.