April 21, 2020 thousands of industry professionals came together virtually to attend a revolutionary conference, Gremlin’s Failover Conf. With dozens of cancelled events, social distancing policies, and heightened stress due to the current crisis, it was more necessary than ever to take a moment to learn, share, and talk to one another about something we are all passionate about. We loved the experience at Failover Conf, and want to share some of our favorite parts with you.
Tammy Butow’s keynote
A good conference begins with a good keynote, and Gremlin’s Tammy Butow absolutely delivered. Tammy drew her inspiration from her personal experience with reliability engineering, beginning as a 20-year old in the industry struggling with her organization’s reliability issues which resulted in an ATM error giving out free hundred-dollar bills. Tammy kept the audience engaged with a Twitch-esque presentation, slacking her audience while talking! So responsive, and huge props to her multitasking skills.
Tammy left the audiences with some great sound bites such as “Being an SRE is like being a detective” as well as ten awesome tips for enhancing your organization’s reliability:
Remember that reliability is feature zero.
Identify your top 5 critical systems.
Create a map of key players who are critical for reliability to improve
Tell everyone (including the business and product team) the critical issues we need to fix in the next 3 months and track your progress towards hitting these milestones
Prep before disaster recovery day or disaster strikes
Start small and gradually expand the blast radius
Gradually roll attacks out: shutdown, cpu, packet loss, latency…
Practice Chaos Engineering on your automation
Don’t wait to see how your system handles failure, be proactive.
Share your progress and results.
What an awesome start to the conference!
Our favorite talks
Don’t get us wrong, we loved all the talks. Each one was unique, informative, thoughtful, and engaging. But there were a few that stood out. Here are three that we found particularly interesting, as well as some key takeaways.
Jennifer Petoff’s “Why Training Matters for an SRE Practice”
Jennifer Petoff’s talk on training addressed an issue that most organizations, regardless of their maturity, have to reckon with. Training is crucial for the upward trajectory of a person’s career, their comfort and trust with their organization, and overall job satisfaction. Yet many organizations have difficulties when it comes to training their employees well.Jennifer begins her talk by explaining exactly why training is so crucial. As she stated, training helps build confidence and fight imposter syndrome and drives the desired organizational culture. There are differing levels of training, ranging from low-effort to high-effort.
Higher effort training methods are highly recommended as they demonstrate leadership commitment to development and reinforce desired behaviors in trainees. However, attaining this quality of training can be difficult, so Jennifer proposes what low-maturity and high-maturity organizations can each do to make their training process as effective as possible. Low-maturity teams can start by addressing skill gaps in processes or tooling. They can also work on having a deeper knowledge of their teams in order to tailor training messaging. High-maturity teams can assess the team mix; how many newbies, internal transfers, old-timers, and industry veterans do you have? Are the ratios equal?
Jennifer addresses a crucial cultural issue that many organizations grapple with each day. For its applicability, humanity, and thoughtfulness, we enjoyed this talk.
Marco Coulter’s “Slowdown is the New Outage”
Marco Coulter’s talk addressed a subtle and silent issue: slowdowns. In comparison to outages, slowdowns seem to be of a lower concern. Marco began his talk with polling the audience on how they would react to a lengthy outage versus a minor slowdown, and the results were interesting, with the majority (48%) immediately starting a war room for a major outage but raising a JIRA ticket for a minor slowdown (74%).While it seemed the outage was a bigger issue, Marco demonstrated how a slowdown was just as painful to customers, and eventually would be just as costly to a business as a complete outage. As he put it, “You will lose the customer and not even realize it.”However, slowdowns can be combated by great observability into your systems. He reminded attendees that the metrics you aren’t watching will be the ones that sneak up and bite us. Marco left us with a list of functions needed for insight into observability problems:
Baseline of normal performance
Segmented metrics of customer business transactions
Levers for code isolation
Cross-silo metric observability
ML-trained noise filtering and anomaly detection
By implementing these into your systems, you can make sure that your slowdowns don’t build up and cost you customers. For this eye-opening take on slowdowns, the interactive nature of the polls, and the candid quotes, we added Marco’s talk to our favorites list.
Amy Tobey’s “The Future of DevOps is Resilience Engineering”
Last but certainly not least is our own Amy Tobey’s talk, which summed up key concepts from resilience engineering as a way to understand DevOps. Amy began her talk by having the audience participate in a group breathing exercise, taking 3 deep breaths before speaking about the yoga practice of pranayama as a way to understand DevOps. Like pranayama, DevOps is relatively ancient (at least in terms of internet years). However, prior to resilience engineering, there was no way to scientifically prove the benefits of DevOps practices. Amy continues to speak about how we can build resilience practices into our DevOps teams by taking stock of the integral human aspect.
Amy then continues to talk about socio-technical systems, or systems where humans and tech must work together. In these systems, it’s important to note the human limitations. She talks about the power of creating a common ground, and how important cognitive capacity is to individual productivity. To increase cognitive capacity, we need to design things that eliminate toil.When we work to increase cognitive capacity, cognitive adaptability also increases. Imagine an on-call scenario where an engineer needs to respond to a high-severity incident. If the engineer has higher cognitive capacity due to runbooks and automation, they’ll have room for adaptability, allowing them to think on their feet and solve the issue faster.Amy also poses that there is no such thing as human error or root causes. Rather, incidents are caused by multi-layered problems within systems. As people, it’s our job to look at those systems and solve the issues that lead to incidents occurring. Lastly, Amy says that it’s important to learn from failure, but it's equally as important to learn from success.
We loved this talk’s timely nature, compassionate approach, and breadth of knowledge. Here are the resources that Amy shared if you want to learn more:
We all come to conferences for the talks, but we stay for the hallway track, socialization, and sense of community. Though this conference was held remotely, we certainly had our share of fun. Here are some of the moments we liked best, plus our thoughts on the future of virtual conferences.
“The metric you are not watching will get you.” -Marco Coulter
“Disasters are inevitable. It’s never just one thing that makes a disaster—it’s all the straws on the camel’s back.” -Heidi Waterhouse
“I just want to remind all of you that it’s the people that make it possible to adapt to change. Automation isn’t going to do it all, it’s not going to come from an executive mandate—it’s you, it’s me, it’s all of us resilienc-ing our way through this together.” -J Paul Reed
Love for the hallway track
Gremlin gave us everything we wanted socially from a conference sans the free tee shirts. We had spaces for open discussion with different channels in Slack. We had a place to go in search of awesome tools to help us stay more reliable. We had happy hours, and polls, and Q&A!And attendance reflected all these amazing attributes. The conference was buzzing all day. With increased accessibility, no need for travel or board expenses, and the ability to tune in while wearing PJs, more people were able to join from all over the world. Some were listening to a talk while having lunch, others while winding down at the end of the night before bed. Other people mentioned enjoying the virtual setup more, as it was a friendlier environment for introverts.
We’d like to give huge kudos to our friends at Gremlin for making this day one to remember, and for giving people a safe space to come together during a difficult time.
"I have less anxiety being on-call now. It’s great knowing comms, tasks, etc. are pre-configured in Blameless. Just the fact that I know there’s an automated process, roles are clear, I just need to follow the instructions and I’m covered. That’s very helpful."
"I love the Blameless product name. When you have an incident, "Blameless" serves as a great reminder to not blame anything or anyone (not even yourself) and just focus on the incident resolving itself."