April 22, 2020 at 11:20 AM PST, Amy Tobey began her talk “The Future of DevOps is Resilience Engineering” at Gremlin’s Failover Conf. This talk focused on key concepts from DevOps as a way to understand resilience engineering. Amy began by having the audience participate in a group breathing exercise, taking 3 deep breaths before speaking about the yoga practice of pranayama as a way to understand DevOps. Like pranayama, DevOps is relatively ancient (at least in terms of internet years). With resilience engineering taking root in tech, we have vast amounts of research to push DevOps forward to scientifically prove the benefits of DevOps practices. Amy continued to speak about how we can build resilience practices into our DevOps teams by taking stock of the integral human aspect, including topics like common ground, socio-technical systems, cognitive capacity, and cognitive ability.During her talk, attendees registered additional questions. Requests noted in timeline below.
Talk Q&A Timeline
11:44 AM PST:
- Q: “Regarding ‘root cause’ would saying ‘trigger’ be better? Like, given the set of things that resulted in the incident, what tipped it over the edge in this situation?”
- A: The term most of us are using today is ‘contributing factors.’ You can think of it as all the root causes bundled together, all the little things that happened to make an event possible. If a contributing factor is new, and you want to distinguish it from known contributing factors, you can try labeling it as an emergent contributing factor.
11:49 AM PST:
- Q:“What do you think are the biggest challenges to people in DevOps (and broader systems) to embracing resilience engineering principles?”
- A: The main thing is growing our scope. The focus on CI/CD has pushed us forward in terms of bridging Dev and Ops, but there's so much more to do. I feel that resilience engineering is the path to opening those doors: looking deeper at how & why we do things, identifying where we can make systems more robust, and keeping resilience in the conversation at all times. We should also lean away from the focus on creating perfect tech stacks. When we understand the practices and purposes of our work, it becomes much easier to consume those stacks and choose the right solutions for our individual, team, and organizational needs. We can also zoom out. If we can get leaders to look a little more holistically, they'll see these 10x engineers who are shipping at incredible rates are privileged in many ways. One form of privilege being in a position to ship code endlessly but not do maintenance work.
When we understand the practices and purposes of our work, it becomes much easier to consume those stacks and choose the right solutions for our individual, team, and organizational needs.
12:00 PM PST:
- Q: “These ideas are so important for us to improve managing the ever-growing complexity in our systems. Do you have any recommendations for bringing these ideas to more traditional IT organizations?”
- A: My approach is to start using the terminology and practices without asking. Start talking about cognitive capacity. It's infectious! Additionally, incident analysis is a good way to begin. It doesn’t require additional tooling, high levels of managerial buy-in, or a lengthy ramp-up time. Instead, you can just begin setting a quick review meeting for customer-facing incidents and inviting teammates to join. Once in the meeting, you can work together to conduct a blameless retrospective and fill out the analysis, giving narrative detail, establishing a timeline, and dictating any action items that stem from the incident. By doing this, you’re helping your team to grow and learn from each incident.
12:01 PM PST:
- Q: “How do we measure and deal with situations of reduced cognitive capacity? Do we rely on the engineer to know they have reduced capacity (tired, stressed, etc)?”
- A: The answer is almost always rest: vacations, sick days, etc. It's super helpful when engineers are aware of their cognitive capacity. When talking about it is normalized, people become more likely to admit when they're low and have non-ableist language to talk about feeling less smart. To foster this sense of community, you can begin asking teammates, "How is your cognitive capacity today?" rather than “How are you doing today?” I might be fine, but my mind isn't working at full speed so the answers for those two questions will be very different. It’s also important to empathize. When someone's been working on an incident for even 4 or 5 hours, I think it's fair to say "There's no way you're operating at nominal cognitive capacity, go take a break and come back in an hour.” This takes honesty with yourself, and psychological safety within teams and organizations.
The answer is almost always rest: vacations, sick days, etc. It's super helpful when engineers are aware of their cognitive capacity.
Action Items (aka Additional Reading)