Burnout is real. Today, the source of burnout can be anything from pandemic fatigue, to the onslaught of political divisiveness, or simply the pace of life worldwide. Whatever the culprit, we’re living in a stressful time. People working in cloud native environments definitely feel burnt out. Silicon Valley investor Marc Andreessen famously said, “Software is eating the world,” and that seems to be quite true. High demand is fueling churn. System and cloud operators feel pressure. Developers are stuck with unsupported and unfunded projects. Even open source maintainers face unsustainable review loads. We care deeply about what we do, and that makes it doubly upsetting.
The end-user experience – this is what sits at the forefront of our minds as software engineers, service reliability engineers, product managers, and business owners. Most fundamental to that experience is the reliability of our applications. If a user cannot access my app, then the rest of the experience does not matter. So how do we deliver service reliability amidst the challenges going on?
The Impact of Team Burnout
Burnt out people make more mistakes, are less satisfied with their own work, and are generally not much fun to be around. They’re also likely to quit in search of greener pastures. As the economy rebounds, there are plenty of job opportunities. Lots of things happen when someone leaves an organization, both good and bad. On the positive side, workers progress onto new opportunities and organizations welcome new talent, with fresh perspectives. On the down side, it’s often the case that when employees leave, they do so before a replacement is hired. That means all their soft skills and tribal knowledge “walk out the door” before there’s a chance to train the next person.
Maybe this is you. You’re burnt out or you manage a team that’s burnt out. Hopefully, since you’re reading this, you want to fix the problem. Let’s take a minute to understand why it’s happening. We’ll focus on the example of working in cloud native and explore a key strategy that can help combat burnout, or at least manage through it.
Constant Change and Continuous Learning
By design, containers and Kubernetes enable teams to deliver value faster by means of standard tooling, packaging, and deployment models that reduce friction when deploying into production. The focus is automation, and the goal is increased speed. In fact, the speed has accelerated dramatically for teams leveraging Kubernetes and its related tools, which are also delivering faster. Unfortunately, things begin to break down when parts of a system change. New versions — not just of Kubernetes, service meshes, network proxies, and ingresses, but also of container runtimes (containerd, runc, etc.), and the Linux kernel — all bring churn. No matter how much testing you build into the process though, you can’t test every corner case.
Without support, and if teams aren’t resourced adequately to orchestrate, run, and manage containerized applications, the initiative to reduce burnout is likely to go south, or at least a bit sideways. Investing time for training, documentation, and team cross-training are a must, especially given the constant change that naturally occurs in cloud native.
It’s true that most engineers and ops teams are self-taught, especially when it comes to new tools. They rely on docs, forums, and samples to gain proficiency. Experiential learning is pretty common and many engineers have enough context to just dive-in. However, once you start running in production and you face scale and performance demands, you must assess how adequate your systems, tools, and resources are in helping the “cause”. This is the time that your people need all the support they can get to head off burnout.
If a team is ripe for burnout, you gotta look for the signs by asking the right questions. Discover where there might be ambiguity, loopholes, or room for communication. It’s important to do this before it’s too late, and churn kicks you in the (you know the rest).
Managing Possible Burnout through Checklists
So how do we make our cloud native systems reliable? Most of us working in cloud native systems already appreciate the importance of site reliability. While there’s been many books on the topic starting with Google’s SRE in 2003, in reality it’s not so easy to pull off. It’s pretty daunting for smaller, younger organizations who are even more resource-constrained than larger, more mature teams. The real question is how much can we tolerate, over what timeframe, without running into user issues and business trouble.
What we need to do is start with resilience - the capacity to recover when things go wrong. This concept can be applied to team resilience and system level resilience. Reliability is about focusing on the end-user experience and then setting a budget threshold that balances what is acceptable with the business goals for investment - Service Level Objectives (SLOs).
Service Level Objectives (SLOs) represent the degree to which we should invest in the system-level resilience which will vary over time and by service. As we’ve already talked about, things fail due to rapid changes no matter how much you invest in the bits. And when they do, the user experience depends on how resilient the other, often underappreciated, part of the system is - the human system. So, how do we build resilience in the people? How can we do this when they are burnt out? Atul Gawande studied this question extensively in highly complex and stressful environments and discovered that a simple solution is key - checklists.
Checklists codify known methods of resolving issues, and can be improved over time. Introducing checklists reduces errors by reducing stress in situations such as a Sev1 incident. Checklists reduce stress by reducing cognitive load because individuals don’t rely on memory alone when under stress. Checklists improve our ability to understand, analyze, and learn from crises by introducing a common set of attributes (like what happened when, and by whom) across multiple crises. And finally (living, actively updated) checklists reduce the cost of onboarding new team members allowing others to move on, which also helps bring new perspectives to the wider team.
I hope you will consider these steps of asking questions, establishing SLOs, and developing checklists as you explore new ways to manage stress and burnout, both of which result from an environment of quick and constant change introduced by growing technical innovation. This is why I’m happy to be part of the team at Blameless. I’m not just building something for SREs, I’m also encouraging them to live out a culture of Blamelessness so that we can all provide value to our customers and continue to do what we love.