Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

A Little Resilience Goes A Long Way

Emily Arnott

“Security and reliability are important” - Every Sysadmin… Ever

Let’s call this the mother of all understatements. If you’re reading this blog, there’s a good chance that you:

a.) Agree wholeheartedly with this sentiment and think it should go without saying, AND…

b.) Are surrounded by folks who pay lip service to this idea while not taking it as seriously as they should.

If that description sounds familiar and you’re looking for some tips and tricks to get people onboard and make incremental gains in the face of organizational resistance, this blog is for you.

The practical constraints on engineering are not easy to navigate

We live in a world of compromise and uncertainty, where conflicting priorities can make focusing on security and reliability a challenge. Making security and responsibility a shared responsibility across your engineering organization is a step in the right direction but it comes with costs. Developers asked to carry a pager are particularly susceptible to suffering from context switching. Given the opportunity, these folks would commit 100% of their focus to expanding and innovating their products, and rightly so. Innovation is hard, but the rewards for teams that outpace their competitors are considerable.

The smart money establishes individuals and disciplines within their engineering organizations like SRE, DevSecOps, Platform, or infra engineering designed to support and supplement software development. Still, it’s all too common for practitioners to quickly end up spread across more competing demands than they can effectively manage. This naturally leads to ad-hoc, reactive solutions being implemented that quickly become a drain on overall productivity.

This can happen more easily than you think. Even when you know it’s the right thing to do, convincing the organization to invest time in proactive reliability and security measures isn’t always easy. The time, money, and engineers, required to implement or improve on systems come with an opportunity cost, and it’s not always easy to convince senior stakeholders that “Slowing down to speed up” is in their long-term interest. 

Unfortunately, without us all getting better at making the argument, more resources are’nt likely to be on the way any time soon. One recent study found almost 50% of organizations are planning on reducing cybersecurity headcounts, despite the frequency and severity of cybersecurity incidents increasing. Even if you’re not feeling overwhelmed yet, it may be time to prepare for when you could be.

Why proactive planning is even more important in lean times

Don’t worry, it’s not all doom and gloom. You can start to lighten the load on practitioners even if your leadership won't commit more resources. It may seem counterintuitive to be concerned about process and planning when things are on fire, but it’s actually the most important time to care. Building and evolving an incident management process is the best way to quickly improve your situation.

Dealing with incidents in isolation, and scrambling to come up with a solution each time, leads to problems recurring and worsening. When your resources are spread thin, you’re at risk of hitting a tipping point. Something could break so badly, or in some weird new way, or too quickly after another incident, that a more catastrophic result could occur. Maybe that catastrophe looks like customers churning, service level agreements being violated, engineers burning out or quitting, or your public reception plummeting. The fewer resources you have, the more potential every incident has to bring you to this disastrous point. It’s like putting fires out in a room full of dynamite. Treating every incident like an opportunity to interrogate your process, and achieve some iterative improvement, can quickly alleviate the pressure on your whole team.

Where to invest when you have little to invest

To start preventing fires and installing sprinkler systems, so to speak, you have to start investing in process. It can be tough to know what this should look like. A lot of guides to SRE assume that you’re a big organization, capable of deploying resources like full-time engineers and bespoke tools to solve these problems. Even if you are a big org, you may not be able to prioritize these big resource investments right away, or scale up to them quickly enough when dealing with incidents.

Here are some strategies to employ that require few resources, but can have great rewards:

Logging your incident resolution steps

The first thing to implement is tracking what you do while you solve an incident. This doesn’t need to be an elaborate thing, and ideally is quick and unintrusive. Simply making notes of each thing you try as you try it only adds a second but can form a valuable document for subsequent incidents. Make sure these documents end up somewhere searchable by anyone in the organization. When your feet are under you again and you have the time, you can document and expand these documents, turning them into runbooks for incident types.

Communicate more during resolution

Facilitating logging and tracking can be motivated and enhanced by communicating during the incident. As soon as you recognize an incident has occurred, set up a new communication channel for that incident in whatever software your engineers use. Ping a more general channel and provide a link to get more people involved. Then, as you diagnose and solve, keep the channel updated.

This won’t be a perfect process right away. There may be too many people getting involved, leading to redundant work and distractions. People may not have all the info they need to contribute. But as you work through incidents and build more processes, this communication channel will become more and more valuable. Eventually, it can guide incident responders itself.

Run a quick retrospective after each incident

When your service is restored, you’re likely to be eager to return to whatever other pressing tasks you were taken away from. In the long run, though, you’ll save time by thinking more about the incident before moving on. What caused the incident? What caused the causes of the incident? How likely is it to recur? How much damage did it really do? What could be done to prevent it from recurring? Meet with your peers to discuss these questions.

As you run more retrospectives, you can build templates to ensure all the most important information is captured. They can be combined with the aforementioned tracking documents to make a knowledge base and guide for future incidents.

The most important thing to remember: self-compassion

Although trying your best to meet these goals is important, remember that no one is perfect. No matter what standards you set for yourself, there’s always going to be new situations outside of your expectations. Don’t think of these unprecedented incidents as failures. Not only is it demoralizing, it reinforces the idea that the answer is simply being perfect from then on. It makes your way of thinking rigid and distracts you from opportunities for improvement.

It’s important to extend this self-compassion even further, though. It’s one thing to say “my systems aren’t perfect, so I won’t be frustrated when they break”. But the reality is that if you’re stretched thin and putting out fires, you will feel frustrated. It’s just being human. And knowing that being frustrated is at odds with improvement will likely just make you feel more frustrated. That’s OK. As long as you keep striving towards acceptance and improvement, you will make progress.

Investing in tooling is the best force multiplier

Although it may seem unintuitive when you aren’t working with a lot of money, putting some of your budget into tooling can quickly pay for itself. Investing in tooling, such as Blameless, can instantly automate the tips we’ve recommended above. You’ll have a strong foundation to your processes and a guide for future improvement.

When you really do it right, building incident management processes and improving reliability and security is a full-time job – if not more. Tooling allows you to bridge that need, cheaper and quicker. Check out a Blameless demo to see how we can help you out, wherever you’re starting from.

Book a blameless demo
To view the calendar in full page view, click here.