Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

Weathering Black Friday and Other Storms Reliably

Emily Arnott

If you work in eCommerce, you can see the storm on the horizon. Black Friday, the biggest shopping day of the year both online and off, is only a few days away. Your services are going to hit usage spikes you possibly have never seen before. And it will be all aspects of your services pushed to your limit – people won’t just be searching, or just buying, or signing up for programs, they’ll be doing all of these at once.

Most crucially, everyone else is offering deals too. If your service is laggy, let alone crashing, people won’t wait around for it to get fixed. They’ll be off to your competitors before you can blink. Keeping your services operating smoothly is truly make or break on Black Friday.

Even if you aren’t in eCommerce, you’ll inevitably experience waves of usage, and sometimes storms. Feeling confident to tackle it head on is important to keep your services running and your customers happy.

Make a game plan – and test it

Think about what could happen when you hit a big usage spike. What parts of your services will be bottlenecked by their speed? What will happen if you hit capacity for traffic on your servers? Try to think holistically and comprehensively. To whatever extent you can, try to simulate a huge usage and see what happens.

Once you have a list of things that could occur, make a plan that addresses each of them. There likely won’t be one single magical solution that takes care of everything, so instead segment out the necessary tasks based on who would be completing them. Come up with roles for each set of tasks and make checklists.

Don’t assume that your plan will execute flawlessly. Black Friday is right before Thanksgiving, when many people will be away from their computers. They may unexpectedly get called away. When looking at your roles and checklists, make sure you have a backup person, and a backup person for your backup person, in mind to handle them. This may require some proactive knowledge transfer to make sure everyone is capable of handling tasks that might fall to them.

Stick to reasonable standards

In reliability, perfect is the enemy of good enough. Unfortunately, you won’t have a perfect Black Friday. You’ll likely have some sessions go poorly, some customers left dissatisfied, and some services running slowly. Trying to fix absolutely every problem that crops up will have you panicked, worn out, and unable to address major incidents when they occur.

Instead, set a standard for your services based on the outcomes you want to see. Work holistically across your organization to decide what a successful Black Friday would look like – how many sales, how many visits, how many new customers. Then figure out what percentage of sessions would have to be successful (fast enough, consistent enough, and error-free enough for the user to complete their objective) to meet those goals.

Once you have these objectives in mind, find a way to track them automatically as a service level objective. This will show you how much breathing room you have before you’re tracking below your goals. Triaging problems as they emerge based on how much they impact the SLO will keep you focused on big fires and not panicking about lit matches. Plus, when you approach an SLO breach, you can have policies in place to throw on the brakes, call in help, and make sure you stay OK.

Proactively build sociotechnical buffers

When you expect increased usage, you can take proactive steps to handle it technically. Maybe this looks like upgrading your cloud hosting plan, eliminating common bottlenecks in your service, or scheduling additional on-call engineers. This technical preparation is usually standard operating procedures, and it is necessary. But it isn’t sufficient to truly weather a storm like Black Friday.

You have to build sociotechnical buffers to make a truly resilient system. Think about the potential stress and workload of your on-call engineers. How can you proactively build a “buffer” of emotional energy for them so they don’t burn out? Maybe it looks like restructuring their workflow in the days leading up to Black Friday so they aren’t overworked and can focus on the upcoming tasks. Maybe it looks like offering other compensation to show your appreciation for the effort they put in. Work with your teams to determine what would be most useful to allow them to rise to the challenge of Black Friday.

Blameless can help!

Blameless incident management can help orchestrate your plan, with role-based checklists, reminders, and automatic communication. Plus, our SLO manager makes sure you’re sailing comfortably within your expectations. See how we do all of this and more by signing up for a demo!

Book a blameless demo
To view the calendar in full page view, click here.