Situation Room: On-Call Team Faces Worst Case of Sunday Scaries

Emily Arnott

Picture this: it’s Sunday night. You’re relaxing in bed, in that sweet spot where you’re geared up for Monday, but the fun of the weekend hasn’t yet faded. As you idly scroll through content on your phone, you see a message preview pop up.

It’s to your work email. That’s bad.

It’s from the hosting company you contract. That’s really bad.

They’re saying they accidentally deleted the production database. That’s “jump out of bed” bad.

This is the scenario faced by a Blameless engineer at a previous organization. For this story, let’s call him Joe. Joe worked in backend at a small bootstrapped startup - a total engineering team of a dozen or so. As they had no infrastructure experience, they had contracted out their hosting to a third party company. 

Let's return back to our story. On that fateful night, Joe sees the alert on his phone. Jumping out of bed, he checks his laptop to confirm that the customers have completely lost access to their service. The third party wiped out the database powering their entire product.

Joe immediately leaps into action, joining a video call with his boss and CTO. They SSH into the production environment and confirm that the worst has occurred: all of the tables and the data within them are gone. In the MySQL logs, they see that a ‘drop database’ command had previously been issued and replicated to every DB in the cluster. They don’t know why that command was issued, but it doesn’t really matter: they have to fix it.

All was not lost!. The engineering team runs nightly backups on their servers. This means they can get the service back online with no data lost. As they watch potential revenue slip away with every second of outage, they wonder, how long will the restoration take? They have no answer. They’ve never tried restoring from a backup database before.

The answer turns out to be… a very long time. Without an in-house infrastructure team and no possible support from the third party, they must rely on one engineer, with little SQL experience, following a runbook written by someone else. The database is around 3TB after heavy compression. Decompressing and restoring from this database will take ages, even if all else goes smoothly. They start to pursue other restoration options in parallel; whatever gets the service back up first.

But restoring the service is only half the story: the team must also communicate with customers and other stakeholders. They aren’t sure when they’ll be back online. Having to decompress and restore the backups will take the longest, but there’s little certainty that other methods will work. How do you assure stakeholders that the issue will be sorted while setting realistic expectations, and while preserving their support for the company?

Try one thing. Tell the customers, “We have a new idea”. It doesn’t work. Give them the bad news.

Try another thing. “There’s potential? Let’s give it an hour.” Nope.

Here’s something that might work; with a few variants we can try if it doesn’t. Let the customers know we have more ideas. But no. Not this one. And not that one either.

And another. Nope. And another. Still no.

On and on like this, for 24 hours, then 48 hours. The entire engineering team sleeps in shifts- only short naps — because at any time, one of their solutions might work. They update their customers every hour. The CTO and CEO are constantly on the phone. When they have a second to spare, they shout over to the engineers for an update. But what is there to update? They don’t know; nobody knows. 

As it turns out, the entire disaster started because an engineer at the third party infrastructure company forgot to turn off automatic command replication when dropping a specific database during an update. Wow. It’s good to know, but it doesn’t solve the problem. They must continue the work to restore their service.

It’s the morning of day three. Joe has worked over 40 hours in the last 48. Tempers are running high as sleep is running low. The CTO and CEO want answers. The customers want answers. The engineering team more than anyone wants answers. Then, it ends. Finally, finally, unceremoniously and unsurprisingly, it ends. The long restoration of the backup databases finishes. The data looks fine. The applications are healthy. But our story isn’t over.

The aftermath

With any major incident, the aftermath can be just as impactful as the incident itself. In our story, the CTO was fired. An infrastructure team was established and they switched from the third party hosting company to a major cloud provider. More backend engineers were hired. A new head of engineering was hired, who said he was there to “whip the team into shape”. The CEO stepped down and was replaced by the head of product.

These were major shakeups. The CTO was the only founder to leave the company in its 10 year history. The remaining C-level executives had lost faith in the engineering team. The new head of engineering fostered a toxic work culture in the name of preventing future incidents. Joe felt embarrassed to be involved in such a public incident. Between the trauma of the incident and burning out from the culture of his new boss, he ultimately left the company.

Apart from the technical, there are social and cultural lessons to be learned from this incident as a whole. A key problem with how the organization reacted to the incident is that they attached  blame to particular individuals instead of looking for systemic solutions. Sometimes individuals need to be held accountable, but removing and replacing aren’t sufficient solutions. Also, disciplining through scolding or authoritarian leadership will not achieve the results for which you might hope. In other words, “whipping people into shape” doesn’t “whip incidents away”.

For the startup, it wasn’t all bad. They did make some systemic changes, mostly through the infrastructure team, such as:

  • Reducing the database backup restoration down to 30 minutes with a new strategy
  • Reducing the size of the database from 3 TB to 30 GB by moving the diagram content from the mysql `diagram` table to S3 buckets (fronted by a thin Diagram Storage service layer)
  • Moving from the third party hosting provider to a major cloud provider
  • Not having have a master-master db setup in the new cloud setup
  • Starting the initial work on instrumentation and observability 

These are the changes that actually work to improve the response to future incidents. Not only does this new setup prevent another similar outage, it also ensures that any related incidents that occur will be solved much faster.

Even more important are the more general lessons they learned. Here are a few that Joe shared:

  • If it can be deleted or undone, someone will eventually do it.
  • If you haven’t done Disaster Recovery drills (DR) you don’t know if they work or how long it will take to recover. 
  • If your DR plan takes too long then it shouldn’t be considered a functional DR plan
  • Outsourcing your infrastructure team entirely is a really bad idea even if you are small
  • If you think you can get contractors within a short window for an emergency, you are wrong. Any DR plan must only assume current employees.
  • Master-Master database replication setups are extremely risky and should be avoided

Joe’s most important lesson: prioritize sleep. “Sleep is the most important resource in a long running outage. Leadership should make sure people get sleep no matter what. In the end no one is dying no matter how severe the incident.”

Incident response is vital to organizations of any size. You can never guarantee that an outage won’t occur. Instead of aiming to prevent any and all incidents from occurring , prepare yourself to respond faster and learn more each time an incident occurs. Blameless can help with runbook documentation and other incident response tools. Find out how by checking out a demo.

If you’d like to have your incident war story featured on the Blameless blog, reach out to