Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

The Reverse Red Herring

During an incident, time is fungible. At points it seems to go way too fast, and at times it seems like an eternity for a command to complete. More importantly, however, is how it feels to be in an incident. It’s a heightened state of being, where any and every piece of information could be “the one” that helps crack open what is really going on. Likewise, there is an inherent distrust of incoming information. This distrust is necessary to keep the facts that you do trust closely guarded. You find yourself only letting in things that have been thoroughly verified, and also jive with the current set of trusted facts in a meaningful way (fit your mental picture of the incident).

However, there is a third set of information that is often overlooked - the “Reverse Red Herring”. Everyone knows about Red Herrings – a piece of information which seems so compelling and believable, and yet in the end turns out to be completely irrelevant. The Reverse Red Herring is the opposite – a piece of information that is discarded early on in an incident, which later turns out to be the very key which cracks the whole case. More importantly, with regards to time:  this piece of information is no longer a focus once discarded, so it seemingly comes “back through time” from when you first discovered it to instantly become a focus again – if you are lucky.

Some time ago I was a director of Engineering at a company that made laptop security software (client software) for Windows and Mac. On a particular morning, we had pushed out a garden variety update to all endpoints. I ended up going to meetings for the entire morning and was largely absent from the Engineering floor.

Right after lunch I noticed we had an open incident running in the outage channel. Our web cluster had hit the high water marks on requests and was setting off alerts to the OPs team. We had recently put in high water marks for various conditions to help us debug before infrastructure went down.

As we waited for the team to diagnose what requests were causing the alerts there was wild speculation in the war room as to what the root cause could be. Was it a DDOS? The databases seemed to be fine, so whatever was being requested wasn’t even hitting them. Whatever was being requested seemed to be of very low value and wasn’t even affecting our main service, other than by way of oversubscribing the infrastructure. Someone said “could the laptop software update be to blame? It’s the only thing that changed, just saying” but everyone disregarded this as totally unrelated and a distraction –of course it couldn’t be, it was just a code change to the endpoint software itself and had nothing to do with our web servers.

We soon figured out the majority of the traffic was resulting in 500 errors (internal server error) and it looked like the requests were coming in from all over the world. Certainly this was a DDOS and the next step was to mitigate the type of request that was causing it.

As we investigated what was being requested so we could block or cache it, it became clear all the traffic was to the same destination API – the configuration API for our endpoint software. Clearly someone had discovered that the API wasn’t rate limited (we would have to constantly push the rate as we added more customers) and was pounding the API to DDOS us. The requests seemed well formed, and we couldn’t block or cache the traffic because our endpoint software would cease to be able to update.

We were stuck.

We continued to propose solutions for an hour or so, but it felt like a whole day as everyone saw the graphs on the TVs in the Engineering department go up and to the right with 500 errors. The web infrastructure was really well built and was holding up, but the end was near and we had to figure out what was going on. My role was to communicate up and out from Engineering. Needing to update every 20 mins was excruciating. Trying to characterize the situation without knowing what was causing it is not an easy task.

Who had enough distributed boxes to do such a thing? Why were they interested in taking down our web application when it was just for configuration and didn’t affect our main service? They must have had our endpoint and had analyzed its behavior to craft the URL they were using, so who were they?

They were us.

We all knew we had pushed a software update to the endpoint software, but it sat in the back of our minds as a useless fact which was completely orthogonal to our current situation. However, the change that had gone out resulted in a malformed ID in the URL the endpoint software used to “radio home” and get its configuration. This malformed request caused the API to choke and throw a 500. The endpoint software was written to request more frequently if it received an error response, so it wouldn’t have a stale configuration. We were DDOS-ing ourselves.

The endpoint change had all of a sudden “returned through time” to be the most relevant fact of the entire incident. From that point forward we didn’t shoo away information so quickly. We started to at least pause and give it a second look. We also gave that guy a lollipop.

We managed to catch it this time, but making a foolproof way to find reverse red herrings is tough. I’d like to toss the question over to our readers. Is there a method to follow so we don't easily dismiss possible contributing factors? How do we do that without compromising efficiency? Have you experienced a reverse red herring before? If so, please share your advice with us on Twitter or our new community Slack channel!

Resources
Book a blameless demo
To view the calendar in full page view, click here.