At first glance, people tend to think that incidents are cut-and-dried, relatively objective occurrences. But if you look closely, incidents are highly varied, often require unique handling, and often defy clear answers to something as seemingly simple as knowing when they even start.
In this article, we’ll look at how systems can be a lot more ambiguous about their health, set up a structure for how an incident progresses, and use this structure to work through the ambiguities of three example incidents.
Let’s start by taking a look at this image, which is intended to depict the conditions of a system with two states: normal vs. broken:
I’d be relatively certain that most people can readily determine when and whether this system is broken as it moves through time:
The biggest problem with this easy to imagine and understand system is that it does not represent the reality of today’s complex, highly distributed systems very well. These are called “gray failures” and could be represented like this:
Or maybe even like this:
In this last case, is the system ever really working? Or ever really broken?!?
Everybody wants to have systems like the first one (or imagines them as such). Binary, 0 or 1, clearly and objectively working or broken. This clear, bright line makes it easy to reason about the systems and it makes it easy to talk about them too: The system is either clearly working, or it is not. But such systems are becoming a vanishingly small fraction of the world that reliability engineers have to manage.
Not only do we have to deal with the ambiguity of this type of gray failure in distributed systems, but we also have organizationally defined constraints which bound the “expected” use cases for our systems. If a user is attempting to do something to (or with) your system using an “unsupported” browser – is that a problem? What if they are trying to do “too much” or “too fast”? Who defines these boundaries – the supported browser versions, or the rate and performance limitations? They might be policies within your org, agreed-upon standards in the industry, or the boundaries of what service you can realistically provide. Arriving at a singular definition of “normal usage” can be very complex.
So what is an incident?
ITIL defines an incident as an “unplanned interruption to or quality reduction of an IT service”. However, this isn’t very helpful because we are now in an always-on, 24x7 world of gray failure - there is always some degree of “quality reduction” and the key for service owners is to know when that degradation is too much.
Declaring an incident is a call for help. It’s a recognition that “normal”, business-as-usual plans need to be changed because of some significant deviation from normal. This call for help might be just on an individual level, where the on-call engineer is having to divert their attention from other planned work, or possibly having to bow out of a previously scheduled meeting. It could involve a few people for a slightly more involved incident or it could involve multiple teams to respond to a major incident.
The other main purpose of declaring an incident is for public awareness. Sometimes the incident responder does not necessarily need additional help but it’s important that other people avoid falling into the same trap or refrain from actions which could make a situation worse. If the deployment system has begun misbehaving, then it may be very important to avoid attempting additional deployments until the system can be corrected or caught up with a backlog of work.
There’s another aspect of “declaring an incident” which is related to the idea of public awareness. This aspect could be called a “signpost for the future” and is closely connected to the ideas around learning from incidents but especially the importance attached to learning from “near misses”. One of the tenets of the Safety-II community is that the work to maintain normal operational performance is critical to create a system’s resilience to disruptions. If “incident retrospectives” help to inform an organization about weaknesses and risks in their services, then they serve a role similar to “DANGER” signs posted near the edge of a cliff, even if no one (or the system) has fallen off of the cliff yet. From this point of view, it can even be beneficial to declare incidents retroactively so that the occurrences can be cataloged for future benefit.
Before exploring a few different incidents to see how these concepts work in practice, let’s borrow some terms from the electronic music world to make a structure for incidents that will help us navigate some of their ambiguities. Specifically, we’ll use those which deal with sound envelopes to break down the timeline of an incident: delay, attack, sustain, and release. Here’s a diagram:
Let’s explore these components by looking at a few real world examples:
In this scenario, the initial trigger could be identified as happening in December when the volcano began actively erupting or could be considered to be the climax eruption on January 15th. The decision should be guided by looking to see which leads to a better understanding of the incident.
The delay portion of the timeline would span the time from trigger until the physical failure of the undersea cable itself. Cloudflare has documented a variety of different measures to characterize the attack phase of the outage envelope spanning several hours from 03:00-05:35.
Due to the physical challenges of repairing a cable under the ocean as well as safety concerns regarding the nearby volcano, this outage was sustained for 38 days until the cable could be repaired. During this time there was a small amount of satellite connectivity that was established but, as documented in the Cloudflare report, it was extremely limited compared to the normal traffic volumes. While this restoration was effective for the main two islands of Tonga (75% of the population), the residents of other outlying islands may still have to wait for 6-9 months of continued outage in the release phase of this incident as inter-island cables and facilities are repaired.
This situation in Tonga provides what might be considered a fairly clear-cut example of an incident, but it still depends on your point of view. If you were a retailer in Europe, even an e-commerce shop, this would probably not be an incident for you; but if you are a South Pacific logistics service provider, it certainly could have been; and if you were one of the two internet service providers for Tonga or the undersea cable operator, it definitely was an incident – of the most severe variety. Due to the physical impacts of the volcano in the region (and the Pacific Rim due to tsunami danger), the declaration of this as an incident served both to recruit help for repair and to advise people about the problems related to the event.
For another example, let’s look at the incident with Fastly on 2021-06-08 based on their own outage report. In this case, the earliest trigger for the event is dated back to May 12th, with an almost 4 week delay before a customer change incited errors across 85% of the platform. Within a minute the event crossed the threshold of detection and after about 49 minutes of sustained errors, release and a return to normal performance took an additional 59 minutes. Preventative work to correct the underlying flaw (disarming the trigger) took 4.5 hours more (not represented on the envelope diagram).
With the public declaration of “an incident” at 09:58, Fastly was able to let their customers (the service providers) and the customers of their customers (the service users) know that there was a problem going on. This notice served like a yellow “caution” flag at a race while the Fastly engineers investigated and mitigated the problem condition. Internally, the declaration of an incident would have facilitated getting the attention of both engineering and management resources in order to speed the repair and guide public messaging. Given that Fastly posted their public statement the very same day of the incident, it worked.
Hospital decision support systems
For a third example, ascending the scale of ambiguity to determine whether an “incident” is occurring, let’s look at a decision support system used at an emergency department of a hospital. This system employs machine learning algorithms to guide clinicians in their determination of whether a patient should be admitted to the hospital. The model was trained on data from April through October 2019 and then checked by test data that spanned November 2019 through April 2020. While this was a retrospective modeling study rather than a “live streamed”/”in the moment” event, we can still use the same framework to look at the data and effects. Here is an image from the paper (cited as figure 3 in the paper) showing the accuracy of the model predictions over time:
With the confidence bands as marked at 95%, and if you did not have knowledge of the absolute date as the timeline plays out, where would you draw the line to declare an incident of “the model is too wrong”? The pattern seen in late December is not significantly different than the pattern in mid-March 2020. In many ways this pattern matches the third diagram in this article where the system is never fully working, nor fully broken.
Declaring an incident is a way to get help by focusing the attention of the responders on a particular set of circumstances. This focus can help by removing distractions but it can also hinder if relevant situational cues are excluded from consideration. It can be difficult to determine a priori which cues are relevant so casting a wide net is a good idea. Declaring an incident can also be a way to warn people who may not be able to take corrective action but may be able to compensate for the adverse conditions. Or, as in the case of the volcano near Tonga, alerts and warnings of elevated danger served as “signposts to the future” so that people could focus their cognitive efforts constructively.
While you may not be able to precisely define when an incident starts, or whether your system is “too degraded”, don’t get stuck on those details. Consider the purpose(s) of your system and whether the users and/or support teams will benefit now, or in the future. When you declare your next incident, consider how these aspects can help your team and your customers respond most effectively. You don’t need to “overthink” the question because it is generally better to err on the side of caution and future learning value (declaring an incident), than to be worried about the stigma of yet another incident.
What do you think? When do the benefits of declaring an incident outweigh the overhead for you? Join the discussion on Twitter or in our community Slack channel!
Kurt Andersen is a practitioner and an active thought leader in the SRE community. He speaks at major DevOps & SRE conferences and publishes his work through O'Reilly in quintessential SRE books such as Seeking SRE, What is SRE?, and 97 Things Every SRE Should Know.