In this episode, Lauren Caliolio, Director of Reliability Engineering at Nike, and Cat Sewel, Director of Engineering at Nubank, join Matt Davis and Kurt Andersen from Blameless to give the proper severity to the topic of incident severity. Incident management always begins with prioritizing and triaging based on the incident’s impact. Judging that impact isn’t trivial! Join our panel as they break down the nuances and details of incident severity.
In this episode, Lauren Caliolio, Director of Reliability Engineering at Nike, and Cat Sewel, Director of Engineering at Nubank, join Matt Davis and Kurt Andersen from Blameless to give the proper severity to the topic of incident severity. Incident management always begins with prioritizing and triaging based on the incident’s impact. Judging that impact isn’t trivial! Join our panel as they break down the nuances and details of incident severity.

Speakers

Kurt Anderson

Strategy, Blameless
Read Bio

Kurt Anderson

Strategy, Blameless
Kurt Andersen is a practitioner and an active thought leader in the SRE community. He speaks at major DevOps & SRE conferences and publishes his work through O'Reilly in quintessential SRE books such as Seeking SRE, What is SRE?, and 97 Things Every SRE Should Know. Before Blameless, Kurt was a Sr. Staff SRE at LinkedIn, implementing SLOs (reliability metrics) at scale. Kurt is a member of the USENIX Board of Directors and part of the steering committee for the world-wide SREcon conferences.

Matt Davis

Staff Infrastructure Engineer, Blameless
Read Bio

Matt Davis

Staff Infrastructure Engineer, Blameless
Matt is a Sr. Infrastructure Engineer at Blameless. His expertise brings to bear a variegated background including data-center operations, storage hardware and distributed databases, IT security, site reliability, support services, observability systems, and techops leadership. He has a passion for exploring the relationships between the artistic mind and operating distributed software architectures.

Lauren Caliolio

Director, Reliability Engineering, Nike
Read Bio

Lauren Caliolio

Director, Reliability Engineering, Nike
Lauren is a non-traditional technology professional with 15 years of experience working within systems administration, emergency medical services, and site reliability engineering. She currently works as a director of reliability engineering. When she’s not working, you can find her riding the ambulance, knitting, and playing the piano poorly.

Cat Swetel

Director of Engineering, Nubank
Read Bio

Cat Swetel

Director of Engineering, Nubank
Cat is a technology leader specializing in lean-inspired, data-informed coaching for technology organizations. She is passionate about increasing diversity in STEAM as a means of creating possibilities for a more equitable human future based on generative institutions. In her leisure time, Cat enjoys making jokes about Bitcoin, hiking, and reading feminist literature.

Video Transcript

Matt Davis:

Hi, everyone. Welcome to another episode From Theory to Practice. Today, we are going to be talking about the topic, what is difficult about incident severity, and today our panel are some guests and a member of Blameless. I'll go ahead and let everyone introduce themselves. We'll start with Cat. Go ahead.

Cat Swetel:

Cool, thanks Matt. I am Cat Swetel. I'm the director of engineering at Nubank, which is the Latin American bank. I run the foundational technologies group there. And I guess I will pass it off to Lauren.

Lauren Caliolio:

Everyone. My name's Lauren. I'm a director of reliability engineering over at Nike. I manage teams of embedded SREs responsible for overseeing a couple of pretty critical business lines that are pretty critical actually to running a nike.com, as well as Nike's series of mobile apps.

Kurt Andersen:

I'm Kurt Anderson, SRE architect here at Blameless, and kind of honorary co-host with Matt who runs most of this show. So I'll pass it over to Matt.

Matt Davis:

Thank you very much. Matt Davis is my name, and I am an SRE advocate at Blameless. Thank you everyone for joining us today, and I want to thank our guests for being here. And I think that the best place to start with severity, because this is a very contentious thing, and I've seen some pretty interesting Twitter threads that we're arguing about severity. I see some people on LinkedIn posting a lot of different opinions about what severity means, and there's actually some great articles online about how to distinguish severities and what they actually mean.

So a question that came to mind last night as I was thinking about how do we start this discussion about severity is, well, let's ask the question right up front. Can severity be measured? Lauren, I want to ask you first, as a director of SRE, is this a question that people ask? Is this something that you deal with? Or what do you think, can it be measured?

Lauren Caliolio:

I think the trouble with even attempting to tackle this, while not just at Nike where I've just arrived, I'm only about second month in, but at other places I've worked, is that it's highly subjective. And you have to have alignment across, well, there are a lot of stakeholders, there are a lot of cooks in the kitchen that you have to have alignment with in order to answer that fully.

And once you have alignment regarding severity being measured and we all agree on it and what the measures and goalposts should be that are aligned to it, then it's something that you really have to just continuously monitor and continuously gauge across all the different stakeholders that you have that would care about how you're defining severity, or even levels that are aligned with it too. So it's highly subjective, I'd say.

And every org I feel has a different perspective on it. And it gets even more, I feel, contentious and you are a part of orgs that have a existing, I'd say, an ITIL practice, where your SRE has to basically fall into alignment, whatever the existing ITIL practice is. So I feel like I gave a very engineering centric answer to that, which is, it depends. Which is, I'm not sure if that's what you're looking for, but I feel that's like the real answer here.

Matt Davis:

No, you brought up a lot of good points, and the one I like the most is how subjective that assessment is. I think I read in a Slack, when we set severity, it's all about the feels, it's all about the vibe at the beginning. And that's how we all have these different perspectives of it when we're coming into the incident. And it's really curious that, I love how you say that you revisit it and you keep... It reminds me of revisiting common ground and it feels like understanding what the severity is very much a piece of common ground. Cat, what about you? What do you think about the idea of measuring severity?

Cat Swetel:

I want to just plus one, everything that Lauren just said. Obviously, it's highly subjective and I love the point of revisiting, same as we would anything else. But I've been in a situation before where the product landscape for the organization changed drastically, and we still have the same idea of severity for incidents, and it just was a nightmare. Exactly like you would expect it to be, right? And it, I guess, gave people that feeling, where they're like, "Am I crazy? This feels like a step one, but I'm looking at this and it's not..." And that's always a bad sign for me when people are asking, "Am I crazy? Or..." I think highly contextual and probably shifts over time as do most things in your business.

Lauren Caliolio:

It's interesting that you mentioned the feels that are attached to it, especially when you get into levels and stuff, because especially in the world of e-commerce and certain times of year, I mean, you might have an S3 where it's like, "This is an S3," and it's like, "No, this has all the feels and all, basically, the actions that are being taken is like, oh, wait, S1." It's like, "But wait, but wait a minute." So much of it is attached to what comes down from the business. And especially if I use the example of e-commerce, and especially in organizations like Nike, a lot of it's just like, "Oh wait, but we're in holiday."

So it's really interesting. Which is also part of why one thing that I'm actually working on right now is how to ensure that we have a process that we're regularly reviewing with regards to alignment on severity levels, and gauging where things should be dependent on time of year even, I'd say. And trying to figure out how we can get there. Because at least my experience has been with this stuff that once you have a definition of what your severity is and what your levels are and everyone is in agreement, it's really easy to just say, "Okay, we've done this onto the next thing," and you don't revisit it again.

So with where we are right now at Nike, for example, the last time we've done an overall review of these things was about maybe four, five years ago when SRE was still very much so in infancy here. So it's one of those things where I think it's especially important, especially at other organizations that I've worked at, where I implemented a process to do at least quarterly reviews, which I think is realistic. For a lot of folks, given how busy we all are, it, I think those things are really critical and key for ensuring that you have a realistic gauge of what your levels are, and what they should be.

Matt Davis:

And being able to keep new employees, new onboarded employees up-to-date, and then refresh people who have been there a long time, they haven't even thought about it in six months, nine months. Kurt, what about you? Can severity be measured?

Kurt Andersen:

Well, I wanted on to pick up on the term that both Cat and Lauren used in terms of it being subjective, and kind of look at that aspect. Because if it's subjective, then that means that different subjects are going to look at it in different ways, to me. And maybe it's because they come from different contexts, maybe it's because they have different purposes in mind with what do I get out of a SEV 1 versus what do I get out of a SEV 3? Or what do I have to put in, because it's a SEV 3 or a SEV 1, depending on who you are in the organization or what the incident is that's impacting it?

And I think coming to a common ground and a common understanding like Lauren was talking about, and then regularly checking in to make sure that it still matches your needs or matches the needs of whatever stakeholders you decide have to have input into it, I think is really important and valuable. Because, otherwise, you get the kind of drift that Cat was mentioning, that the environment changes, times change, your product changes, and if you're not keeping up with those changes, you're out of date.

Matt Davis:

I have to say I feel the same way about revisiting and revising, because I feel like it's every week we have something new come up that we hadn't thought about that pertains to our incident management program, how we define severities. And an incident will come up and it'll be this brand new edge case and we're like, "Well, we never really accounted for this kind of case when we designed this. So what do we do? How do we think about this?"

And we actually do the same thing. We do a monthly review now and incident response monthly is the name of the session, and we get together and we do exactly these things, revisiting how we think about severity and how we do the actual response and things like we have on-call teams for services, and how all that happens too. So-

Kurt Andersen:

I almost wonder, and to answer the question if it can be measured, it's, like I think you can assign values to it. I mean, kind of a Likert scale, I extremely disagree, I sort of disagree, I sort of agree, or extremely disagree or extremely agree.

Matt Davis:

Would it be cool if you could have an automatic Likert scale when severity gets changed, then everyone in the incident says what they actually agree on the severity?

Kurt Andersen:

Well, but in some ways it's like security is, you can tell if something is massively more important or more severe or relatively not massive or severe, but sometimes those fine gradations is where everybody [inaudible 00:11:04], I think. It's like, "Is this a SEV 2 or a SEV 3?" And it's like, "Okay, what difference does it really make?" Is sometimes the sense I have of, "Okay, a two or a three, who cares?" It's like, is it major or is it minor? I think is sometimes all the level that you really need to designate.

Matt Davis:

Lauren, you said something interesting that I had not heard before, but it makes sense. You said, "Time of year." And I can relate to this, because I've worked in ad tech and time of year matters so much in ad tech-

Lauren Caliolio:

And in retail.

Matt Davis:

... and in retail and e-commerce. They're linked, all of that stuff. So tell us, does time of year change how you define and look at severity?

Lauren Caliolio:

It's a squishy question, and I think it's a very squishy answer, because I feel that at least in retail and in the e-commerce world, that the perspective of each level seems to change. I use the example of something that comes in as an S3 and within engineering, it's like, this is an S3, but the business is treating it as an S1. And I've also been in the position where it's been vice versa, where I think Cat actually made the point earlier, it's like, "Oh wait, we're treating this, it's an S4, but this is really an S1." It's like, "This should be an S1. This is an all hands on deck situation."

And I think that at least within the retail and e-commerce world, there is a sense of heightened alert around holiday season, which is why I mentioned time of year. I use that as an example, too. There's a lot of prep that goes into it months beforehand, which I think contributes to the perception basically changing, regardless of what might be defined in a Confluence document or discussions that have happened with engineering and product folks regarding what is comfortable in terms of severity? What an isolation should be that attached to each one, and target resolutions for each?

So I feel like it's a matter of perception and a sense of what I would call heightened awareness, where when it comes down to at least that time of year, where there's a lot more attention being paid to revenue and projections for revenue, where you get, I feel like, the squishiness that comes with it. Where it's just like you might end up just throwing out the playbook that we all have and we've agreed to. It's like, "Oh, well we have this beautiful runbook that, and we're all going to follow it when we have this type of incident that occurs." And it's just sort of like, "Okay, no, I'm going to make a phone call. And I'm going to throw this at the window." And this S3 will become an S1, or you'll have vice versa, that also, tends to happen too.

So I feel that's what contributes to it is that heightened awareness, that heightened attention to different things such as revenue, I mean being the number one driver. But things like performance from an SRE perspective too, just things like gauging and watching and monitoring essentially.

Matt Davis:

All eyes are on the graphs when it comes to the holiday seasons, for sure. So as we're talking about severity changing so much, and I'm a little bit on the side with Kurt in that major and minor feel like the most you can really get to, but I know that's not reality. Here at Blameless, we have basically four severity levels. We have zero through three. So when we get into these incidents and we're thinking, okay, we're revisiting our severity, for whatever reason, it goes up and down, it may be time of day or like we were just talking about time of year, there are all these different factors that people talk about contributing to severity.

And so my question I want to ask you, Cat, is can we automate this? Is this automatable? In some ways this does feel toil-ish, but in other ways it feels extremely, like Lauren was just talking about, we need so much flexibility when we choose severity and when we're in incidents that doesn't feel automatable. What do you think, Cat? Can we automate something like this?

Cat Swetel:

Oh, I don't know. I struggle with the question of can we automate for anything every single day of my life? But I think we can automate detection of certain problems and have some sort of like, "Yes, this is in a place where things could go sideways quickly, let's open an incident or something like that and start a discussion. But I don't think you just automate away any discussion of severity. And I think part of that... I don't know, I struggle with this. I'm the wrong person to be asking. I struggle with this so much. But what I have seen is that the humans in our system get anchored in specific absolute numbers. And that's what I've seen around the holidays and stuff like that.

"Oh, we have this many calls in our customer support queue." "Right, but we also had a record business day. And so that's still just 0.01% of all transactions, whatever." Yes, something is wrong, but also that number that you're so anchored in day-to-day means essentially nothing right now. And so my concern when people are like, "Oh, we need to automate this like a sign..." I guess, in this arena, I would rather have multiple biases instead of just automating one bias.

Matt Davis:

I love the way that you stated that, because I mean, that goes back to what we were saying about different perspectives and severity being so subjective. This is the same thing, it reminds me of how we think about complexity and how we think about the fact that one person can't hold the whole system in their head. But it's not only just that, it's that every single person has a different mental model of the system in their head.

And especially when you have multiple people in an incident responding to one, well, however, your first response looks you've got three, four, or five people who have three, four, or five different perspectives on what the production pressure is, how it's affecting the customer, how long it's been affecting the customer, how many customers are affected. And so it's like this is really difficult. These are split decisions, split second decisions that responders have to make. Lauren, can I ask you a little bit about your EMT practice in the past? You were an EMT in the past, is that right?

Lauren Caliolio:

Yep, that's right. I'm actually still an EMT, that's

Matt Davis:

Oh, you still are. Great.

Lauren Caliolio:

... why here.

Matt Davis:

Yep. Oh, awesome. Now, I know a couple of EMTs, and I've never asked this question of them, but do EMTs worry about severity?

Lauren Caliolio:

Wow. Okay. Speaking for myself, I can say that there's... Well, the way you're trained and the way it works, I'd say in my experience and practice is that when you think about severity as a first responder, or at least when you're on an EMS call, you literally have a matter of seconds to sort out how severe a concern might be. And it's, at least for me, somewhat black and white. It's not as I would say squishy if I were to make a parallel to how it seems to work, at least in the corporate world, for me. And it's just because of the time that you have to basically sort out, "Okay, how are you going to treat this?" At least in terms of severity.

So I guess my answer to this would be, yes, but we're not thinking along the lines of a, okay, S1 through S5 for level. It might just be two of these things to sort out how you might treat a call at that time. So I'd say, yes, I guess to that. It's interesting because I haven't had to think about that before. It's sort of an automatic thing, and that's how you're trained, basically, to decide really quickly, "Okay, so how are we going to treat this."

One, as a part of at least the triage mode that you get into iron out what sorting this out looks like, at least in EMS, you basically sort out, there's a common phrase that gets thrown around, which is like, "Okay, stay in play," or if you're just going to book it, and try to get this person treatment as soon as humanly possible. So in cases where you might have a more severe or significant issue that you're dealing with, you're not staying in playing, which is what the phrase is that's used.

You're booking it. You're going to do whatever you can to ensure that the person is in stable condition when they arrive at the hospital. You're not going to stay and try to treat them there immediately. So that's basically I'd say how I think of these things in terms of severity, it's very much so what I would consider more black and white than how we're thinking of it in the corporate world, where we might have four to five levels, usually, is what's typical. And it's a lot more granular. I'd say, if I'd have to define the main thing that's different, it's that there's a lot more granularity that exists for, I think, understandable reasons.

Matt Davis:

I mean when human lives are on the line, it feels like a much easier decision. And that's a weird thing to say, but it does feel more black and white. When I'm in an incident and we are arguing about what severity to set it at, sometimes people take the position that the severity is something that guides the incident instead of the other way around.

So for example, an easy example of this would be, and we do this, we have our major severities are basically 24/7 response, SEV 0 and SEV 1. And our minor severities, SEV 2 and SEV 3 are not 24/7 response. So we've gotten in these discussions, multiple times, where there was an argument or contention over what are we doing? Are we setting the severity to match the response? Or are we matching our response to meet the severity that we picked? And-

Kurt Andersen:

And to me, that's the question of what's the purpose of a severity, essentially? And I have advocated for an instrumentalist point of view, it's like, you pick the severity as a point of common ground, so that everybody agrees, here's the level of effort that we're going to put into resolving this thing. And to me, that's the simplest sort of framework that I can think of for a severity is you decide what level of effort's appropriate, and then you use a label that says, here's the level of effort that we have chosen.

Matt Davis:

And do you do that after the fact or before? It's-

Kurt Andersen:

During, usually. I mean, because often you'll detect something is a problem and not necessarily understand enough about it, what's wrong or how wide the ramifications are, until you've done some investigation. So you may not know until you've done some investigation what the severity is, and you may find out information later that changes your mind, too.

Matt Davis:

I was just thinking of how I've seen really strong opinions on incidents about severities. And one of the strong opinions that comes to mind, and this goes back to that very subjective thing, and back to what the response looks like is the balance between severity and priority. So I'm trying to think of it in terms of the EMT world, too. I'm thinking there's no concept of separating these two things in a medicine world, especially in an emergency medicine.

Kurt Andersen:

Well, I think you can separate urgency and importance, and if somebody's not breathing, that's both. I mean, you can almost do an Eisenhower matrix kind of a thing of urgency and importance. And it's like, this is really important. They're not breathing and they don't have a heartbeat. Breathing comes first, and you got to make sure that they can breathe. I mean, it's the CPR algorithm, you got to make sure they can breathe, that they are breathing, and then you work on circulation. And they're all very important, but they have to be done in order, because you can pump non-oxygenated blood around and it doesn't do anybody any good.

Matt Davis:

Right. That's a great point. This is the question I'm thinking about now, I wonder how important is severity to the people involved?

Kurt Andersen:

To the responding team? Is that what you're asking?

Matt Davis:

To the responding team and maybe also the customer. In some senses, I also consider the customer a responder, instead of necessarily like a victim. So why is it important? And how is it important? For example, we've been talking about severity kind of directs our actions or helps us get sense of the incident. But does severity also matter in a non-response way? I shouldn't maybe say non-response way, but maybe in a functional way is what I'm trying to say. When I'm a responder, I almost don't care what the severity is. And when I'm a responder, I don't think about it that much.

And I also think about the customer. Does the customer care about severity? In what instances do they care about severity? Lauren, maybe you can give us some of your experience. You just joined Nike and I don't know how many incidents you may have encountered in your two months there so far, but do customers care what the severity is? Or do they really just care that they get the service back? Does it matter?

Lauren Caliolio:

Yeah, that's interesting. When we define customers, are we talking about clients? Are we talking about people who are buying products, who are trying to access.com? Is that what we're referring to when we're selling customers?

Matt Davis:

Well, I mean, that's a good distinction to make. When I think about customers, I do, as someone who's been in operations a while, I do consider developers some of my customers. So I do consider internal customers as customers too. But the customers who are buying things, I think is what I'm more talking about. The customer that we're building the product for.

Lauren Caliolio:

If I'm to think about the question that way, then I'd say, do customers trying to purchase our product care about the severity of an issue? I'd say, no. But perception, I'd say if dotcom is just completely out matters quite a bit. And if customers don't have the information that we might have internally, while we're dealing with the incident, I'd say the perception of just seeing or not being able to even see a landing page might contribute to what they might think the severity of an issue might be.

Matt Davis:

Oh, yeah.

Lauren Caliolio:

Versus how we might be treating it internally. For example, to a customer not being able to hit dotcom without any information, without a maintenance page, just letting them know, "Hey, this is temporary because there's work going on," et cetera. Completely different than how they might perceive the brand even, or the severity of an issue or the outage. If they see a landing page that says, "Hey, we're going to be down for an hour. Doing some work."

Matt Davis:

And that is really curious, too, because I'm thinking about my mother actually, and a couple times my mother, she's not extremely adept at technology and she somehow logged out of her Netflix account, or did something to where her Netflix account wasn't working anymore or something like that. And she was calling me up, because she knows what I do for a living. And she was like, "Netflix is down." And I'm like, "Really? We were just watching it an hour ago. I think it's fine."

And then she continues to tell me about what is actually happening, and finally got to the conclusion of she somehow logged herself out. But her perception was that Netflix was down. And I can imagine when nike.com is down, it could be down because some errant JavaScript on the page screwed something up. Or it could be down because there's a huge database that went down. But the customers-

Kurt Andersen:

Or your Fastly CDN goes up, right?

Matt Davis:

Yeah. Oh right. It could be a third party altogether.

Lauren Caliolio:

For sure, I think that's basically what I was thinking about with these things. To the customer, you're down, if you're down, if you're unable to hit it, regardless of what might be going on behind the scenes, especially if there's lack of information or clarity regarding what's going on. And that's not being communicated appropriately.

Cat Swetel:

Yeah, for me-

Kurt Andersen:

Can I get-

Lauren Caliolio:

But, say, to the customer, they're not thinking about it in terms of how severe it might be. They're just looking at the fact that they can't access a thing, and if they don't have information regarding what might be going on or if they're not thinking, "Oh, maybe I should just try hitting the page again and it'll load up for me and life will go on it." I'd say to the external facing customer, it all just looks the same. It all just looks like the site's unavailable.

Matt Davis:

Yeah. Go ahead, Cat.

Cat Swetel:

Yeah, for me, that's exactly why even when I'm responding, severity matters a lot, because, for me, it's more a signal of how much help do I need resolving this? Rather than, oh, we have to assign numbers to things, but I know if I say, "This is a SEV 1," customer support is going to come and say like, "Oh, okay, what can we do to help you out to give you breathing room for this?" And maybe external comms is going to be like, "Let's write a notifications, so that people don't do an app open retry like eight bazillion times." So giving us some room to breathe and remediate it. And then also longer term, too, if it's a SEV 1, then I trust that that's a signal to the rest of the organization to approach me and offer help longer term too.

Matt Davis:

Are there-

Lauren Caliolio:

I was actually thinking about this while we were talking, just to piggyback off of Cat's point, and thinking about, well, do customers care about this or not? Versus whether or not your internal customers care about this, about severity levels and how they're defined. And what I was thinking about, which was in alignment with what Cat was sharing too, was, I'd say the levels that you define it internally matter a lot, if not to the responders as much who have a main immediate focus of just restoring stability, but it matters in terms of, I feel like, the global apparatus in some cases that you might have to mobilize. How many folks are you waking up across how many time zones?

So S1 versus S5 is like a world of difference, which is where, when I think about it in terms of what works with the model as most companies are following it versus what might not work with it, I would put a point in the column of where, "Okay, well this is why we have this and this is why it exists and why we use it as a point of reference despite the challenges that come with it." And it's say, linked to that, it's like, "Well, how many folks am I waking up? How many teams am I involving with this?" And that's where the level I'd say matters a lot.

Matt Davis:

Well, does anyone think that severity matters to the business in, I don't know what to call this, I don't want to say measuring reliability because I don't like that term, because I think reliability is unmeasurable. But this feels like the kind of statistic that a business would use to measure reliability to say, "Well, look, we're reliable. We've only had 10 SEV 1s in the past six months, but we've had 30 SEV 3s and whatever."

I guess, well, I mean as a metric, so if we take severity away from its function to help guide us, help guide the response to actually be what I consider a dynamic part of the common ground of an incident, but then does it get frozen in time? Is it actually used? I'm interested to hear your experience, anyone's experience on this. Is severity levels, is that used as a metric to actually measure the "reliability" of the team?

Kurt Andersen:

I think that often it is used in that way, but I don't think it should be.

Matt Davis:

Okay.

Lauren Caliolio:

Yeah, I could talk about this alone for six hours on this topic, on using number of incidents as a, "Look how reliable we are." I have [inaudible 00:37:01]-

Matt Davis:

Yeah. Yeah.

Lauren Caliolio:

... all the time. But, yeah, I feel like this is a hot topic, for sure. Just that alone, I'd say could be its own episode.

Matt Davis:

Yeah. You mean like counting incidents and counting things? Counting [inaudible 00:37:17]-

Lauren Caliolio:

Yeah. And aligning severity levels, too. Reliability measures.

Matt Davis:

How often have you seen this done? Is it being done at Nike? Do you know?

Lauren Caliolio:

Yes. Yes it is. But Nike is not the only place I've worked where that's done. It's actually, I would call it a pretty typical approach-

Matt Davis:

Oh, okay.

Lauren Caliolio:

... in my experience.

Matt Davis:

All right. Cat, what about you, do you think it's very typical?

Cat Swetel:

I think it's very typical, but I think I have an unpopular opinion on this front.

Matt Davis:

Awesome.

Cat Swetel:

Sometimes I think it's good to track it at inflection points in your system, maybe. I've been in two situations before where there was rapid growth in the ecosystem and so there were more incidents, because there was nonlinear growth. So yes, there's more incidents, because people in their brains, they get anchored in a number, "This is how many we have every month." And they, the people that are responding are even like, "Oh my god, this is bonkers," right?

Matt Davis:

Yeah.

Cat Swetel:

When that number goes up. And so I think sometimes it's good say, "Right, but the number of incidents per transaction or per visit or whatever is holding steady or trending down. Are we scaling in a way that is safe or we're not also scaling the blast radius, since we're scaling the system and things like that?" So that's my unpopular hot take.

Matt Davis:

But you qualified it really great to say that it's per something. We talk about SLOs this way. SLOs are meaningless taken alone. They have to be taken per something, per hour, per day, per year, per whatever. But I really like the inflection point that you're... In music, we call this an articulator of form. Yes.

Cat Swetel:

And that's way cooler.

Matt Davis:

Hey, tech steals terms from music all the time. Orchestration is the best example, but articulators of form are extremely important in musical analysis. They tell you those inflection points. And there are certain things about music that happen around these articulators of form. So I don't know if it's unpopular opinion with me, Cat, I kind of like that perspective. I do.

Cat Swetel:

All right. Okay. Well, I got to be signing off here folks. That's my [inaudible 00:40:08]-

Kurt Andersen:

Yeah, no, I-

Matt Davis:

And we're done.

Kurt Andersen:

Actually, I think that might-

Cat Swetel:

Just kidding.

Kurt Andersen:

... be a good place to drop the mic, because we're about at time.

Matt Davis:

We are. That is a great place. Is there any last words that anyone wants to say about severity? Lauren, would you like to say something?

Lauren Caliolio:

I would just like to say that I thought Cat me at all actually a lot of really great points. And I did find myself thinking a lot more about SLOs as she was talking. I was like, "Oh, well this is how we use SLOs to measure other things." The challenge for me, historically, has just been when it comes to reporting, in terms of what is easier to present to the business regarding reliability. And that's where it falls into what I'd say is the more traditional way or perspective of thinking about it, which is incident numbers.

But I just thought those were amazing points regarding how you'd want to not necessarily rely on current state and something like availability, but more thinking about it from the perspective of tracking or trending over time. Which was actually where the main benefit is to be gained through introducing usage of SLOs.

Matt Davis:

Kurt, how about you?

Kurt Andersen:

No, I think this is a great conversation. Nothing more to add at this point.

Matt Davis:

Yeah. I'm also happy to leave us with the wonderful mic dropped by Cat, and I really appreciate everyone joining us today. This has been a really awesome conversation. Really loved having both of you, Cat and Lauren, to our webinar show here. And look forward to seeing you sometime again. Thanks everybody.

Cat Swetel:

Thank you.

Lauren Caliolio:

Thanks for having us.