
Description
Speakers

Stephen M. Dick

Stephen M. Dick

David Levinger

David Levinger

Matthew Dodge

Matthew Dodge
.png)
John Weil
.png)
John Weil
Video Description
Video Transcript
Matthew Dodge (00:00):
Hello. Welcome everyone. Um, today we're gonna talk about how to talk to your executive team about reliability. My name is Matthew Dodge. I'm a senior customer success manager here at Blameless. I'm also from the blameless side. I have with me John, we, who's an accounting executive on the Blameless team. Um, and we also have a couple of our customers who have been really great advocates for us. Um, I'll actually let them introduce themselves. Stephen and David, if you wanna introduce.
Stephen M. Dick (00:29):
Thank you, Matt. David, do you wanna kick us off?
David Levinger (00:32):
Sure. Hi, my name is David Levinger. I'm, uh, the SVP of operations at Machine, uh, healthcare technology startup company that uses artificial intelligence software to solve, uh, payers problems in the in the cloud.
Stephen M. Dick (00:46):
Thank you, David. Good morning, good afternoon, good evening to wherever you are. Thank you for joining us at this webinar. My name is Stephen Dick. Uh, what I do in the industry is I build global engineering teams, so focus on the reliability of critical systems. I've served in a variety of senior leadership positions at SAP and Salesforce most recently at BetterCloud. And just to let that you could all join us either live or watching the recording.
Matthew Dodge (01:15):
Great. Thank you guys. Um, and we'll just jump right into it, um, with a couple of questions. Um, first off, we wanna say, and this is, this is actually in the chat, but, um, if you have questions throughout, um, please feel free to put those in the chat if you're watching this live. Um, and we'll address those at the end. Um, but between now and then we'll just have a conversation. We're gonna talk to Stephen and David about kind of their, their journey through adopting blameless at their respective companies. And, um, the, the initial thing that we look at when we have these kind early conversations is everyone is doing something right, um, and you're considering blameless to address your incident response. So initially when we have these conversations, we talk about things like, what is the current state, um, how does your executive team think about reliability? What are their expectations today? Um, and then also how they're, once you bring in blameless, how the expectations for your executive team kind of change over time. So David and Stephen, I, I know your stories are similar, but a little bit different. Um, and then we'll, we'll let you guys decide who goes first. But in that early phase, looking back, can you talk a little bit about the existing state, kind of pre blameless?
David Levinger (02:42):
You want to kick us off, Stephen?
Stephen M. Dick (02:44):
Yeah, a hundred percent. So I can speak to, uh, a number of journeys. I think when it comes to, uh, high, high executives, uh, think about reliability, um, the most important thing to think about is, uh, internal perception is driven by how customers perceive the product. And so you need to link the perception the whole way back to how customers perceive the product. And there's a number of different ways to do that. Um, you can look at csat. Um, typically many companies will be running NPS scores on their customer base, and you'll get qualitative feedback. And high customer perception is related to the re reliability of your product. You can look at reliability related churn. Uh, for SaaS companies, typically you want your churn to be below 10% or so. Uh, but then oftentimes you're looking at out of that 10% of your customers at our turning, how much of that is related to reliability?
(03:42):
But you can also look at support ticket volume and a number of other areas as well. Um, so one of my previous companies, uh, what we saw was reliability related churn was sky high, it was 15%. Um, we were getting a lot of feedback from our customers that their perception of how we were providing a reliable experience was just quite low. Uh, that feedback was coming in through our M ps scores, through our support tickets, but of course our reliability related turn as well. And that was a key driver for ruling our blameless tooling, a whole new culture around our incidents and a whole new focus on product reliability. So if that encapsulates maybe the before of our story, maybe a good point here will be, uh, David, I'll pivot over to you if you wanna maybe describe, you know, what the before journey was and then maybe we can look at hopefully, unless change the game.
David Levinger (04:42):
Yeah. Uh, so it's interesting for us, uh, a lot of our customers were enterprise software companies, right? So they were used to Nonas products and some of their integrations with us, uh, didn't really suffer from, from outages. So for us it was a little bit different, right? We had an incident management program that we used to, to deal with reliability and uptime. And, and from, uh, from most people's perspective, it was fine, right? But it wasn't fine from, from our perspective, from the ops teams perspective, from our perspective, uh, we were dealing with too many unaccounted for outages and we weren't seeing the progress. So like for us, the focus was really about how do we revolutionize this so that we are woken up less so that we have less problems that we're scrambling to solve, right? Um, that when we go to solve them, it's predictable. And then how do we feed that back into the process so that we do become more reliable, right? So we were kind of trying to head off, uh, some of what, what Stephen was, uh, suggesting, right? Like, we didn't want to get to that point. Um, and we knew we were, we were like on a track for it, right? We were heading right towards it. Eventually it was gonna kind of go off that edge and we didn't wanna get there. So that was the, the, the before state, you know what I mean?
John Weil (05:59):
Yeah. And I have a, I have a quick question for you on that, cuz um, what I hear a lot of times is sort of similar to what you had just described, right? Where maybe certain, certain parts of like an old guard say, well, this is fine, this is working for me. But you know, David, as you just said, the op team says, well, this isn't fine. I wanna be woken up less. I wanna have, uh, a little bit of a better plan. Yeah. How, how do you help cross that bridge if parts of the organization, executive team or otherwise say, Hey, this is, this is fine, but you know, it's not fine. How do you, you know, how do you help cross that bridge?
David Levinger (06:31):
Yeah. Uh, I think the easiest way to do it is to start small. Uh, and to show them that the, what you're asking for investment-wise, time-wise, et cetera, is not large. Um, and then you tie that into, but the benefits we can gain from this are massive. And then you throw in a little bit of a dose of fear, right? Like, you don't want us to not wake up, right? You don't want us to have a moment where we don't solve the problem and it's becomes customer impacting in a way that is catastrophic for our business. And it's important to help them see that picture, even if they've never experienced it. So it's a little bit of, um, touting your team's performance. Look, we've done a phenomenal job, but what if we don't? Or what if this incident that we recovered from this last time turns into something 10 x is big next time and we're not getting that like the feedback loop, right? So really pushing to that, that kind of next level, like you want the business to grow in this productive way, in this positive way, we need a better tool to make that happen.
Stephen M. Dick (07:33):
If I can jump in there as well, like, one of the areas that I'll find success in is I like to think of my role as has been the, you know, obviously operating as a VP of s r e. I like to think of my role as being the chief apology officer. And so when an incident or a problem does happen, I like to pivot my role into more of a customer facing role. And I love talking to customers and it's always fascinating the type of feedback do you get from your customers around how incidents are affecting them, how performance is affecting them, how skip defects or, or affecting them. And the way I use that tool, uh, uh, within my repertoire is el take these customer quotes and bring these back into our executive team to tell the customer's story. So I'd like to narrate here's what our customers are saying, uh, here's the feedback that we're getting from our customers, either in the form of a quote or a video, which can be equally as powerful as well. And sometimes that can be a helpful double team to what David is saying, you know, you've gotta tie your team's abilities, but sometimes also having an another team or an entity tout, uh, your own abilities or, uh, the effectiveness or ineffectiveness of some of your measures can be helpful as well.
Matthew Dodge (08:52):
That's great. And then with, with that, how does that, um, feed into the change of the executive expectations over time? Right? Uh, part of it is the team's performance kind of validating the performance, which you've spoken to. Part of it is that customer story and the power of that customer story. So over time, is there, do you see this shift with the executive expectations where they, they come to expect that customer story or they're, they're asking more questions around, um, the RCAs and some of the, the metrics and things like that because of this shift in, um, the tooling or the, the culture and the process? All of the above really.
Stephen M. Dick (09:35):
So I can speak to what I've seen in the industry and, uh, you know, I, I start with using industry standard metrics like time to resolve, time to detect escape defect velocity. Um, but it's really, once you double click into the more granular details, I've noticed you really need to start to explain some more context around the metrics that you're using. And so I like to use data to instill a culture of operational, uh, urgency, especially in less, uh, mature environments where, you know, maybe you don't have automated rollbacks, maybe you haven't invested the infrastructure around self-healing systems. Maybe you don't have the thought leadership and poko concepts and circuit breakers. And so maybe you do have least long running incidents. And one of the challenges there is instilling a culture of operational urgency. And so what I do to supplement some of the industry standard metrics is I start to publish dashboards that chart detel in timeline of an incidence through live metrics.
(10:38):
So I'd like to show our time to assemble our time to acknowledge a page, our time to detect an issue, our time to mitigate the impact for our customers using live yet aggregated data in a way that tells us story. And that helps people understand, okay, well listen, we can break down these metrics and break down the incident on lifecycle into a way that we can optimize over time. And so part of, uh, what you're describing Matthew, is yes, uh, data is important, but also think how you use that data and how you explain data and how you bring people along for a journey is also particularly critical as well.
David Levinger (11:18):
Yeah, I'd say leveraging data is, is massively important. You need people to be able to see the benefits, so to speak, right? Like, you can, you can talk about it all you want, but when they can see that the data's changed or they can see that meantime to resolution has shrunk, um, that's very, very important. I'd say the other, the other piece on my side, one of the biggest shifts that started becoming an expectation at the executive level was around calmness during an incident. And it's, it's really interesting how many people don't want to declare an incident or just say there's a reliability problem. Like, they don't even want to do it. They're like, no, no, no, no, it's fine. Let's not talk about it. Um, and part of it is that there is a, a stress and a panic with it, right? So one of the ways that we tried to cross that was, um, after the operations team had started leveraging blameless, we transitioned over to getting engineering to do it.
(12:14):
And the first few engineering incidents were like everybody talking and panicking and focusing on the wrong things. And it was, it was anybody that had been in an operations incident was like, oh my goodness, this, we, we need to fix this. This isn't good. And we weren't even getting to that resolution as quickly, right? And then once we showed some of that, it became very easy for them to go, oh, this matters. It's not just something that you guys do, it actually matters for us in resolving a bug. It actually matters for us in resolving, uh, to Stephen's point, even a, a customer expectation mismatched, right? Like, Hey, why was this slow? Was that actually an outage or not an outage? Right? Um, and those being able to respond to those things very quickly, going back into metrics matters, right? So sometimes it's meantime resolution of a real incident, and sometimes it's meantime to resolution of a perceived incident or perceived problem. And honestly, from a customer perspective, it doesn't really matter, right? Perception is almost everything, right? Like, your system can be up and you're like, no, it's not an outage. They're like, yes, it is, right? So that definitely started to shift the executive perspective into, no, we need this for everything. This needs to be universal, right? Nobody can have, you know, incidents or issues and not leverage a tool that helps us with that feedback loop.
John Weil (13:33):
Awesome. I appreciate, uh, both of those stories, uh, right there. Um, and I think storytelling is everything, right? That's like the, the biggest key to life. If you can tell a good story, you can probably get your point across in almost any discipline, but it's, it's great to hear, uh, from the front lines a little bit about your experience, particularly with, uh, reliability and executives. Um, you know, David, you were talking about getting engineering and operations on the same page. So you know, here to start, obviously we wanted to dwell on sort of the executive buy-in, right? Because like, sort of getting that top-down buy-in, uh, is critically important. We've also seen for, you know, our customers and the conversations that we're having, you know, it's, it's equally as important, like you were saying, getting everyone from engineering and operations, uh, on the same page. So, um, you know, David, I'll, I'll go back to you to start, since you were just dwelling on this point here, which is maybe the storytelling for an executive different for storytelling for folks in engineering. So, you know, if maybe you could tell us some stories from, from the battlefield of, you know, getting people on the same page, how that story's a little bit different, how that conversation is a little bit different, um, I think might be helpful for everyone today.
David Levinger (14:41):
Yeah. Um, so on the executive side, it's very focused on why did it take you so long to resolve that? Or why was there such a challenge in understanding that this was a problem or why did we do a thing, right? So it's, it's kind of interjecting questions that you already know the answer to, but getting the other folks on the executive team to think about it, why is that? Why didn't they know? And then of course, to be able to provide an answer, right? And say like, Hey, you know, why don't we try this? Why don't we give this a shot? Why don't we, Hey, the next time this happens, let us shepherd you, let us help you, let us shepherd you through this, right? Um, and then on the engineering side, it's kind of flipping that, right? Like, hey, you don't wanna have like people questioning why it took you so long to resolve this, right?
(15:28):
You don't want to be stressed when there's a bug or an incident or a problem. This tooling will help you do that while providing the, the feedback loop to give you the, so sorry. In some cases, engineering knows there's a problem, but they don't have, uh, the priority to fix it. And being able to say to them, Hey, this gives you that priority. This is showing that this did really cause a problem, right? It's not you saying there's a problem. Like, no, it's a legitimate problem. And then to what Stephen was saying, making it metrics driven. Did it happen once? Did it happen twice? Is this the fifth time? Right? Like, those things make a big, a big difference in increasing that priority, especially not for subzero problems, right? But slowness can still be a huge issue. So it's kind of two different sides of the same conversation.
Stephen M. Dick (16:15):
Yeah. Now, chime in on that. What I've noticed is people are tend to be motivated by two different things. One is we're motivated by positive emotion and then we're motivated to try to avoid negative emotion. And what I've noticed is people respond a lot more strongly to avoiding negative emotion. And so when it comes to an incident, nobody enjoys being in these kinds of emergency war rooms where everybody's stressed, the pressure's on, maybe everybody's on camera, maybe have an engineer who is working on the issue. And what you hear is, tap, tap, tap. Oh no, top tap, tap. That shouldn't be happening. Tap, tap, tap silence. And everybody's just watching this, right? I mean, nobody enjoys those kinds of environments. And you know, the feedback that I've heard from engineers, from executives, from engineering managers, it's much more around how do we avoid the pressure cougar cer environment when we do have an issue?
(17:18):
And this is where I think establishing an incident practice becomes critical. It's not just a tool, it's not just a culture, it's not just a process, but it's a practice. You know, it involves repetition, it involves making sure that what is perceived to be an emergency gets repeated often enough in safe environments where everybody becomes comfortable doing what they're supposed to be doing, and there's muscle memory there. And that brings the, uh, the negative emotion, the adrenaline way dying. And so if I was to make a recommendation to anybody listening, not there, it's think about your practice as well as your tooling. Think about how you repeat incidents, do drills, do game day exercises. So everybody enters into these incidents, knowing what to do, uh, with a brain that isn't under the influence of adrenaline so they can do what they do to the best degree possible as well.
David Levinger (18:15):
Yeah, practice is incredibly important and it's one of the things that it's, it's hard to get that practice when you're following a document, right? Like if you're incident management response, your reliability is this document that you're picking up and you're like, read step one and read step, that's hard. Um, but when you have kind of these worms and these breakout sessions, uh, a lot of what, what blameless enables, it makes it easier to go through that in a consistent manner where, you know, you're not freaked out about it anymore. But without the practice, like Stephen saying, without that practice, you are still freaked out, right? They'll get in there and they're like, ah, um, but go through it a few times and then people are just, they get on, they're like, here it is. Here's the thing. It's just very factual. And in many cases that not in many cases, I would say in all cases, that leads to a faster resolution. And it also leads to more people wanting to use the tool because they see the benefit of, oh man, that wasn't stressful and we solved it in 10 minutes. Like, awesome, right? We did a great job, it was great for the customer. We have a good answer for engineering. It's all crystal clear. And I didn't, you know, feel freaked out, um, during that whole process.
Stephen M. Dick (19:25):
Yeah, I completely agree with that. I've, I've noticed like smaller companies, lesser mature companies, um, will focus on playbooks and we will have like an incident management document then that will prescribe here's what everybody's supposed to do. And I was at a company that had this, uh, 32 row RCI document that prescribed roles and responsibilities for everybody. Um, but similar to what David is articulating there, people look at that and go into freeze mode during, during event of, and so a toe like blameless can really help you automate your playbook. And so a good way to think about it is a playbook describes your process, but then you need to automate it some high and automating it provides a range of benefits. Um, it takes the cognitive load off your incident responders, it allows 'em to focus on technical analysis and reducing the issue, but it also makes the overall flow of the incident a lot more simple as well.
Matthew Dodge (20:25):
Yeah. And, and that's a great point. And I think some of that, um, ease, ease of use or like reducing the, the cognitive load is a really important point. Um, one thing that came up as David was speaking, man, on a previous point just now, um, I found we find it interesting sometimes, um, what not to work on is interesting, right? Um, and there can be a value of blameless. And, and what I mean by that is to say, uh, this idea of like the ticket tracking and the priority tracking, right? Um, and then also kind of how you evaluate, how you evaluate that and how that evaluation process kind of changes over time. Have you seen, or, or I should say, let me ask it in a different way, what has been your experience with, um, this kind of bringing to the surface, um, how, how you evaluate that priority of what to work on and what not to work on?
David Levinger (21:26):
Oh yeah. I mean that's, to me, that's like top of, top of the heap on making an incident effective, right? Like nine times outta 10, you're working on an incident, you get into it, you find a problem. Is it the problem? No, but it is a problem, right? So like having that ability to really quickly be like, yeah, you're right, that's a problem to do that right? Hashtag to do that in the, in the thing so that there's a, a, a tracker for later and move on cuz it is not relevant to this issue. And in some of the really complicated issues we've come out away with 10, 15 different to-dos. That ended up, again, when I talk about that, I'm gonna keep talking about that loop, right? That feeds back into, oh, now we can make sure this doesn't happen again. But people will get hung up, especially in an incident, especially we're in that, they're in that pressure cooker, right? Like, no, but this is a problem and we should fix it. And having uh, kind of an easy pressure release valve to say, you're right, this is a problem, right? A to-do and let's move on with this incident, but we're gonna come back to that. We're not gonna let that drop by the wayside, right? We have a way to track that, tie back to this incident, tie back potentially to other incidents. I think it's just is very valuable. It's been huge for us.
Stephen M. Dick (22:41):
Yeah, that's great. I, I see this come up a lot. The key question is oftentimes when is this incident resolved or when are we auto incident? That's another common question, right? And so I like to talk about whenever the customer impact is medicated, that's when we go back to what we call peacetime modes. And if you think about an incident that's wartime modes, it's expensive to have everybody there due during exception handling on a bridge. Uh, so you wanna get your staff back to peacetime modes, back to building great products as quickly as possible. And so one of the most effective, effective questions that I'll find to do that is, is the customer impact mitigated. And from there it becomes more of a decision tree.
John Weil (23:30):
Awesome. Thank you for that. Thank you. Uh, just another, uh, quick question in terms of like working on something and then not working on something. So what both of you just described was, hey, how do we, how do we maintain priority inside of like an active incident, right? How do we make sure we keep the, the focus on, you know, we're storing customer service and uh, going back to peace time, right? What I'm curious is, as a part of the after, right, like the post-incident process and just in the vein of getting engineering and operations on the same page, have you found, um, either with your current experience with blameless in the previous life, um, and Stephen, I'll, I'll sort of lob this question over to you to, to kick things off, which is, you know, sometimes there's a lot of those to-dos that come out, right?
(24:14):
A lot of tickets that come out of an incident. And sometimes that can maybe lead to, you know, that that nervousness that you talk about amongst engineers as it relates to incidents, things you're working on. So have you found there to be value in terms of, hey, we created, you know, some action items or there were some tickets that came out of an incident, but actually we can deprioritize that work, um, just to keep maybe engineering a little bit more streamlined on other things and they don't feel like there's always this backlog of bugs or items for them to work on. So have you, have you found that to be a, a, a a tool you've ever used before?
Stephen M. Dick (24:48):
Yeah, so I see the anxiety come up in a number of different places. Sometimes during the incident itself, people just start finding overwhelming amounts, amounts of additional or tertiary issues, and that can overwhelm the incident response team. And then the other thing I've seen come up a lot is, uh, there's, there can be this sense of distrust or mistrust of the post-incident follow-up process. Uh, what we often describe as the RCA or post-mortem process, if that's not, if not, if that process isn't locked on and, uh, has a super strong feedback loop with metrics attached to it and cultural buy-in to fix underlying issues after the incident. I've noticed people can be anxious and not clues off the incident because they think the only way to resolve the issue is with an incident. And so I see incident response and, uh, problem management as two sides of lithium coin. Uh, you resolve immediate issues during the incident, its itself, but then it's critical to have a post-game analysis phase where you actually fix underlying issues that you identified during the incident itself. And that speaks to the culture around your incident process. And it speaks to the confidence that people can have. Um, so I would definitely recommend thinking about the problem management process and making sure that that's pretty well locked on as well.
David Levinger (26:13):
Yeah, one of the most productive meetings that we have is our retrospective meeting of all incidents in the previous week. And that's where we kind of hold ourselves accountable. Exactly like Stephen saying to, did we get all the correct follow ups? Uh, did we mitigate the things that truly could cause this incident to happen immediately, right? Because there could be things that could help in the future, but they're not like a now thing, right? Um, and then by having that in a discussion, and we do this now for several product lines at our company, not just the op side, but also in engineering, you also get the team, you, you, you end it with two benefits, well, probably more than two, but two that tie top of mind for me. One is, um, you're, you're instantly sharing information, right? Everybody now understands this incident occurred.
(26:59):
Here are the things that happened that that helped you to resolve it. Here are the follow ups, here's what's been completed, here's what's planned for the future. Um, and then they also get to kind of question you on it. And that's really important as well, because there are many, many times that somebody's like, oh yeah, this thing happened. I restarted a service, so we're gonna go, what? That's not no <laugh>, that's not the resolution to the incident. Like, wait, why did it happen? Right? So that, that speaks to what Stephen was saying related to culture, right? So when you have those meetings, people can't hide from it anymore, right? You're, you're having this forum to talk about it. The talking about, it's always very focused on how do we just make sure this doesn't happen again, it's not blaming people, it's not trying to make anyone feel bad cuz they made a bad decision.
(27:41):
At the end of the day, we're all gonna make mistakes. The question is, are we making the same mistakes? Because if we're making the same mistakes, we have a problem. But if we're making new mistakes, things are failing in new ways, and you can, you can show that to everybody, then you end up with the whole team, not just your team, but the engineering team, the, the product team, even the folks that are engaging with customers going like, no, this process matters. I see the iterative improvement over time. I see it, you know, benefiting, but if you miss that, then people are like, this is a waste of my time, right? I I spent all this time on an incident, nothing improved. What's the point? Right? So you're like, the culture and that practice is very important.
Stephen M. Dick (28:20):
Yeah, I completely agree with that. So one of the metrics that I've used in the past is the number of repeat incidents. And that measures what David is talking about is how many mistakes are just repeating time and time again. And that should be the metric that you be as close to zero as you can possibly get. Um, and that meeting that David is talking about, that weekly incident review meeting, I mean, executive behavior is so incredibly nuanced in those kinds of meetings. I mean, you can have a, a very Socratic style, but if your tone is off just a little bit, your engineers really pick up on that, right? Uh, so I've, I've placed, uh, a lot of emphasis on that meeting, um, over the years. I bring donuts in when I'm in the office to those kinds of meetings just to try to lighten the loads and to mark the time as something celebratory.
(29:13):
It's more of a learning opportunity rather than a criticizing opportunity. But, um, it's incredibly nuanced. There's a lot of, I think there's a lot of really bad ways of doing meetings like that and I've probably done those and I've got the bottle scars to prove it. Um, so I can definitely speak to it a little bit. But if you can find ways of lightening the, uh, the overall tone, bringing in donuts, making dad jokes is another tool that I use a lot in meetings like last, but to David's point, it's a critical meeting and can amplify the learning soly incident itself.
David Levinger (29:46):
Yeah. I I want to double down on what Stephen just said. You have to make it a positive meeting. They have to see that meeting as this is improving the product. This is improving the customer for customer experience. This is improving my stress level, your stress level. Like they have to, it has to be positive. So you have to be very careful about framing things through that light, right? This is about improvement, this is about making things better. This is about giving you the ammo you need to justify that fix that you always wanted to put in cuz you knew that there was a problem over there, right? Um, and if you keep it positive, then, then people walk outta that meeting going, this was good, right? But if it starts to trend into that negative side where people feel like they're being punished or, uh, or you know, it's their fault that something happened, like at the end of the day, it's, it's almost never one person's fault that something happened, right? It's a failure of the system. It's a failure of the teams of people that go into making these products work. So by, by fostering that message of positivity and team inclusion and, and all that sort of stuff, you really start to make everyone feel like, oh, this is good. This is positive. We're coming together as a group, right? It's, it's not adversarial is I think, really important.
John Weil (31:03):
Uh, Stephen, do you bring in repeat donuts for repeat incidents or do you like to rotate through
Stephen M. Dick (31:08):
<laugh>? Uh, so I like to, uh, the, the trick up my sleeve was Taco Tuesdays, and here's just a bit of a lighthearted story about this. I grew up in Irelands and tacos in Baland on back there were these crusty shell things with broken up hamburger made and lots of salts. So when I came to California and discovered the world of tacos, I was just blown away. And so I instituted Taco Tuesdays as a way of, well, just eating tacos, <laugh>, um, so no repeat donuts, uh, but Taco Tuesdays is definitely something I like to leverage.
Matthew Dodge (31:48):
That's great. Oh, that's a great story by the way. Um, in the different, uh, cuisines <laugh>. But, um, we, you guys have actually touched on this, so I wanna move forward a little bit and talk about this in a little bit of a different way, but, uh, it's a great time to talk about it because what you guys were just discussing really feeds into this, and that's the question around the culture and process and the tooling, actually. So this is all part of the same motion, if you will, but, uh, sometimes you can kind of emphasize one over the other. I think some of the stuff that we just spoke about is, uh, especially, uh, well actually for both of you, the way that you approach that meeting, the culture that you're trying to build around, the positivity around the learning and bringing people together and, and getting that, that feeling of, um, you know, motivated to kind of do more and motivated to kind of continue in this direction and, and everybody's supporting each other. I mean, I think that's fantastic from a culture standpoint, but is that something that is, uh, a result of the change to the tooling, the change to the process? Or how do you kind of see this motion working with these three pieces?
David Levinger (33:07):
You wanna kick us off, Stephen?
Stephen M. Dick (33:09):
Sure. Kick us off. Um, here's, here's an interesting fact to, this is now proprietary knowledge. Uh, when I joined Salesforce about seven or eight years ago, it was right after the company's worst starting time events in Salesforce's history. And you can read about this on Google. Uh, what happened was, uh, over 10,000 customers couldn't access the platform for over 24 hours. There was data loss, it was all over Twitter. Uh, mark Benioff had to issue an apology, and that's what I was stepping into. Part of my remit, uh, back then and back there was to materially improve the global incident management process across the entire company. Now, at the time, Salesforce was about 35,000 people and we had every problem you could imagine under the sun. We had long times to resolve issues. We had the press reaching out to our employees, uh, business insider.com if you're watching this, uh, watch out.
(34:12):
Uh, we had the press reaching out to our employees on mass. Uh, we had Twitter and social media challenges and, and problems scaling our social media, uh, across such a vast media empire. And so one of the things that I was thinking about is jeepers, you know, this seems like a bit of a catastrophe. It feels a lot like the California wildfires that we've, that we've experienced in the last five or six years. It kind of feels like a major hurricane over in the east coasts. Uh, we had massive amounts of teams that had never really talked to each other before. I mean, they didn't even know that each other existed at the company. And so what I was thinking about is how do you orchestrate all of these teams in a way that's cohesive? And so what I did is I brought in, uh, the incident management system through a set of external consultants and they train our executives in how to respond to incidents in exactly the same way the FEMA and the F B I and police departments respond to incidents across the United States. And so I think about ting process culture, but also practice. How do you form a practice? And so we emulated our incident response practice after FEMA because it became clear that the scale of our incidents had just art scaled a normal incident response, uh, process. Um, so I think about that three-legged stool, but when you bring them all together, it forms a practice. And how you operationalize that practice is absolutely critical.
David Levinger (35:51):
Yeah, I agree. Um, I don't know that there's a solid first. Uh, I also think it's funny cuz we actually modeled some of ours after FEMA stuff as well, <laugh>. So I didn't even know that until just now that we, we shared that kind of outlook on some of this. But I think that the, um, I think for me it's, you start with lean process and by lean I mean it can be as minimal as you can get, right? So initially it was just, just start, just use the tool. Just start an incident in the tool. That's it. We had no other process. Um, and then you, you're cultivating that culture of just do that one thing that takes no time at all, right? And then, okay, what worked and what didn't work, right? And so initially our process was of paper document, right? So follow this paper document and it was like, man, this is obtuse and I don't like it.
(36:44):
Um, how do we make this better? And then the recommendation was, oh, well let's look at, let's look at a tool. Let's look at blameless. So when we started with blameless, it was the same thing as lean as we can make it. My goal was that you could start and close an incident in, in under five minutes. Like if it took longer than five minutes, do it. It is not how we're starting this process. And then you're cultivating that culture of a lean process helps tools, help the process stay lean and make it easier. And then it's that cycle, right? So the, the culture is embrace this continuous improvement and only expand the process if it matters and contract it whenever it doesn't, right? So if you're doing something and at the end of that retrospective meeting you're like, I don't wanna see value in this anymore.
(37:30):
If the team's like, yeah, I don't either. Nope, there's no value. Cool. Stripped out of the process, done, update the documents, move forward. And so if I had to pick a first, I would probably say the culture of iterative improvement, the culture of lean process and the belief that you can keep process lean while adding value and then looking at tooling as a way to make that more optimized, to save your team's, you know, time and trouble and that sort of stuff. And you have to be continuously improving it, right? You never want people to get off an incident or a retrospective meeting or anything and go, what a waste of my time. Like if there's anything wasting somebody's time, we should be questioning constantly questioning why. And, um, and the customizations in blameless, the ability to configure it and change it, manipulate it and all that sort of stuff has been awesome because every team at our company follows the same rough process but with different intermediary steps, right? You can customize it for them because the questions aren't the same and the things they need to check aren't the same and the ways they need to respond to them may not be the same, but the process is right. You open incidents, you resolve incidents, you follow followups. That process is the same and the culture embraces the process. The culture embraces that. And then you have that customization and the tooling to make it easier for everybody.
Stephen M. Dick (38:50):
Yeah, I would just plus one then I think continuous improvement of the process is critical. Something that I've used before to do that is I've done live RCAs on the incident bridge after the customer impact has been mitigated, but with a focus only on process improvement when it comes to the rca, uh, of the technical scenario. Oftentimes that's left pass to, uh, digest and sit for maybe a day or two so people can really investigate and really uncover the true drivers. But the RCA of the process and continuing, continuing to evolve the process can be done live on the incident bridge when it's fresh on everybody's mind and sort of spending five minutes at the end of an incident, just a decompress and get everybody's thoughts around what parts of the process works, what parts didn't, what needs to be evolved or evolved away. I think that's absolutely critical as well.
David Levinger (39:45):
Yeah, you said decompress and I think that that's really important too. You don't want people dropping off an incident feeling, oh gosh, that was just the worst. I just, you know, you want the, you wanna have this moment where they, they leave with hope, they leave with a belief that no, this isn't gonna happen again. This is better. Right? So having that, what you just described, I think is, is brilliant. I'll, I'll plus one back at you for that because it gives 'em that, that peace of mind. Like, no, the process worked or it didn't work, but we know why it didn't work and we fixed it so the next time it's gonna work. So they don't have that that, that hopelessness of like, God, we don't know why it broke and we didn't know when it broke and a customer told us and we just worked on this for, you know, this really long time cuz it was super hard and it might happen again in 15 minutes because we don't know. Right? So like shifting that where they do know, uh, I think is very, very important.
John Weil (40:42):
It's, it's really, uh, interesting to hear, um, both of you say practice and process improvement, um, as it relates to, um, you know, practicing the incident management process, whatever that might be. But then also related to the after incident, um, which I think is a, is a huge component and the way that both of you described, you know, your journeys at previous organizations and even now is, is was pretty interesting. I was trying to sort it out in my mind. So it sounds like you to implement, you know, to implement a new culture, you implement a new process. So it's a little bit of a chicken in the ag cuz you're doing them at the same time, right? Like, hey, here's the new process, the new process is the new culture, and we'll also drive cultural changes. And then tooling you can sort of layer on, right?
(41:28):
That you were either using existing tooling or maybe we're changing the process, changing the tooling. Maybe we change the tooling down the line as well. Um, but both of you keyed on as a part of, as a part of building a practice. It's good to do, you know, game plans and game days either, you know, when they're in, when we're in peacetime, Jabar, Stephens language or immediately after to do a live R c a. How do you, you know, how do you find, obviously with blameless, I'm very aware of how, you know, to capture this sort of process improvement and changes happens. But, um, you know, uh, I'll, I guess David, I'll I'll point this to you first, which is, you know, what, what's a good way of capturing that feedback, right? Either in a previous life, current life, how do you capture that feedback of, hey, we ran a game day, here's what we found. Or you know, Stephen, when we're, when he's doing those live RCAs, hey here's, here's these process improvement changes, here's how we're capturing that, here's how we're implementing it, you know, what, what are your strategies and, and, and sort of tools there.
David Levinger (42:27):
Yeah, I mean cap capturing it. We use the, like I said, the to-do functionality and blameless like all the time. Um, and when we're doing that retrospective meeting and we're going through the incident, like some of those meetings are super short cuz we have a clear understanding. We're like, boom, boom, boom, we're done. Everybody's like, yeah, that sounds good. We move on. Some of them get, um, contentious is the, is the wrong word, but there's a lot of exciting conversation and sometimes we'll end up in those meetings, you know, going on, we need to add another follow up, we need to add another follow up, we need to add another follow up. Um, and then the question after that is, how do you prioritize it, right? Because you're not gonna be like, oh, so put 'em all on the next sprint, right? <laugh>, it's like, that's not gonna work.
(43:06):
Um, so you have like a bit of a risk mitigation kind of thing, right? Like how, how, how much does, is this gonna help us mitigate things in the future? Um, you know, have we seen related either repeats or things that are close enough to repeats that we should pay attention to it? Um, I think that when you're trying to kind of smash culture tooling and process together, you know, people have to see that improvement and reduction of stress and believe that the tools they're using to do it is, is getting better. Um, and so you, you are doing them all at the same time, but you're, you're doing it in a way where people are excited about the next iteration because they truly believe that next iteration will improve their life, will improve the customer, will make it less stressful, make it easier. Um, and without that tracking in piece, you're not gonna make those iterative improvements, right?
(43:59):
Um, like we tend to ask a couple questions all the time, right? Was this detected by an automated alert, right? If the answer is no, we need to make one, was there a runbook written about how to solve this? The answer is no. You need to write one. Um, now does it mean that that runbook is gonna help you solve the problem perfectly every time? No, but it's gonna, it's gonna point you in a lot of directions for previous incidents that did help solve it, right? And so there's some like, uh, consistent to-dos that we always walk out of an incident with, right? Like, you will not have an incident that we didn't detect through some sort of automated alert a second time. We will build the automated alert. Um, we will not have an incident that there was no runbook for. With, with the extreme corner case of no, this truly was a flash in the pan, it's never gonna happen again.
(44:49):
We literally have replaced that piece of software so it can't happen again. But outside of that, you, you do that and then the platform, the, the tooling's ability to link all that together so that it's easy to see it is very valuable, right? Like if we're working with engineering, it's like, Hey, engineering, this is what we want you to do. Here's the incident. You go into that incident, you see all the other to-dos, you see the conversation, you see the retrospect. If you see all that communication, there's so much more rich context kind of automatically put together that otherwise is lost.
Stephen M. Dick (45:20):
I might reframe the question into what not to do because I've seen a lot of that in the industry. Uh, my only what not to do is don't spin up a Google Doc or Quip doc or Coda doc. Don't just spin up a dock and hope that that's gonna take care of, of your retrospective process. I've seen that a lot in the industry. A lot of companies still do it. There's a number of just very common pain points associated with it. It's difficult to search for after the fact. Everybody loses the link and Slack or whatever. Um, it's also inconsistent when you try to get the same information like what David's describing, you know, asking the same questions for every retrospective. If that depends on a human being having to remember to ask those questions. Your system is fundamentally broken. Uh, so don't spin up a Quip doc or a Google Google Doc. You need to have a system of records that combines the qualitative assessments during the R C A itself, but then also your system of records where you can open up tickets and then track those tickets the whole way through to completion. So the system of records really important and it overcomes an a number of pain points that Google Docs and alert document systems have.
Matthew Dodge (46:39):
That's a great point. And more to that point, I think you, you kind of set this up, Stephen, without even really realizing it. But, um, to, to your point there about the correct kind of tooling to support this change in the culture and change in the process, um, how does Blameless fit into this journey? What's it been like? So, um, we've, you guys have kind of talked about this, um, a little bit more, um, in general, in general terms, right? Needing to have that, needing to have this change and you have this, uh, system of record and things like that, but specific to blameless. Um, why, and then we're, we're, we're running low on time, but just really quickly to to both of you kind almost like a why blameless or, you know, how does Blameless fit in that journey for filling in that
David Levinger (47:28):
Piece, that tooling piece?
Stephen M. Dick (47:33):
David, do you wanna kick us off?
David Levinger (47:35):
Yeah, yeah. Um, so for me, number one, I can do everything from Slack. Yeah, that's absolutely massive. And then I can share Slack links with folks after the fact, and I can invite anyone in my org into these incidents without them needing to go to a different tool or learn a different thing. Um, the fact that it integrates into teleconferencing solutions we use Zoom is, is also brilliant. If there's never a question of like, what bridge are we on the bridge in, in the channel, just click the, that bridge. Um, and in many ways you, you want to avoid Stephen said earlier, reducing cognitive load. And I think that's an incredibly important thing to do, especially during an incident, right? So when you hop into a blameless channel and you get a summary like, Hey, here's what's going on with the incident, here's the incident, you know, zoom link that you can go, uh, join, here's a series of questions that need to get answered.
(48:27):
Here's a different roles that people can have and what they're supposed to do and who's taking care of that. It just reduces that whole load. Right now people are like, oh, I know what to do, I'm here. Yeah, I did that. Yeah. Oh crap, I forgot to do that. Excellent. Let me go do that. I mean, a common one for us is like reminding people to update external parties about the status of the incident, right? Like you're deep in fixing it and that no one knows you're doing it. And so you're doing a great job, but no one knows you're doing a great job. And part of that job has to be communication. So the tool helps with that. Um, and then the integration with the, the issue tracking system so that we can actually track results at the end. That for me, like all of that had to be there. Um, it had to be easy for people to use without having to go to another tool. It had to be easy for them to get updates and, and reduce that cognitive load, and then it had to have that follow up at the end through retrospectives and ticketing so that we actually resolve things on ongoing.
Stephen M. Dick (49:20):
Yeah, I absolutely love that. And then for me, you know, I met the finding team a number of years ago. I was evaluating the products and got to know some of the finding team and the mission of the company and the team just really resonated with me. Uh, I had spent a number of years trying to code custom solutions for some of the common challenges at Blameless sales for, I remember spending weekends Saturday nights up late night with a bottle of bourbon trying to figure out how do I auto record a bridge? And then looking at my code the next morning and figuring out that, okay, that part bourbon was maybe not so great from my coding skills. So, so I had personal experience in trying to figure in, in trying to automate some of the challenges that the product solves for. Um, and then to David's point, uh, the workflow is embedded in a lot of the tools that our engineers use day-to-day. So it doesn't exist outside of the most common workflows. And so when you have a tool that's embedded as part of your day-to-day work, it's a lot more easier to adopt. And so the challenges with education, the challenges with change resistance, I find are just a lot more, uh, uh, maybe reduced as a result of how the implementation of Lepo has, uh, has matured over the last number of years. So we're big fans of the product, of the team, uh, big fans of the results that we've gotten through the product as well.
(50:54):
Thank you.
John Weil (50:56):
Yeah, no, appreciate the kind words and uh, I know both of you have been, uh, great customers as well, so we, we appreciate that. Um, I know we have about nine minutes left here, so, uh, as much as I would love to spend the next hour talking about how great blameless is, I think we can transition to, uh, just a little bit of q and a from, uh, some of the, the questions we've got. I really appreciate the, the engagement we got throughout this, obviously. Uh, David, Stephen, thank you for, uh, the great discussion. It's always informative for, uh, for me as well. Uh, the first question all posed from the chat is, um, how hard was it to get institutional buy-in amongst all teams? Uh, and, and Stephen, I'll I'll I'll flip it back to you to answer this first question, which again, was how hard is it to get institutional buy-in amongst all of the teams there? Steve, open to, open to Stephen or David <laugh>?
Stephen M. Dick (51:52):
Sure, happy to jump in. Um, it's always a big spike. Um, it depends on this scenario, what's going on in the business, what's going on in the industry. Usually the trick is, um, what I've noticed is I'll, I'll start with what doesn't work, and I've seen this a lot. Uh, people will start off with what they're solving for in their departments. What are the pinpoint, maybe it's, there's, there's too many pages or maybe a they wanna migrate a database or something. Um, they had to plan and try to sell that plan. And I think that's the wrong way to go about it. I think people need to invert their perspective and look at what does the business actually need at this point, and how does this process or tool or culture enable the business to accomplish the things that it's trying to accomplish. And so oftentimes that's retention of customers, which is absolutely key in an economy like this.
(52:47):
Uh, oftentimes it's, uh, reducing customer friction and increasing NPSs scores absolutely key that the customers you do have are just delighted with your products. And so starting at the business first and figuring out what's happening in the, in the business and how can I sell what I'm trying to do in a way that enables the business, uh, I think that has proven to be useful for me in the past. And just as long as you can focus on the business and business enablement, then the change resistance. So you often see is maybe a lot less reduced or a lot, or maybe primed to be reduced, uh, from Thea?
David Levinger (53:31):
Yeah, I would, I would say I've never worked at a company where it wasn't hard to get buy-in across, across teams for, for processes like this. Um, but I am primarily a startup guy. I've been in many, many different startups over the course of my career. Startups tend to favor no process. Um, like controlled chaos is like a, a common theme of startups, but a successful startup needs to transition from that into a medium business and hopefully a large business. So there's, there's these kind of moments where you can interject, uh, like I've said many times, lean process, interject it back in, help people see the value of it, and that it's actually an enabling factor. And then they shift from, this is hard in a waste of my time to why didn't we do this a month ago? This has made me so much more effective.
(54:21):
But you have to, uh, paint that picture, tell that story about, I know it seems bleak now, but I can get you over here. Right? And, and here's the path to making that happen. And then you, of course have to actually do it, right. They have to see those results. And then once you've done that, um, so I guess, sorry, I'll back up a little bit. Picking the right first project, picking the right first team, picking the right first moment to do it, ensuring massive success, repeating it, repeating it, repeating it, talking about how the process helps and the tooling helps, and all that sort of stuff. Shifts them from hard to easy. And then you'll cross this boundary where, um, people are then coming to you saying, Hey, I want to use this. Hey, can you train my team on this? Hey, can we shadow when you have an incident so we can see how you run it so that we can do it better? Um, but I've always had it start hard. I've always had it start with pushback on, uh, why do I have to do this ex, this is just extra work for me. Um, so you have that. You always, I've always had that journey or that hurdle to, to cross, but once you do it is transformative. It is a multiplying effect in productivity and in knowledge sharing and in answering the questions, um, at leads to everything that Stephen said, uh, and helps the team feel more efficient while they're doing it.
Matthew Dodge (55:39):
Well, thank you for that. Um, well, we've answered some of the questions to the q and a. I think this is a good place to wrap up for us. Thank you guys again for your time. Of course. Great conversation today. Um, for those of you watching attendees, we will email or registered attendees, excuse me, we'll email you the recording. Um, it'll also be available on blameless.com. As of tomorrow, if you have any questions, we put this in the chat, but if you have any questions, you can reach us@helloblameless.com. And then also David and Stephen, um, if you wanna take this opportunity to let people know if they can stay in touch with you, either, whether that's LinkedIn, uh, company, website, you wanna plug any upcoming talks you wanna plug?
David Levinger (56:26):
Uh, yeah, you can definitely find me on LinkedIn. Uh, David Levinger, I'm working at Machine. I, uh, feel free to reach out. I'm happy to, always happy to talk about this type of stuff with anybody that's interested in chatting about it. I find it, I, I, I mean, maybe it's weird to say it's a fun topic for me, but it is a fun topic for me, largely because I've gone through this process a few times, seen how transformative it is, and when you, when you're on the other side, you're just like, I'm never going back to that. Right? Like, this is, this is a better place to be and I'm happy to share that and help folks think about how they can solve those problems over time.
Stephen M. Dick (57:00):
Yeah, same for me as well. LinkedIn probably the best place to find me. Feel free to reach out and connect.
Matthew Dodge (57:08):
Um, real quick, sorry, throw one more question at you, but, uh, we did just get one through the chat and got maybe two minutes left. Um, how to pick the right project. Do you have any tips on that?
David Levinger (57:20):
I would say I'm ha Oh, go ahead Stephen. No,
Stephen M. Dick (57:24):
So there's, there's two strategies I use. Follow the pain and follow the chain. So follow the pain is just worry about which teams are having the most pain. There's a pager volume and what times of, uh, what times are people getting pages at? Is it 2:00 AM in the morning or 9:00 AM in during the workday? Um, so follow the pain. Uh, look for where the pain is. And usually if you can solve for that pain that greases, its skids, but then you can also follow the chain and that's breaking down your, uh, maybe it's your C I C D pipeline. If you're having change related issues, or maybe it's similar bottleneck within your system, you can follow the chain of your systems to identify where you have bottlenecks. And oftentimes a bottleneck will have a team attached to that bottleneck in coming in and solving for that bottleneck, greases case as well.
David Levinger (58:16):
Yeah, I think, I think that's spot on. Uh, only thing I would add is where you have sway helps a lot. Yeah. Um, so if you, if you know that you, you have a really great rapport with a certain team, they're more likely to embrace the tool, really follow the process, really listen, they tend to be a good place to start. Uh, the other place I would say is, uh, I, uh, my team knows that no incident, uh, go, go wasted from a what can we learn from this? How can we dr. Use this to drive positive improvement to the company or to the process, uh, which is very similar to follow the pain. So I will be opportunistic as well. Um, so like for example, we've, we were doing it on the op side, it wasn't rolled out to the rest of the company. All of a sudden an incident started, CEO's involved. It's a really big deal. And I'm like, Hey, why don't you jump into Blameless and let us help you run this incident. Like, you know, we'll have two op folks jump in, we'll help you run it the whole nine yards. And then at the end of that they're like, man, that was so much better. So that opportunistic, which is pain for sure, but that opportunistic like, oh, there's something going on right now. We already have this in place. Let us guide you. Helps a lot.
Matthew Dodge (59:30):
Thank you again so much David and Stephen, especially for taking the time to answer that last question. Really appreciate your time today. For those of you watching listening, appreciate you, uh, spending the time with us as well. Um, and then we'll, we'll end it there. Thank you again, everyone. Thank you. Thanks all. Yeah. Take care.
David Levinger (59:49):
Take care.