In this episode, Alyson van Hardenberg, Engineering Manager at Honeycomb.io, and Varun Pal, Staff SRE at Procore, talk to Matt Davis and Jake Englund from the Blameless team. Find out how their engineering orgs pick an incident commander. What type of technical knowledge should this person have about an incident? And does it take a certain type of personality to excel as incident commander? Register to watch the full conversation!

Description

In this episode, Alyson van Hardenberg, Engineering Manager at Honeycomb.io, and Varun Pal, Staff SRE at Procore, talk to Matt Davis and Jake Englund from the Blameless team. Find out how their engineering orgs pick an incident commander. What type of technical knowledge should this person have about an incident? And does it take a certain type of personality to excel as incident commander? Register to watch the full conversation!
In this episode, Alyson van Hardenberg, Engineering Manager at Honeycomb.io, and Varun Pal, Staff SRE at Procore, talk to Matt Davis and Jake Englund from the Blameless team. Find out how their engineering orgs pick an incident commander. What type of technical knowledge should this person have about an incident? And does it take a certain type of personality to excel as incident commander? Register to watch the full conversation!

Speakers

Matt Davis

Staff Infrastructure Engineer, Blameless
Staff Infrastructure Engineer
Read Bio

Matt Davis

Staff Infrastructure Engineer, Blameless
Matt is a Sr. Infrastructure Engineer at Blameless. His expertise brings to bear a variegated background including data-center operations, storage hardware and distributed databases, IT security, site reliability, support services, observability systems, and techops leadership. He has a passion for exploring the relationships between the artistic mind and operating distributed software architectures.

Jake Englund

Sr. Site Reliability Engineer, Blameless
Sr. Site Reliability Engineer
Read Bio

Jake Englund

Sr. Site Reliability Engineer, Blameless
Jake has an insatiable curiosity for learning about how complex systems work. Ever since his serendipitous introduction to SRE, Jake has been fascinated by the unique challenges and innovative solutions which come with scaling web services by orders of magnitude. In his spare time, he enjoys video and tabletop games, dancing, and cooking.

Alyson van Hardenberg

Engineering Manager, Honeycomb.io
Engineering Manager
Read Bio

Alyson van Hardenberg

Engineering Manager, Honeycomb.io
Engineering manager at Honeycombio. Mother, gardener, baker, potter,🇨🇦, she/her. Find me on Twitter: @akvanh.

Varun Pal

Staff Site Reliability Engineer, Procore
Staff Site Reliability Engineer
Read Bio

Varun Pal

Staff Site Reliability Engineer, Procore
Varun has over 16 years of experience with designing and engineering solutions. Currently he sits on the Cloud Platform Engineering Team at Procore Technologies where he's scaling systems using Kubernetes on AWS. His expertise lies in working with large engineering teams and designing scalable solutions, especially on a global scale.

Video Description

In this episode, Alyson van Hardenberg, Engineering Manager at Honeycomb.io, and Varun Pal, Staff SRE at Procore, talk to Matt Davis and Jake Englund from the Blameless team. Find out how their engineering orgs pick an incident commander. What type of technical knowledge should this person have about an incident? And does it take a certain type of personality to excel as incident commander? Register to watch the full conversation!

Table of Contents

Video Contents

Video Transcript

Matt Davis (00:00):

Hi, everyone. Welcome to From Theory to Practice. This is our second episode and this time, we're asking the question, "What's difficult about incident command?" My name is Matt Davis, I am a staff infrastructure engineer at Blameless and I do a lot of intuition and observability systems here at Blameless and I host this mini webinar and, in general, I ask a lot of questions. I'm going to ask a question of Jake, could you please introduce yourself?

Jake Englund (00:35):

Hi, I'm Jake Englund. I'm a senior site reliability engineer on the instructor team at Blameless. And I got introduced to SRE when I applied to Google as a software engineer back in 2010, and they were like, "Have you heard of SRE?" And that's what's happened since then, so my world has kind of revolved around how to apply software engineering solutions to infrastructure and operations, especially just to make life a little easier for all of us on the sharp end and the applied part of taking care of our systems and making them reliable. Today, with us, we have Alyson from Honeycomb, if you'd like to introduce yourself.

Alyson van Hardenberg (01:15):

Yeah. Hi, I'm Alyson van Hardenberg at Honeycomb. I'm an engineering manager, and we focus mostly on front-end engineering on my team. I've been at Honeycomb for four years, and over that time, I have also been on call for our production systems, and we use Honeycomb to help debug those systems. We use it as one of our tools during incidents.

Jake Englund (01:39):

Fantastic. And we also have with us Vroom from Procore.

(01:46):

Welcome to then the challenge of hosting us. Like, "Okay, does this turn into an incident where we're now trying to debug this actively?" I'm reminded of the Slack... There was that writeup that just got shared around and I'm trying to remember which channel it got shared in. But about how like, when you're doing incident response in your Slack and you work at Slack and you're having an incident, that that could be a little bit of a challenge, adds an extra level to it.

Alyson van Hardenberg (02:12):

We have the same problem at Honeycomb where it's like, "No, Honeycomb!"

Jake Englund (02:16):

And I guess same thing here at Blameless too.

Matt Davis (02:18):

We did.

Varun Kumar Pal (02:23):

I think I'm good. I don't know what's happening. This only happens when I'm on Zoom with you guys. This doesn't happen often, so I'm not sure what's going on here. I introduced myself twice till now. I don't know if that got recorded or I need to go third time is a charm?

Matt Davis (02:40):

Why don't you go ahead and just introduce yourself a third time so we get a good introduction.

Varun Kumar Pal (02:45):

All right. I'm Varun Kumar Pal, I'm a staff site reliability engineer with the cloud systems engineering team here at Procore. I've been with Procore for over two and a half years and mostly I lead the Kubernetes and infrastructure team. With the incident management, we have been using Blameless for two and a half years, so I kind of lead the guild for incident commanders as well.

Matt Davis (03:08):

Oh, awesome. So I immediately have a question. You have a guild for incident commanders?

Varun Kumar Pal (03:17):

We have been trying to get on a guild, because when Blameless was... When we adopted Blameless we were a significantly smaller company of over 100 engineers. Now we have grown exponentially to around 650engineers and previously we were like each team, when an incident happens, one person becomes the incident commander, while the other becomes a communication lead and you go about the incident. But now it's just not possible. So we are trying to put together, we are endeavoring to put together a guild together so that like-minded peoples come together and we see this place through.

(04:01):

So one of the challenges that we have seen is when we have different incident commanders, you have different experiences within an incident. That becomes a huge problem, because all our stakeholders are interested in metrics and reports and everything and now everything is skewed because every incident commander has a different way of going about incidents. Now we wanted to get those experiences together, we wanted to consolidate that to a few group of people who can provide a very same or a similar kind of experience for all the other engineers. And that's what led to the foundation of a guild of incident commanders.

Matt Davis (04:46):

That is fascinating. And is this guild something that... I mean it sounded like it's very bottom up, like, "This is something we see that we need and so we're going to do it." The company is officially sponsoring this guild. Is it extracurricular? Do you have meetings and stuff?

Varun Kumar Pal (05:05):

The company has not yet bought into it. And they are, but they're slowly seeing the reason why this is necessary for a team. Because when things are asked in a retrospective of, why these things are recurring, "Why do we have recurring incidents?" That's when the answer comes about like, "Hey, this is what it is." And then when people see different experiences during an incident, the obvious question comes like, "That incident was different. Why is this different? Why is the experience so different? Why is it so strenuous to go around this incident?"

(05:43):

That's where these experiences come. So this has been founded by a couple of principal engineers and couple of engineering managers apart [inaudible 00:05:50], and we were like, "Let's try to make our own lives easier by coming together, having a session once a week, exchanging ideas, what needs to be done and let's make our own lives easier, because this is not going good."

Matt Davis (06:07):

That is really interesting. I had no idea that Procore did something like that. Alyson, does Honeycomb have any kind of a thing like this? I know Honeycomb is a much smaller company than Procore.

Alyson van Hardenberg (06:21):

Yeah, it's much smaller. Our whole company is smaller than your engineering team, much smaller than your engineering team. But at Honeycomb we expect all engineers to be on-call, and in those on-call rotations, anybody could be the incident commander. We have roles in incidents. So it might be whoever hops in first and acknowledges an alert and they say, "Okay, I'm going to be the incident commander." They pull up the roles in incidents document, "Okay, these are my roles, these are the things I have to do." And we have that document set up such that anybody who is on call can be the incident commander, anybody could be the communications manager and so on.

Matt Davis (07:03):

And when you say anybody, how do you decide? I heard you say maybe it's the person who's the first responder, but is it always the first responder?

Alyson van Hardenberg (07:11):

It is not always the first responder, because we find it to be really important that whoever is the incident commander is not the person who is doing the deep dive into what is going on. It's more of the traffic controller knowing who is doing the investigation, when they're going to report back, what needs reporting out to support to the communications manager, who's taking notes and recording the process, all of those kinds of things.

(07:34):

So the first person to acknowledge the alert might belike, "Oh, this is my area of expertise. I'm the best person to do the deep dive into the problem. So-and-so my on-call buddy, will you be the incident commander?" And then the roles balance that way.

Matt Davis (07:52):

What is the on-call buddy?

Alyson van Hardenberg (07:55):

Oh, on our on-call rotations, we always have two, sometimes three engineers on-call. We'll have a product on-call rotation, a platform rotation, and sometimes a telemetry rotation. And we all help each other out. We're never alone on call.

Matt Davis (08:09):

I love that idea of the telemetry on-call. I've thought about that exact concept before, because as someone here at Blameless who architects and runs a lot of the telemetry infrastructure, I'm often just asked to do something in an incident. It happens all the time. "Hey Matt, this log isn't showing up. Can you help us find this log?" "Hey Matt, can you help us locate where this metric is?" Those kinds of questions, I can get there immediately because I'm living in the system and I'm like, "Oh yeah, it's right here." And I can answer those questions immediately, but immediate answers are not always possible.

(09:05):

In fact, I want to ask both of you or anybody actually, who has thought about this, the difficulty in filling out the roles. And I'll tell you what I mean. When we get called upon to be an incident commander, it often falls in our lap. I feel that way. And often when that happens, I don't know what to do, or I'm so busy mitigating that I literally don't have the cognitive ability to make sure that communications happen or to pull in the right person. How do I even know who the right person is to pull in? How do I know where to find that expertise? Alyson, you've used the term and coming from an observability company, you talked about observability in the context of incidents. Can you explain what you mean about that? Is it kind of like what I'm talking about?

Alyson van Hardenberg (10:19):

When I think of observability, I think of it as a shared language between engineers and your systems. So systems are speaking their own language of logs and events and metrics. We don't necessarily speak that same language. As a product engineer, someone who's working in the front end a lot, I don't know what the instrumentation is that's coming out of our Kubernetes cluster. But I can use my observability tooling in this case Honeycomb, to ask those questions. I can say, "Show me account where there's an error." And some errors would come up in that particular data side and I can dig in from there. That's observability for me.

Matt Davis (10:56):

Okay. Do you think that there is a key bit about observability that we talk about on the human side of things?

Alyson van Hardenberg (11:13):

Absolutely. There's like a culture around observability of asking questions. Asking questions of your data, asking questions of your system, and the ability for anybody to access the tools. I mean anybody, support, sales, marketing, product to ask questions of the system unlocks the power to ask questions of each other as well. So as a product engineer on call, I feel a lot more comfortable all being like, "Hey, I ran this query over our data, I found these things, but it's not clear to me what they mean yet." I know that the platform team has more expertise in this area. "Hey, platform team, hey, platform on-call buddy, can you help me make sense of this information?" It helps me present my question from an informed place, even though I might not know anything about Kubernetes or Kafka for example.

Matt Davis (12:03):

Yeah. How would you know to go to the platform team? Is this just something that's innate knowledge or is there a spreadsheet that to go check and, "Oh yeah, this is for this team." Or is this a culture thing or tribal knowledge?

Alyson van Hardenberg (12:21):

Yeah, it's a little bit of all those things. One thing we do at Honeycomb, whenever an engineer joins the organization, we do an architectural overview. And it's always hosted by the second most recent engineer to join the team backed up by the teammates. So that helps you have an idea, "Okay, this is the architecture, these are the systems." As anew person to the engineering team, you fumble through the architecture presentation supported by the rest of the team, and everybody goes through this process of trying to understand what pieces are there in place. You don't have to have a deep understanding, but just a surface-level knowledge of what's there, where to go and look in the documentation for issues in that area.

Matt Davis (13:06):

Right. Is this the kind of training that people will retake when things change, or do only new people get the advantage of learning?

Alyson van Hardenberg (13:15):

Yeah, it's one of those meetings that's really fun to go to because there's always a like, "Ooh, which one is not going to be covered? That's my area of specialties." And there's always the peanut gallery in the background chipping in fun facts about the history of Honeycomb and then the story of how this new piece got added in. It's like the lore of the system.

Matt Davis (13:39):

Varun, how about at Procore? Do y'all do something like this to prepare people for knowing where to look for expertise?

Varun Kumar Pal (13:47):

We have Incident Management 101, but that's exactly what it is. It is just the tip of the iceberg. So until very recently it was all tribal knowledge. People knew whom to go to when things happen, and then slowly it started being put into spreadsheets. And then we started having multiple spreadsheets and now people have started updating multiple spreadsheets. The last survey we did within our technology organization, we found out that none of the engineers, or most of the engineers actually did not want to be an incident commander. Because it's just pain. It's just a pain to be the incident commander, because you are now stuck with an incident for probably, which you have no idea about, and you don't know what to do about, you don't know who to approach. They would rather be the SME and be the hands-on keyboard rather than be a commander and do all the responsibilities of over there.

(14:50):

We have found a way to get around this by introducing anew software called Backstage, which is our service catalog. We document all our systems over there, all our owners, Slack teams, Opsgenie team, and put tags over there so that when someone is on call, they can just search that system, like put in API and know which all teams are there and just page them out via Opsgenie and get their attention.

(15:20):

It's been difficult doing all these things because we are around 650 engineers spread across 8,200 teams and getting everyone at the same knowledge level is a huge challenge. So it requires a lot of training, lot of confluence documentation, lot of meetings that are required, office hours, lot of those training, but we are slowly and surely getting there.

Matt Davis (15:48):

That's an interesting statistic, and you said the word SME, the SME, subject-matter expert is what Varun was talking about there. And I've run into this too where engineers, they really just want to engineer. That's what they do, they engineer, "And that's what I'm going todo." But the pain part of this is what really interests me about the incident commander role. Jake, we've had some similar pain around incident commander. Is there some recent pain from a recent incident where you held the incident commander role that you would've called it painful?

Jake Englund (16:39):

Yes. I'm thinking specifically just about even when I... So this is one where I came in as both the incident commander and I was still actually being the kind of operational lead at the same time, the person who was digging into things, especially because it was just early on and how things were developing. And it was that one where we... We have a file storage microservice called Cabinet and it was so serendipitous too, because I'm delighted hearing about people talk about how you prepare people to be on call, because it's something that we have multiple meetings that we're talking about throughout the week at Blameless. We have Observability [inaudible 00:17:19]just to talk about observability and how to see and understand what's going on in our systems.

(17:24):

We have Practice to Practice Gamelan, which both of these meetings I'm already mentioning are things that Matt does, and this intrinsic value of understanding that with Practice to Practice is that we're talking about the actual work as done and that it's people from all over the company being able to come together and understand what everybody is doing and what everyone else is good at and how to find different things.

(17:44):

And one of the things that came out of one of our Practice to Practice sessions was digging into the Cabinet microservice. And so it was really exciting when we noticed that there was an issue with the Cabinet microservice, that because we had just talked about this in one of our Practice to Practice sessions, that I was able to dig in and start figuring out what was going on. But the cognitive overhead of just going like, "I know what's going on, but in the same amount of time that I'm needing to... I'm digging into trying to mitigate this." Or just even to understand it the first part, "Are we losing customer data?" Because that's always my first and most immediate response is that like, "If we're losing information that I got to be light speed in that direction." Then which leaves very little clock cycles wise to be able to even just communicate to other people what's going on.

(18:29):

So being able to hit the brakes a little bit to be able to communicate out, and/or to be able to tag someone else in. And I think that something else that complicated things a little bit was that it was after 5:00PM for the East Coast, so there was already that little bit of extra inertia of, "Hey, I need somebody else here." And especially because I think my secondary was on the East Coast as well at that point, just to load balance here a little bit and just to be able to say, "Hey, it's this thing we talked about and so I'll be able to shorthand to you and have you be able to communicate more."

(19:00):

But then fighting the fact is that I should have been more in the IC role just from where logistically things were. But I knew that I was the person that was like, "Well, at least I can do this. I know there's a couple other experts here, but I think I can dig right into this one." And I was almost really excited to because it was this new piece of knowledge that I had. But just almost that kind of emotional struggle there, which is a little different from... As we're talking about, I've said multiple times and as Varun had brought up as I think a lot of engineers feel like, I would rather be on call, and I'd rather be on call 24/7 for something that I am the subject-matter expert on than to be on call for five minutes for something that is going to point out everything I do not know about a system.

(19:40):

And so that's something that's always kind of an anxiety when I'm on call for something as well, so having buddies at least is another thing that I think is a great thing to alleviate that kind of pressure. Because at least if all three of us don't know about something, that at least it's clearly not something that is spread knowledge there.

(19:59):

But I do think there's this implicit assumption that all engineers are fungible in filling an on-call role or in filling an incident commander role, and that's clearly not true. But then to the extent that you want that to be possible, there has to be some amount of training, there has to be some amount of common grounding. And that I find so many organizations are resistant to providing that even as everyone here is going, "Hey, can we get on the same page about things?" It's always exciting to hear about how people are getting that common grounding experience just because you do have to make that investment.

Matt Davis (20:31):

I'm thinking about, I love the on-call buddy concept by the way, that's just beautiful. I love that. And we've actually talked about this too. I think this is maybe becoming more common to have a product on call, and like I said, to have an instrumentation, or my pet name for it is a seer on-call rotation. And then you have the first responder. We're trying to reboot our on-call incident response program here at Blameless, and one of the big questions that's coming up is, "Who gets to get incident commander? Who is that person?"

(21:24):

And is it makes sense for the first responder to be it? I mean Alyson, what do you think? Should the incident commander be a role that floats around, that's kind of amorphous? Or does it make more sense for thereto be... It's like people arguing whether or not we should hire a scrum master, should we hire an incident commander? That's their job, hire a scrum master todo scrum master stuff, hire an incident commander to do incident commander stuff, or is that too much? Should everyone be doing it, I guess, is the other side of that question. What do you think, Alyson?

Alyson van Hardenberg (22:09):

Yeah. One thing I do think for sure is that the SME, the subject-matter expert should not be the incident commander in any given incident. At Honeycomb with our smaller company, we don't have enough incidents to have a full-time role for an incident commander. I do think that having shared knowledge of how to be an incident commander is important, and that does require some training. So at Honeycomb with our on-call buddies, I think it's usually pretty clear who the incident commander is, and who the SME is going to be in any given incident.

(22:49):

If the UI is broken, the platform engineer is going to be the incident commander and the product engineer is going to be the subject-manager expert. If our database is down, then the platform engineer is going to be the SME, and the product engineer is going to be the IC. And because those people have been trained to be on call and know what those roles look like, they know where to jump in.

Matt Davis (23:16):

That's an excellent point. Being trained for incident command is the biggest part of what I just heard there. I don't think that this is done enough. I don't think that this is done nearly as extensively as it could be. Varun, is this what your incident command guild does in a sense?

Varun Kumar Pal (23:40):

Exactly. I think when we are a huge company, when different people become incident commander, we get different experiences. And we have seen that it directly impacts people's experience during an incident. And the bottom-line metrics like MTTR, right? We would like incidents right now because we all are remote, and mostly it's Slack channel through which incidents are resolved. Sometimes it just goes silent. We want incident commanders to be the active person over there. We want the incident commanders to summarize the incident. We want the incident commanders to gather the troops, put the status in, so that if they need help, they know where the status is. Is it the PR being constructed? Has it been committed? Is it being approved? Where is it? Is your CI going on? Is your CD going on? What's happening? So we have found that in spite of all the training, it is just a personality trait within the people.

(24:39):

And you can provide as much confidence documentation, you can provide as much help that they need. But the incident commander needs some certain personality traits that we were not able to find in all engineers. And we wanted to be cognizant and mindful of that. So that led to the formation of the guild where we actually found people who were willing to do the job, who had the personality traits, who could drive the incidents towards that.

(25:13):

To slightly touch on the on-call philosophy, we have as lightly different philosophy here at Procore. We believe that when a person is on call, he's representative of the entire team. It is still the team's responsibility. It is not just the on-call person's responsibility. So if the on-call person needs help, whatever be that be, the team has to come help them regardless of where they are. So the on-call person is just a representative for us.

(25:43):

During an incident, if for whatever reason there is no incident commander, the on-call person then needs become the commander as well as the SME. That's where he pages out the entire team that, "Hey, I need help." And that's when whoever is available, they jump in, they understand that the pain that the person on the on-call is going through, and they jump in and come in, wear the different hats that are required, and go through the whole process.

Matt Davis (26:14):

This touches on something that I deal with a lot and we just touched on a little bit. I guess to put it one way, it's the ability of the incident commander to know what's going on. I'll give you a specific example, because one thing that... I like to stay away from hypotheticals when I talk about these sharp-end topics. I was in an incident a few weeks back where I discovered the defect. I actually didn't even open the incident until the next day because the defect originally appeared in dev. So I didn't have any foresight or insight into that problem until the next day when it happened in production. I was the only one who noticed. I was the person who opened the incident. I was the person who immediately had to start mitigating the effects of the problem and I had to be the person to try to gather people into the incident.

(27:41):

So I just listed off, how many did I list off there, six roles, how is that supposed to happen? How is that supposed to work? I don't know. And it was a fairly high severity incident. The interesting thing about it though was that it wasn't customer facing, but it affected our entire observability infrastructure. So we were blind, in other words. We were not blind to all of our observability, but to metrics specifically for some of our clusters. We were blind. Kind of like blind to one perspective of the system. We have other ways of knowing if the system is failing, so it didn't feel so bad that I had to be spread so thin and wear so many hats because I was grappling with the same question that Jake brought up. "How are we actually being affected? How are our customers being affected?" So I could answer that question, but then I think, who else is watching me? Who else is watching me and knowing, "That engineer is spread too thin, we need to help."

(29:07):

How does that happen? So that's kind of the question I'm posing. I'll ask you first, Alyson, when you are on an incident commander role, do you notice that? Do you notice, "Hey, this person seems to be overstretching themselves." Or, "Hey, this person, I just noticed they've been working on this for four hours straight and they've never left the channel to go do something. Or if they did, they didn't let us know." How does an incident commander figure out that common ground has been lost? How does an IC do all of this? Alyson, do you have an answer? We would love an answer.

Alyson van Hardenberg (29:55):

I do not have an answer. I only have my own experience. One of my strategies when I am IC is a lot like when I did first-aid training. You know in first-aid training, when you come onto the scene and you're like, "Okay, I'm taking charge. I'm the incident commander of this emergency, you with the red shirt on, go call 911 and report back to me." And it's the same thing in incident command, I find. Like, "You subject-matter expert, go and dive into this thing, report back to me in half an hour, whether it's you found something, you're continuing down your research, wherever status you are, at report back to me."

(30:39):

And with those constant check-ins, I can sort of gauge the feeling in the room. I'll know if they've been at it for an hour and they haven't found anything, it's probably time to reach out to somebody else. And I can ask them, because of those repeated check-ins, "How are you doing? Should we ask so-and-so for help? Do you need a break?" Or, "Hey, I've been IC for a couple hours now, I need a break and I need to hand IC off to somebody."

Matt Davis (31:11):

I've gotten myself into the habit of doing that, of going, "You know what, I have not left this laptop for two hours. I'm taking a break." And I will tell the incident channel, "I am taking a screen break." I don't see enough people do that, to be honest with you. I really love the repeated check-ins idea.

Alyson van Hardenberg (31:35):

Follow-up question for you, Matt, do you hand IC to somebody else explicitly when you take those breaks?

Matt Davis (31:42):

No, I don't think I do. I don't think I do and that's a great question.

Alyson van Hardenberg (31:47):

Could you?

Matt Davis (31:49):

I should. I should be able to.

Jake Englund (31:52):

You should be able to. I would say I think with a lot of things is that the need to transfer enough state between there can often feel like a burden to where you feel like that's too much of an imposition. Because I feel like that's how I would feel sometimes being able to do that trade-off, unless there's somebody kind of actively in almost like a hot seat/standby role to be able to do that. I feel like that's sometimes half the reason why I hop into almost a comms lead role is that off...

(32:20):

Because I love, Alyson as you're pointing out, the first-aid training kind of thing is that if you notice that there isn't... Alack of order in something, that if you impose that order, people will often just latch onto that. But then also it really does come down to, "Okay, you literally are the person giving the directions imposing that order now. So if you don't, it doesn't happen."

(32:42):

And kind of in the same way it's like, "It seems like multiple people are handling comms lead." But if I step try to step away because it seems like that's the case I find every time, nope, I was actually the comms lead just because I stepped into that role and was doing that. But it doesn't always have to be explicit. Sometimes somebody, some can just step in and fill that role. But then it can help to be explicit about it, especially if you do want to be able to step away or transfer or things like that. But I do kind of look at a comms lead role almost as sometimes being able to be a standby at least, because if you're communicating about the entire incident, I feel like you at least have some context about what's going on still, even if you weren't the person like in lead. So that's immediately what my brain goes to. If I needed to step away, that would be the person I would ask.

Matt Davis (33:24):

That's a really good point.

Alyson van Hardenberg (33:25):

It's funny you say that, Jake. I was going to say I would suggest comms to step in, because usually they have enough context and that's usually my go-to as well.

Matt Davis (33:36):

Varun, how about at Procore? Is this something that you do? Do you trade off IC roles in the middle of an incident like this?

Varun Kumar Pal (33:43):

Oh yeah. It's very common, actually. And that's where we stress on the importance of the incident commander summarizing the incident at regular intervals so that anyone can just take it over and people can deserve breaks. Specific to my team, we have a practice where if anyone is onboarded to an incident, we add the entire team. Everyone knows that if you're not on call, you don't have to keep an eye. But it's just that if one person has been at it as a team lead and then we have a manager, we keep an eye out that if that person is getting burnt out, then someone else steps in. So that's the way we mitigate it, and we make sure that the on-call person is not left alone in troubled waters to navigate the situation. So that's one way we navigate this particular issue. The entire team joins into an incident, though they're not looking at a Slack channel of what's going on, but when needed they act or they call out for help and that's when we swap roles, we are like, "Okay, I can take over and do what's needed."

Matt Davis (34:50):

Do you think that people get afraid of doing that, of speaking up? I'm thinking of an example where... I don't know a system well enough and I might be afraid to speak up to get help. I may be suffering a lot of imposter syndrome, in other words. A lot of this, "If I don't try to figure out this problem and I have to escalate to somebody else, I am going to be looked at as not able to do my job." And that just feels artificial and weird. But I think that's what goes through our heads when we're in the middle of an incident and we have got this huge production pressure on top of us and we're like, "Well, everyone's expecting me to fix the problem because it's in my lap." And it absolutely doesn't need to be that way.

Varun Kumar Pal (36:01):

You're right. And everything that you said, it is part of being human. That is the first thing that happens to us. And that's why to my earlier point, when someone is on call, it doesn't mean he alone is responsible. He's a representative of a team that is responsible for something. And this comes by example. If you yourself show you're vulnerable to the team, the team takes that and empowers you. So that's how you do it. You do not expect for someone else to show the vulnerability and then you step in, you show that you are vulnerable and you ask for help. And then when people step in, that's when they realize that, "Hey, it just can go either way." That's how you start it. It just takes the very first person to start it and then it carries on from there.

Matt Davis (36:48):

That's great. I'm getting chills right now because that is the beautiful part about what we're talking about is the human part of it, and building that reciprocity is so critical to this role. Even if we just go beyond incident command and we just talk about being on call, feeling safe to be able to say, "I don't know what I'm doing. I need help." Or, "I've been working for four hours, I need to go eat. I understand things are completely down, but I am going to faint if I don't get food in my stomach." It's just things like that. Yeah, it's human. And I think that can be forgotten during incidents is that, yeah, there's machines here, but there's humans here too.

(37:42):

I want to start to wrap us up with another question, and it has to do with incident command and the multiple ways that we need to communicate and coordinate. And I'll tell you a specific example of where something happened. And actually this wasn't an incident, it was a maintenance, but it could have been an incident because it was maintenance on a production thing. We had the entire team gathered together to watch another engineer do the maintenance, and I think at one point we decided to get into a huddle, a Slack huddle. So you get into a Slack huddle, well then you lose everything that happens in the channel. And Slack huddles, they have a closed captioning kind of transcription, but it doesn't get... It's not capturable, you can't capture it and then put it into the channel.

(38:53):

Another example, you get into an incident and you have stuff going on in the Slack channel. It doesn't have to be Slack. It could be IRC, it could be MS Teams, it could be anything. But someone decides we're going to get it on a video call. So they get on a video bridge, and then you've got three or four people on the bridge, you've got 15 or 20 people on the channel, and then you also have all of these different threads that are going along in the channel, and then you have people actually calling other people. Oh, and then there's the direct message because the managers need to talk about what's... And the list goes on.

(39:34):

Is this a challenge that either of you have faced? How do we cope with this as incident commanders? How do we figure out, how do we even know that those DMs are happening? How do we even know what's being talked about, if I have to sit here on the Zoom and try to transcribe what's going on in the Zoom, how do I keep up with what's going on in the Slack channel? I don't know Varun, do you know what I'm talking about?

Varun Kumar Pal (40:07):

This is quite common in organization and that's why the whole role, yeah, you need to have the right mindset to go about it. Just not everybody can become an incident commander, because you need to know when to shut off the noise and how to channel the energy in the right place. You have conversations going on incident Slack channels that are completely tangent to what's going on. Then you need to get people back into the detail, "Hey, can we focus on what we were talking about? We are talking about approving the PR. Why are we worried about some other thing that is absolutely not related?" So it means you need to be in the control of the room. You are the moderator of the call per se. And things like you called out either swimlanes in Blameless or Zoom calls. Those are very common in the organization, and that's where you hold people accountable.

(41:01):

Even if you are not part in the Zoom call, you add people and say that, "Hey, can one of you just act like a scribe. Let us know, summarize what happened over there." So this has to be a really proactive approach. Because engineer engineers are not used to. Ask them to scribe a Zoom call. Oh, they will run away to the corners of the world. They just won't cut. So this is a very common problem, unfortunately, now that we all are remote, we are not on premise in the same room. If we were all in the same room, we would've been in the same boardroom, we would've been talking to each other during an incident, getting it resolved. We are not. We're in a new world where we are all remote.

(41:43):

So the incident commander has to be very proactive and deliberate about this. When they are separate breakout sessions, assign someone responsibility that you come back, scribe and summarize what's going on. Consolidate people and let people know that, "Hey, as per this DM, this is going on." You need to really channel that energy. And that's why to Alyson's initial point, the incident commander cannot be the subject-matter expert because this is a full-time responsibility during an incident.

Matt Davis (42:16):

Alyson, have you encountered this particular type of issue? Especially, I don't know if how distributed of a team Honeycomb is, butat Blameless we're completely distributed, so we run into this problem all the time, period. How about you?

Alyson van Hardenberg (42:35):

Yes, Honeycomb is fully distributed and has been since before COVID. And so we run into this all the time where there's too much chatter in Slack. Somebody starts a Zoom call... This is my child. Just a minute, buddy. Someone starts a Zoom call and then what? All of that context is lost into Zoom and nobody was summarizing it back. So we actually found that we would get our comms person to join the Zooms and be the recorder of those notes and the incident commander, using a word that Varun used, the moderator of those Zoom calls would organize it and make sure those summary notes were getting posted, not even at the end of the Zoom call, which can sometimes go on for hours, but at a regular cadence. "These are the decisions that are being made. These are the threads that we are following. This is where we're at." Reporting it back with a regular cadence to Slack, and just making that the norm of summarizing along the way.

Matt Davis (43:41):

I think that's so excellent. It's hearing such great ideas and approaches to these problems. I want to thank everyone for joining. Alyson, especially your son. I understand you're sick, little guy. It's okay. You've got an expert incident commander there in the room. You'll be good. Thank you again to our panelists, participants. Really enjoyed this conversation. I learned so much just from hearing about the guild and the buddy system. I'd love the on-call buddies, that's great. Thank you so much. We'll see you next time.

Jake Englund (44:27):

Thank you.

Matt Davis (44:28):

All right, thanks everyone. Appreciate your time today. I really love this conversation. This is just the height of my week. Thank you so much.

Varun Kumar Pal (44:38):

Thanks, Matt.

Alyson van Hardenberg (44:38):

It was really great chatting with you all. Hopefully you can still use that end clip. It's okay if he's in the video. It's fine. He joins all my meetings.

Matt Davis (44:46):

That's actually great to know. Let our coordinators know that it's okay to leave him in.

Alyson van Hardenberg (44:52):

It's okay to leave the bit with Nathan in it.

Jake Englund (44:54):

Awesome. Well, I very much appreciate and thank you so much for being able to join today, that I tremendously enjoyed the time here. I feel like I could talk for another hour still.

Matt Davis (45:03):

I know, me too. I was like, "Has it been 45 minutes? I think it has." All right. I hope I'll get to talk to all of you all again sometime soon.

Varun Kumar Pal (45:14):

Thanks guys.

Alyson van Hardenberg (45:15):

It was lovely to meet you all.

Matt Davis (45:16):

Nice [inaudible 00:45:17] to meet you. Bye-bye.

Varun Kumar Pal (45:17):

Yeah.