Our first episode covers a juicy topic to kick us off — What’s difficult about on-call? We invited Yvonne Lam, Engineering Manager at Kong, and Charles Cary, CTO at Shoreline.io, to chat with Kurt Andersen and Matt Davis from the Blameless team. Watch the full conversation, honest and unscripted, where the four discuss their personal experiences and learnings about on-call duty.

Description

Our first episode covers a juicy topic to kick us off — What’s difficult about on-call? We invited Yvonne Lam, Engineering Manager at Kong, and Charles Cary, CTO at Shoreline.io, to chat with Kurt Andersen and Matt Davis from the Blameless team. Watch the full conversation, honest and unscripted, where the four discuss their personal experiences and learnings about on-call duty.
Our first episode covers a juicy topic to kick us off — What’s difficult about on-call? We invited Yvonne Lam, Engineering Manager at Kong, and Charles Cary, CTO at Shoreline.io, to chat with Kurt Andersen and Matt Davis from the Blameless team. Watch the full conversation, honest and unscripted, where the four discuss their personal experiences and learnings about on-call duty.

Speakers

Matt Davis

Staff Infrastructure Engineer, Blameless
Staff Infrastructure Engineer
Read Bio

Matt Davis

Staff Infrastructure Engineer, Blameless
Matt is a Sr. Infrastructure Engineer at Blameless. His expertise brings to bear a variegated background including data-center operations, storage hardware and distributed databases, IT security, site reliability, support services, observability systems, and techops leadership. He has a passion for exploring the relationships between the artistic mind and operating distributed software architectures.

Kurt Anderson

Strategy, Blameless
Strategy
Read Bio

Kurt Anderson

Strategy, Blameless
Kurt Andersen is a practitioner and an active thought leader in the SRE community. He speaks at major DevOps & SRE conferences and publishes his work through O'Reilly in quintessential SRE books such as Seeking SRE, What is SRE?, and 97 Things Every SRE Should Know. Before Blameless, Kurt was a Sr. Staff SRE at LinkedIn, implementing SLOs (reliability metrics) at scale. Kurt is a member of the USENIX Board of Directors and part of the steering committee for the world-wide SREcon conferences.

Yvonne Lam

Staff Software Engineer, Kong
Staff Software Engineer
Read Bio

Yvonne Lam

Staff Software Engineer, Kong
I play with books, cats, food, yarn, and dirt, not all at the same time. Staff Software Engineer on the Engineering Enablement team at Kong Software. @yvonnezlam on Twitter.

Charles Cary

CTO, Shoreline.io
CTO
Read Bio

Charles Cary

CTO, Shoreline.io
Charles Cary is Shoreline's CTO. Shoreline helps SREs rapidly automate away toil so they can stop fielding pages. He was formerly at AWS, where he did operations for DynamoDB. He loves debugging distributed systems and figuring out how to make oncall less painful!

Video Description

Join Blameless engineers and industry experts as they work through the challenges of SRE and incident management.

Table of Contents

Video Contents

Video Transcript

1

00:00:04.049 --> 00:00:04.440

alright.

2

00:00:07.200 --> 00:00:20.370

Matt Davis: hi there, welcome to the inaugural version of srt from theory to practice, my name is matt Davis and i'm a staff infrastructure engineer at blameless.

3

00:00:20.850 --> 00:00:32.520

Matt Davis: And I work in our infrastructure and sorry team, and I do a lot of operational and architectural types of things over on the operations and systems side.

4

00:00:33.660 --> 00:00:39.030

Matt Davis: Why don't we go ahead and introduce the rest of our panel today you've on Please go ahead.

5

00:00:39.840 --> 00:00:59.730

Yvonne Lam (she/her): hey um I am Yvonne lamb, I am a staff software engineer at Kong mostly working on ci and CD things and developer tools, I have done all kinds of deployment engineering and Sri things and OPS things and yeah that's basically me.

6

00:01:04.440 --> 00:01:05.610

Matt Davis: Charles go ahead, please.

7

00:01:06.390 --> 00:01:10.170

Charles Cary: hi i'm Charles Carey i'm Sherlock Holmes cto.

8

00:01:12.210 --> 00:01:20.310

Charles Cary: deal with a lot of incidents anything that kind of gets escalated lots of trigger routing you know, in addition to just kind of leading technical development over here.

9

00:01:20.790 --> 00:01:31.230

Charles Cary: And yeah never been been doing OPS in various forms, for a while now is that aws working on some of those services as well and yeah so what i'm all about.

10

00:01:33.690 --> 00:01:48.240

Kurt Andersen: And i'm curt Anderson i'm s3 architect here at blameless also and previously was with linkedin and before that HP managed services so i've been dealing with operational issues for a long time as well.

11

00:01:49.920 --> 00:01:58.770

Matt Davis: i'm glad we have such a nice rounded operational expertise here, because today we're going to talk about on call.

12

00:01:59.370 --> 00:02:15.900

Matt Davis: And we're going to talk about basically the overriding question of what is difficult about on call, and you know there's all kinds of ways that we can talk about this and I, I think that i'm going to go ahead and start the discussion myself, excuse me.

13

00:02:17.160 --> 00:02:19.890

Matt Davis: Because I happen to be on call this week.

14

00:02:21.060 --> 00:02:23.190

Matt Davis: i'm not on call at the moment.

15

00:02:23.490 --> 00:02:37.530

Matt Davis: And I actually got someone to cover for me, a new employee who hasn't actually been in the rotation yet volunteered to cover my shift for the day, so it have a good time get you know get some good sleep.

16

00:02:38.580 --> 00:02:40.710

Matt Davis: And then be able to do this, recording today.

17

00:02:42.420 --> 00:03:00.840

Matt Davis: Now, one of the things that I like to think about when i'm talking about on call with other people is the very difficult decisions that we make when we're in the middle of doing our work and one of the things that I found very difficult.

18

00:03:01.920 --> 00:03:04.500

Matt Davis: In fact, an incident that happened yesterday.

19

00:03:04.920 --> 00:03:16.590

Matt Davis: Was as the on call um you know, I was pretty much called upon, also with my expertise to mitigate the problem.

20

00:03:18.060 --> 00:03:20.220

Matt Davis: And I was having some difficulty.

21

00:03:21.360 --> 00:03:24.240

Matt Davis: reaching out I had I had some difficulty.

22

00:03:26.460 --> 00:03:29.100

Matt Davis: Being a kind of incident commander, if you will.

23

00:03:30.480 --> 00:03:39.960

Matt Davis: You know this was a 72 incident and I won't go into our severity levels, right now, but it wasn't a low incident and it wasn't a high incident so.

24

00:03:40.590 --> 00:03:52.980

Matt Davis: Something was down and and I, as the engineer needed to get it fixed as the person with expertise, I was also the next person who I probably escalate to to get this fixed.

25

00:03:53.400 --> 00:04:02.280

Matt Davis: So I wasn't able to do things like Oh, should we contact the customer, because of this, you know this blip that might happen.

26

00:04:02.640 --> 00:04:13.620

Matt Davis: or hey what other communication is going out so that's one of the things that I was finding really difficult to do, while I was actually trying to handle the incident.

27

00:04:15.720 --> 00:04:24.720

Matt Davis: You Varna i'm going to call on you, if I may Is there something what let me ask you this first wins last time you, you were on call.

28

00:04:25.620 --> 00:04:33.600

Yvonne Lam (she/her): i'm probably a year or so ago, I mean I, and that was on call for internal systems.

29

00:04:33.960 --> 00:04:34.350

Okay.

30

00:04:36.840 --> 00:04:43.200

Matt Davis: describe that why why, why do you make that distinction between an internal system and an internal system.

31

00:04:44.070 --> 00:04:52.380

Yvonne Lam (she/her): it's a little bit it's a little bit different just in terms of the amount of monitoring support in terms of.

32

00:04:53.010 --> 00:04:58.620

Yvonne Lam (she/her): The business level of support um so, for instance, when I used to work at chef we.

33

00:04:59.250 --> 00:05:08.910

Yvonne Lam (she/her): I worked for release engineering we owned like the we were probably like the biggest aws expense, because we ran a giant production build cluster.

34

00:05:09.300 --> 00:05:14.040

Yvonne Lam (she/her): For you know something like I don't know like round about 100 platforms so lots of stuff.

35

00:05:14.730 --> 00:05:23.280

Yvonne Lam (she/her): And it had tentacles in all kinds of places at the time that I was working on it, I mean it's very different now and it.

36

00:05:24.270 --> 00:05:36.240

Yvonne Lam (she/her): You being on call for internal things is a little bit different because a lot of times you just get people like you get people that you know you don't necessarily get an alert right you're like you get people saying things like.

37

00:05:37.590 --> 00:05:40.320

Yvonne Lam (she/her): Something seems funny like my slow.

38

00:05:43.350 --> 00:05:47.100

Matt Davis: yeah yeah yeah that that's really interesting.

39

00:05:48.540 --> 00:05:59.250

Matt Davis: Because I think you're right, we tend to not think about our internal customers as customers, I mean they are customers.

40

00:05:59.790 --> 00:06:14.400

Matt Davis: um now Is there something that you can recall from being on call back then, when you were doing the internal you know response, do you find that more difficult than being on call for doing an external response.

41

00:06:15.120 --> 00:06:15.630

um.

42

00:06:16.980 --> 00:06:22.140

Yvonne Lam (she/her): I find it so I mean like it's different I don't know that it's more difficult that.

43

00:06:23.190 --> 00:06:24.660

Yvonne Lam (she/her): In some ways.

44

00:06:25.710 --> 00:06:35.430

Yvonne Lam (she/her): The pressure is less because you know you're not necessarily you know it's not necessarily a sub zero for something that's directly customer facing right this minute, but.

45

00:06:36.300 --> 00:06:45.300

Yvonne Lam (she/her): It nobody is fronting for you like that's the thing that's hard right like it's like like you do not have a support team or.

46

00:06:46.020 --> 00:06:57.510

Yvonne Lam (she/her): some kind of customer facing team that is answering the phone or answering emails and saying yes, we know it's down, we know it's down, we know it's not right like like you got you know 100 people showing up one by one.

47

00:06:57.810 --> 00:06:57.990

or.

48

00:06:59.430 --> 00:07:05.340

Yvonne Lam (she/her): Do you know that nobody can build anything that's like I know I know, thank you, yes, on it on it, no.

49

00:07:06.600 --> 00:07:07.530

Matt Davis: They do.

50

00:07:09.060 --> 00:07:18.360

Matt Davis: This makes me think of another question that we're often faced with, I think this is true of every operation is what is an incident.

51

00:07:18.660 --> 00:07:19.140

Yes.

52

00:07:20.700 --> 00:07:25.170

Matt Davis: Does the fact that its internal external meaning it isn't or is an incident.

53

00:07:26.520 --> 00:07:30.450

Matt Davis: I don't know my personal opinion, is it doesn't really matter.

54

00:07:30.900 --> 00:07:31.680

Matt Davis: got an incident.

55

00:07:31.740 --> 00:07:34.530

Matt Davis: In a complex system is an incident yes.

56

00:07:35.070 --> 00:07:38.040

Yvonne Lam (she/her): And, and I think that we had.

57

00:07:39.060 --> 00:07:52.800

Yvonne Lam (she/her): Like one of the reasons why we had a pretty serious on call rotation for um for the build systems for the built platform is that well you know right about the time that I started, we got hit with heart bleed.

58

00:07:53.220 --> 00:08:09.420

Yvonne Lam (she/her): Right, so our ability to rebuild our software to mitigate customer issues was at risk because we first had to mitigate things, and you know, like because we had a bunch of stuff go wrong with our build platform to so.

59

00:08:10.230 --> 00:08:29.280

Yvonne Lam (she/her): That is yeah so you know so it's you know we tend to think of on call is something that mostly affects people in the SAS world, but you know I think there's a there's a movement towards having it be like if it affects your ability to get things to customers, then.

60

00:08:30.300 --> 00:08:36.120

Yvonne Lam (she/her): You know there's one on caught, you know, there may be an all call there, there may be some kind of an incident structure that you want to have.

61

00:08:36.210 --> 00:08:40.320

Matt Davis: Right QA the QA system is down cannot QA.

62

00:08:42.030 --> 00:08:48.300

Matt Davis: More much less deploy but i've hit that instance in the past i've taken down QA in the past.

63

00:08:50.280 --> 00:09:00.300

Yvonne Lam (she/her): Well, and they're offline systems to right like so like so many places are doing sort of like data crunching machine learning, like all you know all of those pipelines it's like.

64

00:09:01.080 --> 00:09:05.790

Yvonne Lam (she/her): Like how big a deal is it if your machine learning pipeline goes down, I mean.

65

00:09:06.390 --> 00:09:20.070

Yvonne Lam (she/her): there's probably probably nothing bad is going to happen right this minute like but over time somebody surface, you know that it's like people surface will be impacted like you know there's some kind of a data quality measure that you will not be able to meet.

66

00:09:21.060 --> 00:09:28.530

Kurt Andersen: yeah speaking from experience when you have pipelines that take 22 hours to run and if they fail at our.

67

00:09:30.210 --> 00:09:37.470

Kurt Andersen: you're a day behind by the time you even just restart the pipeline and it can have really serious impacts on.

68

00:09:39.210 --> 00:09:40.830

Kurt Andersen: User experience i'll say.

69

00:09:41.970 --> 00:09:42.480

Kurt Andersen: in general.

70

00:09:43.530 --> 00:09:48.480

Yvonne Lam (she/her): People like seeing their updates right away like that makes them feel like something has happened, like it's important.

71

00:09:49.050 --> 00:09:53.340

Kurt Andersen: Right or or seeing an appropriate list of people that you might want to connect with.

72

00:09:54.960 --> 00:09:57.030

Matt Davis: or being served the correct add.

73

00:09:58.530 --> 00:10:15.510

Matt Davis: That yes thing that i've had this same exact experience like you're you're saying Kurt where will this this feed takes 24 hours to process we actually won't know until tomorrow if something went wrong then you know.

74

00:10:16.530 --> 00:10:25.530

Matt Davis: it's difficult to know well, should we stop the feed, because we know this is broken, or do we wait until we see the results to understand how broken it got.

75

00:10:26.100 --> 00:10:30.270

Kurt Andersen: So it's and, and these are very interesting and difficult.

76

00:10:31.290 --> 00:10:52.830

Kurt Andersen: problems that are being grappled with in the op ml community, which is essentially looking at how do you apply general reliability principles in the context of machine learning and training scenarios which can take significant amounts of time and hardware to to process yeah.

77

00:10:53.070 --> 00:11:07.860

Matt Davis: yeah hey Charles how about you when when we think about things that are difficult for an on call engineer to do to make difficult decisions what what crosses your mind sure.

78

00:11:09.750 --> 00:11:17.550

Charles Cary: I think one of the most challenging things is assessing customer impact, and the reason I bring that up is.

79

00:11:19.170 --> 00:11:30.210

Charles Cary: Often, you know you're you're you're on call right and what you're thinking about is you know, is the system functioning or not, is there, something I should be attending to right now, or often if it's a quiet shift like people.

80

00:11:30.450 --> 00:11:33.570

Charles Cary: To be honest, they start doing other stuff unfortunately which can happen.

81

00:11:34.050 --> 00:11:34.740

Yvonne Lam (she/her): And then.

82

00:11:35.100 --> 00:11:47.280

Charles Cary: What will happen is like you jump into the moment of the issue, and I think folks often you're now like in a debugging mindset when like actually the first thing that needs to be stressed out is like well.

83

00:11:47.850 --> 00:11:53.460

Charles Cary: Irrespective of what the ticketing system has declared in terms of the severity of the issue like.

84

00:11:54.300 --> 00:12:03.420

Charles Cary: Is this a global outage does this impact all the customers, does it impact only a subset is it one customer right and then really they think the next part of that is.

85

00:12:04.170 --> 00:12:08.700

Charles Cary: Know beyond just how many customers for those that it does impact, what is the impact.

86

00:12:09.030 --> 00:12:14.190

Charles Cary: Is it just a degradation of the service, this is total outage is it like a data loss event right.

87

00:12:14.460 --> 00:12:19.890

Charles Cary: And the reason I think it matters is just like that's how you know how quickly Should I be escalating right now.

88

00:12:20.190 --> 00:12:29.340

Charles Cary: right because, even if you know exactly what to do if it's a meaningful percentage of the customers and let's say it's a potential data loss event you're probably should escalate immediately.

89

00:12:29.880 --> 00:12:36.330

Charles Cary: Right like irrespective of it's like hey I just need to restart this instance it's like well yeah but the business ramifications are such that.

90

00:12:36.750 --> 00:12:49.530

Charles Cary: You better let people know because we have to begin, you know preparing to talk to folks or doing a deeper analysis of like well what really happened, you know what was the impact of that and I find that that's also a really hard thing for.

91

00:12:50.580 --> 00:13:06.330

Charles Cary: folks to quickly train on like you can train on run books, you can train by observing others, it takes a long time to build up an intuitive sense for this specific business if certain things happen, what is the true severity and who do I need to talk to like really fast ooh.

92

00:13:06.390 --> 00:13:15.120

Matt Davis: i'm so glad you use the word intuition here the intuitive sense, because you know when when we get called.

93

00:13:16.560 --> 00:13:17.220

Matt Davis: we're.

94

00:13:18.420 --> 00:13:22.860

Matt Davis: we're we're almost automatically enter that firefighting mode because.

95

00:13:23.370 --> 00:13:32.730

Matt Davis: for whatever reason we've been page, so our cognition gets a little blip and we're like automatically into this mode of what patterns are matching right now.

96

00:13:33.360 --> 00:13:46.440

Matt Davis: And so that that intuition, we really have to lean on I was thinking what you were describing that how How would an oncology engineer even know that restarting this thing.

97

00:13:47.640 --> 00:13:56.040

Matt Davis: will result in data loss and therefore will result in so many customers having you know, a data outage and.

98

00:13:57.570 --> 00:14:07.500

Matt Davis: How would they even know that, how would they even know to to immediately think Oh, this may mean data loss I better escalate to the data team, or whatever.

99

00:14:08.790 --> 00:14:14.190

Matt Davis: I don't know the answer that question, have you seen any examples of of how an old person would do that.

100

00:14:16.680 --> 00:14:17.070

Charles Cary: well.

101

00:14:19.110 --> 00:14:31.830

Charles Cary: I think it usually comes from spending time I think i've heard i've heard other other folks talk about this, but I think I think it's really true that, like in on call it's hard to replace time spent on call.

102

00:14:32.760 --> 00:14:44.070

Charles Cary: Like often you know just reading documentation or working on code doesn't actually give you like a systemic model in your brain of how the system works, but like being on incidence does for.

103

00:14:44.610 --> 00:14:53.490

Charles Cary: us to at least, and so I do think there's like an hours of practice aspect on call in terms of how it gets there, I think.

104

00:14:54.930 --> 00:15:02.580

Charles Cary: The general technique, I recommend to people is since it it's hard to develop strategies to develop intuition.

105

00:15:03.060 --> 00:15:13.170

Charles Cary: If you don't have them, which is true at the beginning, the most important thing I find is to create like a kind of a culture where it's okay to escalate really fast.

106

00:15:14.100 --> 00:15:27.690

Charles Cary: Right in the sense that someone in the org has an intuition about this, I think it's I think it's important to establish hey it's okay to go and page other people, even if they may come back and say like come on it's the small thing.

107

00:15:27.720 --> 00:15:43.290

Charles Cary: it's not a big deal it's important to make that safe to do, because otherwise I think it's very hard to develop the intuition right and it's usually just not possible, I think, even from even reading the code it's very hard, you have to see the running system and experience it.

108

00:15:44.310 --> 00:15:57.390

Matt Davis: yeah and feeling safe that you can escalate to someone is really, really important being able to think because i've hit this in the past myself being on call it's 3am.

109

00:15:58.350 --> 00:16:00.360

Yvonne Lam (she/her): I don't think we're waking out right.

110

00:16:00.390 --> 00:16:00.780

yeah.

111

00:16:02.610 --> 00:16:04.980

Matt Davis: You know where we going to save on.

112

00:16:05.430 --> 00:16:07.440

Yvonne Lam (she/her): Oh, that just that you know you're waking a person up.

113

00:16:07.440 --> 00:16:08.850

Right yeah.

114

00:16:09.990 --> 00:16:14.250

Kurt Andersen: So I wanted to ask a question, because you brought up the escalation term Charles.

115

00:16:15.960 --> 00:16:26.700

Kurt Andersen: Does that imply some sort of directionality is it is it a matter of like matt was alluding it's 3am I don't know what's going to go on.

116

00:16:27.090 --> 00:16:41.370

Kurt Andersen: i'm going to call my manager i'm going to call the PR team i'm going to call a colleague i'm going to call somebody who's on another team, but it's basically appear what's, what do you have in mind when you say escalation Charles.

117

00:16:42.750 --> 00:16:56.610

Charles Cary: yeah so sometimes i'll see that there's like a notion of a formal escalation like there's like kind of a bit of a command and control hierarchy and on call and there's like primary, secondary sometimes tertiary you know it can be extended on early.

118

00:16:57.270 --> 00:17:12.330

Charles Cary: And I don't know if that that's not really the only escalation, I think, like when I say escalation, maybe the the way to define it is just a go get other people, because you don't know it, whoever those people may be.

119

00:17:12.450 --> 00:17:13.140

Kurt Andersen: Right and what are.

120

00:17:13.290 --> 00:17:26.580

Charles Cary: Our city and and I think in terms of if I had to provide guidance to folks and who should they go and get one valuable thing to develop what being on call is.

121

00:17:27.960 --> 00:17:35.850

Charles Cary: Not necessarily who's the expert in everything, because you can't page the expert every single time they might be busy at that moment to like with another issue.

122

00:17:35.880 --> 00:17:39.690

Charles Cary: Like which happens, I think it is useful, though, to have a mental map of.

123

00:17:41.280 --> 00:17:44.850

Charles Cary: For the different sub components of what you're trying to control.

124

00:17:46.440 --> 00:17:55.140

Charles Cary: Is there a one hot person in each direction that you're comfortable escalating to so that, like in the kind of limit you get to the person who can help.

125

00:17:55.590 --> 00:18:05.580

Charles Cary: And so it's more about just kind of well here all my supporting services do I have a contact on them that i'm comfortable talking to your other one my downstream services and so kind of building up a little bit of that.

126

00:18:06.480 --> 00:18:12.720

Charles Cary: You know the next top, so to speak, I think, is useful and maybe that's probably a better term than escalation.

127

00:18:14.790 --> 00:18:33.120

Matt Davis: I when the way you described it, I would actually even call that a little bit of adaptive capacity i'm being able to lean on those other people I i'm exhausting my capacity in this expert in this area of expertise, I need to.

128

00:18:34.560 --> 00:18:45.120

Matt Davis: tap into someone else's adaptive capacity to be able to help me with this problem and that yeah that feels a lot more lateral than a escalation up.

129

00:18:45.420 --> 00:18:46.170

Charles Cary: yeah the thing.

130

00:18:48.210 --> 00:19:05.730

Kurt Andersen: And it almost sounds like your case that you mentioned matt have you got in you started debugging looking at it technically almost having a one hop saying hey I need somebody to run the incident for me while I look at as the technical responder or the technical expert.

131

00:19:06.930 --> 00:19:12.450

Kurt Andersen: That could be one of these one hop resources to be able to take advantage of as well.

132

00:19:13.380 --> 00:19:14.880

Matt Davis: Oh yeah yeah you're right.

133

00:19:15.930 --> 00:19:30.840

Matt Davis: Like probably as I look back on on my time with the incident, yesterday I can think of, I can think of several examples where it would have made and i'm operating looking in hindsight, I know this.

134

00:19:31.920 --> 00:19:37.950

Matt Davis: And you know, maybe there were opportunities for me to reach out to other people.

135

00:19:39.390 --> 00:19:45.570

Matt Davis: And that's something else I think you know just talking about that difficulty being able to.

136

00:19:46.860 --> 00:19:49.680

Matt Davis: You know, we talked we talked about flow.

137

00:19:51.000 --> 00:20:00.600

Matt Davis: And we talked about getting into the intuition, of a system and oftentimes and I like drawing parallels with music because that's what I do.

138

00:20:02.040 --> 00:20:03.840

Matt Davis: But oftentimes when you're.

139

00:20:04.890 --> 00:20:10.200

Matt Davis: You know when you're playing the gig and when you're into the gig when you're in the middle of an incident.

140

00:20:10.770 --> 00:20:22.350

Matt Davis: And you're basically in a flow of a sense because you're constructing this mental model in your head about what you know what you know, the system is about and then you're.

141

00:20:22.920 --> 00:20:39.120

Matt Davis: Actually, building on that mental model in real time, as you learn about the failure states and then you're pulling people across and you're in this flow it's like, how do you break out of that flow and remember to reach out to that next top.

142

00:20:40.920 --> 00:20:48.330

Matt Davis: Yvonne woody, what do you think to have you ever run into this problem where you don't know where to find expertise or yep.

143

00:20:48.960 --> 00:20:52.770

Yvonne Lam (she/her): Absolutely, I mean I work a lot with build systems right and so.

144

00:20:53.460 --> 00:21:01.290

Yvonne Lam (she/her): No matter how hard teams try to make a very clear line between this is the build platform, and this is the application.

145

00:21:01.530 --> 00:21:16.980

Yvonne Lam (she/her): And we deal with everything over here that happens to the application and you all do with everything over there that happens to the billing platform in reality there are lots of bugs where it's not clear where the problem is right, like the ever popular my build is slow.

146

00:21:18.420 --> 00:21:28.080

Yvonne Lam (she/her): Right, I mean that's like well sometimes that slow for system reasons and sometimes it's slow because of an application change or a change to the application build but um.

147

00:21:28.980 --> 00:21:41.070

Yvonne Lam (she/her): You know I I feel like one of the things that i'm trying to figure out how to do is is you know get us to the point where like we're actually good at working on those kind of in between issues.

148

00:21:43.320 --> 00:21:50.790

Yvonne Lam (she/her): You know the and and you know that that it's like I I tend to feel that, like in software.

149

00:21:51.900 --> 00:22:00.180

Yvonne Lam (she/her): You know that we tend to be sort of very focused on a very ownership model driven a very legalistic.

150

00:22:00.570 --> 00:22:07.740

Yvonne Lam (she/her): driven and it's like okay so that's kind of like the end that's kind of like the floor, but that shouldn't be the ceiling right like.

151

00:22:08.520 --> 00:22:15.600

Yvonne Lam (she/her): Like we should just be able to say I don't know the whole system looks like it's doing something weird and I don't have.

152

00:22:16.290 --> 00:22:23.610

Yvonne Lam (she/her): Like you know, we need to have like people cross disciplinary cross you know cross department, where we can go and say.

153

00:22:24.210 --> 00:22:32.580

Yvonne Lam (she/her): Look, this just looks funny like you can tell me that i'm wrong if y'all want to laugh at me afterwards i'm fine, but I am really concerned because this does not look right to me.

154

00:22:34.170 --> 00:22:44.580

Matt Davis: I can you describe that as the in between and and would you say that's kind of like the same area that a lot of people talk about glue work.

155

00:22:45.660 --> 00:22:46.050

Yvonne Lam (she/her): or.

156

00:22:46.380 --> 00:22:48.060

Matt Davis: Are you thinking of something different or.

157

00:22:48.090 --> 00:22:53.730

Yvonne Lam (she/her): I just think I do think that it's very related to glue because a lot of that is um.

158

00:22:54.840 --> 00:22:55.290

Yvonne Lam (she/her): You know.

159

00:22:56.400 --> 00:23:05.910

Yvonne Lam (she/her): it's it's because part of your mental model isn't just um who's the next team on the list right like it's like who do I have a relationship with.

160

00:23:06.270 --> 00:23:17.760

Yvonne Lam (she/her): That I won't feel bad about asking them to look at this like you know, like not just you know, do they know something, but you know, do I feel okay talking to them.

161

00:23:17.940 --> 00:23:21.000

Matt Davis: Oh yeah yeah The relationship is actually.

162

00:23:22.110 --> 00:23:26.910

Matt Davis: more attractive to you than the expertise that's fascinating.

163

00:23:28.740 --> 00:23:42.690

Matt Davis: You know we're humans, and so we live in this social construct we're that's how we are human that's how we're mammals we're we're in a society and we build these relationships and.

164

00:23:43.140 --> 00:23:58.710

Matt Davis: And yeah I can I can think of that right now, when, just like that you know getting paged at 3am well, I have a better relationship with my mentor, then I do with the expert that built the system.

165

00:24:00.060 --> 00:24:09.180

Matt Davis: My mentor is going to be a lot more forgiving that I woke them up at 3am and can help me through this thing that i'm completely.

166

00:24:09.660 --> 00:24:24.090

Matt Davis: You know i'm just out of my element about whereas I might think the expert is going to they might look down on me, or they might think I don't know what i'm doing and and I can see this like cognitive dissonance getting in the middle of like trying to make that decision.

167

00:24:25.770 --> 00:24:36.420

Matt Davis: that's a that's a really I hadn't thought about it that way, and that the escalation that next top may actually not be the expert it just may be.

168

00:24:37.980 --> 00:24:43.680

Matt Davis: The most reasonable place for you to reach to me, you are under strain yourself.

169

00:24:45.780 --> 00:25:03.750

Kurt Andersen: And I encountered that a lot, too, in terms of having a team of accessories and all of you kind of sharing a rotation, if you like, and I think that's one of the benefits of a primary, secondary on call structure.

170

00:25:04.680 --> 00:25:20.700

Kurt Andersen: Is that you have a defined next top if you're the primary and you, you get a call you've got somebody who's already on deck, so to speak, that you can reach out to for building a shared understanding.

171

00:25:22.500 --> 00:25:28.020

Kurt Andersen: And sometimes experts can be a little prickly if and.

172

00:25:28.740 --> 00:25:45.780

Kurt Andersen: yeah and especially if they get overloaded with everybody, asking them questions, I mean that's another downside of kind of siloed information or experts who don't share information effectively is that then everybody does have to ask them questions and call them to get an answer.

173

00:25:47.370 --> 00:25:56.790

Matt Davis: yeah and you can you can also run the danger of that expert experiencing tunnel vision which experts are.

174

00:25:58.020 --> 00:26:03.870

Matt Davis: I I kind of like to say it they're allowed to do that, but you have to help them through it.

175

00:26:04.350 --> 00:26:06.060

Matt Davis: Yes, where the expert needs help.

176

00:26:06.480 --> 00:26:16.980

Matt Davis: bridge is breaking them out of their expertise, so that they don't fall into that tunnel vision and going well, it has to be this, this is my area of expertise, so this is where i'm going to look.

177

00:26:17.610 --> 00:26:37.050

Kurt Andersen: Right or it's another really negative thing that i've seen in a as a pattern is this idea of meantime to innocence, where you get everybody to swarm on an incident and then everybody looks for the fastest way to escape at possible oh not my system by.

178

00:26:38.760 --> 00:26:51.750

Kurt Andersen: And and it's essentially a try to try to not be responsible or deny responsibility and get out of there and then it becomes difficult for the person who's stuck.

179

00:26:52.920 --> 00:27:06.030

Kurt Andersen: coordinating the incident or responding as the tech lead to say oh wait a minute it actually is the DNS, and so we got to go get the DNS people back in here because they bailed early or something i'm picking on DNS just because that's the kind of.

180

00:27:06.420 --> 00:27:10.020

Kurt Andersen: The bird and that raises the cost of being wrong too because that's like.

181

00:27:10.260 --> 00:27:15.780

Yvonne Lam (she/her): What if what if you're like Oh, it is DNS and then you call them back and it's not DNS right like so that's.

182

00:27:17.340 --> 00:27:20.400

Yvonne Lam (she/her): You know that is that I that I feel like a lot of what.

183

00:27:21.990 --> 00:27:27.810

Yvonne Lam (she/her): A lot of where the interesting to me work is is like lowering the cost of being wrong yes.

184

00:27:28.110 --> 00:27:28.950

that's absolutely right.

185

00:27:30.390 --> 00:27:30.780

Kurt Andersen: That.

186

00:27:32.040 --> 00:27:33.420

Matt Davis: That sounds to me like.

187

00:27:34.620 --> 00:27:37.050

Matt Davis: helping people make difficult decisions.

188

00:27:38.190 --> 00:27:46.980

Matt Davis: It sounds like the same kind of thing it's like and also you know, going back to what Charles was saying about feeling safe to escalate like.

189

00:27:49.110 --> 00:27:59.760

Matt Davis: I would put it another way, feeling safe to admit you're wrong or feeling safe to admit I don't know or being able to say that.

190

00:28:01.140 --> 00:28:07.170

Matt Davis: or and i've done this, too, and actually I was thinking about this question.

191

00:28:10.800 --> 00:28:24.090

Matt Davis: When when on call we're often faced with learning opportunities in fact I would I would guess every single incident is a learning opportunity and so when we're actively on call.

192

00:28:25.530 --> 00:28:35.760

Matt Davis: I often wonder about this, how hard, is it to remember that this is a learning opportunity this isn't just an outage.

193

00:28:36.600 --> 00:28:49.170

Matt Davis: This is a opportunity where the the our our complexity has opened itself up for us to see inside it and whether it may be a very narrow view or maybe wide but.

194

00:28:49.650 --> 00:29:01.590

Matt Davis: Those are the times that we can learn a lot about our co workers about our company about how we organize our work and about the technical aspects of the system have.

195

00:29:03.090 --> 00:29:17.880

Matt Davis: Charles have you run into anything like this, where you've faced or have seen this where responders or engineers are faced with do I learn or do I get this fixed as fast as possible.

196

00:29:21.960 --> 00:29:23.010

Charles Cary: that's a good question.

197

00:29:24.630 --> 00:29:28.290

Charles Cary: I mean my general recommendation is mitigate as quickly as possible.

198

00:29:29.550 --> 00:29:32.280

Charles Cary: just for the sake of like minimizing customer impact.

199

00:29:32.370 --> 00:29:35.220

Charles Cary: Sure, in terms of learning.

200

00:29:36.780 --> 00:29:49.230

Charles Cary: I do, I do believe you learn to be a better on call by being at the incidents themselves in terms of translating that into kind of kind of deeper observations right.

201

00:29:50.310 --> 00:29:52.980

Charles Cary: The place where I see it, being very useful is.

202

00:29:54.390 --> 00:30:05.670

Charles Cary: I, I am a great believer in doing at least some of the on call for the software you right, and the reason is, is, I think, like in those moments you kind of recognize where.

203

00:30:07.860 --> 00:30:19.170

Charles Cary: Maybe the software was inadequate and and like, in particular, like there's a lot of stuff that's easy to take for granted, because it'll get through quite a bit of testing, but then fails and prod like.

204

00:30:19.920 --> 00:30:33.090

Charles Cary: Just understanding the benefits of like logging other telemetry building systems that handle exceptions in the general case right building things that are self healing building things that self Defense right and.

205

00:30:33.870 --> 00:30:38.880

Charles Cary: I think all of those kind of principles which aren't really necessary for like application correctness.

206

00:30:39.720 --> 00:30:49.260

Charles Cary: If you have an on call where well if your it folks who wrote the application are never on call one of the things is that they that the learning doesn't feed back.

207

00:30:49.830 --> 00:31:00.660

Charles Cary: To the development process, and I find that like that's one place where you can create a virtuous cycle, where development improves if folks are do some on called be some and they.

208

00:31:01.650 --> 00:31:07.050

Charles Cary: They become they become better at the engineering practices as well, because they think about how they will feel in that moment.

209

00:31:09.810 --> 00:31:13.260

Matt Davis: How about you Yvonne i'm learning, while you're on call.

210

00:31:13.980 --> 00:31:25.650

Yvonne Lam (she/her): Oh um so I was going to take charles's answer a little bit of a different direction, which is that you know i've known a lot of debs who ended up getting put on call for service that they wrote and they.

211

00:31:26.460 --> 00:31:34.650

Yvonne Lam (she/her): don't see the value in it like you know I think as OPS people as as sorry, you know, like, I mean nobody likes being on call right like it's a chore.

212

00:31:35.070 --> 00:31:45.600

Yvonne Lam (she/her): But for us like you know when you when you start a new job, and then you get to be on call it's not like this is fun, but it's like okay i'm a member of the team I pulling my weight.

213

00:31:45.990 --> 00:31:52.740

Yvonne Lam (she/her): i'm trusted enough that people think that I will shut down some critical service, I mean they might be wrong, but like I am trusted enough.

214

00:31:53.160 --> 00:31:57.930

Yvonne Lam (she/her): Whereas, you know, like one of the depths that i've worked with described being on call for.

215

00:31:58.260 --> 00:32:09.180

Yvonne Lam (she/her): You know this sort of complicated distributed system is he just felt like his job was being asleep protector, you know it was like giving the experts, a few more hours of sleep and he couldn't really do anything.

216

00:32:09.750 --> 00:32:17.610

Yvonne Lam (she/her): And like that, like you know I don't know that he was completely correct, I mean like I think he did more than he that he.

217

00:32:18.870 --> 00:32:26.640

Yvonne Lam (she/her): thinks he did, but I do think there's something about like how do I, like How do we make that experience of on call valuable to dance like.

218

00:32:26.850 --> 00:32:45.330

Yvonne Lam (she/her): I think a lot of times what happens to is that their stuff that's valuable to individual debs like you know that sometimes you know there's some things you only learn by experiencing them and it's like you know, having nice nice retry logic, for instance stuff like that um.

219

00:32:46.470 --> 00:32:53.280

Yvonne Lam (she/her): You know, like I think that's all good, but I think the harder thing is where the team has to fight for.

220

00:32:54.390 --> 00:32:58.410

Yvonne Lam (she/her): Something that would make their on call lives better that.

221

00:32:58.410 --> 00:32:59.940

Yvonne Lam (she/her): makes for a better service.

222

00:33:00.240 --> 00:33:04.560

Yvonne Lam (she/her): And they have to do that learning together, I think that's often very hard for people.

223

00:33:07.710 --> 00:33:22.410

Matt Davis: that's a that's a good point because I i've had that experience to with developers on both sides of the fence where they didn't see the need for them to be on call they shouldn't be on call it's not there.

224

00:33:23.490 --> 00:33:28.110

Matt Davis: it's not the way they think it's not the way they work it's not the way they do their thing it's not them.

225

00:33:29.460 --> 00:33:38.310

Matt Davis: And so it was really hard to get them to see a different see on call in a different light at the same time.

226

00:33:39.750 --> 00:33:43.050

Matt Davis: i've seen developers embrace it.

227

00:33:44.070 --> 00:33:46.980

Matt Davis: 100% and.

228

00:33:48.360 --> 00:33:50.970

Matt Davis: I think it's a really interesting question.

229

00:33:51.990 --> 00:33:58.290

Matt Davis: I I forget who tweeted this, but it was it was some video of.

230

00:34:01.140 --> 00:34:05.250

Matt Davis: Of a bunch of COPs chasing a guy and the end the guy had.

231

00:34:06.420 --> 00:34:20.220

Matt Davis: run behind the Van and all the COPs ran right by the Van where the the guy was hiding the helicopter filming it knew exactly where the guy was but all the COPs ran around the guy.

232

00:34:20.700 --> 00:34:41.790

Matt Davis: And someone made a tweet comment like Oh, this is like seeing newbie people on call when you know what the answer is, but you want them to discover how to fix it and i'm i'm wondering if if that sort of thing kind of ties into this like experiencing on call, but then.

233

00:34:42.840 --> 00:34:43.560

Matt Davis: How do you.

234

00:34:45.390 --> 00:34:46.590

Matt Davis: How do you make it.

235

00:34:47.880 --> 00:34:50.730

Matt Davis: exciting, how do you make it, you know.

236

00:34:52.080 --> 00:35:04.560

Matt Davis: An opportunity, how do you know how do we spin it on its on its head and instead of it being something bad that happens, show that this is not, this is actually just normal.

237

00:35:05.850 --> 00:35:20.910

Matt Davis: This is just this is just the system operating the system is operating in a normal way always on the brink of failure, and we just happen to catch the system right when it touched that brink and so that's where we are and it's.

238

00:35:22.170 --> 00:35:25.650

Matt Davis: I had trouble answering this question myself um.

239

00:35:27.090 --> 00:35:27.420

Matt Davis: I.

240

00:35:29.070 --> 00:35:29.670

Matt Davis: I don't know the.

241

00:35:30.030 --> 00:35:32.340

Kurt Andersen: way that a lot of people don't want to take that red pill.

242

00:35:34.650 --> 00:35:35.790

Yvonne Lam (she/her): And I think that.

243

00:35:36.930 --> 00:35:43.500

Yvonne Lam (she/her): I think there's also a very human desire to be done with things like okay I should my feature it's done right like.

244

00:35:43.770 --> 00:36:02.610

Yvonne Lam (she/her): I checked it by code change and it's done like I thought that I think it was um Liz found Gray and honey honey combs said something about how like they found that three and a half hours is kind of when they get is kind of their limit on situational awareness, for.

245

00:36:02.730 --> 00:36:04.230

Yvonne Lam (she/her): Somebody checks in a change.

246

00:36:04.530 --> 00:36:17.460

Yvonne Lam (she/her): And they have it paged it out enough that if something happens in production they're like oh wait that could be me, and you know, like I like she said it in a tweet so I don't so I don't have a reference of.

247

00:36:17.790 --> 00:36:27.900

Yvonne Lam (she/her): My head, but I do think that that's a very interesting that you know that it's like since I work on ci CD pipelines, I do think that's something that it's like like like.

248

00:36:28.500 --> 00:36:45.870

Yvonne Lam (she/her): You know what is the I mean it's not just the speed, but it's you know there's something about the the you know when do people have enough situational awareness around change to think Oh, you know what that could be my thing.

249

00:36:49.440 --> 00:36:50.400

Matt Davis: it's a good question.

250

00:36:51.450 --> 00:36:56.910

Matt Davis: Because I think that changes from person to person it's a very subjective thing to have.

251

00:36:58.230 --> 00:37:06.750

Matt Davis: situational awareness was, which is a troublesome thing to begin with which situation are we being aware of right.

252

00:37:08.340 --> 00:37:16.320

Matt Davis: I didn't hear that i've never heard the three and a half hour thing, but that makes a lot of sense yeah i'm Okay, we finally deployed this.

253

00:37:17.190 --> 00:37:29.700

Matt Davis: I have another ticket i'm going to close this ticket i'm going to move on to my next ticket i'm going to pull another ticket from the backlog I can't be thinking about this, while i'm trying to get into my next new part of work.

254

00:37:33.270 --> 00:37:51.810

Kurt Andersen: And something I read the other day to pointed out that being on high alert for issues is fatiguing just maintaining that sense of awareness and alert nose to the risk of failure is something that requires.

255

00:37:53.130 --> 00:37:55.020

Kurt Andersen: effort and.

256

00:37:57.810 --> 00:38:04.800

Kurt Andersen: it's easy to understand, I think it may have come up in the context of one of the talks last week from s3 con.

257

00:38:06.120 --> 00:38:08.730

Kurt Andersen: Where the s3.

258

00:38:10.230 --> 00:38:21.090

Kurt Andersen: One of the managers for the Ms teams organization was talking about the load on the teams for on call when the pandemic started.

259

00:38:21.870 --> 00:38:30.390

Kurt Andersen: And all of a sudden everybody was piling all the schools in the world were piling into teams for virtual classrooms.

260

00:38:30.780 --> 00:38:49.020

Kurt Andersen: And they had had rotations that were a week long and and people were being crisp out of that, and so they ended up having to go to one day rotations for on call just because it was so much of a burden on the people wow.

261

00:38:50.100 --> 00:38:51.570

Matt Davis: wow I.

262

00:38:54.840 --> 00:38:58.800

Matt Davis: It It makes me think of the burden of on call.

263

00:39:00.090 --> 00:39:03.600

Matt Davis: translating into the burden of.

264

00:39:04.710 --> 00:39:07.050

Matt Davis: feeling like you have to be the one to fix the problem.

265

00:39:08.190 --> 00:39:08.430

Kurt Andersen: hmm.

266

00:39:12.060 --> 00:39:17.310

Matt Davis: I struggle with this all the time as a as a responder.

267

00:39:18.510 --> 00:39:43.020

Matt Davis: um and I, and I think I struggle with it because I I do I do contain an area of expertise and I want to be able, I really genuinely want to be able to fix it um but it kind of goes back to what what Charles was saying before you know at what point do you hit the escalation button.

268

00:39:44.520 --> 00:39:47.280

Matt Davis: And you know when do you feel like you've.

269

00:39:48.660 --> 00:39:54.840

Kurt Andersen: The the escalation button, I think, part of the question, there, though, is the escalation button is not an eject button.

270

00:39:55.200 --> 00:39:57.120

Kurt Andersen: It is not a defeat button.

271

00:39:57.510 --> 00:40:04.710

Kurt Andersen: And if it feels that way then it's going to cause people to be unwilling to use it.

272

00:40:06.720 --> 00:40:18.450

Kurt Andersen: And I noticed that our time is running short here also so I don't want to have the last word on this but, but I think it's important to not view escalating as as abdicating responsibility.

273

00:40:20.040 --> 00:40:21.510

Matt Davis: Charles you're you're nodding.

274

00:40:22.830 --> 00:40:25.320

Charles Cary: Absolutely right, and I think that um.

275

00:40:27.540 --> 00:40:42.540

Charles Cary: escalate means i'd like you to help me solve this problem is usually how I kind of portray it to folks now there if it's ongoing there needs to be released at some point it doesn't mean you're like tied to it forever, but there is some amount of you know.

276

00:40:43.680 --> 00:40:50.760

Charles Cary: When you bring another person in you're not working with them and continuing to do it right, and I think that's only fair to whoever you're going to.

277

00:40:52.050 --> 00:40:56.070

Charles Cary: And that's also that's going back that's the that is one of those opportunities to learn.

278

00:40:56.640 --> 00:41:02.100

Charles Cary: Right, like in the sense that that's where you can actually get even if you can't really contribute meaningfully to the debugging anymore.

279

00:41:02.430 --> 00:41:14.910

Charles Cary: that's how you can actually see you know things to learn and so escalation, even if you're taking a backseat in terms of the kind of ongoing effort it's usually worthwhile at least till the shift runs.

280

00:41:15.540 --> 00:41:27.390

Yvonne Lam (she/her): And something I would add to is that you know I don't think we should be hard on ourselves about wanting to solve the problem, because, like all of tech culture is about being the person with the answers right like.

281

00:41:27.690 --> 00:41:32.280

Yvonne Lam (she/her): Like there's no way we can just turn that off in our brains and we're on call.

282

00:41:32.550 --> 00:41:35.400

Yvonne Lam (she/her): Even if we know that it's not a good idea.

283

00:41:37.080 --> 00:41:39.840

Matt Davis: you're you're you're absolutely right, we.

284

00:41:40.920 --> 00:41:45.990

Matt Davis: We want to be the hero, we want to be the person who figures out figured it out, we want to be.

285

00:41:46.380 --> 00:42:01.680

Matt Davis: You know, we want it, we want to show that we did something on this on call shift I couldn't work on feature work all week or all day or whatever, but hey look what I learned look what I fixed and I was on call and and I I did that.

286

00:42:02.190 --> 00:42:04.590

Yvonne Lam (she/her): I did that at three in the morning I bought myself.

287

00:42:07.290 --> 00:42:12.660

Matt Davis: Exactly oh yeah oh no one knew about it, I didn't have to escalate at three in the morning haha.

288

00:42:13.620 --> 00:42:29.670

Matt Davis: Well, I wanted to thank everyone for joining, thank you, Charles Thank you Yvonne it's been a wonderful time talking to you i've this conversation has sparked some ideas and some moments of learning for me and I really hope it's done the same for you all.

289

00:42:30.180 --> 00:42:31.650

Yvonne Lam (she/her): absolutely all right.

290

00:42:32.160 --> 00:42:34.500

Kurt Andersen: Thank you very much thank you all have a great conversation.

291

00:42:34.860 --> 00:42:35.400

Thank you.