

Description
Speakers

Matt Davis

Matt Davis

Kurt Anderson

Kurt Anderson

Yvonne Lam

Yvonne Lam

Charles Cary

Charles Cary
Video Description
Table of Contents
Video Contents
Video Transcript
1
00:00:04.049 --> 00:00:04.440
alright.
2
00:00:07.200 --> 00:00:20.370
Matt Davis: hi there, welcome to the inaugural version of srt from theory to practice, my name is matt Davis and i'm a staff infrastructure engineer at blameless.
3
00:00:20.850 --> 00:00:32.520
Matt Davis: And I work in our infrastructure and sorry team, and I do a lot of operational and architectural types of things over on the operations and systems side.
4
00:00:33.660 --> 00:00:39.030
Matt Davis: Why don't we go ahead and introduce the rest of our panel today you've on Please go ahead.
5
00:00:39.840 --> 00:00:59.730
Yvonne Lam (she/her): hey um I am Yvonne lamb, I am a staff software engineer at Kong mostly working on ci and CD things and developer tools, I have done all kinds of deployment engineering and Sri things and OPS things and yeah that's basically me.
6
00:01:04.440 --> 00:01:05.610
Matt Davis: Charles go ahead, please.
7
00:01:06.390 --> 00:01:10.170
Charles Cary: hi i'm Charles Carey i'm Sherlock Holmes cto.
8
00:01:12.210 --> 00:01:20.310
Charles Cary: deal with a lot of incidents anything that kind of gets escalated lots of trigger routing you know, in addition to just kind of leading technical development over here.
9
00:01:20.790 --> 00:01:31.230
Charles Cary: And yeah never been been doing OPS in various forms, for a while now is that aws working on some of those services as well and yeah so what i'm all about.
10
00:01:33.690 --> 00:01:48.240
Kurt Andersen: And i'm curt Anderson i'm s3 architect here at blameless also and previously was with linkedin and before that HP managed services so i've been dealing with operational issues for a long time as well.
11
00:01:49.920 --> 00:01:58.770
Matt Davis: i'm glad we have such a nice rounded operational expertise here, because today we're going to talk about on call.
12
00:01:59.370 --> 00:02:15.900
Matt Davis: And we're going to talk about basically the overriding question of what is difficult about on call, and you know there's all kinds of ways that we can talk about this and I, I think that i'm going to go ahead and start the discussion myself, excuse me.
13
00:02:17.160 --> 00:02:19.890
Matt Davis: Because I happen to be on call this week.
14
00:02:21.060 --> 00:02:23.190
Matt Davis: i'm not on call at the moment.
15
00:02:23.490 --> 00:02:37.530
Matt Davis: And I actually got someone to cover for me, a new employee who hasn't actually been in the rotation yet volunteered to cover my shift for the day, so it have a good time get you know get some good sleep.
16
00:02:38.580 --> 00:02:40.710
Matt Davis: And then be able to do this, recording today.
17
00:02:42.420 --> 00:03:00.840
Matt Davis: Now, one of the things that I like to think about when i'm talking about on call with other people is the very difficult decisions that we make when we're in the middle of doing our work and one of the things that I found very difficult.
18
00:03:01.920 --> 00:03:04.500
Matt Davis: In fact, an incident that happened yesterday.
19
00:03:04.920 --> 00:03:16.590
Matt Davis: Was as the on call um you know, I was pretty much called upon, also with my expertise to mitigate the problem.
20
00:03:18.060 --> 00:03:20.220
Matt Davis: And I was having some difficulty.
21
00:03:21.360 --> 00:03:24.240
Matt Davis: reaching out I had I had some difficulty.
22
00:03:26.460 --> 00:03:29.100
Matt Davis: Being a kind of incident commander, if you will.
23
00:03:30.480 --> 00:03:39.960
Matt Davis: You know this was a 72 incident and I won't go into our severity levels, right now, but it wasn't a low incident and it wasn't a high incident so.
24
00:03:40.590 --> 00:03:52.980
Matt Davis: Something was down and and I, as the engineer needed to get it fixed as the person with expertise, I was also the next person who I probably escalate to to get this fixed.
25
00:03:53.400 --> 00:04:02.280
Matt Davis: So I wasn't able to do things like Oh, should we contact the customer, because of this, you know this blip that might happen.
26
00:04:02.640 --> 00:04:13.620
Matt Davis: or hey what other communication is going out so that's one of the things that I was finding really difficult to do, while I was actually trying to handle the incident.
27
00:04:15.720 --> 00:04:24.720
Matt Davis: You Varna i'm going to call on you, if I may Is there something what let me ask you this first wins last time you, you were on call.
28
00:04:25.620 --> 00:04:33.600
Yvonne Lam (she/her): i'm probably a year or so ago, I mean I, and that was on call for internal systems.
29
00:04:33.960 --> 00:04:34.350
Okay.
30
00:04:36.840 --> 00:04:43.200
Matt Davis: describe that why why, why do you make that distinction between an internal system and an internal system.
31
00:04:44.070 --> 00:04:52.380
Yvonne Lam (she/her): it's a little bit it's a little bit different just in terms of the amount of monitoring support in terms of.
32
00:04:53.010 --> 00:04:58.620
Yvonne Lam (she/her): The business level of support um so, for instance, when I used to work at chef we.
33
00:04:59.250 --> 00:05:08.910
Yvonne Lam (she/her): I worked for release engineering we owned like the we were probably like the biggest aws expense, because we ran a giant production build cluster.
34
00:05:09.300 --> 00:05:14.040
Yvonne Lam (she/her): For you know something like I don't know like round about 100 platforms so lots of stuff.
35
00:05:14.730 --> 00:05:23.280
Yvonne Lam (she/her): And it had tentacles in all kinds of places at the time that I was working on it, I mean it's very different now and it.
36
00:05:24.270 --> 00:05:36.240
Yvonne Lam (she/her): You being on call for internal things is a little bit different because a lot of times you just get people like you get people that you know you don't necessarily get an alert right you're like you get people saying things like.
37
00:05:37.590 --> 00:05:40.320
Yvonne Lam (she/her): Something seems funny like my slow.
38
00:05:43.350 --> 00:05:47.100
Matt Davis: yeah yeah yeah that that's really interesting.
39
00:05:48.540 --> 00:05:59.250
Matt Davis: Because I think you're right, we tend to not think about our internal customers as customers, I mean they are customers.
40
00:05:59.790 --> 00:06:14.400
Matt Davis: um now Is there something that you can recall from being on call back then, when you were doing the internal you know response, do you find that more difficult than being on call for doing an external response.
41
00:06:15.120 --> 00:06:15.630
um.
42
00:06:16.980 --> 00:06:22.140
Yvonne Lam (she/her): I find it so I mean like it's different I don't know that it's more difficult that.
43
00:06:23.190 --> 00:06:24.660
Yvonne Lam (she/her): In some ways.
44
00:06:25.710 --> 00:06:35.430
Yvonne Lam (she/her): The pressure is less because you know you're not necessarily you know it's not necessarily a sub zero for something that's directly customer facing right this minute, but.
45
00:06:36.300 --> 00:06:45.300
Yvonne Lam (she/her): It nobody is fronting for you like that's the thing that's hard right like it's like like you do not have a support team or.
46
00:06:46.020 --> 00:06:57.510
Yvonne Lam (she/her): some kind of customer facing team that is answering the phone or answering emails and saying yes, we know it's down, we know it's down, we know it's not right like like you got you know 100 people showing up one by one.
47
00:06:57.810 --> 00:06:57.990
or.
48
00:06:59.430 --> 00:07:05.340
Yvonne Lam (she/her): Do you know that nobody can build anything that's like I know I know, thank you, yes, on it on it, no.
49
00:07:06.600 --> 00:07:07.530
Matt Davis: They do.
50
00:07:09.060 --> 00:07:18.360
Matt Davis: This makes me think of another question that we're often faced with, I think this is true of every operation is what is an incident.
51
00:07:18.660 --> 00:07:19.140
Yes.
52
00:07:20.700 --> 00:07:25.170
Matt Davis: Does the fact that its internal external meaning it isn't or is an incident.
53
00:07:26.520 --> 00:07:30.450
Matt Davis: I don't know my personal opinion, is it doesn't really matter.
54
00:07:30.900 --> 00:07:31.680
Matt Davis: got an incident.
55
00:07:31.740 --> 00:07:34.530
Matt Davis: In a complex system is an incident yes.
56
00:07:35.070 --> 00:07:38.040
Yvonne Lam (she/her): And, and I think that we had.
57
00:07:39.060 --> 00:07:52.800
Yvonne Lam (she/her): Like one of the reasons why we had a pretty serious on call rotation for um for the build systems for the built platform is that well you know right about the time that I started, we got hit with heart bleed.
58
00:07:53.220 --> 00:08:09.420
Yvonne Lam (she/her): Right, so our ability to rebuild our software to mitigate customer issues was at risk because we first had to mitigate things, and you know, like because we had a bunch of stuff go wrong with our build platform to so.
59
00:08:10.230 --> 00:08:29.280
Yvonne Lam (she/her): That is yeah so you know so it's you know we tend to think of on call is something that mostly affects people in the SAS world, but you know I think there's a there's a movement towards having it be like if it affects your ability to get things to customers, then.
60
00:08:30.300 --> 00:08:36.120
Yvonne Lam (she/her): You know there's one on caught, you know, there may be an all call there, there may be some kind of an incident structure that you want to have.
61
00:08:36.210 --> 00:08:40.320
Matt Davis: Right QA the QA system is down cannot QA.
62
00:08:42.030 --> 00:08:48.300
Matt Davis: More much less deploy but i've hit that instance in the past i've taken down QA in the past.
63
00:08:50.280 --> 00:09:00.300
Yvonne Lam (she/her): Well, and they're offline systems to right like so like so many places are doing sort of like data crunching machine learning, like all you know all of those pipelines it's like.
64
00:09:01.080 --> 00:09:05.790
Yvonne Lam (she/her): Like how big a deal is it if your machine learning pipeline goes down, I mean.
65
00:09:06.390 --> 00:09:20.070
Yvonne Lam (she/her): there's probably probably nothing bad is going to happen right this minute like but over time somebody surface, you know that it's like people surface will be impacted like you know there's some kind of a data quality measure that you will not be able to meet.
66
00:09:21.060 --> 00:09:28.530
Kurt Andersen: yeah speaking from experience when you have pipelines that take 22 hours to run and if they fail at our.
67
00:09:30.210 --> 00:09:37.470
Kurt Andersen: you're a day behind by the time you even just restart the pipeline and it can have really serious impacts on.
68
00:09:39.210 --> 00:09:40.830
Kurt Andersen: User experience i'll say.
69
00:09:41.970 --> 00:09:42.480
Kurt Andersen: in general.
70
00:09:43.530 --> 00:09:48.480
Yvonne Lam (she/her): People like seeing their updates right away like that makes them feel like something has happened, like it's important.
71
00:09:49.050 --> 00:09:53.340
Kurt Andersen: Right or or seeing an appropriate list of people that you might want to connect with.
72
00:09:54.960 --> 00:09:57.030
Matt Davis: or being served the correct add.
73
00:09:58.530 --> 00:10:15.510
Matt Davis: That yes thing that i've had this same exact experience like you're you're saying Kurt where will this this feed takes 24 hours to process we actually won't know until tomorrow if something went wrong then you know.
74
00:10:16.530 --> 00:10:25.530
Matt Davis: it's difficult to know well, should we stop the feed, because we know this is broken, or do we wait until we see the results to understand how broken it got.
75
00:10:26.100 --> 00:10:30.270
Kurt Andersen: So it's and, and these are very interesting and difficult.
76
00:10:31.290 --> 00:10:52.830
Kurt Andersen: problems that are being grappled with in the op ml community, which is essentially looking at how do you apply general reliability principles in the context of machine learning and training scenarios which can take significant amounts of time and hardware to to process yeah.
77
00:10:53.070 --> 00:11:07.860
Matt Davis: yeah hey Charles how about you when when we think about things that are difficult for an on call engineer to do to make difficult decisions what what crosses your mind sure.
78
00:11:09.750 --> 00:11:17.550
Charles Cary: I think one of the most challenging things is assessing customer impact, and the reason I bring that up is.
79
00:11:19.170 --> 00:11:30.210
Charles Cary: Often, you know you're you're you're on call right and what you're thinking about is you know, is the system functioning or not, is there, something I should be attending to right now, or often if it's a quiet shift like people.
80
00:11:30.450 --> 00:11:33.570
Charles Cary: To be honest, they start doing other stuff unfortunately which can happen.
81
00:11:34.050 --> 00:11:34.740
Yvonne Lam (she/her): And then.
82
00:11:35.100 --> 00:11:47.280
Charles Cary: What will happen is like you jump into the moment of the issue, and I think folks often you're now like in a debugging mindset when like actually the first thing that needs to be stressed out is like well.
83
00:11:47.850 --> 00:11:53.460
Charles Cary: Irrespective of what the ticketing system has declared in terms of the severity of the issue like.
84
00:11:54.300 --> 00:12:03.420
Charles Cary: Is this a global outage does this impact all the customers, does it impact only a subset is it one customer right and then really they think the next part of that is.
85
00:12:04.170 --> 00:12:08.700
Charles Cary: Know beyond just how many customers for those that it does impact, what is the impact.
86
00:12:09.030 --> 00:12:14.190
Charles Cary: Is it just a degradation of the service, this is total outage is it like a data loss event right.
87
00:12:14.460 --> 00:12:19.890
Charles Cary: And the reason I think it matters is just like that's how you know how quickly Should I be escalating right now.
88
00:12:20.190 --> 00:12:29.340
Charles Cary: right because, even if you know exactly what to do if it's a meaningful percentage of the customers and let's say it's a potential data loss event you're probably should escalate immediately.
89
00:12:29.880 --> 00:12:36.330
Charles Cary: Right like irrespective of it's like hey I just need to restart this instance it's like well yeah but the business ramifications are such that.
90
00:12:36.750 --> 00:12:49.530
Charles Cary: You better let people know because we have to begin, you know preparing to talk to folks or doing a deeper analysis of like well what really happened, you know what was the impact of that and I find that that's also a really hard thing for.
91
00:12:50.580 --> 00:13:06.330
Charles Cary: folks to quickly train on like you can train on run books, you can train by observing others, it takes a long time to build up an intuitive sense for this specific business if certain things happen, what is the true severity and who do I need to talk to like really fast ooh.
92
00:13:06.390 --> 00:13:15.120
Matt Davis: i'm so glad you use the word intuition here the intuitive sense, because you know when when we get called.
93
00:13:16.560 --> 00:13:17.220
Matt Davis: we're.
94
00:13:18.420 --> 00:13:22.860
Matt Davis: we're we're almost automatically enter that firefighting mode because.
95
00:13:23.370 --> 00:13:32.730
Matt Davis: for whatever reason we've been page, so our cognition gets a little blip and we're like automatically into this mode of what patterns are matching right now.
96
00:13:33.360 --> 00:13:46.440
Matt Davis: And so that that intuition, we really have to lean on I was thinking what you were describing that how How would an oncology engineer even know that restarting this thing.
97
00:13:47.640 --> 00:13:56.040
Matt Davis: will result in data loss and therefore will result in so many customers having you know, a data outage and.
98
00:13:57.570 --> 00:14:07.500
Matt Davis: How would they even know that, how would they even know to to immediately think Oh, this may mean data loss I better escalate to the data team, or whatever.
99
00:14:08.790 --> 00:14:14.190
Matt Davis: I don't know the answer that question, have you seen any examples of of how an old person would do that.
100
00:14:16.680 --> 00:14:17.070
Charles Cary: well.
101
00:14:19.110 --> 00:14:31.830
Charles Cary: I think it usually comes from spending time I think i've heard i've heard other other folks talk about this, but I think I think it's really true that, like in on call it's hard to replace time spent on call.
102
00:14:32.760 --> 00:14:44.070
Charles Cary: Like often you know just reading documentation or working on code doesn't actually give you like a systemic model in your brain of how the system works, but like being on incidence does for.
103
00:14:44.610 --> 00:14:53.490
Charles Cary: us to at least, and so I do think there's like an hours of practice aspect on call in terms of how it gets there, I think.
104
00:14:54.930 --> 00:15:02.580
Charles Cary: The general technique, I recommend to people is since it it's hard to develop strategies to develop intuition.
105
00:15:03.060 --> 00:15:13.170
Charles Cary: If you don't have them, which is true at the beginning, the most important thing I find is to create like a kind of a culture where it's okay to escalate really fast.
106
00:15:14.100 --> 00:15:27.690
Charles Cary: Right in the sense that someone in the org has an intuition about this, I think it's I think it's important to establish hey it's okay to go and page other people, even if they may come back and say like come on it's the small thing.
107
00:15:27.720 --> 00:15:43.290
Charles Cary: it's not a big deal it's important to make that safe to do, because otherwise I think it's very hard to develop the intuition right and it's usually just not possible, I think, even from even reading the code it's very hard, you have to see the running system and experience it.
108
00:15:44.310 --> 00:15:57.390
Matt Davis: yeah and feeling safe that you can escalate to someone is really, really important being able to think because i've hit this in the past myself being on call it's 3am.
109
00:15:58.350 --> 00:16:00.360
Yvonne Lam (she/her): I don't think we're waking out right.
110
00:16:00.390 --> 00:16:00.780
yeah.
111
00:16:02.610 --> 00:16:04.980
Matt Davis: You know where we going to save on.
112
00:16:05.430 --> 00:16:07.440
Yvonne Lam (she/her): Oh, that just that you know you're waking a person up.
113
00:16:07.440 --> 00:16:08.850
Right yeah.
114
00:16:09.990 --> 00:16:14.250
Kurt Andersen: So I wanted to ask a question, because you brought up the escalation term Charles.
115
00:16:15.960 --> 00:16:26.700
Kurt Andersen: Does that imply some sort of directionality is it is it a matter of like matt was alluding it's 3am I don't know what's going to go on.
116
00:16:27.090 --> 00:16:41.370
Kurt Andersen: i'm going to call my manager i'm going to call the PR team i'm going to call a colleague i'm going to call somebody who's on another team, but it's basically appear what's, what do you have in mind when you say escalation Charles.
117
00:16:42.750 --> 00:16:56.610
Charles Cary: yeah so sometimes i'll see that there's like a notion of a formal escalation like there's like kind of a bit of a command and control hierarchy and on call and there's like primary, secondary sometimes tertiary you know it can be extended on early.
118
00:16:57.270 --> 00:17:12.330
Charles Cary: And I don't know if that that's not really the only escalation, I think, like when I say escalation, maybe the the way to define it is just a go get other people, because you don't know it, whoever those people may be.
119
00:17:12.450 --> 00:17:13.140
Kurt Andersen: Right and what are.
120
00:17:13.290 --> 00:17:26.580
Charles Cary: Our city and and I think in terms of if I had to provide guidance to folks and who should they go and get one valuable thing to develop what being on call is.
121
00:17:27.960 --> 00:17:35.850
Charles Cary: Not necessarily who's the expert in everything, because you can't page the expert every single time they might be busy at that moment to like with another issue.
122
00:17:35.880 --> 00:17:39.690
Charles Cary: Like which happens, I think it is useful, though, to have a mental map of.
123
00:17:41.280 --> 00:17:44.850
Charles Cary: For the different sub components of what you're trying to control.
124
00:17:46.440 --> 00:17:55.140
Charles Cary: Is there a one hot person in each direction that you're comfortable escalating to so that, like in the kind of limit you get to the person who can help.
125
00:17:55.590 --> 00:18:05.580
Charles Cary: And so it's more about just kind of well here all my supporting services do I have a contact on them that i'm comfortable talking to your other one my downstream services and so kind of building up a little bit of that.
126
00:18:06.480 --> 00:18:12.720
Charles Cary: You know the next top, so to speak, I think, is useful and maybe that's probably a better term than escalation.
127
00:18:14.790 --> 00:18:33.120
Matt Davis: I when the way you described it, I would actually even call that a little bit of adaptive capacity i'm being able to lean on those other people I i'm exhausting my capacity in this expert in this area of expertise, I need to.
128
00:18:34.560 --> 00:18:45.120
Matt Davis: tap into someone else's adaptive capacity to be able to help me with this problem and that yeah that feels a lot more lateral than a escalation up.
129
00:18:45.420 --> 00:18:46.170
Charles Cary: yeah the thing.
130
00:18:48.210 --> 00:19:05.730
Kurt Andersen: And it almost sounds like your case that you mentioned matt have you got in you started debugging looking at it technically almost having a one hop saying hey I need somebody to run the incident for me while I look at as the technical responder or the technical expert.
131
00:19:06.930 --> 00:19:12.450
Kurt Andersen: That could be one of these one hop resources to be able to take advantage of as well.
132
00:19:13.380 --> 00:19:14.880
Matt Davis: Oh yeah yeah you're right.
133
00:19:15.930 --> 00:19:30.840
Matt Davis: Like probably as I look back on on my time with the incident, yesterday I can think of, I can think of several examples where it would have made and i'm operating looking in hindsight, I know this.
134
00:19:31.920 --> 00:19:37.950
Matt Davis: And you know, maybe there were opportunities for me to reach out to other people.
135
00:19:39.390 --> 00:19:45.570
Matt Davis: And that's something else I think you know just talking about that difficulty being able to.
136
00:19:46.860 --> 00:19:49.680
Matt Davis: You know, we talked we talked about flow.
137
00:19:51.000 --> 00:20:00.600
Matt Davis: And we talked about getting into the intuition, of a system and oftentimes and I like drawing parallels with music because that's what I do.
138
00:20:02.040 --> 00:20:03.840
Matt Davis: But oftentimes when you're.
139
00:20:04.890 --> 00:20:10.200
Matt Davis: You know when you're playing the gig and when you're into the gig when you're in the middle of an incident.
140
00:20:10.770 --> 00:20:22.350
Matt Davis: And you're basically in a flow of a sense because you're constructing this mental model in your head about what you know what you know, the system is about and then you're.
141
00:20:22.920 --> 00:20:39.120
Matt Davis: Actually, building on that mental model in real time, as you learn about the failure states and then you're pulling people across and you're in this flow it's like, how do you break out of that flow and remember to reach out to that next top.
142
00:20:40.920 --> 00:20:48.330
Matt Davis: Yvonne woody, what do you think to have you ever run into this problem where you don't know where to find expertise or yep.
143
00:20:48.960 --> 00:20:52.770
Yvonne Lam (she/her): Absolutely, I mean I work a lot with build systems right and so.
144
00:20:53.460 --> 00:21:01.290
Yvonne Lam (she/her): No matter how hard teams try to make a very clear line between this is the build platform, and this is the application.
145
00:21:01.530 --> 00:21:16.980
Yvonne Lam (she/her): And we deal with everything over here that happens to the application and you all do with everything over there that happens to the billing platform in reality there are lots of bugs where it's not clear where the problem is right, like the ever popular my build is slow.
146
00:21:18.420 --> 00:21:28.080
Yvonne Lam (she/her): Right, I mean that's like well sometimes that slow for system reasons and sometimes it's slow because of an application change or a change to the application build but um.
147
00:21:28.980 --> 00:21:41.070
Yvonne Lam (she/her): You know I I feel like one of the things that i'm trying to figure out how to do is is you know get us to the point where like we're actually good at working on those kind of in between issues.
148
00:21:43.320 --> 00:21:50.790
Yvonne Lam (she/her): You know the and and you know that that it's like I I tend to feel that, like in software.
149
00:21:51.900 --> 00:22:00.180
Yvonne Lam (she/her): You know that we tend to be sort of very focused on a very ownership model driven a very legalistic.
150
00:22:00.570 --> 00:22:07.740
Yvonne Lam (she/her): driven and it's like okay so that's kind of like the end that's kind of like the floor, but that shouldn't be the ceiling right like.
151
00:22:08.520 --> 00:22:15.600
Yvonne Lam (she/her): Like we should just be able to say I don't know the whole system looks like it's doing something weird and I don't have.
152
00:22:16.290 --> 00:22:23.610
Yvonne Lam (she/her): Like you know, we need to have like people cross disciplinary cross you know cross department, where we can go and say.
153
00:22:24.210 --> 00:22:32.580
Yvonne Lam (she/her): Look, this just looks funny like you can tell me that i'm wrong if y'all want to laugh at me afterwards i'm fine, but I am really concerned because this does not look right to me.
154
00:22:34.170 --> 00:22:44.580
Matt Davis: I can you describe that as the in between and and would you say that's kind of like the same area that a lot of people talk about glue work.
155
00:22:45.660 --> 00:22:46.050
Yvonne Lam (she/her): or.
156
00:22:46.380 --> 00:22:48.060
Matt Davis: Are you thinking of something different or.
157
00:22:48.090 --> 00:22:53.730
Yvonne Lam (she/her): I just think I do think that it's very related to glue because a lot of that is um.
158
00:22:54.840 --> 00:22:55.290
Yvonne Lam (she/her): You know.
159
00:22:56.400 --> 00:23:05.910
Yvonne Lam (she/her): it's it's because part of your mental model isn't just um who's the next team on the list right like it's like who do I have a relationship with.
160
00:23:06.270 --> 00:23:17.760
Yvonne Lam (she/her): That I won't feel bad about asking them to look at this like you know, like not just you know, do they know something, but you know, do I feel okay talking to them.
161
00:23:17.940 --> 00:23:21.000
Matt Davis: Oh yeah yeah The relationship is actually.
162
00:23:22.110 --> 00:23:26.910
Matt Davis: more attractive to you than the expertise that's fascinating.
163
00:23:28.740 --> 00:23:42.690
Matt Davis: You know we're humans, and so we live in this social construct we're that's how we are human that's how we're mammals we're we're in a society and we build these relationships and.
164
00:23:43.140 --> 00:23:58.710
Matt Davis: And yeah I can I can think of that right now, when, just like that you know getting paged at 3am well, I have a better relationship with my mentor, then I do with the expert that built the system.
165
00:24:00.060 --> 00:24:09.180
Matt Davis: My mentor is going to be a lot more forgiving that I woke them up at 3am and can help me through this thing that i'm completely.
166
00:24:09.660 --> 00:24:24.090
Matt Davis: You know i'm just out of my element about whereas I might think the expert is going to they might look down on me, or they might think I don't know what i'm doing and and I can see this like cognitive dissonance getting in the middle of like trying to make that decision.
167
00:24:25.770 --> 00:24:36.420
Matt Davis: that's a that's a really I hadn't thought about it that way, and that the escalation that next top may actually not be the expert it just may be.
168
00:24:37.980 --> 00:24:43.680
Matt Davis: The most reasonable place for you to reach to me, you are under strain yourself.
169
00:24:45.780 --> 00:25:03.750
Kurt Andersen: And I encountered that a lot, too, in terms of having a team of accessories and all of you kind of sharing a rotation, if you like, and I think that's one of the benefits of a primary, secondary on call structure.
170
00:25:04.680 --> 00:25:20.700
Kurt Andersen: Is that you have a defined next top if you're the primary and you, you get a call you've got somebody who's already on deck, so to speak, that you can reach out to for building a shared understanding.
171
00:25:22.500 --> 00:25:28.020
Kurt Andersen: And sometimes experts can be a little prickly if and.
172
00:25:28.740 --> 00:25:45.780
Kurt Andersen: yeah and especially if they get overloaded with everybody, asking them questions, I mean that's another downside of kind of siloed information or experts who don't share information effectively is that then everybody does have to ask them questions and call them to get an answer.
173
00:25:47.370 --> 00:25:56.790
Matt Davis: yeah and you can you can also run the danger of that expert experiencing tunnel vision which experts are.
174
00:25:58.020 --> 00:26:03.870
Matt Davis: I I kind of like to say it they're allowed to do that, but you have to help them through it.
175
00:26:04.350 --> 00:26:06.060
Matt Davis: Yes, where the expert needs help.
176
00:26:06.480 --> 00:26:16.980
Matt Davis: bridge is breaking them out of their expertise, so that they don't fall into that tunnel vision and going well, it has to be this, this is my area of expertise, so this is where i'm going to look.
177
00:26:17.610 --> 00:26:37.050
Kurt Andersen: Right or it's another really negative thing that i've seen in a as a pattern is this idea of meantime to innocence, where you get everybody to swarm on an incident and then everybody looks for the fastest way to escape at possible oh not my system by.
178
00:26:38.760 --> 00:26:51.750
Kurt Andersen: And and it's essentially a try to try to not be responsible or deny responsibility and get out of there and then it becomes difficult for the person who's stuck.
179
00:26:52.920 --> 00:27:06.030
Kurt Andersen: coordinating the incident or responding as the tech lead to say oh wait a minute it actually is the DNS, and so we got to go get the DNS people back in here because they bailed early or something i'm picking on DNS just because that's the kind of.
180
00:27:06.420 --> 00:27:10.020
Kurt Andersen: The bird and that raises the cost of being wrong too because that's like.
181
00:27:10.260 --> 00:27:15.780
Yvonne Lam (she/her): What if what if you're like Oh, it is DNS and then you call them back and it's not DNS right like so that's.
182
00:27:17.340 --> 00:27:20.400
Yvonne Lam (she/her): You know that is that I that I feel like a lot of what.
183
00:27:21.990 --> 00:27:27.810
Yvonne Lam (she/her): A lot of where the interesting to me work is is like lowering the cost of being wrong yes.
184
00:27:28.110 --> 00:27:28.950
that's absolutely right.
185
00:27:30.390 --> 00:27:30.780
Kurt Andersen: That.
186
00:27:32.040 --> 00:27:33.420
Matt Davis: That sounds to me like.
187
00:27:34.620 --> 00:27:37.050
Matt Davis: helping people make difficult decisions.
188
00:27:38.190 --> 00:27:46.980
Matt Davis: It sounds like the same kind of thing it's like and also you know, going back to what Charles was saying about feeling safe to escalate like.
189
00:27:49.110 --> 00:27:59.760
Matt Davis: I would put it another way, feeling safe to admit you're wrong or feeling safe to admit I don't know or being able to say that.
190
00:28:01.140 --> 00:28:07.170
Matt Davis: or and i've done this, too, and actually I was thinking about this question.
191
00:28:10.800 --> 00:28:24.090
Matt Davis: When when on call we're often faced with learning opportunities in fact I would I would guess every single incident is a learning opportunity and so when we're actively on call.
192
00:28:25.530 --> 00:28:35.760
Matt Davis: I often wonder about this, how hard, is it to remember that this is a learning opportunity this isn't just an outage.
193
00:28:36.600 --> 00:28:49.170
Matt Davis: This is a opportunity where the the our our complexity has opened itself up for us to see inside it and whether it may be a very narrow view or maybe wide but.
194
00:28:49.650 --> 00:29:01.590
Matt Davis: Those are the times that we can learn a lot about our co workers about our company about how we organize our work and about the technical aspects of the system have.
195
00:29:03.090 --> 00:29:17.880
Matt Davis: Charles have you run into anything like this, where you've faced or have seen this where responders or engineers are faced with do I learn or do I get this fixed as fast as possible.
196
00:29:21.960 --> 00:29:23.010
Charles Cary: that's a good question.
197
00:29:24.630 --> 00:29:28.290
Charles Cary: I mean my general recommendation is mitigate as quickly as possible.
198
00:29:29.550 --> 00:29:32.280
Charles Cary: just for the sake of like minimizing customer impact.
199
00:29:32.370 --> 00:29:35.220
Charles Cary: Sure, in terms of learning.
200
00:29:36.780 --> 00:29:49.230
Charles Cary: I do, I do believe you learn to be a better on call by being at the incidents themselves in terms of translating that into kind of kind of deeper observations right.
201
00:29:50.310 --> 00:29:52.980
Charles Cary: The place where I see it, being very useful is.
202
00:29:54.390 --> 00:30:05.670
Charles Cary: I, I am a great believer in doing at least some of the on call for the software you right, and the reason is, is, I think, like in those moments you kind of recognize where.
203
00:30:07.860 --> 00:30:19.170
Charles Cary: Maybe the software was inadequate and and like, in particular, like there's a lot of stuff that's easy to take for granted, because it'll get through quite a bit of testing, but then fails and prod like.
204
00:30:19.920 --> 00:30:33.090
Charles Cary: Just understanding the benefits of like logging other telemetry building systems that handle exceptions in the general case right building things that are self healing building things that self Defense right and.
205
00:30:33.870 --> 00:30:38.880
Charles Cary: I think all of those kind of principles which aren't really necessary for like application correctness.
206
00:30:39.720 --> 00:30:49.260
Charles Cary: If you have an on call where well if your it folks who wrote the application are never on call one of the things is that they that the learning doesn't feed back.
207
00:30:49.830 --> 00:31:00.660
Charles Cary: To the development process, and I find that like that's one place where you can create a virtuous cycle, where development improves if folks are do some on called be some and they.
208
00:31:01.650 --> 00:31:07.050
Charles Cary: They become they become better at the engineering practices as well, because they think about how they will feel in that moment.
209
00:31:09.810 --> 00:31:13.260
Matt Davis: How about you Yvonne i'm learning, while you're on call.
210
00:31:13.980 --> 00:31:25.650
Yvonne Lam (she/her): Oh um so I was going to take charles's answer a little bit of a different direction, which is that you know i've known a lot of debs who ended up getting put on call for service that they wrote and they.
211
00:31:26.460 --> 00:31:34.650
Yvonne Lam (she/her): don't see the value in it like you know I think as OPS people as as sorry, you know, like, I mean nobody likes being on call right like it's a chore.
212
00:31:35.070 --> 00:31:45.600
Yvonne Lam (she/her): But for us like you know when you when you start a new job, and then you get to be on call it's not like this is fun, but it's like okay i'm a member of the team I pulling my weight.
213
00:31:45.990 --> 00:31:52.740
Yvonne Lam (she/her): i'm trusted enough that people think that I will shut down some critical service, I mean they might be wrong, but like I am trusted enough.
214
00:31:53.160 --> 00:31:57.930
Yvonne Lam (she/her): Whereas, you know, like one of the depths that i've worked with described being on call for.
215
00:31:58.260 --> 00:32:09.180
Yvonne Lam (she/her): You know this sort of complicated distributed system is he just felt like his job was being asleep protector, you know it was like giving the experts, a few more hours of sleep and he couldn't really do anything.
216
00:32:09.750 --> 00:32:17.610
Yvonne Lam (she/her): And like that, like you know I don't know that he was completely correct, I mean like I think he did more than he that he.
217
00:32:18.870 --> 00:32:26.640
Yvonne Lam (she/her): thinks he did, but I do think there's something about like how do I, like How do we make that experience of on call valuable to dance like.
218
00:32:26.850 --> 00:32:45.330
Yvonne Lam (she/her): I think a lot of times what happens to is that their stuff that's valuable to individual debs like you know that sometimes you know there's some things you only learn by experiencing them and it's like you know, having nice nice retry logic, for instance stuff like that um.
219
00:32:46.470 --> 00:32:53.280
Yvonne Lam (she/her): You know, like I think that's all good, but I think the harder thing is where the team has to fight for.
220
00:32:54.390 --> 00:32:58.410
Yvonne Lam (she/her): Something that would make their on call lives better that.
221
00:32:58.410 --> 00:32:59.940
Yvonne Lam (she/her): makes for a better service.
222
00:33:00.240 --> 00:33:04.560
Yvonne Lam (she/her): And they have to do that learning together, I think that's often very hard for people.
223
00:33:07.710 --> 00:33:22.410
Matt Davis: that's a that's a good point because I i've had that experience to with developers on both sides of the fence where they didn't see the need for them to be on call they shouldn't be on call it's not there.
224
00:33:23.490 --> 00:33:28.110
Matt Davis: it's not the way they think it's not the way they work it's not the way they do their thing it's not them.
225
00:33:29.460 --> 00:33:38.310
Matt Davis: And so it was really hard to get them to see a different see on call in a different light at the same time.
226
00:33:39.750 --> 00:33:43.050
Matt Davis: i've seen developers embrace it.
227
00:33:44.070 --> 00:33:46.980
Matt Davis: 100% and.
228
00:33:48.360 --> 00:33:50.970
Matt Davis: I think it's a really interesting question.
229
00:33:51.990 --> 00:33:58.290
Matt Davis: I I forget who tweeted this, but it was it was some video of.
230
00:34:01.140 --> 00:34:05.250
Matt Davis: Of a bunch of COPs chasing a guy and the end the guy had.
231
00:34:06.420 --> 00:34:20.220
Matt Davis: run behind the Van and all the COPs ran right by the Van where the the guy was hiding the helicopter filming it knew exactly where the guy was but all the COPs ran around the guy.
232
00:34:20.700 --> 00:34:41.790
Matt Davis: And someone made a tweet comment like Oh, this is like seeing newbie people on call when you know what the answer is, but you want them to discover how to fix it and i'm i'm wondering if if that sort of thing kind of ties into this like experiencing on call, but then.
233
00:34:42.840 --> 00:34:43.560
Matt Davis: How do you.
234
00:34:45.390 --> 00:34:46.590
Matt Davis: How do you make it.
235
00:34:47.880 --> 00:34:50.730
Matt Davis: exciting, how do you make it, you know.
236
00:34:52.080 --> 00:35:04.560
Matt Davis: An opportunity, how do you know how do we spin it on its on its head and instead of it being something bad that happens, show that this is not, this is actually just normal.
237
00:35:05.850 --> 00:35:20.910
Matt Davis: This is just this is just the system operating the system is operating in a normal way always on the brink of failure, and we just happen to catch the system right when it touched that brink and so that's where we are and it's.
238
00:35:22.170 --> 00:35:25.650
Matt Davis: I had trouble answering this question myself um.
239
00:35:27.090 --> 00:35:27.420
Matt Davis: I.
240
00:35:29.070 --> 00:35:29.670
Matt Davis: I don't know the.
241
00:35:30.030 --> 00:35:32.340
Kurt Andersen: way that a lot of people don't want to take that red pill.
242
00:35:34.650 --> 00:35:35.790
Yvonne Lam (she/her): And I think that.
243
00:35:36.930 --> 00:35:43.500
Yvonne Lam (she/her): I think there's also a very human desire to be done with things like okay I should my feature it's done right like.
244
00:35:43.770 --> 00:36:02.610
Yvonne Lam (she/her): I checked it by code change and it's done like I thought that I think it was um Liz found Gray and honey honey combs said something about how like they found that three and a half hours is kind of when they get is kind of their limit on situational awareness, for.
245
00:36:02.730 --> 00:36:04.230
Yvonne Lam (she/her): Somebody checks in a change.
246
00:36:04.530 --> 00:36:17.460
Yvonne Lam (she/her): And they have it paged it out enough that if something happens in production they're like oh wait that could be me, and you know, like I like she said it in a tweet so I don't so I don't have a reference of.
247
00:36:17.790 --> 00:36:27.900
Yvonne Lam (she/her): My head, but I do think that that's a very interesting that you know that it's like since I work on ci CD pipelines, I do think that's something that it's like like like.
248
00:36:28.500 --> 00:36:45.870
Yvonne Lam (she/her): You know what is the I mean it's not just the speed, but it's you know there's something about the the you know when do people have enough situational awareness around change to think Oh, you know what that could be my thing.
249
00:36:49.440 --> 00:36:50.400
Matt Davis: it's a good question.
250
00:36:51.450 --> 00:36:56.910
Matt Davis: Because I think that changes from person to person it's a very subjective thing to have.
251
00:36:58.230 --> 00:37:06.750
Matt Davis: situational awareness was, which is a troublesome thing to begin with which situation are we being aware of right.
252
00:37:08.340 --> 00:37:16.320
Matt Davis: I didn't hear that i've never heard the three and a half hour thing, but that makes a lot of sense yeah i'm Okay, we finally deployed this.
253
00:37:17.190 --> 00:37:29.700
Matt Davis: I have another ticket i'm going to close this ticket i'm going to move on to my next ticket i'm going to pull another ticket from the backlog I can't be thinking about this, while i'm trying to get into my next new part of work.
254
00:37:33.270 --> 00:37:51.810
Kurt Andersen: And something I read the other day to pointed out that being on high alert for issues is fatiguing just maintaining that sense of awareness and alert nose to the risk of failure is something that requires.
255
00:37:53.130 --> 00:37:55.020
Kurt Andersen: effort and.
256
00:37:57.810 --> 00:38:04.800
Kurt Andersen: it's easy to understand, I think it may have come up in the context of one of the talks last week from s3 con.
257
00:38:06.120 --> 00:38:08.730
Kurt Andersen: Where the s3.
258
00:38:10.230 --> 00:38:21.090
Kurt Andersen: One of the managers for the Ms teams organization was talking about the load on the teams for on call when the pandemic started.
259
00:38:21.870 --> 00:38:30.390
Kurt Andersen: And all of a sudden everybody was piling all the schools in the world were piling into teams for virtual classrooms.
260
00:38:30.780 --> 00:38:49.020
Kurt Andersen: And they had had rotations that were a week long and and people were being crisp out of that, and so they ended up having to go to one day rotations for on call just because it was so much of a burden on the people wow.
261
00:38:50.100 --> 00:38:51.570
Matt Davis: wow I.
262
00:38:54.840 --> 00:38:58.800
Matt Davis: It It makes me think of the burden of on call.
263
00:39:00.090 --> 00:39:03.600
Matt Davis: translating into the burden of.
264
00:39:04.710 --> 00:39:07.050
Matt Davis: feeling like you have to be the one to fix the problem.
265
00:39:08.190 --> 00:39:08.430
Kurt Andersen: hmm.
266
00:39:12.060 --> 00:39:17.310
Matt Davis: I struggle with this all the time as a as a responder.
267
00:39:18.510 --> 00:39:43.020
Matt Davis: um and I, and I think I struggle with it because I I do I do contain an area of expertise and I want to be able, I really genuinely want to be able to fix it um but it kind of goes back to what what Charles was saying before you know at what point do you hit the escalation button.
268
00:39:44.520 --> 00:39:47.280
Matt Davis: And you know when do you feel like you've.
269
00:39:48.660 --> 00:39:54.840
Kurt Andersen: The the escalation button, I think, part of the question, there, though, is the escalation button is not an eject button.
270
00:39:55.200 --> 00:39:57.120
Kurt Andersen: It is not a defeat button.
271
00:39:57.510 --> 00:40:04.710
Kurt Andersen: And if it feels that way then it's going to cause people to be unwilling to use it.
272
00:40:06.720 --> 00:40:18.450
Kurt Andersen: And I noticed that our time is running short here also so I don't want to have the last word on this but, but I think it's important to not view escalating as as abdicating responsibility.
273
00:40:20.040 --> 00:40:21.510
Matt Davis: Charles you're you're nodding.
274
00:40:22.830 --> 00:40:25.320
Charles Cary: Absolutely right, and I think that um.
275
00:40:27.540 --> 00:40:42.540
Charles Cary: escalate means i'd like you to help me solve this problem is usually how I kind of portray it to folks now there if it's ongoing there needs to be released at some point it doesn't mean you're like tied to it forever, but there is some amount of you know.
276
00:40:43.680 --> 00:40:50.760
Charles Cary: When you bring another person in you're not working with them and continuing to do it right, and I think that's only fair to whoever you're going to.
277
00:40:52.050 --> 00:40:56.070
Charles Cary: And that's also that's going back that's the that is one of those opportunities to learn.
278
00:40:56.640 --> 00:41:02.100
Charles Cary: Right, like in the sense that that's where you can actually get even if you can't really contribute meaningfully to the debugging anymore.
279
00:41:02.430 --> 00:41:14.910
Charles Cary: that's how you can actually see you know things to learn and so escalation, even if you're taking a backseat in terms of the kind of ongoing effort it's usually worthwhile at least till the shift runs.
280
00:41:15.540 --> 00:41:27.390
Yvonne Lam (she/her): And something I would add to is that you know I don't think we should be hard on ourselves about wanting to solve the problem, because, like all of tech culture is about being the person with the answers right like.
281
00:41:27.690 --> 00:41:32.280
Yvonne Lam (she/her): Like there's no way we can just turn that off in our brains and we're on call.
282
00:41:32.550 --> 00:41:35.400
Yvonne Lam (she/her): Even if we know that it's not a good idea.
283
00:41:37.080 --> 00:41:39.840
Matt Davis: you're you're you're absolutely right, we.
284
00:41:40.920 --> 00:41:45.990
Matt Davis: We want to be the hero, we want to be the person who figures out figured it out, we want to be.
285
00:41:46.380 --> 00:42:01.680
Matt Davis: You know, we want it, we want to show that we did something on this on call shift I couldn't work on feature work all week or all day or whatever, but hey look what I learned look what I fixed and I was on call and and I I did that.
286
00:42:02.190 --> 00:42:04.590
Yvonne Lam (she/her): I did that at three in the morning I bought myself.
287
00:42:07.290 --> 00:42:12.660
Matt Davis: Exactly oh yeah oh no one knew about it, I didn't have to escalate at three in the morning haha.
288
00:42:13.620 --> 00:42:29.670
Matt Davis: Well, I wanted to thank everyone for joining, thank you, Charles Thank you Yvonne it's been a wonderful time talking to you i've this conversation has sparked some ideas and some moments of learning for me and I really hope it's done the same for you all.
289
00:42:30.180 --> 00:42:31.650
Yvonne Lam (she/her): absolutely all right.
290
00:42:32.160 --> 00:42:34.500
Kurt Andersen: Thank you very much thank you all have a great conversation.
291
00:42:34.860 --> 00:42:35.400
Thank you.