Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.
The Blameless Podcast

Resilience in Action E12:

Being Born from Google, Dependency Management, and Sustainable Teams First with Steve McGhee
RIA Episode 12

Being Born from Google, Dependency Management, and Sustainable Teams First with Steve McGhee

January 13, 2022

Kurt Andersen

Kurt Andersen is a practitioner and an active thought leader in the SRE community. He speaks at major DevOps & SRE conferences and publishes his work through O'Reilly in quintessential SRE books such as Seeking SRE, What is SRE?, and 97 Things Every SRE Should Know. Before joining Blameless, Kurt was a Sr. Staff SRE at LinkedIn, implementing SLOs (reliability metrics) at scale across the board for thousands of  independently deployable services. Kurt is a member of the USENIX Board of Directors and part of the steering committee for the world-wide SREcon conferences.

Steve McGhee

Referenced in the Podcast - Building Reliable Services on the Cloud by Phillip Tischler with Steve McGhee and Shylaja Nukala

Kurt Andersen (00:03):

Hello, I'm Kurt Andersen. Welcome back to Resilience in Action. Today, we are talking with Steve McGhee, who has spent over 10 years as an SRE within Google, learning how to scale global systems across multiple product areas within Google. He's also managed multiple engineering teams around the world, in California and Japan and the UK. And he currently works as a reliability advocate, helping teams to understand how to build and operate world class reliable services. But Steve, you haven't always been at Google. And I think that our listeners might be interested in understanding how your experiences at a smaller company informs your perspective on reliability.

Steve McGhee (00:52):

That's right. Thanks for having me, Kurt. Really happy to be here. Yeah, I tend to joke that I was born at Google because I kind of went to Google right out of college. And so like I was sort of born in space in this weird place that is nothing like anything else. And then at one point I left. I was like, see you later. So I moved to back to California and I joined this smaller company. I wouldn't say it's like a small company, but it was definitely smaller than Google at the time. And kind of did the whole modernize, helped move them onto the cloud from on-prem data center, doing things old fashioned way. And man, I learned a lot just about what the old fashioned way was. I honestly had no idea because I was born on the spaceship. So coming back to kind of the way things were really done was a really good education for me. And so I was able to then try to translate how Google did things and how cloud works to this existing rightly complicated system and figure out how to make these two worlds match up. It was a challenge for sure.

Kurt Andersen (02:04):

So just for context, can you give us an idea of how big the company was? Like how many engineers were with the company when you joined for this cloud transformation?

Steve McGhee (02:15):

Sure. Engineering I think was around 100 people. So not a tiny startup. In fact, the company's older than Google, funny enough, which is pretty amazing. Started off writing software in a box on a CD or maybe floppy, I don't know, I think a CD. But yeah, so when I joined, it was about 100 people. It was a SaaS provider and it was very much in the ship it to the data center and then there's the other data center where VR lives, and that was kind of it like that was the distributed system. So it wasn't exactly a monolith, but it was approximately a monolith. And so, but they were very aware of new ways of doing things and they're actively trying to transition into this new model.

Steve McGhee (03:08):

And so that's where I kind of came in, was I was sort of like just sort of outside talent, if you were. Like how would you go about this problem? I kind of had carte blanche to bring in new of technologies and new ways of doing things, but we still had to make it work with the existing system and processes that were in place. And turns out that was the tricky part. It wasn't really the technologies. It was the getting all the stuff to work that was already there with the new stuff as well. So that was pretty great actually.

Kurt Andersen (03:36):

So functioning in a hybrid mode while you went through the transition, is that sort of where the challenge was?

Steve McGhee (03:42):

Yeah, yeah, exactly. So coming up with a migration plan or running multiple things in parallel, there was a couple M&A, companies that we had acquired. They were at different levels of maturity as well. So you hear the joke a lot about multi-cloud like why would you even multi-cloud. And it turns out lots of times you don't choose to. It just gets hoisted on you because you acquired a company who did this other thing and had a different upbringing than you. And so now you've got these two worlds you have to manage under one corporate umbrella. And in our case, I think we had seven. So it was pretty gnarly. Like we had four different cloud providers at one point and that's pretty awesome. So that was a big challenge.

Kurt Andersen (04:33):

A lot of fun there, I'm sure. What were the reliability practices at the time? Because with that many cloud providers scattered around doing bits and pieces, how are the teams keeping it reliable?

Steve McGhee (04:49):

So it was very much like an ITIL ITSM shop there. So IT service management is this whole thing. There's a little yellow book you can buy that explains it all to you. ITIL is on version 4, I think or something like that, 4.1, who knows, I don't really keep track, but there was the corporate side. So there was like service desk. And then there was like the production side, which was the NOC or a SOC. I think they call it sometimes systems operation center and being able to kind of feed all of the signals into one room that had a bunch of screens on the wall. And then having people who are in that room have the ability to run certain commands, basically like when this light turns on, push this other button over here. It was obviously more complicated than that, but that was the generic model was route everything through this one room that has locked doors. So that was kind of it.

Kurt Andersen (05:47):

And so what interest, what brought them to the point of bringing you in? I mean how did they decide they needed to make the transition to a more modern approach? Because it sounds like they had bits and pieces scattered around already.

Steve McGhee (06:01):

Yeah. Oh yeah. So the other part of that is there was these acquisition companies and some of them were kind of pulled into this traditional model and some weren't. So as you, I'm sure are aware, lots of acquired companies get sort of left alone. They continue to operate as they were operating before, but they just sort of share the revenue or whatever, but they kind of do their own thing. So there are some companies, or some of the sort of sub companies were doing that. And some of them were kind of doing the opposite where they were clearly operating at a better level than the parent company.

Steve McGhee (06:37):

And so sort of like, well, how do we take the best from each of these and incorporate them into one operational model? And that I think was just kind of coincidental with when I started. I don't think I was brought in to be reliabilities are in any way. It was very much like I came in as an infrastructure guy because they wanted to know how to do Kubernetes and packages and containers and stuff. And then I was like, by the way, there's this SRE thing. And they're like, tell us more about that. What is that? So this was a while ago. This was kind of before SRE got really big. It was like immediately post book, if that's a good touch point.

Kurt Andersen (07:15):

Okay. Yeah, so five years, few years ago, then the book was out. People were reading it and starting to rock the idea or at least parts of the idea.

Steve McGhee (07:24):

That's right. I think I remember, I think I bought six or seven copies and just brought the book in and just handed them out to whoever would take one. And then I don't know if they got read, but there was interest for sure of this weird new thing, so yeah. It was a good place to start.

Kurt Andersen (07:41):

So speaking of books or book like things, you are one of the co-authors of a new O'Reilly publication called Building Reliable Services in the Cloud. And there will be a link in the show notes for people to access that. This is sort of like a mini book, certainly compared to the SRE tomes that have come out. But if you had to boil down your mini book into just two or three key takeaway ideas, where would you start?

Steve McGhee (08:14):

Yeah. So apparently it's called a report. This is the term of art. It's a small, it's like it's 50-ish pages actually. I don't know what it actually comes out to in the printed form. I guess we'll have to see. And to be clear, Phil is the main author. He wrote a huge amount of it. He was able to bounce things off of me and I provided some kind of other angles to think about things and some diagrams and stuff like that, but we worked closely on it, but he had the structure in his mind the whole time, which is pretty awesome. But I would say the main to me, at least as I'm not the primary author, but sort of the observer of what Phil was writing and what I would get out of it if I were reading it for the first time, I hope that what people would get out of it is that SRE, or, sorry, not SRE, but reliability sort of more generic than SRE is not a thing that you can really just bolt onto an existing system.

Steve McGhee (09:19):

And it's really more about like how the system is designed and understanding a lot of these kind of weird details that you would hope you wouldn't have to care about, but in fact, they're really important. So if you just build a system in a vacuum and you're like no, let's make it reliable, you're in trouble. So the whole point of this report is to show you how to design a reliable system, which implies that you're designing something that doesn't exist to begin with. So a lot of these learnings, we can apply to existing systems, but you can't just ignore the stuff under a certain level. You do have to go in and change some things that you were hoping to ignore, but realize we actually can't.

Kurt Andersen (10:02):

So somewhat of the problem of leaky abstractions then?

Steve McGhee (10:06):

Yeah. It turns out most abstractions are leaky and even when they're really clean, they leak out in ways that you don't even notice. They leak out on the bad days instead of the normal days. And then this is like a, it's not a new concept. I was actually, this is kind of a side sort of tangent, but I was looking at an old video internally the other day. And it was from a Google thing from, I don't know, 10, 15 years ago. And during that time, someone from the audience is like, "Hey, what's the deal with reliability? How come our data centers are only 99% available? We want our services to be more available than that."

Steve McGhee (10:53):

And then the sort of the heads of state that are sitting up at the front are like, "Yeah, that's right. You need to use more than one data center." It's like those data centers aren't going to get more reliable. You need to be okay with that and you need to find ways to work around that. And I think that that's a message that I've been trying to help lots of companies with right now, which is like, please don't ask for a more reliable data center because it won't happen. What you need to instead is accept that there is no spoon. Accept that a single data center, whether it's a zone or a region, or like a physical building that you own yourself isn't going to get more reliable and you probably shouldn't try to invest in making it itself more reliable.

Steve McGhee (11:35):

And instead you should use many of them and make the system as itself, the greater system account for the inherent unreliability in a single data center. And that's a message I try to get across a lot right now. And it was just kind of fun to see this in a video from inside Google from 15 years ago, where the person asking the question was asking the exact same question from today but just in this internal system, so.

Kurt Andersen (12:04):

Right, and do you feel like the message is getting across? I mean, do you still see people asking that question inside of Google or have they pretty much assimilated that understanding?

Steve McGhee (12:16):

Yeah. I mean, not really, not at that level of detail, sorry, not at that layer of abstraction, if you will. No one talks about single data centers inside of Google. They're abstracted away enough where you really just kind of choose what level of availability you need or expect, and the system will kind of give it to you. So if you don't care how reliable it is, yeah maybe we'll put it in one data center for you. But you don't really talk about the geography of it. You really talk about the attributes that you want out of the service and the system will kind of do it for you. And when I say the system, it's like a sociotechnical system. So it's like there's some people involved as well too, but I just mean like the hand wavy system.

Kurt Andersen (13:04):

Okay, good to know that there are still some people involved that it is sociotechnical and it's not just the-

Steve McGhee (13:11):

AI is not involved. It's not a futuristic space alien technology or anything like that. It's still humans. There are spreadsheets, occasionally, things like that. So, yeah.

Kurt Andersen (13:24):

You still have the people in the tank from Minority Report underlying the system, right?

Steve McGhee (13:31):

No comment on the tank people.

Kurt Andersen (13:35):

So what other point might you want to make from building reliable systems?

Steve McGhee (13:41):

So yeah, the rest of the book, or sorry, not the book, but the report, sorry, because it's smaller, is that this reliability stuff, you can't obviously bolt it on later. It's not pixie dust that you can just buy at the store, but it is actually engineering. So the way I think about it is in order to perform engineering, you have to kind of know what the problem is. You have to kind of understand the state of play. You kind of have to know what you have available to you. What's in my machine shop? What tools and parts do I have and then what trade offs am I going to make? So engineering at the end of the day is like applying science to problems and making these trade offs.

Steve McGhee (14:21):

So a good example of this is like, can I have it be more available or more correct, but slightly slower. Is that okay? If I can make this trade off, then sweet. I now have an ability that I didn't have before. And so the simplest example here is retries. So if something breaks the first time and I just silently retry it again, and then it works to the user, it was just a little bit slower, but it was good. So this is just the most simple form of reliability engineering. And so the report has a bunch of this stuff in it, which is like, what are some of the things that you would actually do ahead of time? What would you design into the system in terms of capabilities that would allow you to make these trade offs over time? Sometimes it's complexity as well. You're just making the system more complex in order to make it more available. And that in itself is also a cost of course.

Kurt Andersen (15:16):

Okay. So one of the things that people struggle with on this trade offs spectrum is understanding what's okay enough, I'll say. So to go back to your retry example, it's going to be a little slower because it's done one or more retries, but eventually it gets an answer for them. And part of the art of SLOs talks about this, trying to distinguish happy from unhappy customer experiences, how do you figure that out?

Steve McGhee (15:55):

Science. No, I mean, kind of like I say that tongue in cheek, but I'm kind of serious.

Kurt Andersen (16:00):

Explain that one.

Steve McGhee (16:01):

The best thing you can do is, or the first thing you can do is kind of your intuition. If this is something where people are waiting for the answer so that they can do something in the real world, then maybe latency is more important than something where you're just back filling a cache for something to happen later. There's huge orders of magnitude between those things, like user waiting versus asynchronous. Let's just separate those for starters. When we get into really closer and closer, you have to kind of consider is it eyeballs on the screen? And if so, how fast is human perception?

Steve McGhee (16:37):

So at a certain point, there's hugely diminishing returns. So the difference between 300 milliseconds and 40 milliseconds is extremely difficult in terms of the actual computery stuff and makes zero difference to the human. Don't do that one. If it's a human waiting to click a button and they can't perceive the difference in latency, just leave it out. There might be other reasons for, because it's part of a chain of events or something like that. But if it's purely for a human to interact with, consider the distance from the eyeball to the brain is real. There's transmission speed there. So you're not going to control that latency. So think of the whole, whole system. And even tighter systems though, like when it's the input to another system, which is to another system, which another blah, blah, blah, blah, blah, then yeah, you're getting down to really narrow kind of windows where happiness is in the short latency windows.

Steve McGhee (17:33):

Then when I'm talking about science is what I mean is like trying to introduce latency and having a proxy metric that allows you to determine if it's too much. It's super important, so you can develop like tranches of customers or internal customers. You can perform experiments essentially. So you can say if we double the latency here, do we get more support tickets or do we get people abandoning their carts? Or do we get some other proxy metric that says it is getting bad. People are unhappy. And it doesn't have to be across the entire board. It's just the people in that tranche, in that experiment group. So is there payment throughput rate lower because the site is harder to use? Then in that case, it sounds like we found the number.

Steve McGhee (18:20):

Let's not make it that slow. Let's now bisect it. Let's go halfway between A and B and do it again. And so you don't have to get the number right the first day. That's kind of the most important thing is that you can work your way towards what you think is the right number. And then you have to remember that that number that you've decided is not going to be right tomorrow. Something's going to change in your products. Who knows, like there's going to be some new type of customer that's out there, or you're going to be promoting a new widget in your system or something like that. So it's always going to be changing. So the important thing here is to have the ability to perform this science and to keep doing it. So making sure that you are not just kind of like falling back to old patterns, because I don't know, it looks pretty green on the dashboard, so it must be fine, but actually find reasons to believe that your customers are happy. Constantly prove it to yourself.

Kurt Andersen (19:18):

That's an interesting perspective to undertake a constant re-validation of the things you're assuming to be true.

Steve McGhee (19:27):

At least periodic, it doesn't have to be every minute of every day. As long as you put down a schedule, like every six months we're going to revisit this or something like that, that works too, but you don't want to just throw the dart once and be like, that's it for eternity. We're going with 400 milliseconds because who knows?

Kurt Andersen (19:43):

Yeah. Maybe you had some skew that day, and so your number isn't even all that great from that one experiment too.

Steve McGhee (19:49):

Exactly, yeah.

Kurt Andersen (19:51):

So you talked about this, the principles in this report as applying to the way you ought to build systems, but then you also made a really brief illusion to trying to retrofit and kind of apply them to existing systems and learn from where you're at. How would you suggest people tackle that because that seems easier to do it coming out of the gate with a greenfield implementation.

Steve McGhee (20:19):

It certainly is. Yeah, it's definitely easier said than done. One example of this is that's kind of touched on a bit and it's one of the harder things to talk about in the book is dependency management. And so if you are, sorry, report, I keep saying book, but it's actually a report. There's a difference.

Kurt Andersen (20:40):

Fair enough, okay.

Steve McGhee (20:41):

The process of dependency management seems academic and somewhat trivial and inconsequential at first, if you're not down in the depth of it. And then-

Kurt Andersen (20:55):

By dependency management, can you expand on what that is for listeners that might not be familiar with the term?

Steve McGhee (21:01):

Perfect example is the Log4j thing that's going on or has been going on the last couple weeks. So this is a temporal thing in podcasts are forever. I know, but if you depend on a library that is written by someone else and is managed by someone else, and it is the best and you should definitely use it and it will save you a bunch of time, and then later someone's like, oh, by the way, it's broken. Or it can cause this major problem, either it's performance regression, or in this case, there was a security vulnerability, the ability to be able to know that you depend on that thing and be able to change your dependence on that thing, or the ability to know how that thing fails. In the case of the Log4j thing, it wasn't a reliability problem. It was a security problem. But you could imagine, instead of a security problem, let's say the Log4j problem was that like one in 1,000 requests that go through this code path would just corrupt the data or it would just crash the server or something like that.

Steve McGhee (21:58):

That would be a very similar problem in terms of the response in terms of let's all upgrade everybody at the same time, but like the detection would be very different and its impact on your business is probably going to be a lot different as well. It would be a lot more kind of clear potentially. So if you had the case where your particular architecture hits one set of these bad servers somewhere, and every request goes through them, and one in 1,000 gets dropped, then all of a sudden, you're potentially losing an entire nine one day or something like that. That would be bad. But if on the other hand, if you designed for something like this, by spreading out your load and you have all of your requests go through many different copies of these or many versions of these, or maybe they have what we call like a soft dependency on this type of thing, then either you're going to have zero impact, or you're going to have a much more distributed impact of this type of failure.

Steve McGhee (22:53):

So the key here is the concept of a soft dependency. So if you can make it so the request coming in the front door spins off sub request to go to databases and to other services. And they say like, "Hey, give me this and give me that. And like what color should I make this button?" And so on and so forth, a lot of these are kind of optional. They make the experience more rich or easier on the user or something. But if you get no answer from that service or you get an old answer from that service, it might still be okay. And in those cases, being able to what we call degrade is really, really important to the stability of the overall system. And so finding as many of these as possible as you can within your system is highly encouraged.

Steve McGhee (23:39):

So the analogy falls apart, because Log4j is not something that you would actually have a soft dependency on, but you could imagine something like this. So being able to understand how dependencies work to begin with, and then being able to unravel them to the point where you can say like, well, this one, we don't really need it. If we turn it off, it should be fine. So an example of this inside of Google was that if Google search searches all of the whole internet. So let's say for some period of time, half of the internet, or some portion of the internet was just unavailable, like the index was unavailable, should we still return results? And the answer is yeah, probably. We can't return all the results. The answer might be in this half of the internet that is not available to the index at the time. But the other half still is, so let's still return some results and maybe you even add a little indicator, like this is a partially degraded result. So it might not be perfect right now. And that's way better than it just being like, oh Google's down, sorry. Turn it off. So the partial degradation is an important characteristic of a highly reliable system.

Kurt Andersen (24:50):

Okay. Out of curiosity, do you count those periods where it's partially degraded against the reliability in some fashion? Do you charge it 50%?

Steve McGhee (25:01):

I mean, yeah. I mean, it depends on your objective, I guess. So the way that we suggest people think through this is like initially yeah, sure. If you don't know how often you're degrading, then definitely count it. You should be aware of how often this is happening, but count it in two different buckets. So be like how often are we returning results at all? Is that high nines? Are we doing great? Then how many times are we returning results? Is that also high nines? Are we doing it, are we most of the time being degraded? Then maybe that's not good. Maybe it's okay. I don't know, but measure it first and being able to distinguish between degraded or not. And then when you get really clever, you can say how degraded are we?

Steve McGhee (25:41):

So in the case of like an index for the web, like what portion of the index was able to return a response before we were rendered a result to the customer, like 99%? Did we actually hit all of the, imagine the web index is across 100 MySQL databases, which it's not. But did you get a response from all 100 shards? Then in that case, your ratio is good. It's 100%. Was one of the shards rebooting? 99%, still pretty good. But record that as a separate SLI. And that way you can know over time, our availability was good, but our quality dropped a little bit and is that okay? Maybe depends again on kind of your business needs.

Kurt Andersen (26:26):

Yeah. No, I think the stability is the key there and that seems to be an area where a lot of people don't have the visibility that they need in order to answer the questions. Well,

Steve McGhee (26:39):

Yeah, it's a really kind of a slippery slope when you first get into SLOs to try to cram everything into this one golden SLO. This is the indicator for customer happiness. It's fast, it's correct. It's available. It's got a nice coat of paint on it. All the things in one number. And I mean that, okay, you can track how many moments of perfection we have, but when something becomes imperfect, in which way is it imperfect? That's actually important when it comes down to mitigations and preventions and things like that. So having more visibility is pretty important in that respect.

Kurt Andersen (27:19):

Nice. So since building reliable systems in the cloud or services in the cloud is a report and is short or shorter than a full book, what are some of the things that you would have liked to have included but didn't have space for or time?

Steve McGhee (27:37):

Yeah, this is funny stories that this was originally, we were planning on writing like a full size, kind of like another like the SRE book or the SRE workbook, like a big book. And so this was actually going to be one chapter in that whole thing, or maybe a couple or we were going to kind of figure out how to break it up. And so we had all these other chapters we were planning as well. And so the section that I was kind of in charge of was we were calling it, putting it all together, and this was one of the putting it all together. So this is sort of like the breadth first part of the book or the potential book I should say. And so, like other things that we were going to talk about was like, how does YouTube work?

Steve McGhee (28:17):

Let's just talk about it end to end. How do uploads work and processing. And then we can talk about SDNs and we can talk about cache filling and we can talk about transcoding and there's all this stuff. And it's actually a really good example of like a modern warehouse scale, consumer internet business. Because there's a lot of stuff going on. It's not just like a simple eCommerce widget shop. There's a whole bunch of things happening. And then things like stuff that I didn't go into too much, but other people were working on was like, how do warehouse data centers work? Like the plant, like what's going on and how much steel do you need?

Steve McGhee (28:58):

And what do you do for like fire suppression? And what's the water for? Stuff like that. It turns out that's super, super complicated. And so there was a lot of depth that we could go into into different directions. And so this was sort of like the, I don't know, the top of the pyramid in terms of like, this is the entry point into lots of other chapters. And you can imagine talking about load balancing, like CAP theorem, and we had a paper on just time, like how do we deal with time in terms of time zone and clock drift and daylight savings and all of this stuff. It turns out that in a global lease scale, distributed system, time is not straightforward. It's actually pretty complicated.

Kurt Andersen (29:48):

Yeah, I believe it was Leslie Lamport that said that there's no one so miserable as a person who has two clocks, not to mention 2,000 of them or 2 million of them. Right?

Steve McGhee (29:58):

That's right. I think anybody who can cite Leslie Lamport is like a pal of mine. The Lamport clocks are kind of my kind of eye opening moment in computer science and in college. And if you're into a distributed systems and you haven't read stuff from Dr. Lamport, I highly recommend it. Even the old stuff, the stuff from way back then, it's still super relevant. It's pretty amazing.

Kurt Andersen (30:24):

Oh yeah. Yeah, it's all excellent stuff. So tell me a little bit about what you've learned in managing teams, because right now you're working kind of as a consultant, at least that's what I gather a reliability advocate does. What did you learn about the other side of the table from the individual contributors when you were sitting in the manager's chair?

Steve McGhee (30:52):

Oh, that's a good question. Yeah, I was a SRE manager for seven or eight years and I think at one point I had like four or five different teams, so like 20 to 40 SREs in my different groups. And they were very different. Obviously the individuals were very different, but the teams were very different, how they all interacted. And it was really, I thought really fun. I thought it was really great, kind of getting to know everybody and seeing how different people handled the weirdness of this role in different ways. And so one case in particular was there were some people who this was kind of early in SRE, they just did, they did not like being on call. And I get that.

Steve McGhee (31:49):

I don't like being on call too, but I also kind of do like it. I'm kind of weird. I'm kind of good at it. I can do it. It's not that bad. But there were a couple individuals where they're just like, look, I tried it. It's not for me. And at the time we didn't have a terribly progressive view on this. We're like, well, if you're an SRE, you're on call. This is just the thing. And at a certain point, we kind of went like, well, I mean, but that does it have to be? Does it have to be the thing? Do you have to be on call to be an SRE? How about we revisit this? Because SRE was big enough that it's not like we were suffering from not enough people.

Steve McGhee (32:25):

I mean there's always not enough people, but it's not like there was nobody available to be on call. And so what we did was we just kind of like were open to the idea, like okay, no problem. Especially if it was a medical reason or just like a lifestyle change or if someone was just being grumpy and they were just like, I don't want to be on call today. That's not a great excuse, but if it's like, no, really for me, the on call's not going to happen, but I can still contribute to the goal of making Google a reliable place, then yeah, we have a place for you for sure. That was sort of the change in tone.

Steve McGhee (33:00):

And I'm not saying like I kicked this off or anything like that, but I was kind of there when that change started to happen and a few people on my teams benefited from it, and it was pretty great. One of the things that I thought was kind of funny in retrospect, or maybe it's obvious, but to me, it was neat was when we became a lot more explicit about on call pay and basically this coming back to the on call thing, which was like if you don't want to be on call, you're not going to get like an on call bonus or whatever. And in the past it was sort of like hit it, like sometimes teams would do it and some teams would do it differently in different parts of the world would do it in different ways.

Steve McGhee (33:40):

And really what we did was we just kind of published everything. We said, we made everything consistent and we kind of like took the best practices from each teams. And even the tools we used and things like that, how do we track these things and how do we perform shift trades and how do you know if someone's been on call too much and they're going to burn out and things like that. And just making this all a lot more consistent across all the teams really kind of made things a lot more clear and people had a lot less anxiety about being on call because they didn't feel like it was this sort of like a dark box that was a secret like, oh, I got to be on call again.

Steve McGhee (34:15):

I'm not sure if it's going to be a bad week and I know it's going to happen again. And it was just like, they developed anxiety in people and just making it a lot more transparent, took a few folks who didn't like me on call and got them to say actually, I didn't mind being on call, like getting a page in the middle of the day and fixing it. It was actually pretty great, but it was like I had this anxiety of like, I didn't know when I was going to be on call. And I didn't know if next quarter I was going to have to be on call twice as much and like, I didn't know if it was worth me doing this through the weekend, but now that I know how, I can see graphs and see how often everyone's on call.

Steve McGhee (34:49):

And I know we have these two new team members joining us soon and they're going to make it look like this and blah, blah, blah, blah, blah, just adding more visibility into the planning system as a technical engineering manager, is really, really helpful for your team. It's just like show them how the system works. And have them have a little bit of autonomy and then input into the system and let them have some visibility into how the sausage is made a little bit more. And it just helps a lot for sure. Because it's a stressful job. And you want to kind of limit that if you can.

Kurt Andersen (35:27):

Cool. It sounds like that it improved your empathy for the people who are answering the pager.

Steve McGhee (35:33):

Totally, yeah. And the other thing that we benefited from was just the fact that Google's pretty big. At one point we were able to say like, no one gets woken up anymore, as much as possible. You have two teams, one is on one side of the world and one is on the other side. And so that way, you just don't have a shift that goes through midnight or whatever, and it's not so much that it was like dedicated, like these are the hours that you're allowed to be on call. It's like just have two teams and amongst your two sets of people, just figure it out. You guys can make up your own minds. And some teams did like noon to midnight just was their expected time. And some teams did eight to eight or whatever it was.

Steve McGhee (36:16):

It was up to each team and that worked out really well, too. So again, it was giving the teams some empowerment, but also providing the guidelines around, like you don't want to burn yourselves out. So try to do this. And here's the tool link to allow you to see, the who's on call when, and this team over here tried this way and it didn't work for these reasons and this team did it this way and it did work for these reasons. So one thing that I did do, which was, I think unique in a kind of a bad way, it sounded good, but ended up not working was we did have one team that was like a triumvirate, I don't know. We had three sites, so it was San Francisco-

Kurt Andersen (36:56):

Follow the sun kind of a thing around the world.

Steve McGhee (36:59):

Yeah. It was San Francisco, Tokyo, London. And we each had eight hours and it was like nine to five. The time zones work out amazingly well between those three cities. And it sounded awesome and it totally didn't work. So the being awake was great. Like when you get the page, you're in the office and it's wonderful, but it turned out that the communication between three teams tended to route through one of the teams. And so you didn't get the circular communication between the teams. You kind of got out and back and then out and back, and then no, the third leg didn't really talk very much for whatever reason. And because the one team was like central and sort of took over and so there were some bad patterns there.

Steve McGhee (37:45):

But we really kind of found out that because of that bad communication, it turned out that those two, I don't know, out there, like the non central teams suffered because they weren't able to get the context of the on-call shift very well either. And so there were handoffs, kind of managed through some tools, but it just wasn't as rich as it would be if it was just a pairing of two teams, because there's really only one and in a pairing, you only have one party to tell all of your context too. And so it's hard when you hand off some context to the next team, but then that team doesn't re hand off that context to the next. It just, it's this strange explosion and complexity that just kind of didn't work. So that was unfortunate.

Kurt Andersen (38:33):

So once you go to three teams, you end up with the game of telephone going on. I mean, it was part of it. I mean, you just didn't get everything passed along. Interesting, interesting observation. So would you be, I presume you know Charity Majors and she's been an advocate for people taking this sort of pendulum in their careers and taking some time as a manager and then if they want, taking some time as an IC, what do you think about that concept? Is that a reasonable one or is that sort of a corner case?

Steve McGhee (39:09):

I think it's reasonable, especially I think it's really good advice for those who are a little unsure about which way they're going or that they feel stuck or they feel on some sort of like pathway that they have to stick with. Google has been pretty good about, especially within SRE, at least has been pretty good at like, you don't have to be a manager to get promoted. You can just keep getting promoted on the IC track and making that very clear, and you kind of have to do more than just lip service to that. You have to have people who have shown that they've done it and then you kind of have to escalate, or not escalate, but elevate those people a little bit and be like, look, this is that person we were talking about.

Steve McGhee (39:55):

And look, they're doing some cool stuff and this is their name and you can go talk to them and you can ask them questions. So just by eliminating the perceived need to go into managership and saying there is another path, I think that's kind of like one step. The other step is kind of what I think what you're alluding to, which is kind of what I've done, which is I did the managers thing for a while and it was fine. And then I stopped. I haven't been a manager for five, six years maybe. Part of that was because I left and I was at a different company, but when I came back, I'm an IC now. Totally different job. But yeah, I know how the manager thing works.

Steve McGhee (40:34):

So when I talk to my manager, I'm like, I know all the tools that they're using and blah, blah. It's pretty funny. But having empathy for other roles that you interact with is super, super important. That said, if you have no interest in being a manager, I wouldn't feel like you should be forced to try it for a while. Especially if you know that, you wouldn't like it. Because it is a very different job. And the problem with doing a job that you wouldn't like, and it's a manager job is it's not just you that gets affected. You will then be like passing on the pain to other people who are depending on you. And that's no good. I definitely suffered from burnout right before I left from Google, I was managing too much in a space I didn't really get terribly well and my team suffered, and I feel really bad about that.

Steve McGhee (41:33):

So I wish I had kind of seen the light sooner and gone, hey, maybe I need to step back. What I did was I left the company, which was like a very aggressive move.

Kurt Andersen (41:44):

That's the reason you're stepping back, yeah.

Steve McGhee (41:46):

I had other reasons for doing it too, but that's not the only move. You can just step back in other ways like the world will not end. So knowing that the track is there is super, super important.

Kurt Andersen (42:03):

Good, okay. So as we draw to the end of our time, are there any particular takeaways that you would want people to remember from the conversation?

Steve McGhee (42:16):

I would say like, well when looking through the report, it's like a breadth first view of all these systems and how you would design and operate these systems. But I think the really high level view that I think kind of comes across is like, you need to be able to know where you are in terms of how your system is running and what you're actually expecting out of it. And so this is where observability and SLOs come out of. This is just like orientation, like knowing where you are in the woods, then you got to know where you want to go. Are we trying to get to the top of the mountain or are we just trying to get of woods? Which is it?

Steve McGhee (42:55):

So being able to know those two things is super important. And then being able to know what you're capable of, and not just like you as a person. Do I have grit or not or anything like that? But I mean what is your system able to do? Can you perform rollbacks quickly? Or can you like drain from one data center to another data center without affecting users? What are the things that you can do today? And have you done them? Do you know they work and can you do them quickly? Because the important thing when it comes to reliability is, or not the most important thing, but one of the more visible things is being able to respond in times of crisis.

Steve McGhee (43:34):

And so in order to respond in times of crisis, you have to kind of like know where you are, know where you're supposed to be and be able to know how the system works underneath you. So if you know the ways in which it fails and you know the methods that are available to you to mitigate or to improve the system while it's shifting underneath you, then you're going to have a much better day.

Steve McGhee (43:53):

You're going to be able to recover from a strange event or a bad push or like even like a hurricane or something like that. When these things happen, you'll be able to step in and move quickly and mitigate failure as soon as you can. And with these high nine systems like speed matters actually. So you want to be able to be back online as soon as possible if you're talking about a system that makes millions of dollars a second or something crazy like that, or is responsible for dispatching 911 calls. These are important systems that we don't hand out five nines for no reason. They should be out there for important reasons. So when your system is down and it's not fixing itself, you do want to be quick if you can. So that's where the nines come in, turns out.

Kurt Andersen (44:43):

Okay, awesome. Well, thank you again, Steve, for joining us on Resilience in Action today.

Steve McGhee (44:51):

Thanks for having me.

Kurt Andersen (44:53):

And yeah, for folks that want to check out the report that he and Phil wrote, you can look at the landing page for this podcast and we'll have a link there to get your very own copy of Building Resilient Services in the Cloud. And we'll also have a link there in gen to the other resources that Google SRE and the CRE teams have made available for helping you to understand reliability engineering. So thanks again, Steve.

Steve McGhee (45:22):

Thanks, Kurt.

Kurt Andersen (45:23):

And I will cut off the recording, cool.

Pricing calculator   - Blameless Images
ROI calculator

Find out how much 
you could save

Incidents can do real damage to companies that aren't sufficiently prepared them. Use our calculator to estimate the full cost of incidents for your team.
use the calculator
collapse button - Blameless Images