Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

Thought Leadership Panel: What is a "real" SRE?

Blameless recently had the privilege of hosting SRE leaders Craig Sebenik, David Blank-Edelman, and Kurt Andersen to discuss how can SREs approach work as done vs work as imagined, how to define SRE and DevOps and the complementary nature of the two, the ethics of purchasing packaged versions of open source software, and more.

The transcript below has been lightly edited, and if you’re interested in watching the full panel, you can do so here.

Amy Tobey: Hi everyone - I'm Amy Tobey, and I'm the staff SRE at Blameless. I've been an SRE and DevOps practitioner since before those names existed, as have my guests here. I'm excited to be joined by Craig Sebenik, Kurt Andersen, and David Blank-Edelman. They've all lived and breathed SRE for decades, written books, and I'm excited to get into our discussion. I'll have you each introduce yourselves in alphabetical order.

Craig Sebenik: My name is Craig Sebenik. I am an SRE at Split Software. Split is a small startup in Redwood City. We do feature flagging and A/B testing framework. I've been at a number of startups and I was at LinkedIn for a number of years. Just like Amy mentioned, I was doing SRE before it was called SR back when it was called system engineering or even sysadmin, so great to be here.

David Blank-Edelman: I'm David Blank-Edelman. I'm a cloud advocate at a small company called Microsoft that you may have heard of. I have been in the operations field some 35-odd years in one of these capacities. I still consider myself a sysadmin after all this time. I'm the curator and editor of the book Seeking SRE and a cofounder of the SREcon conferences. I'm delighted to be here.

Kurt Andersen: I'm Kurt Andersen, and I'm an SRE in our product SRE organization at LinkedIn. We have three groups that we broadly divide our SREs in. We've got product, data, and infra. I'm in our product group, which covers everything that our users touch.

Amy Tobey: Thank you, everyone. Let's get started. We discussed what we want to talk about in this panel and Kurt brought up something that's very near and dear to my heart, which is the work-as-done versus the work-as-imagined model. I feel like this is a great way to talk about SRE work because it's a tool we can use to look at SRE as well as the work of the people around us so we know what to do for them. With that, Kurt, could you give us a quick overview of what SRE work-as-done is and then I'll ask Craig to talk about coding through that lens.

Kurt Andersen: Sure. Work-as-done versus work-as-imagined is a concept that comes out of postmortems or incident retrospectives, and a learning-from-incidents mindset where one undertakes an exploratory journey to understand, as best you can, what actually happened as opposed to what you think happened. But this applies not just to the postmortems of incidents. It also applies, for instance, to planning a new project. If you want to implement something to make your SRE teams’ lives better, it's really important to understand what those lives actually consist of before you try to make them better, not just what you hypothetically want them to be.

One of the areas that this comes to the fore of is understanding toil. If you, as a senior director/VP of SRE, have this theoretical concept of SRE as 50% engineering and no more than 50% toil, that's a great construct to have in mind. But if that is your imagination and it doesn't align with the way that work is actually being done by the teams, you are setting your teams up for burnout and frustration and yourself for frustration when you can't align with what is actually hurting the teams and costing them time, effort, and brain space.

Work-as-done versus work-as-imagined is a concept that comes out of postmortems or incident retrospectives, and a learning-from-incidents mindset where one undertakes an exploratory journey to understand, as best you can, what actually happened as opposed to what you think happened.

Amy Tobey: Another angle is that a lot of people see SRE work as including a large component of coding. I know Craig had something he wanted to bring up around that. Let’s align ourselves to the problem as Kurt suggested, which is the automation and toil problems in front of us, and how solving these problems leads to expectations for us.

Craig Sebenik: I've been in a number of startups and hiring is always a problem. One of the complications has been defining SRE. Before you do that, you have to define what DevOps is. From that, you can get to what you are trying to hire for. The first thing you have to do is figure out whether you want a narrow or broad definition of SRE. If you want a relatively narrow definition, you go back to the original idea of SRE where Ben Treynor said it is essentially what happens when you take a software engineer and give them an ops role. With that in mind, you want to solve as many of these problems as you can with software as opposed to people.

Amy Tobey: One of the things that's been fascinating me over the last few years is how many people have an SRE title but don't come from software, like people who come from systems administration, NOCs, and the less traditional paths, and how those folks come forward into this vision of SRE as coder in an ops job. We have ops people in ops jobs that are, hopefully, learning to automate or doing more of it.

David Blank-Edelman: From my experience and from the experience of having people on my team, one of the best sysadmins that ever worked for me was a linguist by training. My experience was that, back in the day we were not quite walking upright and we were still realizing, "Hey, it is important to do some automation." Even if we're not getting the software engineering background, per se, we are still writing code of some sort that is helping us do repetitive activities— "Okay, I'm tired of doing that. I'm going to make an alias that's called blah, and it's going to just do the thing, every time I have to do it.”

Then, later on we moved into things like configuration management where we were attempting to represent our environment or how to repeat the creation of parts of our environment using stuff that looked a lot like code, even if it was not really code. Maybe it was a DSL of some sort.

When we started doing that, we're like, "Okay, so we're starting to build out all these configurations." Then we're like, "Oh, wait a second, I'm copying and pasting these two things. Why am I doing that? This one should include that one." Next we were kicked dragging and screaming into software engineering principles and using those wherever possible because we realized this just doesn't work. Even early in the early days of Perl, one of the things was that it was for people who are lazy. That was meant to be a good thing. We didn't want to do these repetitive tasks.

There was humility, laziness, and hubris, I think, from Larry Wall; the basic notion is that we've come closer. For me, the journey into SRE and the journey into SRE code came exactly from this direction. Then at some point, we're like, "Oh, this source control system, that's kind of important. We could figure out how to use that to keep our stuff because maybe we want to know who changed what and why or when.” I think that people have come in through this.

Then as we started to, in the cloud world, think about things like distributed systems, or when we started doing reliability work when we had to deal with somebody else's code, we had to know enough of whatever that language is or whatever you're doing to be able to read your code.At the moment, I don't consider myself to be a software engineer per se, independent to what Ben says. However, I certainly have picked up as much as I could from that. I just don't do it professionally. I have so much respect for software engineers, that I don't call myself a software engineer but I, for sure, have tried to learn everything I possibly could from that field.

At the moment, I don't consider myself to be a software engineer per se, independent to what Ben says. However, I certainly have picked up as much as I could from that. I just don't do it professionally. I have so much respect for software engineers, that I don't call myself a software engineer but I, for sure, have tried to learn everything I possibly could from that field.

Amy Tobey: It sounds like what we're really looking at is bringing in software engineering practices to our approach to toil, automation, and infrastructure. The one that you didn't mention, when you were talking, that reminded me of this is that the Puppet and chef communities really brought testing to the forefront. Maybe there's some different gauge than just writing code in a particular language.

Craig Sebenik: One thing I want to highlight that David said was, essentially, you have to define what coding means. Do DSLs count, or the classic old stuff (old Puppet, Chef, CFEngine) or the newer stuff, Terraform, et cetera? Again, it kind of depends on where you're at and how you would want to define them, this idea of narrow definition or broad definition, and how much you want to include normal software engineering practices.

A big one that I run into a lot is when you want to unit test, say, Terraform, it can be complicated. However, if you have a higher lower-level language, for example Python or Go, the unit testing is a little bit more straightforward. Again, it depends on how much you want to incorporate those basic ideas into what you're doing versus relying on the tool itself, AKA the Puppet, Chefs, and Terraforms of the world.

Kurt Andersen: Well, I think that's mostly a question of the degree of rigor. The principles of version control, peer review, code review (in some fashion), unit testing, staged release—all of these things build rigor in the team. You can practice them individually. It's hard to do peer review individually, but you can do something short of peer reviewing. But you can do all the other things individually if you choose to. I don't know that that is necessarily what we're describing as coding. I think you can write in a DSL. Perhaps writing the code that interprets the DSL is more what we would think of as software engineering. But it doesn't take long before you find with any DSL that it needs to be extended because they didn't cover your use case. That that harkens back to the joys of free and open software. Right?

David Blank-Edelman: I think it's not just coding but, from software engineering, I also get things like algorithmic analysis. I want to be able to look at a piece of code and figure out how much space it's going to use, how much memory it's going to use, how much CPU it's going to use. Is it quadratic time? The big O notation is actually useful to be able to at least reason about things like this.

Similarly, I think that the stuff I get from the software engineering world and from what I've had to pick up along the way is around failure modes. Sysadmins know about failure modes. They see them all the time. But after a while, you start to see classes, you start to understand these things, like what happens when you have cascading failure. You start to know what happens when you have thundering herd problems where everything gets turned on at the same time and all the devices or systems attempt to make the same request that knocks you over. I think that that's where there is value in getting into that place, even if you're not going to be building large scale systems because these large systems we build, we're in them as much as we are making them. So having to understand them is crucial.

Craig Sebenik: I was just thinking that some of the simplest parts of software engineering, for example documentation, are things that historically, when I first started doing sysadmin work, you'd write a 20-line, 30-line Perl script or Bash script documenting the more complicated pieces, but otherwise it didn't happen. It was like, it's just going to live for a few days, right? So why would I bother documenting?

If you look at one of core concepts of the Python world, you read code far more than you write it. You're making sure that your code is readable, especially coming from Perl, which often means writing something that is potentially not the Perl way or the Pythonic way of doing it, but you write it so it's simple to understand. This is probably more valuable than writing some really cool regular expression that, six months later, you literally have to take 20 minutes to try and figure out what you wrote.

Amy Tobey: That's always been my approach to comments. I leave them for the forgetful me. It's not like, "What did I do?" It was like, "What did I intend to do?"

Craig Sebenik: That's another good point, about what you intended. When David mentioned cascading failures and edge cases, essentially, he didn't consider, "Oh, I wrote this code to do this." In practice, it doesn't really do that. It's like, "Oh, right, I forgot about this weird case when somebody sends an integer as a string." Stupid stuff like that.

Kurt Andersen: I'd say that also loops back to a certain degree in terms of documentation and work-as-done or work-as-imagined because I end up spending a lot of time reviewing architecture documents that are essentially pre-implementation designs. One of the big problems that we have is that those don't ever get updated to reflect the as-built system. If you think about blueprints for a building, there is the design upfront and then there is the as-built version afterwards after a bunch of changes were made along the way.

I end up spending a lot of time reviewing architecture documents that are essentially pre-implementation designs. One of the big problems that we have is that those don't ever get updated to reflect the as-built system. If you think about blueprints for a building, there is the design upfront and then there is the as-built version afterwards after a bunch of changes were made along the way.

Amy Tobey: It's evolved over time because very few buildings stay static, just like people.

David Blank-Edelman: I also have had really cool conversations with people like Theo from Circonus in which he was suggesting in addition to a readme, we should be putting an ethics.md in our repositories in which we're trying to speak to not just what it is and what we intended but also to, as best we can, document what some of the ethical considerations might be around that code or the thing that we're running. This could include what some of the ramifications are and who it might affect. I would love to see more of us in the field pick that up as well and understand what we do has impact. It would be darn great if we could try to document that.

Amy Tobey: Craig and Kurt, do you run into any of that in your work?

Kurt Andersen: It's certainly an important component of doing Internet standards work. I've encountered something like that there. I don't know that writing up an ethical considerations statement for a Terraform construct is a terribly great use of time. I'm thinking about some of the metric systems that I'm involved with right now and the adoption that we're endeavoring to achieve across disparate product groups. Even there, some of the aspects of the measures and the SLIs could have ethical complications, but to define the latency of request response into a database, that seems like a bit of a stretch.

David Blank-Edelman: How about what goes into that database?

Kurt Andersen: The datasets, definitely. The datasets have their own data definitions and are defined in a data dictionary of sorts.

David Blank-Edelman: How about that training that has to go on for an AI system?

Kurt Andersen: That is an area that definitely involves a lot of assumptions that are worth documenting. I'm not saying it's completely irrelevant, but that it probably doesn't apply across the board universally.

David Blank-Edelman: Yeah, I agree. I'm certain that not every little piece of code has to have an ethics statement, but I guess I'm saying that it wouldn't be bad if we, for example, even in a Terraform config were to say, "This is specifying a region there. If this region goes down, the following consequences will happen." I mean, there are consequences to where we choose to build and what infrastructure we allocate to. It just wouldn't be a bad idea to be thinking about this a little bit, even if it's the start of that thought.

Amy Tobey: Maybe in our corporate gigs, the ethics angle isn't as useful but, in open source, I guess the ethics would seem a lot more generally applicable.

Kurt Andersen: Major choices like where to deploy a data center or to turn down a data center have driving forces that are well beyond just the software level. They have risks, cost, taxation, and all kinds of other factors that go into choices like that. To the extent that we can document those, I think the documentation is good because then when things change over time, you can go back and see these were the assumptions or the ground rules that went into the choice that was made three years ago. Now, we can see that three of the five conditions have changed, so it no longer makes sense to have X, Y, Z.

Amy Tobey: It's some small comfort when, at three o'clock in the morning, you're trying to fix a system you discover that and go, "Why? Oh, that's why. Okay." To take that a little further, I actually had a direction for us to go and that I feel like this is a good segue: what do you think of this growing trend of taking all the undifferentiated lifting we're doing in our organizations and moving it off to infrastructure as a service or software as a service? We're in this situation with the way the economy is behaving in the United States where people are having to make decisions about what SaaS tools to keep on their budget sheets, and there are ethical implications to that.

Craig Sebenik: We're a small company, and we provide a SaaS service. Therefore, we use a bunch of SaaS services as well. One of the biggest single consumers of my time is managing all those SaaS vendors. Right now, we have several dozen, which, for a small company, is a lot to manage, especially when the costs are potentially related to the amount of traffic. As we add more customers, the cost that we have budgeted for various SaaS vendors also goes up, and not necessarily in any kind of obvious relationship. It's not linear. Some will plateau out; some will grow exponentially. It all varies. Then obviously, the reason for going with the SaaS vendor is you're trying to bounce between letting somebody else do X, and does X really solve the problem you're trying to solve.

The reason why companies go with SaaS is they don’t want to hire to do that themselves, essentially the build versus buy argument. Going with SaaS means, "Okay, we don’t have to build this." However, again, we have these costs that don't increase linearly as our company grows. They might grow logarithmically which would be great, but sometimes they grow exponentially, at least for a period of time. So managing that is a nonzero amount of time.

Amy Tobey: As we move forward, we're going to have to make decisions and some of these things are going to be really critical to our business. So how do we balance all that as SREs, especially considering how low-level we are in terms of integrating those services and the cost of un-integrating them.

David Blank-Edelman: I can speak a little bit to this but I want to make sure that my biases or my conflict of interests are totally out there. I'm from a large provider of those services. Obviously, there's some value to us that people consume them and pay us for it. So just setting that aside— I don't always agree with everything that Microsoft does, obviously—but one of the things I'm pretty interested in is the work they're starting to do around talking about sustainability, and the efforts around that because there are ethical choices in terms of how much you consume. It's a very interesting conversation to have because often, the way to get to better sustainability is to consume less.

As SREs, as we think about this, it may be the case that we're going to be pivoting a bit, especially in this time, towards things like performance analysis or cost optimization. Can we reduce our spending? Can we reduce what we're consuming? These are activities around scaling, but they also help with reliability activities around optimization and sustainability because you realize, "Oh, okay, we didn't need to run that all the time."

To meld your ethics+SRE question, one place where with this stuff touches is how much we consume and our capacity as SREs to help our companies and our organizations determine that and reduce where possible, or increase if it makes sense to take your stuff out of your data center and give it to a cloud provider which, in theory, is running more efficiently than your data center is because they have larger scale. That may be a positive thing. Again, conflict of interest noted.

Kurt Andersen: I think one of the things you touch on there, David, in terms of raising sustainability as a consideration factor is that I find a lot of choices. SREs tend to be biased toward building, I think. But factoring in considerations of cost and risk, especially total cost of ownership or of operation through the typical software development lifecycle; it is very waterfall biased. But it also talks about developing the software in the very name of it. It has nothing to do with the cost of running that stuff forever or, hypothetically, forever.

We need to have a much more comprehensive view of the software lifecycle, be less focused on the development phase, and more concerned with the total lifecycle and total consumed effort over the lifespan of a given product. It's a mindset change that takes some educating: to get people to think about software from the point where it’s conceived, until the point where it's decommissioned and the code is deleted from your code base. That is the life cycle. You need to figure out what it takes throughout that whole time, including all the patch changes along the way.

We need to have a much more comprehensive view of the software lifecycle, be less focused on the development phase, and more concerned with the total lifecycle and total consumed effort over the lifespan of a given product. It's a mindset change that takes some educating: to get people to think about software from the point where it’s conceived, until the point where it's decommissioned and the code is deleted from your code base. That is the life cycle. You need to figure out what it takes throughout that whole time, including all the patch changes along the way.

Craig Sebenik: One of the other ethical considerations to weigh is the R&D costs. One of the things you see right now, no offense to David, but the big cloud providers’ customers want more and more services to be offered. The providers offer them, which starts competing with the original creators. One of the biggest places you see this, at least from my perspective, is Amazon recently announced its managed Kafka service yet Confluent, the original creator of Kafka, has its own service. They're competing. AWS is using Confluent/LinkedIn's R&D over the past decade to create a service. If you're an AWS customer, even though it might be easier to use the hosted Kafka from AWS, is it more ethical to use Confluent? And there's a bunch of pricing considerations.

Amy Tobey: If you go back to the roots, it's kind of where they started with Zen and Linux.

David Blank-Edelman: "Are my hands clean?" to quote, Sweet Honey in the Rock. It's a really good question. I don't even know how to begin. I think you can take potshots at Microsoft around this too, for sure.

Kurt Andersen: You could raise the question, as an SRE, if you happen to work for one of these companies, should you be supporting the use of free software as a packaged service to sell to other companies, or what's the responsibility there as far as contributing back to the open source software movement?

David Blank-Edelman: I would say two things. One is that ship has kind of sailed. I think that's true, but it does lead me to the interesting question about SRE when it comes to packaged software. We have a lot of SREs in situations where we have a development group, and we're building our own software, but we were certainly in that realm for a long time, of, "Here's the thing I got off the shelf, I'm an SRE. Keep it reliable." That's a challenge that I don't think the SRE world has thought very much about, probably because the origins of these works are from companies that had the ability to spin up large groups to build what they needed along these lines.

Kurt Andersen: There have been a number of talks from people about SRE commercial, off-the-shelf software at SREcon over the last couple of years.

David Blank-Edelman: So Kurt, do you have any of the ideas that came out of those talks? I have seen them myself, but what do we tell people who are on this call who are like, "Yeah, I'm thinking of the SRE thing, but my company runs X?"

Kurt Andersen: The general approach we've found useful is to wrap said software with some sort of framework that allows you to get useful metrics out of it. Measuring things is usually not something that commercial off-the-shelf software has been really strong at, in terms of exposing useful APIs for measuring its behavior. So, putting it inside of some sort of package that gives you those measurements seems to be the most common approach, like putting it with a sidecar such as SEO or Envoy, so that you can run it and get some meaningful intelligence about how it's functioning, and how you can tear it down and spin it back up on demand.

David Blank-Edelman: I think that's exactly the first place I go as well. I was thinking about, "Okay, well, that's cool. That gives you the ability to know about your reliability, but how do we, as SREs, make an impact on that reliability?” Sometimes, you can stick a caching layer in front of things, and that can help. But to me, it's really tricky because it's all of the same ilk. Like, congratulations, you bought a piece of software, you have it well monitored, you know it sucks. Now what?

Sometimes, you can stick a caching layer in front of things, and that can help. But to me, it's really tricky because it's all of the same ilk. Like, congratulations, you bought a piece of software, you have it well monitored, you know it sucks. Now what?

Amy Tobey: This has me thinking about the DBA and DBRE communities as that's their whole life except for a few exceptions. DBA is a database administrator, which is probably what it's been since back when we were all sysadmins. DBRE is the SRE equivalent of database administrators, so database reliability engineering. You had me thinking of that because you get mySQL, right? It's gotten a lot better but you still get the stuff that was imagined by the mySQL developers. Then you have this really long possible loop for feedback.

Kurt Andersen: My MySQL is open source. If you went to something like Oracle or SQL Server, MS SQL, whatever it is, you've got this hermetically sealed box that you have very limited knobs and metrics coming out of.

David Blank-Edelman: Or somebody else's SaaS system template or path system.

Craig Sebenik: When others are dependent on us, it becomes this whole series of dominoes.

David Blank-Edelman: How do you handle a situation where you don't necessarily have the ability to directly influence the reliability of the core thing that you are depending on, which is kind of one of our purveys?

Craig Sebenik: One of the big differences in being a SaaS provider as opposed to the conversations around, say, Jira is if you've implemented Jira not as the cloud vendor but locally, how do you make it reliable so all your teams don't go down when you upgrade? The big difference with a SaaS provider is we have a bunch of software in between all these providers, and then we provide to customers. What we do is add caching layers. How do you distribute the data? How do you fail gracefully when one of these providers fails? Our onboarding of another SaaS vendor, even for somebody as small as we are, is relatively rigorous. We go through things like, what do you do in case of failure? What happens?

One of the big concerns we have is security. Since we take customer data and store some of it, we go through this relatively formal process. It pushes everybody up. We ask them a bunch of questions about reliability and redundancy, and then we bake to come full circle. We go back to product and say, "Hey, what happens if this vendor is not available in this piece of our service?” Obviously if AWS goes down, we're screwed.

But, if this vendor goes down and it takes down this piece of a service, how important is that to our customers? That starts a bigger conversation about, if it's really important, then we add in more layers of protection within our software to essentially at least provide the read path, maybe not the right path, but at least the read path so the customers can continue to operate.

But, if this vendor goes down and it takes down this piece of a service, how important is that to our customers? That starts a bigger conversation about, if it's really important, then we add in more layers of protection within our software to essentially at least provide the read path, maybe not the right path, but at least the read path so the customers can continue to operate.

David Blank-Edelman: Okay, Craig, I've got one last question for you: how did you learn what questions to ask your vendors?

Craig Sebenik: To be blunt, a lot of that came from LinkedIn. When you see things scaled at large sizes, you see weird edge cases. There are things that you just never thought of. To be honest, there are questions that continue to pop up. One of the big ones that I have not focused on a lot in my career is security concerns. You look at LinkedIn security concerns, minus the payment piece, most are relatively light. But my current company, some of our customers are big banks, for example, which obviously have major security concerns. Those security questions continue to evolve.

I think one of the biggest things you can do is try to take lessons from the big companies, not because they're necessarily doing things right, but because they've seen all kinds of weird things. But don't take them wholesale. Take what they say, figure out what applies to you. Take those pieces and continue to evolve them. Don't remain static. Don't say, “These are our questions forever and ever.” Every couple of months, just review them and say, "Do these questions still make sense?" especially if you have a new failure.

I think one of the biggest things you can do is try to take lessons from the big companies, not because they're necessarily doing things right, but because they've seen all kinds of weird things. But don't take them wholesale. Take what they say, figure out what applies to you. Take those pieces and continue to evolve them. Don't remain static.

David Blank-Edelman: That was my question, actually. Specifically, have you had outages that have led to better questions for your vendors?

Craig Sebenik: Yes. They are relatively small for each outage. Over time, they have led to 30%, 50% change in the vendor review process. I mean, that's over the course of a couple years which, again, goes back to this point of you have to continue to evolve it, continue to review all of your policies and say, "Hey, how is this changed?” Part of that comes back to raising the bar. As the vendors improve, then your questions can also hit more and more edge cases, and they can evolve as the entire industry evolves. Hopefully, 10 years from now, this will be significantly more sophisticated than it is today.

Amy Tobey: I really like that. I'm going to open up now to Q&A. The first question: How has the classification of SRE roles into products, data, and infrastructure types helped make your implementation of SRE successful? How are the approaches of each team different or the same? What are the pitfalls to be aware of?

Kurt Andersen: Good question. I would say to a certain extent that it is an administrative division from a span of control and in focus. But also, I would say it allows the teams to be partnered with their development teams in different ways. Our product SREs function in a largely embedded role with specific portions of our product space. For instance, your profile on LinkedIn is developed and maintained by a group that we call the profile team. There are profile SREs who are embedded with those developers. So product-wise, the product SRE teams are split up by product areas.

Data teams are split up more along the technology framework. There are Kafka SREs. There are Oracle SREs. There are other SREs having to do with different types of data storage technologies. Infrastructure is sort of a catch-all to a certain degree. The infrastructure SREs support what LinkedIn calls the foundation team. It's what other companies might call their DevOps engineers. They're the ones who build the tools that the developers use for deployment, testing and everything else.

Infra SRE supports the foundation team. They also support some of our in-house private cloud technology. They support things like performance engineering. They kind of have a mixture in their space.

Amy Tobey: Were there any pitfalls with that?

Kurt Andersen: The pitfalls are the typical pitfalls of any sort of organizational distinction in that you have to be careful not to allow siloization. Whenever you've got multiple levels of hierarchy that you may have to traverse, you have to make sure that you keep good communication going.

Whenever you've got multiple levels of hierarchy that you may have to traverse, you have to make sure that you keep good communication going.

Craig Sebenik: Interesting, because being in startups for quite a while now, I actually have more exposure to the reverse problem, you know, what happens when you don't have a team of SREs with that specific title? One of the things you have to remember is, it isn't so important that you have an SRE person, but the SRE work gets done. When you're smaller, people have a number of different hats. I've seen two kinds of approaches where you essentially take a developer and have them wear an SRE hat, or you take a dedicated SRE and you have them jump between teams. In my experience, the latter works better than the former. The biggest problem is this competition between product features and reliability.Reliability should be part of the product, but that's a different discussion.You have this push and pull between the product desires and stability desires. Having somebody just focus on stability, even if that means they have to learn the quirks of every team, has been generally more successful than trying to take somebody who has more product focus and have them also focus on reliability.

Reliability should be part of the product, but that's a different discussion.You have this push and pull between the product desires and stability desires. Having somebody just focus on stability, even if that means they have to learn the quirks of every team, has been generally more successful than trying to take somebody who has more product focus and have them also focus on reliability.

Amy Tobey: Okay, cool. Another question: Does SRE replace or complement DevOps?

David Blank-Edelman: I feel very strongly that it is a complement because, in my opinion, SRE and DevOps, though, complementary modern operations practices, focus on different things. The way I have come to understand it is if the keyword for SRE is reliability, one possible keyword for DevOps would be delivery. I believe, though they intersect in certain ways, that those are different. I just don't think that SREs, even though they touch on into the realm of security, replace security. There's a complementary thing here. These are all people working on different things.In practice, I think that DevOps has focused on different things traditionally. The way it is practiced is complementary to the way SRE is practiced. I am a big fan of DevOps people doing SRE practices, and taking SLAs and SLOs. There's no licensing fee. You can immediately start doing these things or any requirement along these lines. I'm a big fan of the SRE folks learning things around collaboration. I think there's a lot to be learned here, for sure complementary. There is no big, ‘When you're a jet, you're a jet” dance number in which they square off and fight, in my opinion.

I feel very strongly that it is a complement because, in my opinion, SRE and DevOps, though, complementary modern operations practices, focus on different things. The way I have come to understand it is if the keyword for SRE is reliability, one possible keyword for DevOps would be delivery. I believe, though they intersect in certain ways, that those are different.

Craig Sebenik: There's a pretty good video on YouTube from Liz Fong-Jones about essentially taking an OO approach where she says class SRE implements DevOps. First thing is how do you define those terms, and your definition then kind of trickles down into how you implement it. If you define DevOps as essentially merging Dev and Ops, then SRE and DevOps are essentially two sides of the same coin, right? They're different flavors of the same thing, and you can look at it that way.

If you look at DevOps as the team that writes the tools for CI/CD and monitoring, then I think they could be the same thing. When I talked to SREs from Google years ago, they had essentially two different types of SRE: one where they had the embedded SRE that worked directly with the product teams, and one they called SRE suites which essentially wrote the tools. If you take that definition where DevOps is the implementation of all these tools, essentially, your DevOps team is your SRE suites, and your SREs that are embedded with the teams are your SREs. But in the end, it all just comes down to the terms you want to use, how you've done your recruiting, and how your company’s organically grown. In the end, it doesn't matter as long as the work gets done. Who does it and what title they have really does matter.

In the end, it all just comes down to the terms you want to use, how you've done your recruiting, and how your company’s organically grown. In the end, it doesn't matter as long as the work gets done. Who does it and what title they have really does matter.

Kurt Andersen: I think, as you pointed out, Craig, it's all in the definition. David started off by saying that there are areas of focus with lots of overlap. I do think that there are slightly distinct areas of focus. I'm in full agreement with everything you both said.

Amy Tobey: I'll take this opportunity to put in a little plug because I'm doing a whole talk about this at Failover Conf. The title is “The Future of DevOps is Resilience Engineering,” and this topic is very much what I'm going to be talking about.

David Blank-Edelman: I was going to say I'm a big fan of Tom Limoncelli's comparison to the two. He was kind enough to allow me to put it in the book. He suggests that DevOps people focus on going from somebody's laptop to production, and they're looking in that direction. He suggests that SREs start in production and look backwards in the way they think about things, like how do I get the stuff into production that makes sense. This is a lovely picture, and I’ve found this to be really helpful in my understanding. DevOps folks think about how to move into production, and SRE folks think about how to make production what you want by figuring out how things work by looking backwards. That's one of the ways I've come to understand the difference along these lines, but I think we can all be friends.

DevOps folks think about how to move into production, and SRE folks think about how to make production what you want by figuring out how things work by looking backwards. That's one of the ways I've come to understand the difference along these lines, but I think we can all be friends.

Amy Tobey: Great, thank you. In closing, a huge thanks to our panelists, Kurt, Craig and David. I hope everyone enjoyed our chat today and got as much from it as I did. If you have any other ideas for ways we can band together and help our community or stay connected in this time of social distancing, please feel free to reach out to me on Twitter, @MissAmyTobey.Our SRE Leaders Panel is a recurring series, so we encourage you to follow us on Twitter, @blamelesshq,  to stay tuned for the next one. Please share your ideas with us on Twitter or otherwise, and we look forward to hearing from you all.

Recommended Reading from the Panelists:

Resources
Book a blameless demo
To view the calendar in full page view, click here.