SRE Leaders Panel: Testing in Production

Blameless recently had the privilege of hosting some fantastic leaders in the SRE and resilience community for a panel discussion.

Our panelists discussed testing in production, how feature flagging and testing can help us do that, and how to get managers to be on board with testing in production.

The transcript below has been lightly edited, and if you’re interested in watching the full panel, you can do so here.

Amy Tobey: Hi everyone. I'm Amy Tobey, Staff SRE at Blameless. I've been doing SRE for a long time. I'm really into the SRE community and figuring out ways we can enable reliability in all of our organizations. Today, I’m really excited to be joined by Shelby Spees from Honeycomb and Talia Nassi from Split. I'm going to let them introduce themselves.

Shelby Spees: I'm Shelby. I'm a Developer Advocate at Honeycomb. I've been developing software and running production systems for about five years now and I want to help people deliver more value in their business.

Talia Nassi: I'm Talia. I'm a Developer Advocate at Split. Before this I was a QA and automation engineer for six years, so I have a background in testing. That's why I'm so passionate about testing in production. I'm really excited to be here.

Amy Tobey: To get started, let’s talk about trunk-based development. First though, we're going to talk about something that I believe that most people should be doing by default: testing our code with unit and functional tests on our workstations and in our dev environments. Before we merge our code to our main branch, it has to be ready to go for production.

Let's assume that we all have that in place. What's next? My role as somebody in incident management comes a little bit later in the process. In terms of when we say go and deploy that code to production, what's the first thing we need? Let’s start with Talia to talk about the Split perspective, and how they enable code to be shipped out to the edge and into production before we have full confidence that it's doing exactly what we expect.

Talia Nassi: With Split you can use feature flags. For those of you who aren't familiar, a feature flag is basically a piece of code that differentiates code deployment from feature releases. You can deploy code and it won't be available to your end-users. It’ll only be available for a specific group of people that you can specify in the UI. If you're doing trunk-based development, you can deploy your code to production behind a feature flag and then, once you're ready and you've tested the code in production, you can turn the feature flag on.

Amy Tobey: The feature flag space has expanded a lot over the last few years. I heard you say something about the ability to turn flags on for a fraction of users. Could you speak on how that enables us to ship code confidently?

Talia Nassi: Let's say I'm a developer on a team and I am working on a to-do list app. I want to add the ability to delete items from the list. I'm a front-end developer and the backend developer is creating this new API to delete tasks. So, my change is done and I have to wait for the backend developer.

Either I can wait for the backend developer to be done, or I can push the code to production behind a feature flag, wait for the backend change to be done, and have the backend developer push his code to production behind the feature flag as well. Then I can put myself inside the feature flag, which means that only I will be able to see those changes and can test it with both the front-end and the backend changes. It's a way to do it safely without affecting your end users.

Amy Tobey: Let's say a young developer has these tools and is in a trunk-based development system. They put a feature flag on. They do all the right things. They deploy it to production and they flip the flag on and go home for the weekend.

We're testing in production. Customer service wakes up and there are complaints. Maybe engineering needs to figure out what's going on. I think this is where we can hand off to Shelby and talk a little bit about the role of observability.

Shelby Spees: Whether you're using feature flags or not, I think feature flags are a core part of this. But even if you're not behind a feature flag and you just want to observe your changes, having the data to allow you to observe that change and see the impact of your code changes on real users, that's really important.

The thing about observability is that it requires high-quality, high-context data, and the context we talk about is the context you care about as you're writing your code. As you're writing your code, you think, "Okay, what's important for this? How am I going to know it's working in production?" This is similar to how we do test-driven development. We say, "How do I know this test is effective? How do I know this code is effective?" You write a filling test first, and then you write the code to pass the test. You can do something very similar where you write instrumentation and interact with your system, and live in that production space to learn from the actual behavior.

When you get pager or customer complaints saying, "Hey, there's something wrong," you can flip your feature flag back off. That's a really quick way to fix it. Then you can go back and look at the difference between who was behind that feature flag versus our baseline and understand the impact.

The thing about observability is that it requires high-quality, high-context data, and the context we talk about is the context you care about as you're writing your code. As you're writing your code, you think, "Okay, what's important for this? How am I going to know it's working in production?"

Amy Tobey: Does that imply that in an integration between the feature flagging service and the observability service, the feature flag would be passed into the events being sent to the observability system?

Shelby Spees: Absolutely. You can send that context along with all kinds of other context.

Talia Nassi: I wanted to piggyback off what Shelby said: that if you release to a subset of users or the entire user base, then you can monitor with observability. That's also a really important point. When you do a canary release or a percentage rollout where you release the feature to only a small subset of users, that goes hand-in-hand with testing and production because if something goes wrong, would you want 1% of your users to experience this issue, or 100%? It's an added layer of protection to have a percentage roll out where you incrementally roll out the change as opposed to everyone seeing this feature all at once in production.

When you do a canary release or a percentage rollout where you release the feature to only a small subset of users, that goes hand-in-hand with testing and production because if something goes wrong, would you want 1% of your users to experience this issue, or 100%?

Amy Tobey: We've got the system in place and we're rolling code out. Let's say this time things are going fairly well. But say my director of software engineering is bugging me regularly about it. What's next?

We've got the ability to see how the code is behaving in production. We've got the ability to have very low latency, fine-grained control over which features are enabled in production. What are the other layers that you see out in the wild that help people build that confidence? These things give us a lot of confidence, but there's always a little bit more we can do.

Shelby Spees: It comes down to the DORA metrics: how quickly can you have a change go out into production? Is your lead time on the order of weeks to months or is it on the order of minutes to hours? If it takes several weeks between writing code and it living in production, you lose all of that context. When something inevitably doesn't go right in thousands of changes over the year, you'll want to see the impact of the work you did. But you lose all that context. It's much more expensive to gather all that back up in your head and remember the point of this thing you were working on.

Lead time for changes rolling out is a big one. That's where CI tooling makes a big difference. And at Honeycomb we instrument our builds so that we can see not only how long it takes for a build to run, but what part of it is slow and whether that can be optimized. We try to keep our build times under 10 minutes because we don't want people to start working on something else and then have to switch context back to see whether the build failed. Context switching is so expensive, and that's something I feel really passionate about.

It comes down to the DORA metrics: how quickly can you have a change go out into production? Is your lead time on the order of weeks to months or is it on the order of minutes to hours? If it takes several weeks between writing code and it living in production, you lose all of that context.

Amy Tobe: It sure is. It's a huge drain on cognitive resources. I like how you brought that up. Vanessa asked in the chat about folks using trunk-based development. If there are known defects in the code or it's incomplete, is it still safe to ship behind a feature flag?

Talia Nassi: As long as the default rule is off, then it's safe to do it because if you're in this bubble (if you're internal or if you're part of our beta testing group), then you get to see all the defects, play with this thing, and break it in production. But if you're not, then you don't get to see anything related to this feature at all. Aas long as the default is off, then it's safe to deploy.

Amy Tobey: There's an advantage right here: modern software engineering is about working in teams and groups. Let’s say I get the UI code done and I'm confident that the simulator for the APIs is done. I ship it and the feature flag turned off, and now the backend team doesn't need to bug me and cause another context switch or wait for me to respond. They can simply turn it on in their environment.

There's a new level of coordination that's now captured in the technical system as opposed to us having to walk around and talk to each other.

There's a new level of coordination that's now captured in the technical system as opposed to us having to walk around and talk to each other.

Shelby Spees: What I really appreciate about this workflow (and one of the benefits of trunk-based development) is that you don't have your front-end feature sitting on a feature branch for weeks at a time and falling behind your trunk. 

I love Git. I have used the username GitGoddess. I've taught Git to juniors and seniors in different jobs, but making people manage integrations at the Git level is error-prone and complicated. Using feature flags to manage when things roll out is a lot better. I've been the person to go in and squash and manage other people's feature branches. One was a 115 commit, six months of work. I broke it up to like nine PRs or something.

I don't want anyone to ever have to do that. Hopefully it never gets that bad, but even before you're ready to release the feature, having it guarded by a feature flag removes all of that cognitive load and complexity.

Amy Tobey: We've talked a lot about production and letting our developers fly a little faster. But along that, there's a series of testing and validation we can do. The Twitter threads we've seen over the last couple of weeks about this received oh-hell-no responses. I think that came from a place of needing confidence before we put anything in production. So what are our options pre-production that these tools help us carry through in our dev and integration environments or even local testing? What do these tools help us accelerate?

Talia Nassi: It's a recommended approach for unit testing or integration testing. Something helpful that I recently learned about is to make a custom feature flag abstraction. Let's say you're a developer who's experimenting with giving people free shipping on an e-commerce site and now you're testing the shipping calculator. If the feature flag is on for you, you get free shipping, and if the feature flag is off for you, then you get the existing shipping cost.

In this example, you would have three tests. In the first test you simulate the feature flag on where the shipping cost is zero. For the duration of this test, any requests asking if the feature flag is on, you say yes. In the second test, you simulate the feature flag off and the shipping cost as zero. If any requests come in from the test asking if the feature flag is on, then you say no. In the last test, you just validate that you can go through the entire purchasing flow regardless of whether the flag is on or off.

With this approach you're being super explicit in the test, and then the test just becomes much more self-documenting and descriptive.

Amy Tobey: I like how that sets people up to validate both paths in parallel. I've seen so many incidents where that the current path works fine, the new path works fine, but once both get into production together, they interact in strange and unusual ways.

I've seen so many incidents where that the current path works fine, the new path works fine, but once both get into production together, they interact in strange and unusual ways.

Talia Nassi: I actually just learned about this approach, and up until a few months ago I would always recommend getting your test users and targeting your test users inside of the feature flag and then using those to run. But that can be fragile. If someone deletes a user from a certain treatment in the UI, this causes problems because you're not the only person who can configure changes to the UI. I like this approach a lot better because you take away that fragility.

Shelby Spees: It comes down to thinking about the impact of all your changes. As trivial as it sounds to turn on and off feature flags, it's still adding a fork in the road, it's still a different path that your code can take. You want to be intentional about where you're including it, why you're including, how you know it's going to be successful, and then testing for that.

Amy Tobey: That brings me back to chaos engineering, where we have to have a hypothesis before we start, and if we don't, that's the difference between science and screwing around. It's the hypothesis.

Shelby Spees: So many teams throw things at the wall and see what sticks. There's times when you need to do that, but I think we can do a really good job of reducing how much of the time we're throwing things at the wall and thinking more about the impact of this change. If you know the impact of your change, you can include that in your PR, and that facilitates code review, and that facilitates knowledge transferring in your team. It helps you write better code that's more self-documenting, write better comments on your code, and write better documentation around your code. Being intentional helps you just build better software.

If you know the impact of your change, you can include that in your PR, and that facilitates code review, and that facilitates knowledge transferring in your team... Being intentional helps you just build better software.

Amy Tobey Speaking of intentional, we have this fork in the road. When we get the fork in the road, we push the new feature flag out at 0%. And then we bring it up to 1%. We do the whole process, and eventually we flip it to 100%. At some point, there needs to be a loop back through the process to remove the dead path. Could you talk a little bit about the processes that you've seen work for making sure that happens? I've seen incidents where a feature flag sat forgotten for months or years and then some day later somebody else goes, "What's this?" and they flip it and then there's an incident.

Talia Nassi: The first thing is piggybacking off what Shelby said. Changes to your feature flags should be treated like changes to your code base because of their sensitivity. If you require two code reviews for pushing code to production, then you should require two code reviews for making any changes to your feature flags. In terms of what to do when you have stale feature flags, there are a few things.

A lot of feature flag management systems have an alert that'll say, "Hey, this feature flag hasn't had any impressions," which means people going in and making changes or being targeted in that flag. It'll ask, Do you want to turn it off? Do you want to delete it? What do you want to do?"

In the UI, there's some configuration to set up. In your task management system, like Jira or Asana, whenever you create the ticket to set up the flag, run the test, roll the feature out, etc. you should also create a ticket to delete the flag and remove the old code. Then inside of the code, a feature flag is just an if/else statement. You're saying if you're in this bucket, do this, and if you're in this other bucket, do this other thing. That if/else statement just needs to be reworked and whatever version you chose needs to be put in and that if/else statement needs to be taken out.

Changes to your feature flags should be treated like changes to your code base because of their sensitivity.

Amy Tobey: Then that if/else needs to be kept as simple as possible, too, because if you get a case statement, then you're just asking for trouble down the road.

Shelby Spees: So earlier this year, one of our engineers, Allison, wrote a blog post about using Honeycomb to remember to delete a feature flag. It's a hygiene thing. I appreciate the idea of when you open a ticket that involves creating a feature flag, you also open a ticket to later delete that feature flag. Thankfully, removing a feature flag from code involves code review, so there's a knowledge transfer and context sharing. Having that step forces you to be like, "Okay, what's the impact of this?" Then, if you're using observability tools, you can see if there’s anyone who's still behind this feature flag.

I appreciate the idea of when you open a ticket that involves creating a feature flag, you also open a ticket to later delete that feature flag. Thankfully, removing a feature flag from code involves code review, so there's a knowledge transfer and context sharing.

Amy Tobey: Would that be through the metadata that comes in the events, or do you actually decorate your feature flag code with spans? Or maybe situational?

Shelby Spees: I think you just add your regular code with it. You can add arbitrary fields. You can just give it a name that says feature flag xyz.

Amy Tobey: I was just curious if there were cases where I have a flag A and B, and maybe in A, I have a new span or something. Does that impact the ability of my observability system to consistently display and compare spans?

Shelby Spees: It can be really useful with tracing if the code behind your feature flag is complicated, but you would probably instrument it just the way you would want to instrumented without the feature flag to see how it's going to behave normally, and the add the feature flag as a field.

Amy Tobey: When we do these tests in production, sometimes they can implement or impact our data and our backends. What have you seen out in the field for techniques for people say, “We have a new route for writing through to the database, we're replacing the old one.” We flip to the new one, but maybe it's writing data in a slightly different format. There are all these side effects that can still happen and get out to production that causes incidents, which is more work for me. So what have you seen out there that people are doing to protect the data?

Talia Nassi: Some of the things that I've seen that have worked well is when you're differentiating test data and real production data. The first thing is your test users should have some sort of Boolean or some something in the backend that says it's a test user and that would be equal to true. So anything that this test user does in production, it's going to be marked with this Boolean set to true and then wherever you collect your data, you can just say if you have any actions coming from this test user, put it somewhere else, don't put it in the same place as production because in production it is false.

Same thing with all the other test entities. If you have a test cart and a test page or whatever your testing entities are, then those should have some sort of flag in the backend for “is test object,” and that should be set to true. Then everything that has that flag set should be put somewhere else in your dashboard.

Shelby Spees: The other thing is, if you're testing on real production users, that's real production data. It might be like an experiment, but every time you release a change, that changes something. And you can't possibly know what. That's the underlying theme when we talk about testing in prod. You're just not being intentional about it. The difference here is if you have a subset of your traffic behind a feature flag, if you know that they're specifically opted in as beta users. Then, like Talia said, you can mark them as beta users or test users. If you're dogfooding and you're limiting things to your internal team, then those people should be marked as your team's users.

If your business data reflects, then it's probably not a lot of extra work to be able to report on changes between these different groups.

If you're testing on real production users, that's real production data. It might be like an experiment, but every time you release a change, that changes something. And you can't possibly know what.


Amy Tobey: We've talked quite a bit about confidence, risk, and danger. I want to shift and talk about the opportunities new tools provide. I've been doing this for more than 20 years. Back when we started, we didn't have this stuff. We had stones and chisels and then we got Vim and some people claimed that it wasn't really a step forward.

We've come a long way, and we have these new capabilities. They allow us to do new things. Since we were just talking about the data, let's talk a little bit about what's possible about testing and production that we can't do in the synthetic lab environments, because the data isn't real and because we have things like GDPR that prevent us from doing that testing.

Shelby Spees: I'll share another Honeycomb blog post from one of our users who used Honeycomb to debug an emergent failure in Redis. He monkey patched the Redis Ruby library in order to observe what was going on. They ended up with all these connections. Redis called them and said, "We're shutting you off." It was this huge problem that without observability, he couldn't possibly debug. He could not reproduce it locally or in QA. You needed a certain amount of traffic in order to debug it.

I appreciate that there are industries and certain domains where you need to have a synthetic environment. If you're working on pacemakers, that's really important. You want to be very, very confident in your tests. But there's also a cost to that. Similar to how we talk about three nines versus four nines as an order of magnitude more effort. It's similar to having a test environment that accurately represents production in order to give you any answers.

Being able to reproduce emergent failure modes in production in a test environment is super expensive and it's often not worth the infrastructure cost and the engineering brain cycles to actually do that. That's where a lot of the arguments against testing and prod fall apart, because you're going to have things that can only happen in prod. You may as well give yourself the tools to address them.

Being able to reproduce emergent failure modes in production in a test environment is super expensive and it's often not worth the infrastructure cost and the engineering brain cycles to actually do that. That's where a lot of the arguments against testing and prod fall apart, because you're going to have things that can only happen in prod. You may as well give yourself the tools to address them.

Talia Nassi: You're preaching to the choir. I used to be an automation engineer. Up until the very end of that part of my career, I was only testing in staging because the companies that I was working at didn't test in production. I would spend so much time testing features in staging. They would be pushed to production and then break in production, but they were working perfectly in staging, so what's the point of testing in staging if they're going to break in production? My users aren't going in and using the product in staging, so why do I care? It took enough of that happening over and over again where I was like, "Okay, there has to be another way to do this," and then I interviewed at a company that tested in production and I haven't looked back.

I would spend so much time testing features in staging. They would be pushed to production and then break in production, but they were working perfectly in staging, so what's the point of testing in staging if they're going to break in production? My users aren't going in and using the product in staging, so why do I care?

Shelby Spees: The thing about that, too, is it's that you as a tester are responsible for the quality of the code going out. It's super demoralizing when you have a certain level of confidence in staging and then all of that falls apart in production. It's like, what were you even testing? You were doing your job according to what was assigned to you, but you can't do your job according to what's actually good for the business. For people like us who care about the impact of the changes going out, we want to be able to validate that. We want to be able to feel like things are going to work well.

I've talked to a few testers who are learning about observability and the intersection between being a tester and observing code changes. There's so much about testing that involves the sense of responsibility and ownership over the services in production. You don't have to be a developer to have an impact there. I really appreciate your story because I totally feel you on that.

Amy Tobey: A lot of shops still have dedicated testers doing old school QA. But what we want is to uplift those people so that they're doing more high value work, just like we try to do through all of our careers. Maybe the place for them to move towards is this idea of owning and nurturing the test spectrum, but extending that all the way out into production.

What we want is to uplift those people so that they're doing more high value work, just like we try to do through all of our careers. Maybe the place for them to move towards is this idea of owning and nurturing the test spectrum, but extending that all the way out into production.

Talia Nassi: And when you have the right tools available, they're working together, and you can see the impact of your changes and test them in production before they go out to your end users, you're creating this bubble of risk mitigation. If something goes wrong, you're covered. You can look at your logs and see what's going on before your end users will be affected by them. You're doing it in the safest way possible with the right tools. What else do you need?

Amy Tobey: We've gone from the beginning of the development cycle where we have our unit tests and we've talked all the way through into production, but sometimes stuff slips through. Having these tools in place still makes our life better. I'm an incident commander. I'm in there asking questions, trying to bring an incident to resolution. Now with observability and feature control in place, we have additional tools for resolving that incident. Maybe we could talk about that for a second.

Talia Nassi: This goes back to what Shelby was saying about not being able to reproduce an issue in production. If there's an incident in production, I'm not going to go to staging to test it. I'm going to go to production to figure out how to reproduce the issue. That's something that used to happen a lot when I was a QA engineer. There would be incidents and things were being reported to us in production. It's one of those, “Oh, it's working on my machine,” types things where I did everything I could to test it, and it would be fine for me in staging, and then you would go to production and these incidents would only be in production. You're never going to know the differences between your staging environment and your production environment until you test in production.

You're never going to know the differences between your staging environment and your production environment until you test in production.

Shelby Spees: We talk about this at Honeycomb in terms of approaching on call in a healthy way. If your on-call engineer gets paged at two in the morning and it turns out to be something where they can turn off a feature flag and then debug it in the morning when they've had enough sleep and a cup of coffee, why aren't we doing this more? On-call doesn't have to suck as much as we make it out to. It doesn't have to be this painful. We can put guardrails in place so that incidents get resolved and we stop impacting customers right away, and then we have the observability to go back and actually resolve what went wrong. You can fix the code, but with a full night of sleep.

If your on-call engineer gets paged at two in the morning and it turns out to be something where they can turn off a feature flag and then debug it in the morning when they've had enough sleep and a cup of coffee, why aren't we doing this more?

Amy Tobey: I love that it has an impact on the health of our engineers. We're not stuck at four o'clock in the morning when we're at our lowest possible cognitive capacity. I don't know about you two, but you wake me up in the middle of an REM cycle and I'm a damn idiot for at least an hour before I'm ready to do anything. I really like the idea of it as a tool for reducing burnout and attrition even within our engineering teams.

Talia Nassi: I like that idea of having different types of alerts. If something is broken but it's not a huge issue, there should be different severity levels for the different types of alerts, You should only be woken up in the middle of the night to turn off a feature flag if your entire app is crashing and things aren't fine.

Amy Tobey: With my SRE hat on, I'd say we should only really be waking engineers up for things that are actually harming our customers. If it isn't impacting the critical user journeys, then we should sleep and hit it with a full mind in the morning.

Shelby Spees: That's where SLOs and error budgets come in. Alex Hidalgo's new book, Implementing Service Level Objectives, talks exactly about how we can reduce alert fatigue and have more meaningful pages. He also discusses how to be able to anticipate things like, “Okay, things are affecting a tiny, tiny percentage of traffic right now, but in four hours this is going to start impacting everybody. Should I wake somebody up to fix this or can that wait till morning?”. That's exactly how you do it. It involves some math behind the scenes, but when you have good data about the health of your system, it's a lot easier to be able to alert on meaningful things instead of CPU usage or because traffic isn't within the band that we expect it to be. You shouldn't be waking somebody up for that if it's not going to impact your customer experience.

It involves some math behind the scenes, but when you have good data about the health of your system, it's a lot easier to be able to alert on meaningful things instead of CPU usage or because traffic isn't within the band that we expect it to be. You shouldn't be waking somebody up for that if it's not going to impact your customer experience.

Amy Tobey: I think of those as anxiety driven alerts. They don't know that something is wrong. You're just maybe worried that something might be wrong. It's waking up and going, "Is this good? Is the baby okay?" It isn't great for anybody because disturbed sleep patterns are bad for us.

Talia Nassi: Shelby and I were talking on Twitter yesterday, and she said something really smart. She said like a lot of companies and people are making business decisions based on hunches or an idea, but they're not looking at data. It's so important that you write your test based on actual data and make your business decisions based on data that comes from production. Using an observability tool like Honeycomb will allow you to do that.

Amy Tobey: And making it based on your risk assessment from your SLOs as opposed to some fuzzy idea of how many alerts are coming through, which nobody really has a clear picture of in any infrastructure of size.

Shelby Spees: I always have questions about cultural changes in a team. If you're watching this talk and you're like, "Oh man, like I get it. Some things we can't help. The only way to test it is in prod,” and you go to your manager and you're like, "Why aren't we doing this?", and your manager's like, "What are you talking about?” How do you start moving the needle? How do you start pushing for that cultural change on your teams and in your organizations?

Talia Nassi: Something I always like to tell people is to use examples from your past. If this is something that consistently happens where you test something fully in staging and it works and your automation tests are passing, and then you push to production and it fails, then maybe this is something that you should bring up. Like can we get a trial of these tools that we need just to see if it works for us for a few weeks?

Amy Tobey: Most of my testing in production earlier in my career was done on the rule of “It's easier to ask for forgiveness than permission.” It would just be like, "Well, we don't really have a way to do this. I'm just going to do it and I'm not going to ask anybody." I'm not recommending this for everyone unless you're really confident in your ability to get a new job. But we're often stuck in those places

Shelby Spees: At my last job, my manager told me to take more risks because I would learn faster. I really appreciated that becauseI tend to be very hesitant which is not the making of a good tester. I was on the DevOps team. I had access to all of our production systems. So I didn't want to just make changes willy-nilly.

Amy Tobey: That's usually what our leadership is looking for. What I coach people on is to think about the goals of our leaders. Their job is to give the business confidence in what we're doing. Whether we're writing software or not, often the task is to say, “This is the velocity I have to offer you and this is my confidence in that velocity.” In operations it's typically, “This is the availability I can offer you and this is the confidence I have in that.” We've started to move to more encoded systems like SLOs for availability. We use it for feature flags and observability too. That's our metric we use to determine whether our estimates of confidence are meeting the actual wheels on the road.

Our next question is, “I'm thinking now about destructive tests in production such as stress tests, security tests. Shall I avoid these kinds of tests in production and do you have any experience with that?” So let's start with stress tests. Do feature flags have a role, and how do we integrate that with our stress testing regimen

Talia Nassi: When you run these tests, it's better to do this yourself at a time of low traffic rather than have your site crash in production because you didn't do these types of tests in production. So yes, you should be running performance, load, and stress tests in production, but you should do it at a time of low traffic and a time where you know that if you run the test and the site crashes, you can bring it back up. Feature flags can play a role in that, but yes, they definitely should be done.

Amy Tobey: You can also do things like enable. If your stress testing has a synthetic user, you could enable features just for the stress test. Then you can go back to your observability system and see the impact of the new code on your systems.

Shelby Spees: It brings me back to chaos engineering and how, if you're going to perform a chaos experiment, you want to have confidence that your system will be resilient no matter when you deploy it. But at the same time, you don't want to drop it on Black Friday. I mean Black Friday kind of is a chaos experiment in itself for example, but you want to be smart about when you start an experiment or you release a chaos monkey script. It's the same thing with performing stress tests or security tests, if you have pen testers behind the scenes. You want to be confident that if they're starting to cause problems that you can either block them off from the rest of the traffic, quarantine them, or ban them and that you have mitigation measures in place. The point of these tests is to have more confidence in your production environment. So absolutely you should be doing these in production. But also, you should have confidence in your production environment.

The point of these tests is to have more confidence in your production environment. So absolutely you should be doing these in production. But also, you should have confidence in your production environment.

Amy Tobey: A lot of the standards around security mean the pen tests have to happen in a production environment to really be valid. I really like this idea of being able to enable new features just for the security team to give them the opportunity to attack it while it's out in the real world, but before it's exposed to the wild hordes of bots and kitties and stuff.

Talia Nassi: I think years ago there was no way to safely do it in production. Observability tools and feature flagging didn't exist. Now that there are these tools available, those people who are still in tech from when these tools weren't available are still in that mindset of like, "No, it's not possible.”

Shelby Spees: It makes me think about the purpose of running all these software systems that we're running. We're here to deliver business value, and testing in prod gives us more confidence in our systems and helps us learn more about our systems so that we can do a better job of delivering business value. So when you refuse to even acknowledge the possibility of testing in prod, it's like trying to deliver packages on horses when you could deliver packages on trains. You're going to be stuck behind because you're just not learning about your systems and you can't address the problems that are there. They're just going to stay under the surface forever.

We're here to deliver business value, and testing in prod gives us more confidence in our systems and helps us learn more about our systems so that we can do a better job of delivering business value.

Amy Tobey: You brushed up against an accessibility element there too, which is old farts like me and some of my peers are looking at us and going, "You people, what are you doing? Testing in production? That is verboten." But we also have younger developers or new tech people who now are empowered to do things that we would not have empowered a young developer to do 10 years ago.

Talia Nassi: There's always going to be those people who say, "Testing in production will never work. What are you guys doing?" These naysayers, I like to call them. Honestly, I don't care about these people.

Amy Tobey: I just call them wrong.

Talia Nassi: You're stuck in this mindset and you absolutely refuse to change your mindset. That's fine. You do you. I'll be over here testing in prod.

Shelby Spees: You're already testing in production. Add the tools to your toolbelt so you can get the most value out of those risks that you're already taking because every time you push out a change, that is a test.

You're already testing in production. Add the tools to your toolbelt so you can get the most value out of those risks that you're already taking because every time you push out a change, that is a test.

Talia Nassi: Honeycomb has a free version and Split has a free version. I would go in and download both of them, start using them, and figure out if the tools are right for you, if you like the process of testing in production. And again, use examples from your past to bring this up to your management

Amy Tobey: So with that, that's our time. Thank you Talia and Shelby so much for joining us today. I had a great time. I hope you did too. To our audience, thank you for joining us and so that we can all have this time together in a time when we can't actually be together. Stay safe out there, stay healthy, and stay resilient.

Follow Blameless on Twitter and check out our Events page to stay posted on future webinars and events! 

Recommended Reading:

About the Author
Blameless Community

Get the latest from Blameless

Receive news, announcements, and special offers.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.