At Blameless, we want to embrace all the benefits of the SRE best practices we preach. We’re proud to announce that we’ve started using a new system of feature flagging with canaried and iterative rollouts. This is a system where new releases are broken down and flagged based on the features each part of the release implements. Then, an increasing subset of users are given access to an increasing number of features. By avoiding big changes for big groups, we reduce the chances of major outages and provide a more reliable product faster.
Of course, switching to this system comes with challenges and decisions to make. In this blog post, we’ll share what we’ve learned about canarying and flagging best practices. We’ll look at:
Iteration and canarying is more involved than traditional big releases. You need to look at the code being deployed and flag everything that comprises each new feature. You’ll also need to tag groups of users. Finally, instead of one big release, you do several smaller releases where more groups of users receive more features each time. Flagging features and making user groups creates overhead, and each release will take a bit of additional time. However, the benefits of this system are worth it. Here are a few to consider:
More reliable services. Perhaps the biggest benefit of this approach is improved reliability for your service. Changes in production are a common source of incidents and outages. With small, iterative deployments, you’ll know that things can only go wrong for a subset of your services. Likewise, canarying means that only a subset of your users will be affected. By preventing major outages, you’ll greatly improve the perceived reliability of your service.
Continuous feedback. By iterating through new features, you have many more opportunities to hear feedback from users. You’ll be able to tell what should be improved before you commit to the entire feature set.
More manageable operations. By reducing the scope and spacing out each release, you give operations teams the opportunity to ensure they’re ready to support each feature.
This approach does come with challenges. As this approach only deals with the deployment of code, you don’t have to retroactively change your code to be modular or feature-flagged. However, you need to build new practices for code going forward. Developers need to invest time and energy building new habits for development. They need to instill a mindset of everything being modular and iterative. Switching gears like this can initially cause development to slow, but the payoff of better releases is worth it.
Our release system has essentially two dimensions: the amount of users accessing new features, the canary size; and the new features they have access to, the iteration. Our goal is to expand both of these until they encompass all users and all features without ever making leaps large enough to jeopardize the reliability of each change. How do you find this balance and cadence?
First, you need to build a release roadmap. This is a project that product, development, and operations teams should share. It outlines which features should be included in each iteration, and which canary groups should receive them. It should also contain an aspirational timeline for each stage. However, this timeline shouldn’t be written in stone. You’ll need to adjust your rollout speed based on how each iteration is performing.
The key is monitoring data with feature flagging. You need to be able to see how each feature is individually performing. Whether or not a given feature should be rolled out can depend more on the performance of specific other features, rather than the overall health of the system. Blameless uses monitoring tools such as Sumologic to parse the information our system outputs. It allows us to break down which features are causing issues or unreliability, and which are stable enough to be built upon.
Once you know a given iteration is safe, you can roll it out to the designated groups. The modular setup gives an extra layer of protection, as individual features can be rolled back without impacting the entire system. Don’t depend on this going off without a hitch, though. Like any other backup system, simulating the need to roll back an iteration is necessary to understand what your options actually are.
Another best practice for canarying releases is to use customized specific canary groups. Intuitively, you might just break your users down into indiscriminate chunks — maybe 10 groups of 10% each. This works fine to get many of the benefits of canarying, but you can get even more insights with tailor-made canary groups.
To do this, you first need to understand how your users interact with your service. Blameless uses tools such as Pendo to see how much each user relies on each feature. This is supplemented by meeting with customer success teams, who can relay reports from customers on what matters most to them. Creating things like user journeys and SLIs can quantify this importance.
Once you have profiles for your users, create groups for each iteration based on the features in that iteration. Some qualities that you’d like your ideal canary group to have include:
Of course, you won’t necessarily find a perfect set of users for every feature. The important thing is considering these things when building canary groups and choosing the best candidates you have.
To make these frequent, iterative, and specifically targeted deployments, you’ll need to have a strong deployment system first. Practices like CI/CD are necessarily for this speed and flexibility.
You’ll also need to respond well when things go wrong, which they inevitably will. Blameless’s incident response tools, like runbooks and retrospectives, help you recover quickly and strengthen your responses. Our SLIs and SLOs quantify what matters to users, helping you build user profiles for canary groups. To see all of this and more, check out a demo!