Wondering about DevOps Metrics? We explain what metrics matter in DevOps, why they are important, and how to measure them.
What are DevOps Metrics?
DevOps metrics are a barometer that allows DevOps teams to measure the performance of their development pipeline and identify which areas are succeeding and which need improvement. Some key metrics are lead time for changes, deployment frequency, mean time to recovery, and change fail percentage.
As you implement DevOps practices, there are many specifics and details that you need to decide at the outset. For example, if you’re implementing the practices of continuous integration and deployment, you’ll need to decide just how frequently you attempt to release. There is no one right answer for every organization. There isn’t even one answer that’s always right for each organization. In fact, it’s best practice to continuously improve and iterate systems and processes by adapting to changes in the organization and customer needs.
Monitoring and reviewing DevOps metrics ensures that those iterations are data-driven. Let’s look at how to get started:
Now that we’ve seen how metrics are implemented, let’s look at some common examples.
Lead time for changes is the average time it takes for a code to go from being committed to running in production. This essentially measures the speed of your deployment. Code being committed means that it requires no further development before it’s ready for deployment. The deployment process could involve running the code in a test environment, trying other tests, and packaging into whatever architecture the service runs on.
Why it’s important to monitor: Lead time for changes is an important metric for budgeting time. Frameworks like Agile help you estimate how long development will take. However, if you don’t also know how long deployment will take, you don’t know when you can expect the project to reach users.
Improving lead time for changes is also important to increase deployment frequency. Many aspects of deployment take the same amount of time regardless of the size of the project, so frequent small changes are more costly if your lead time for changes is high.
How do I measure lead time for changes? Lead time for changes is monitored by logging whenever code is committed and whenever a piece of code first runs in production. This is usually done by whatever codebase management software is used, such as Github. By looking at the time between these two values for each piece of code and averaging them across some time period, you produce your lead time for changes.
You can also evaluate the lead time for changes for only specific sections of your code. This can be helpful to see if deployments in some areas take longer than others, and if specific problems with these areas can be fixed.
How do I know if I need to improve? A bad lead time for changes metric means that projects release slowly, even after all the development work is done. This can cause a lag between a demand for new features and those features being available to the public. Slow deployment processes are often shared between many types of projects, so you’re even more delayed if you’re trying to release many changes. This discourages you from deploying frequently, as advocated by other DevOps principles. Finally, future development can depend on users’ response to previous deployments. Development teams sitting idle and waiting for deployments to finish is an indicator of a lead time for changes that needs to improve.
How do I improve? Try breaking down your deployment process into smaller chunks and see where the biggest delays happen. Also look at how the metric varies based on the service area of the code. Once you’ve found specific areas that are causing problems, look into tools and processes that could automate them. If that’s impossible, try making runbooks to guide engineers through these stages more efficiently. If these don’t result in enough improvement, consider changing the process or architecture entirely to something that doesn’t have this specific step. Increasing resources for slow steps, either through guides, tools, or headcount, is key to lowering your lead time for changes
Deployment frequency measures the amount of deployments of code your organization makes to production per some time period. Usually, every deployment is counted, but you can also measure the rate of only major changes.
Why it’s important to monitor: Deployment frequency can serve as a general health check for your entire DevOps lifecycle. A major delay in any stage will cause the deployments to become more infrequent. Having frequent deployment also means your service is quick to react to new demands. If a bug needs to be fixed or a new feature has to be launched, you’ll want deployments frequent enough to address these needs.
How do I measure deployment frequency? Deployment frequency is measured by counting each deployment of new code into production. Tools that manage your deployment, like LaunchDarkly, can help record this. You then keep a running count of the number of deployments per some time period. This could be every day, every week, every month, or something else depending on your organization. You can also keep track of the size and type of deployments to get more specific frequency data.
How do I know if I need to improve? Like other DevOps metrics, the ideal deployment frequency is unique to each organization. However, the warning signs for insufficient deployment frequency are often shared. Deployments should be made in time to address a particular user need or production issue. If you find that the issue remains in production long enough to cause user pain, or that users are dissatisfied waiting for updates, then you should increase the deployment frequency.
How do I improve deployment frequency? Deployments are one of the final stages of the DevOps lifecycle, so improving the efficiency of any stage before it will increase deployment frequency. Track how long it takes a project to go through each stage. Look for stages that take especially long or vary wildly in their length. These will make the biggest impact when improved. Also focus on steps that every project has to proceed through, instead of stages that only matter to a subset of projects. Minor improvements to universal processes can make a bigger difference than major improvements to niche processes.
Keep in mind that there’s another upper limit on your deployment frequency: the amount of changes to deploy. You can’t deploy code that hasn’t been written. Slow development velocity can also cause infrequent deployments. If there are unmet demands for your service, increasing development velocity could be the only way to address them. However, there’s no point making changes and deploying frequently just for the sake of it. DevOps metrics should reflect the actual state of your development lifecycle, not something that’s manipulated to just show the numbers you want.
Mean time to recovery measures the average time it takes to go from an incident causing an outage to the service returning to previous functionality. Mean time to recovery is a MTTx metric, a category of metrics that show how long stages of incident recovery take on average. By tracking each one, you can determine where inefficiencies exist.
How can I track mean time to recovery? Log every time an incident occurs that causes a disruption in service. Then, when the service is functioning again, make another log. Record the time between the two events, and then average all of these times. You might want to just look at the average of the last 30 days to see the effect of your current policies. You can also look at subsets of incidents to see which types are the most difficult to recover from.
Keep in mind that measuring mean time to recovery isn’t the same thing as the mean time to repair or mean time to resolve. Repair refers to correcting the problem that initially caused the incident, for example by deploying a bug fix. Resolve refers to correcting the problem and also doing the work to learn from the incident, essentially wrapping up the incident as a whole. Recovery is necessary for repair or resolve, but repair or resolve aren’t necessary for recovery. Through methods such as backup systems, services can recover and be running again without the initial problem being fixed. All three of these metrics are important to track and keep separate. Blameless’s incident response section can record all three.
Why it’s important to monitor: although all MTTx metrics are important, mean time to recovery has a unique importance as it reflects the pain users experience from an outage. Simply restoring the service doesn’t complete your incident response process; you still need to resolve the issue and learn from it. However, users will only care that they can use the service as soon as possible, even if it’s just a stopgap version. Therefore mean time to recover is the best way to know your outages aren’t causing major inconvenience for users.
How do I know if I need to improve? If your outages are lasting so long that users are upset, or if you’re violating your service level agreements for availability, you definitely need to improve your mean time to recovery. If you have emergency systems like backup services in place, you should see that your mean time to recover is much lower than your mean time to resolve. Your backup systems should allow the system to recover and be functional quickly while engineers tackle the underlying issues. If these numbers are close to the same despite having recovery systems, you should focus on improving your mean time to recover.
How do I improve my mean time to recover? The best way to improve mean time to recover is to focus on systems that restore service even if the original issue isn’t addressed. For example, if a deployment introduces a bug that leads to an outage, you can have a system in place to roll back the production environment to before the bug was deployed. Or if a server crash causes major data loss, you can have a backup server that the service switches to. In both cases, some new information will be lost - the new additions that unfortunately had a bug, or the data created since the backup. However, the service will be functional to keep users happy while the issues are addressed, and your mean time to recovery will be reduced.
To really understand how your processes changing are reflected in your metrics changing, you need to continually monitor your DevOps processes. Tools like Blameless can help. With our Incident Resolution tool, you can have a timeline automatically built for every incident. This can help you see where improvements to efficiency can be made. To find out how, check out a demo.