We explain what software reliability testing is and how it works, how to conduct it, and how you can use it to identify problems in the software design process.
What is reliability testing?
Software reliability testing is the process of testing a software’s ability to function under a given workload in a specific time period. Types of reliability testing include:
Why do software reliability testing?
As users depend more on your services and competition increases, reliability becomes your most important feature. After all, any other features you have are irrelevant if users can’t access them.
Increasing the reliability of your software is a complex, holistic process where you need to account for many factors contributing to unreliability. It involves making changes on every level from the cultural to the technical.
However, whatever changes you make, you need to measure the impact the changes have on reliability. Without reliability testing, you won’t know if you’re really increasing your reliability. Testing allows you to prioritize effective changes and stop changes that aren’t helpful.
Reliability testing vs performance testing
Reliability testing and performance testing overlap in some of their methods and goals. Both try to ensure that software will work as expected under all the conditions it could encounter. They both achieve this by running the code through simulations of those conditions and measuring the response. Reliability testing could be thought of as a specific type of performance testing. The unique aspects of reliability testing include:
How to do reliability testing
The exact process of reliability testing depends on the type of tests you’re trying to run and the architecture of your system. However, some steps are always necessary.
Having a service level objective set to reliability that will make users happy keeps you from over- or underspending on reliability. Then you’ll be able to test to ensure that code reaches those standards.
Building a testing environment. Reliability testing generally deals with code that’s already running in users’ hands, proactively looking for potential disruption to production code. However, you don’t want to test the system that users are relying on, as you’re likely to cause outages. The solution is to build a separate environment to test in.
The testing environment should be as similar to the production environment as possible. As unreliability can come from many different sources, from bugs in the code to hardware limitations, you should replicate all of those sources in the testing environment.
Defining areas of code. Reliability testing makes sure that different features of the service function as intended. A major part of this is making sure that new features are compatible with old features and that all interactions happen as expected. To be able to test for this, you need to be able to clearly designate where one feature or update ends and others begin.
Techniques such as feature flagging help you understand what parts of code contribute to each feature. Then, when testing reveals an issue, you know exactly what part of code should be examined.
Setting up monitoring tools. When your tests are running, you need to be able to monitor and measure the output of the code to know if it’s meeting the standards. Monitoring tools watch these outputs and can show you meaningful context.
Setting up testing tools. Testing tools can help automate the testing process by running the specified requests without manual direction. Make sure the testing tool can integrate with your monitoring tools.
Setting up a testing policy. Testing should consistently happen whenever it’s necessary. Set up policies for when different tests need to occur. This could be based on a time-based schedule or specific events depending on the tests. Write runbooks that guide people through the test to make it easy, and automate them where possible. Simple tests that need to be frequently run are good targets for automation.
Once you have these steps set up, you can begin testing.
Types of reliability testing
Reliability testing can take many different forms. Any test you run to ensure that software is reliable enough for users to be happy falls under the umbrella of reliability testing. Your system might have particular aspects that need specific testing. Think about what experiences your users are counting on, and design tests for those experiences.
For example, if users of your medical advice service often need to complete a search very quickly, your reliability testing should focus on the speed of the search results, and not just their overall availability.
However, no matter what specific tests your service needs, they’ll likely fall into some of these broad categories.
Feature testing ensures that new features in your service work as expected by running through all their possible use cases. In the testing environment, try every setting, option, and combination that can change how your feature functions. Then, see if the feature behaves as you expect. Doing these tests is important every time a feature is added or updated.
To do this, you first need to understand what your feature is exactly and what it should do. Create a table of the possible inputs and the expected outputs, then fill in the table with the actual outputs. Use tools like feature flagging to understand what inputs will be relevant to the tested feature.
When feature flagging reveals issues with how features behave, you need to restructure the code to account for them. Generally, problems found in feature testing need to be addressed in code, as incorrect functionality isn’t likely caused by insufficient resources or other operational factors.
Regression testing is a series of tests run to make sure that a service still fully functions when changes occur to it. A change can be anything from a bug fix to an infrastructure change to a major update. Code is complex, and it isn’t always easy to tell what affects a change could have. Problems that seem like they’d been solved in a previous update could recur when a new update is pushed, so retesting for old issues is necessary. If a change causes new issues, this is known as a regression and must be dealt with.
Regression testing could reveal issues with code, which would require rewrites and retests. It could also reveal issues with resource usage or other infrastructural problems. A new update could end up using additional resources such that another part of the code no longer works.
Load testing refers to simulating the load of expected usage on some code to ensure it’s still reliable enough to make users happy when being fully used. The testing process could involve sending lots of requests to the service in the testing environment and recording details of how the service replies – is the response accurate, fast, and consistent? Look at how many resources are being used in terms of your servers or cloud provider. You want to simulate the maximum amount of traffic you expect the service to receive.
Once you’ve seen how the code performs under this load, make adjustments if it doesn’t meet your reliability standards. This could involve dedicating more resources to running the code, or making changes to the code itself so it runs more efficiently.
By combining these types of tests, you can get a complete picture of how reliable your code is compared to your standards.
How can Blameless help?
However, testing is only part of the process of improving reliability. To understand what to test and how to set your standards, you need to understand user experiences and build SLOs. Blameless can help. Our SLO tools are best in class and provide deep reliability insights. To see how, check out a demo!