Three Ways To Measure Unit Testing Effectiveness

Eduardo Fischer dos Santos

By Eduardo Fischer

Code Coverage Reporting By Itself Is Not Enough!

Have you ever been questioned: how much unit test coverage is ideal? 75 percent? 80? 100?

It’s a common discussion in meetings for new projects, refactoring old projects, or even job interviews. It’s easy simply to put a high arbitrary number or say that there isn’t such a thing as a perfect number, since every project is different. But this gets us no closer to an informative response. And even at 100 percent coverage there is no guarantee that the test suite is adequate.

Imagine that our coverage tool reports that our project has 100 percent test coverage, a perfect score. Every single line of code is covered. Then we take a moment to look at one of our tests. We open the file, look at the test code, and we are left in shock. The test does not contain even a single assertion. We look at the next file and the next: no assertions. We have 100 percent test coverage, but it is the same as having none, because none of our tests is testing anything. Coverage metrics do not report on how effective the tests actually are.

So here we have reached a point where we can agree on one conclusion. Test coverage is not enough: we need something more.

In the following sections, I’ll discuss three different ways of measuring whether the test suites are truly useful. All of them are field-tested by big companies and have proven themselves reliable. It’s best to use at least one (preferably all three).

Test Case Effectiveness Metric

What does “test case effectiveness” mean? There are many ways to define and measure it.  I will show here the metric used at Sun Microsystems for the OpenSolaris Desktop project.

At OpenSolaris, the TCE (Test-case effectiveness) metric is a way to measure the quality of your test cases. It is useful to discover why bugs are escaping to production and how your tests have been handling updates to your project. If the TCE goes down, it means that your tests are not being maintained adequately to handle the updates.

This metric is especially useful when you can get a lot of client feedback on defects before a general release. Open source projects are a good example, since alpha and beta releases are popular, and the users are generally knowledgeable, detailed bug reporters. 

TCE is calculated with a simple formula:

TCE = (BF)/(BF+BNF) * 100% where:

BF = number of Bugs Found as a result of execution of test cases
BNF = Bugs Not Found, which are bugs found in production that slipped through the test cases

As a result you will have a score that varies between 0  and 100 percent where 0 is the worst possible score and 100 is the best possible score. A score approaching 0 means that your test cases found few or no bugs and there were lots of bugs that escaped to production.

We can use this formula giving weights to bugs according to their severity. Most serious bugs have more weight and smaller bugs have less weight. This will give us a good notion on whether our test cases are catching more or fewer bugs with each new version.

Ideally, besides just keeping track of the TCE metric, paying attention to the reason the bugs occur can give you a good idea of how to improve your test suite. If you have no test cases covering the bugs? Write test cases covering them. Test execution problems? Review your pipelines. Incorrect specification? Maintain closer contact with your project manager, developer, and designer.

Mutation Score

Mutation score is a metric used in mutation testing to evaluate the effectiveness of your unit tests. If you have never heard of mutation testing I will give you a brief explanation before focusing on the metric itself.

Mutation testing is the act of inserting some bugs into our code through a specific tool, and then we check if our tests are finding the bugs. If our tests found all of the defects, perfect. But if they are passing that’s a bad sign; our code is buggy so they shouldn’t pass. In more detail, our tool will create a copy of a file inserting a mutation (changing a ‘==’ to ‘!=’ or an ‘>’ to ‘>=’ as an example) and execute that file instead of our code, if our tests fail, that mutation is killed and our tests are robust. If the tests pass it’s time to rewrite them.

Mutation testing seems like something magical for a tester and a nightmare for a developer at first glance. After all, inserting bugs into your code might seem counter-productive; we want fewer bugs, not more. But as a technique it’s a fantastic way to evaluate how your tests behave and respond to bugs. If you are still not convinced, just ask Google: not only do they use these tests but also have many scientific articles about how to improve them. 

When using mutation testing, you can calculate the mutation score.

Mutation score is defined as:

    Mutation Score = (mutants detected / mutants inserted) * 100% where:

    mutants detected = number of defects found by our mutation tests

    mutants inserted = number of defects inserted by our mutation tests

As a result you will have a score between 0 and 100 percent, where 0 is a really bad score and 100 is a perfect score. If it is 0 it means that your unit tests couldn’t detect any of the defects introduced and as a conclusion they are lackluster and in need of improvement.

With this metric, you can get more details on your coverage even when your unit test coverage is at 100 percent. The metric is still useful even if you have low unit test coverage, but you should try to aim for a higher number of unit tests to take the maximum advantage of this metric since you will have more unit tests being evaluated.

The great advantage about the mutation score is that the tests needed to get data for this metric can be inserted into your continuous integration / continuous development framework (CI/CD) alongside your unit test coverage. You could stipulate a minimum of 75 percent test coverage and a minimum of 80 percent for your mutation score and put these conditions on your pipeline. If any new pull request doesn’t meet these requirements it wont get merged. The only downside is that creating a new file and executing the tests for every mutation is very resource intensive, so it could greatly increase your CI/CD times.

Code Metrics - Static Analysis

Code metrics as I use the term here are metrics that are given to you by a static analysis tool. The two previous methods have their own strengths and weaknesses.

Static analysis is the use of a program that runs through your code without executing it. The results will inform you of ways you could have written better code. This method doesn’t demand the time that mutation testing requires or the manual labor of test case effectiveness measurement, so it could be a great solution for you.

Static analysis can find bugs and vulnerabilities and can check for consistent use of clean code practices as well. Maybe there are too many duplicated lines and you could have created methods to encapsulate them, or maybe there are circular dependencies that you didn’t notice, libraries with known vulnerabilities, or exceptions that could be avoided. Having a tool that could detect these things for you automatically in your CI could save a lot of time debugging and bug fixing.

To give you an example, let’s say we have a mobile app written in Flutter and we want some code metrics for this app. We could use “Dart code metrics,” an open source package that offers some good measures. Let’s take a look at some:

  • Maintainability index, which measures how maintainable (easy to support and change) your source code is. This works by counting lines of code, number of operators, and number of independent paths and other characteristics. If you can help the developers keep their code more friendly to changes the less likely that there will be bugs when it inevitably does change.

  • Weight of a class, which quantifies if a class reveals more data than behavior. This is important since revealing too much behavior is a sign that the code is poorly encapsulated, making it harder to understand and more prone to error when it is changed. 


You can also set some rules like not allowing magic numbers (direct usage of a number in the code) which could potentially create legacy code with forgotten rules and make the code a nightmare if there is poor documentation.

There are many static analysis tools on the market. SonarSource, which has several static analysis offerings, counts among its customers MasterCard, Moodys, and Vanguard.

Conclusion

In this article I have shown three ways to improve your unit testing process with easy tools or metrics to add to your company. As with test coverage analysis, none of these techniques by itself will give you perfect results. The more of them you can integrate into coding practices at your company, your team or even your open source project, the more effective your test suite will be. 

Further Reading

Kill the Mutants! - Nico Jansen and Simon de Lang

Complex Problem Solving - Martijn Mass

Stryker Mutator

Dart Code Metrics

SonarSource

Beautiful Testing: Leading Professionals Reveal How They Improve Software (Theory in Practice) - various authors

Author Bio

Eduardo Fischer is a quality assurance engineer who works on the Mobile squad for Avenue Securities, an international retail stockbroker. He discovered his passion for testing during his time as a back-end developer and eventually made testing his fulltime job. You can find him on LinkedIn: https://www.linkedin.com/mwlite/in/eduardo-fischer-dos-santos

Eduardo Fischer dos Santos

Quality Assurance Tester

I'm a quality assurance tester working for an american stockbroker that focuses on brazilian clients.