By Matt Fleming
All of the previous parts of this series laid the groundwork for this final step: interpreting your results.
It’s possible to make mistakes at any time during this series, but the most common missteps in performance testing are made at this point. Correctly analysing results can be difficult unless you follow the principles outlined below.
The output of this step also has a wider impact than anything else because, while only your team is likely to see the data from the previous steps, you should be sharing these results with your stakeholders. You’ll often need to summarise your results and provide guidance to your stakeholders.
That guidance can have a direct impact on your business, for example, if development time needs to be allocated to improving the performance of a known bottleneck in your product which was discovered through performance testing.
The format of your results directly impacts how easy it is for your stakeholders to understand the work you’ve done. It’s in your interest to make it intuitive and clear, especially if you’re providing recommendations too. The most persuasive results are also the clearest.
What follows are some recommendations for making results as complete and easy to understand as possible.
1.1 Use Combination Of Numbers And Visualisations
People have different preferences for how they digest data. Providing both numbers and visualisations, such as graphs and charts, ensures everyone can understand your results.
Numbers and visualisations also have different strengths and weaknesses. Graphs are great for looking at trends over time, while single numbers, such as summary statistics, can be quicker to read and can give a simpler answer for whether performance has changed.
Graphs are also excellent for displaying multi-dimensional data, such as when you need to correlate test results with system metrics, i.e. understanding what event caused performance to change. Correlating data this way helps you to establish cause and effect. When you want to understand why your e-commerce app is processing transactions slowly (effect), you might discover that the processors are saturated because there are too many tasks running (cause).
1.2 Provide Absolute And Relative Differences
When analysing you might be most concerned with the differences between two results, i.e. when looking at results before and after a software update.
The absolute difference of your results is what you get if you subtract the before result from the after result. The units for an absolute difference are the same as the test itself, so if you have a throughput macro benchmark that measures the average rate of data transferred from a hard disk in megabytes per second (MB/s), the absolute difference will also be in MB/s.
Relative differences, on the other hand, are usually expressed as a percentage. The percentage will be positive if performance improved (higher throughput) and negative if it declined (lower throughput).
For example, if throughput version 1.1 of your software product has an average throughput of 600 MB/s and version 1.2 has an average throughput of 625 MB/s, the absolute difference is 50 MB/s but the relative difference is about +4.16%.
Both the absolute and relative differences help you understand how important a change in performance is. Each shows you something that the other doesn’t. For instance, if the average result for your latency test is 5 seconds, you might not be concerned about a 5% increase in latency (250ms), but if your latency test usually took 100ms then maybe you would be concerned by a 5% (5ms) difference; it depends entirely on the test and your product.
1.1 Be Consistent
Using a consistent format for all of your results is one of the best ways to help stakeholders quickly understand your results. For graphs that could mean:
- using the same axis orientation (e.g. always putting time on the bottom)
- having axis run in the same direction (e.g. time running from left to right)
- putting the title, legend, and any other keys in the same spot
It’s much easier for stakeholders to interpret data for results they’ve never seen before with consistent formatting.
2. Use Individual Sample Data
Access to the individual sample data is usually only available if you’ve written the test yourself, or if you’re using low-level tracing that records timestamps. If they’re available, you should collect individual samples rather than the summary statistics that are output at the end of most tests.
That’s because individual samples allow you to look at the full distribution of values for your test and gives far more accurate percentiles than is possible from summary statistics such as the mean, minimum, and maximum. They can also make it easier to spot patterns in the data that might suggest performance issues.
Note that calculating summary statistics from other summary statistics, i.e. calculating the mean of three runs of a test that outputs a mean score, can be even less helpful than just using a single summary statistic because the results are smoothed even more. Hence, you’re less likely still to understand performance issues from a single summary statistic.
3. Detecting Results In The Noise
Figuring out whether a change in a test result is statistically significant or not is the single most difficult part of analysing results. Results are rarely identical from run-to-run and there’s usually some difference which needs to be understood.
Your ability to detect changes depends on the type of performance test. Nano benchmarks, when run correctly, should be able to highlight the smallest changes in performance. At the other end of the scale, macro benchmarks are less precise because they’re usually more variable.
You can use the standard deviation (covered in Statistics for Testers) and the 3-sigma rule to get a sense of whether a result is within the expected range.
You can also increase the number of times you run the tests (not the internal loop count, if you can pass that as a parameter) to generate more data. This is particularly useful if your test emits an average score; the more data you have, the more accurate the displayed average will be.
Outliers can skew your results, which can lead you to misinterpret your results and your data. To make sure that doesn't happen, you need to detect them. Averages, like the arithmetic mean, are summary statistics which can hide the presence of outliers. Instead, you should be using percentiles or plotting the distribution on a graph.
When access to the entire test data, which is required to calculate percentiles, isn’t available, you might want to modify the source code for the benchmark to print percentiles. If the source code isn’t available either, you can use the minimum, maximum, and average to make a guess as to whether outliers exist. If either of the minimum or maximum test results are very different from the average then there’s the potential for outliers.
Graphs are probably the single best tool to understand the distribution of your data because they visually show data samples relative to each other. If you have outliers in your data, they’ll be displayed a distance from all the others.
Scatter graph with an outlier circled in red
5. Result Precision
The amount of precision (the smallest measurable unit) for your results can vary with each test. Generally speaking, the amount of precision available depends on the type of test. Macro benchmarks have the least precision (minutes or even hours) and nano benchmarks have the most (nanoseconds or microseconds).
If a test can produce results with nanosecond resolution it doesn’t mean you should always use it. It’s very easy to make performance changes seem more important than they actually are if you don’t make a decision about how large a change needs to be before you investigate its cause.
Precision With Performance Example
Take a hypothetical micro benchmark that measures the latency of sending messages through a message-broker API. If the typical latency for sending a message is 5 microseconds, you probably wouldn’t want to measure the latency down to a nanosecond resolution, even if that precision is available. Most people would consider 5.245 microseconds and 5.454 microseconds to be roughly equal, and would round to nearest whole microsecond.
6. If All Else Fails, Use Test Duration
After looking at test variability, if you’re unsure whether performance has changed, you can use the test duration as a final tie-breaker. If you’re comparing two different versions of your product, the one that completes the test quickest has the best performance.
This rule does assume that the exact same number and type of operations were performed during every run of the test. For example, if you’re using a benchmark that tries to account for noise on your test system (see Ensure The Test Duration Is Consistent) and varies the number or type of operations performed, you cannot use test duration to detect performance changes.
Ultimately, the duration is the measure that matters most for latency tests because this is what your users will care about: did the operation complete more quickly?
7. Delivering Your Results
Analysing your test results is the final step in the process of building a performance testing stack from scratch. Doing this well can be difficult, and it’s easy for testers to run into trouble by misunderstanding test results. The guidelines above can give you a repeatable method of correctly interpreting the data from your performance tests and benchmarks.
Reporting these results to your stakeholders, and working with them to incorporate suggested changes, can allow you to steer your product away from performance hazards and towards a responsive, optimised product that delights your users.
Matt is a Senior Performance Engineer at SUSE where he’s responsible for improving the performance of the Linux kernel. You can find him on Twitter, and writing about Linux performance on his personal blog.