In the last section, goals were set with stakeholders to understand what metrics to measure, and the best way to share the results. This section will look at the skills needed to draw conclusions from those results and detect changes in the performance of the product.
Understanding Statistics
This is the field of statistics. It’s important to have some understanding of statistics because the output of many performance tests and benchmarks will contain summary statistics such as,
These summary statistics can provide you with a single number that you can use to judge the health of your application. The minimum, maximum, and mean show the worst, best, and average values for the duration of actions (latency), and the speed that data can be transferred by your product (throughput).
If you’re writing your own tests, you need to know which statistic to use for a given test scenario.
If you're nervous about anything maths-related, don't worry. We only need to look at a small region within the vast topic of statistics. And everything in this article will be backed by real-world examples.
1. Latency
Assume you spoke with the Product Development team (one of your stakeholders) and because your application contains an in-memory database, everyone agreed to measure the performance of database queries. Specifically, you want to measure the time it takes for queries to complete and return a result to the user: this is the query latency.
Latency is how long something takes - you want to do one specific action and measure the duration from start to completion. The duration is the latency of that action. If an e-commerce site suffers high latency, it can result in web pages loading slowly for users, which can cause a poor user experience. Low latency is better than high latency because users expect near-instant responses to their actions.
Suppose that you needed to write your own test because an existing one wasn’t available or useable. Modern software is complex, and it’s unlikely that you will see the same latency every time you run your query. Sometimes, having a single latency value is desirable because it’s much easier to reason about a single value.
To calculate the typical latency you’ll want to send the same query multiple times and calculate the average. The average latency gives you the typical query duration that you can expect to see. When you perform a database query, that's how long it's going to take.
Developer tools often allow latency to be recorded. Here’s the Google Chrome Developer Tools view for measuring network latency, which provides a waterfall diagram showing how long each element took to load.
Waterfall diagram of network latency using Chrome DevTool
1.1 Arithmetic Mean
The arithmetic mean is the "average" that we use commonly in our daily lives. When a dinner bill is split evenly between you and other members of the dinner party, you're calculating the arithmetic mean of the cost of your meals.
Apart from a few cases that we'll discuss later, the arithmetic mean is an extremely good indication of the true value of a collection of latencies, which is why it’s one of the statistics used by many popular benchmarks.
The major downside of using the arithmetic mean is that it is susceptible to outliers. Outliers are latency values that are very different from the arithmetic mean value. These extreme values can skew the arithmetic mean and can be caused by any number of things in your product like software bugs or poor internet connectivity. Looking at the minimum and maximum latencies can you give some hint as to whether outliers are present in your latency data.
Only a single extreme value is needed to perturb the arithmetic mean. The following table of hypothetical data demonstrates how the arithmetic mean becomes skewed as the maximum becomes more extreme. The calculations are done assuming:
- 100 latency values were collected
- 99 of those latency values were identical (the value 15)
- 1 latency value was different to all the others and it was the maximum value
Table of hypothetical data showing the effect of increasing maximum outliers on the arithmetic mean
Number of latency values |
99% of latency values |
Mean |
Maximum |
100 |
15 |
15.01 |
16 |
100 |
15 |
16.35 |
150 |
100 |
15 |
29.85 |
1500 |
100 |
15 |
164.85 |
15000 |
This is a worst-case example because you wouldn’t expect 99% of the latency values to be the same in the real-world. It does, however, demonstrate how quickly things can become skewed because of a single outlier.
Examples of typical benchmarks that use the arithmetic mean are covered later, in the section titled Performance Tests And Benchmarks.
While the arithmetic mean is skewed by outliers, there is another statistic that doesn’t suffer from that problem.
1.2 Median
The median is simply the middle number when all the latencies are sorted in non-descending order. The median is exactly halfway between the best (lowest) and worst (highest) latencies. Put another way; 50% of the latencies are the same or less than the median and 50% are more than it.
Most of the time, the arithmetic mean and the median will be very close to each other, but not when they latency data contain outliers. You can use the median to ignore outliers and still know the typical latency.
If you've ever wondered why the median value is used to report the average house price, or why national average income uses the median; it’s because the median gives you the value in the middle of your data, and most house prices cluster around the middle value. House prices in a capital city, like London, skew the mean result but not the median.
The table from our example above is reproduced below and updated to include the median.
Table of hypothetical data showing the benefit of using the median over the arithmetic mean when your data has maximum outliers
Number of latency values |
99% of latency values |
Mean |
Median |
Maximum |
100 |
15 |
15.01 |
15 |
16 |
100 |
15 |
16.35 |
15 |
150 |
100 |
15 |
29.85 |
15 |
1500 |
100 |
15 |
164.85 |
15 |
15000 |
The median is an example of a percentile; it’s the 50th percentile. Looking at other percentiles can be useful to more fully understand your data, e.g. if you’re unsure whether most of your latency values are near the arithmetic mean or median.
1.3 Percentiles
More generally, the Xth percentile has X% values at or below it and (100-X)% values above it. For example, if you sort all the latency data in non-descending order, the 75th percentile will be equal to or greater than 75% of the latencies and less than the top 25%.
Percentiles are useful in analysing latency because they give you the full picture of what your users will see, outliers and all. You can easily see what percentage of users typically experience a specific latency value.
Knowing percentiles is important for many modern web applications with many users. For example, assume that you typically have 5000 visitors to your product’s website every day and that you collected a day’s worth of latency values. Assume the following statistics:
- 0th percentile is 20ms (the minimum)
- 50th percentile is 50ms (the median)
- 75th percentile is 400ms
- 99th percentile is 490ms
- 100% percentile is 500ms (the maximum)
While it’s important to know the median value (50th percentile) of 50ms, we get a lot of useful and valuable information from knowing other percentile latency values. For instance, the 75th percentile shows us that 75% of visitors (3750) experienced latencies at or under 400ms and 25% of visitors saw latency of more than 400ms. That’s 1250 users in a single day. Even the 99th percentile, which relatively speaking, accounts for nearly all of your daily users, still means that 50 visitors see a latency of more than 490ms every day.
In this example, if you were to solely focus on the median value, you’d miss the big picture and lose an understanding of what thousands of users were experiencing when visiting your product’s website.
2. Throughput
Latency is concerned with measuring how long it takes to do a fixed amount of work, but throughput is about measuring the amount of work completed in a fixed amount of time. Throughput provides you with an understanding of how many transactions, requests, or operations your product can perform per time interval, i.e.:
- network packets processed per second
- bytes transferred from disk per second
- transactions per day
Throughput is expressed as a rate which means that using the arithmetic mean in calculations can cause inaccuracies in your throughput rates. You should use the harmonic mean instead.
2.1 Harmonic Mean
The harmonic mean is similar to the arithmetic mean, only instead of adding values together and dividing by however many values you added, you divide the number of values by the sum of the reciprocals of the values.
Consider this example: running a hard disk read-throughput benchmark 3 times produced the following throughput rates, 240MB/s, 350MB/s and 300MB/s.
If you want to know the average throughput of those 3 runs, you can use either the arithmetic mean or the harmonic mean. Notice the different results you get from each:
Table showing the different average throughput rates when using the arithmetic mean and the harmonic mean
Mean |
Calculation |
Result |
Arithmetic Mean |
(240+350+300)/3 |
296.67 MB/s |
Harmonic Mean |
3/(1/240 + 1/350 + 1/300) |
289.66 MB/s |
It’s unlikely that you’ll ever need to calculate the harmonic mean by hand since most test harnesses and statistics packages provide an implementation. Still, understanding when to use the harmonic mean is useful for verifying that the correct mean is being used when averaging rates. If you need to find the mean of a collection of speeds or ratios (rates), the harmonic mean is the correct method to use.
3. Statistical Significance
When you see two different test results for the same test, you might want to know whether they’re different because something changed in the code you’re testing, or due to random chance.
A statistically significant difference in results is a difference that is not caused by chance. This method can be used to detect when a software change has introduced an observed change in performance.
Differences that are not statistically significant are said to be in the noise, or irrelevant.
Being able to detect performance changes is both good for catching unintended changes, such as when a bug is introduced, and also confirming that intentional performance changes, such as software optimisations or better algorithms, actually make the software more performant.
The most common way to detect significance is by using the standard deviation.
3.1 Standard Deviation
The standard deviation measures the spread of your results. It measures how far an individual data point is, on average, from the mean value.
This is very useful information because it gives you some idea of how variable your test is; tests with less variability give more consistent results and are able to detect smaller changes in performance.
There is a rule for determining whether a metric result is within the expected value range based on the metric’s standard deviation (stddev), and that rule is called the 3-sigma rule of thumb or the 68-95-99.7 rule.
The result states that:
- 68% of values lie within 1 standard deviation
- 95% of values lie within 2 standard deviations
- 99.7% of values lie within 3 standard deviations
In other words, if you have a value that lies more than 3 standard deviations away from the mean, it’s almost certainly caused by a genuine change in performance.
Here’s an example of how to use this rule in practice: let's assume you ran a disk read throughput benchmark twice and gathered the following statistics:
Table showing the throughput rates and calculated Stddev of 2 test runs
Run |
Throughput |
Stddev |
1. |
35 MB/s |
0.34 |
2. |
32 MB/s |
0.34 |
Let’s also assume that the software may or may not have been changed between runs 1. and 2. The question you want to answer is: are these values different because of a change in performance, or is the difference due to random chance?
The unit of the standard deviation is the same as the mean, so in 1. the standard deviation is 0.34 MB/s. We can use the 3-sigma rule of thumb to calculate the range of expected values:
35 + 3*0.34 = 36.02 MB/s (high end)
32 - 3*0.34 = 33.98 MB/s (low end)
This means that the throughput rate from 2. is almost certainly caused by a change in performance because it’s outside of the range 33.98-36.02 MB/s.
3.2 Coefficient Of Variation
Sometimes it’s useful to convert a standard deviation into a unitless number, particularly when comparing multiple standard deviations that have different values. This can be done with the coefficient of variation.
It’s common to express the coefficient of variation as a percentage, which can give a more intuitive feel for the variance in your test results.
Using the same example as from section 3.1, we can rewrite the standard deviations as coefficients of variation (CV):
Table showing the benefit of calculating the CV from the Stddev
Run |
Throughput |
Stddev |
CV |
1. |
35MB/s |
0.34 |
0.97% |
2. |
32MB/s |
0.34 |
1.06% |
The values for the 3-sigma rule of thumb are the same as section 3.1 when using the CV.
3.3 Distributions
A distribution is a collection of values from a performance test, usually sorted in order from lowest to highest. The shape of the distribution can tell you lots of interesting facts about your performance data.
Many distributions closely resemble a bell shape, also known as the normal distribution.
Every statistic that has been covered so far relies on your data being normally distributed, but it is possible for your data to have more than one most common value (mode). These are called multimodal distributions and they can produce misleading results for the mean, median, and standard deviation.
Here’s an example of a multimodal distribution graph taken from Wikipedia that shows the body length of 300 weaver ant workers. You can see that it has two prominent peaks which show the two typical values for body length.
Multimodal distribution graph showing the body length mm of 300 weaver ant workers in mm. Source: Wikipedia
Multimodal distributions are quite common in software that has any level of data caching, e.g. the operating system page cache, or the object cache of a web proxy application.
The most robust test for detecting whether the distribution of your data has more than one mode is to plot it on a graph and manually examine it. It would be difficult to see this pattern from only looking at the statistics we’ve covered so far.
If your results show a maximum value that is considerably greater than the mean and median, you might want to graph your data (if you haven’t already) and check the distribution of values.
If you find a multimodal distribution you might want to investigate the cause of the higher peak, to understand why it’s happening (perhaps due to caching) because eliminating the second peak not only makes the mean value more useful, it might speed up your application too.
What’s Up Next
This was a whirlwind tour of statistics that are useful in performance testing, which hopefully provided you with just enough information to help you figure out which statistic to apply in any given situation.
Next, we’ll look at selecting performance tests and metrics for a variety of workloads, and discuss some tips for avoiding the pitfalls of benchmarking.