🎉 Announcing SQEC! MoT's Software Quality Engineering Certificate 🎉

Webinar: Turn Riddles into Test Assets

Unclear requirements equal hidden bugs. Let Keysight Generator with Gen AI parse the acronyms & deliver instant coverage

How To Build A Performance Testing Stack From Scratch: Performance Tests And Benchmarks

How To Build A Performance Testing Stack From Scratch: Article 3 of 5

by Matt Fleming
Nov 30, 2017
8 min read

How To Build A Performance Testing Stack From Scratch: Performance Tests And Benchmarks image

Like Bookmark Add to collection

Content in review

Having put together a performance plan with your stakeholders and acquiring a solid grasp of statistics, you know what you want to test and how to interpret the results.

It’s now time to select tests and benchmarks to generate those results.

Benchmarks are a specific kind of test. They measure the performance of the system while the test is running. The term benchmark has several definitions, e.g. a standard point, value or reference for which things may be compared against, but throughout this series we will be solely using it to mean a specific type of performance test. The terms tests, performance tests, and benchmarks will be used interchangeably.

Many performance tests and benchmarks are publicly available. Below, we'll cover the different types of benchmarks, their pros and cons, and pitfalls to watch out for when choosing tests. We’ll also look at some of the most popular benchmarks, and discuss the best practices if you need to write your own. Regardless of whether you use an existing test or write your own, you’ll need to verify that the test is measuring what you expect.

1. The Benchmark Hierarchy

Benchmarks can be divided into the following 3 categories:

macro benchmarks
micro benchmarks
nano benchmarks

These are ordered by scope, from largest (macro benchmarks) to smallest (nano benchmarks). Each targets a different level of software component, and they move further away from real-world scenarios as you move to smaller benchmarks.

For example, macro benchmarks should accurately model the way customers use your product. At the other end of the spectrum, nano benchmarks are used to measure the performance of the smallest operations in your software, like adding numbers together or calling functions. Micro benchmarks lie somewhere in the middle: not concerned with emulating user behaviour, and measuring larger pieces of software than low-level operations.

The boundary between these levels isn’t exact; it’s more fluid than that. Different benchmark types have different tradeoffs and it’s useful to have some idea which one will give you the performance data you need.

1.1 Macro Benchmarks

Macro benchmarks, or more commonly benchmarks, accurately model a real-life workload. This usually means that they cover much more code than any other type of benchmark. Macro benchmarks tend to target major components of your product and use them in ways that mirror the way your users will.

For example, a performance test that measures the time to render a graph in a web dashboard would be a macro benchmark. That’s the kind of thing a user would want to do, and many software operations take place during the rendering.

1.2 Micro Benchmarks

Micro benchmarks target a specific part of a software product, usually a single component. They are not intended to be representative of user behaviour. These benchmarks can be very useful for capacity planning and discovering bottlenecks that might affect your product in the future as the software grows.

The best use of micro benchmarks is to help optimise the performance-critical parts of your software, i.e. those parts that are executed the most frequently. This does require that you first identify those critical parts. Due to focussing on a single software component, micro benchmarks can be used to detect small changes in performance for that component.

Continuing with the graph rendering example above, a performance test that measures the average time to read the graph data from a file would be an example of a micro benchmark because it targets a single piece of the software: file I/O logic.

1.3 Nano Benchmarks

Nano benchmarks can measure the performance of the smallest units of execution for your product, such as measuring the time to add two integers together or call a function.

One of the biggest problems with nano benchmarks is that they are extremely susceptible to interference from other running tasks and the architecture of the underlying hardware platform. You may see wildly different results from run-to-run depending on whether the nano benchmark fits within the processor’s cache.

Nano benchmarks can also lead you to test things that are not performance-critical because they’re so myopic, such as measuring the cost of floating-point calculations even though your product mainly uses integer arithmetic.

Unless you’re working on a low-level library or Operating System, it’s unlikely that you’ll need to use nano benchmarks in your testing. None of the examples discussed below are nano benchmarks.

2. Picking Tests

There are a variety of tests and benchmarks for most workloads and Google is an excellent resource for finding tests you can use. Some of the more popular ones are discussed below.

For each test, it’s helpful to keep in mind whether it’s a macro benchmark, micro benchmark, or nano benchmark since they each have different characteristics.

2.1 Load Testing

Many products are built as web applications and understanding the maximum load they can handle is important when testing performance. There are a number of load testing tools available that vary based on the language they’re implemented in (Node.js, Java, Python, etc) and whether they’re open source (HP’s LoadRunner being a proprietary example).

There are a number of things to consider when picking load testing tools, such as:

license
programming language implementation
scripting support

Some of the more popular load testing tools are:

Apache JMeter
Gatling.io
Artillery.io
Locust.io

Since these usually connect over HTTP (though that isn’t required), they’re classed as macro benchmark tools because they test major interfaces of your product. The tests created with these tools usually report things like average response time in milliseconds and average throughput rate in MB/sec.

2.2 Database Testing

Most applications today talk to a database, either to read or store data. The performance of the database component, whether SQL or NoSQL, is critical for many products.

Some databases come with their own performance tools. For example, PostgreSQL comes with a benchmark named pgbench and MySQL comes with a benchmark suite.

Sysbench is an open source benchmark tool with drivers to test the performance of databases (PostgreSQL, MySQL), file I/O, memory accesses, and CPU speed. It’s based on the Lua programming language which makes it very good for controlling with scripts. The majority of the tests that come with sysbench are micro benchmarks because they target individual components of the database stack, e.g. CPU, file I/O, memory, locking primitives, and threading.

Yahoo! Cloud System Benchmark (YCSB) is a popular benchmark for comparing different databases and has been used in a number of experiments for different database vendors. It reports throughput and latency, and includes a set of workload scenarios that cover common application workloads. The workloads of YCSB were designed to be easily updated and extended. Since the common workload set model real-world scenarios, YCSB is a macro benchmark.

2.3 Network Testing

Network performance is important because slow networks will affect your users, and possibly even make your product unavailable to your users.

Performance testing of the network may also turn up intermittent issues that are not easily identified during normal operation, such as packet loss at high load. Monitoring the network under controlled conditions makes it simple to identify and diagnose these kinds of issues.

Measuring performance allows you to be proactive about pending failures, catching them before they occur and affect users. It also helps you understand any bottlenecks users might hit.

The two network protocols, Transmission Control Protocol (TCP) and User Datagram Protocol (UDP), can have very different performance characteristics so it’s a good idea to check whichever is used in your product, or check both if both are used.

Some popular benchmarks for testing network performance are:

Both support TCP and UDP, IPv4 and IPv6 and a variety of options to tune the performance tests that are executed.

Both of these test packages are examples of macro benchmark tests because they test multiple operations: connection setup, data transfer, connection teardown.

2.4 I/O Testing

Depending on the architecture of the application, you might want to measure the performance of the disk subsystem.

I/O testing is most useful for those products where I/O is a critical part of the code path, and where poor performance is visible to the user. This isn’t true for every product since caching can hide the performance of the I/O subsystem.

But if your product frequently performs disk accesses, you should have some idea of the best, typical, and worst performance.

Popular disk and filesystem I/O benchmark packages include:

Disk and filesystem benchmark tools are very complex to configure, which makes it all the more important that parameters are not copied and pasted from elsewhere, i.e. the web. You need to methodically select parameters and parameter values because their results will be specific to your environment.

Both DBENCH and fio are driven by configuration files that describe a workload. This means that they can be classed as either micro benchmark or macro benchmark tools depending on the contents of the workload file.

DBENCH and fio are excellent tools for experimentation and reproducing known performance issues because they allow you to recreate an application’s I/O patterns by specifying commands in a workload file.

On the other hand, iozone is a micro benchmark tool because it measures the performance of individual I/O operations such read, write, and mmap.

2.5 Writing Your Own

Sometimes you might find that a performance test does not exist for your use case, and you need to create one. There are some things to be aware of when doing this and some best practices.

Try not to write your own test harness. Many exist for programming languages and libraries such as: JMH for Java, benchmark.js for JavaScript, or Google’s benchmark library. Ruby and Python also have benchmark libraries: benchmark and perf, respectively.
Decide what type of benchmark you want to write, so that you understand the tradeoffs in terms of accuracy and scope.
Beware of short-running tests, they are easily perturbed by OS jitter.
Don’t just provide the average (mean), instead provide more details to understand the distribution of data by reporting percentiles too.

3. Validating Tests

Whether you decide to use an existing test or write your own, it’s important to check that the test measures something important and that it measures it correctly.

Writing benchmark tests can be hard and even popular ones have been known to contain bugs that give incorrect performance figures. Always ensure that you have the latest version and monitor the benchmark project for new releases.

If possible, it’s a good idea to analyse the test with profiling and tracing to verify that it correctly tests what it claims. At a minimum, you can use the time command to check that the test duration was measured accurately by the test.

While functional tests verify that software behaves predictably, performance tests measure the speed of operations; the two types of tests have very different objectives and designs.

Because of this, it is a bad idea to use a functional test as part of your performance work even if the test provides seemingly useful statistics and measurements. If a performance test doesn’t exist for your workload, consider writing your own (using the guidelines from 2.5) instead of repurposing a functional test.

There are a number of differences between functional and performance tests:

The performance of short-lived tests (typically a goal of functional tests) are hard to analyse.
Performance tests use well-tested timing library functions.
Functional tests return a pass/fail status, performance tests can return any range of numbers.
Functional tests typically outnumber performance tests; every feature needs to work correctly, but they’re not all critical to performance.

4. What’s Up Next

There are 3 different types of benchmarks and you’ll need to choose among them when selecting performance tests for your product. Many popular benchmark tools exist for common workloads but if you can’t find one that fits your scenario, you can use the guidelines discussed above to write your own.

Whichever tests and benchmarks you pick, be sure to verify that they behave as you expect and that they measure performance correctly. Using external tools and utilities such as the time command on Linux can give a quick indication of whether the test is behaving correctly.

Now that you understand the different types of benchmarks and have seen examples of some common ones, you’re ready to start running them. Armed with your statistical knowledge from the section on statistics for testers, you’ll also be able to interpret the results to draw conclusions.

Next, we’ll look at some of the best practices for running tests, such as using a common testing framework (if available), configuring the test environment, and making sure test results are not invalidated by ignoring errors.

References:

< Prev Next >

Matt Fleming

Matt is a Senior Performance Engineer at SUSE where he’s responsible for improving the performance of the Linux kernel. You can find him on Twitter, and writing about Linux performance on his personal blog.

Webinar: Turn Riddles into Test Assets

Unclear requirements equal hidden bugs. Let Keysight Generator with Gen AI parse the acronyms & deliver instant coverage

Explore MoT

The awesome power of shifting left — Software Testing Live

Wed, 18 Jun

Software Testing Live: Episode 04

MoT Software Testing Essentials Certificate

Boost your career in software testing with the MoT Software Testing Essentials Certificate. Learn essential skills, from basic testing techniques to advanced risk analysis, crafted by industry experts.

Leading with Quality

A one-day educational experience to help business lead with expanding quality engineering and testing practices.

This Week in Testing

Debrief the week in Testing via a community radio show hosted by Simon Tomes and members of the community