In the past, quality and computers usually meant software quality. Data was in the background. Even though data has been important in some quarters for a while, the rise of machine learning and other data science tools, such as generative AI, has meant that data quality has become a more widespread concern than it was before.
Some aspects of software quality carry over to data quality, but data quality differs in many respects from software quality. This article looks at some of the similarities and differences between the two quality disciplines. It starts with describing some tools and concepts that are often useful in software quality. Then it shows how data is different and outlines some tools for working with data quality.
The familiar ground of testing user interfaces
The first concept most new testers encounter is the interface, specifically the user interface. It might be such a common idea that you’ve ceased to notice it, but all software has a user interface. The interface could be a GUI (graphical user interface), CLI (command line interface), API (application programming interface), or specific file formats that you use to submit inputs to the software and to receive outputs from it.
The interface acts as a shell around the functionality of the software. It also represents a contract with the user; you can submit input in a certain way and then can expect output in a certain way. Inside the shell, the software is free to do almost anything, as long as it honours the terms of the contract that the interface represents.
There might be quality problems such as duplication, poor structure, or poor names in the software’s source code. However, as long as the code honours the interface’s contract, then to some degree these can be overlooked.
Using mental models of cause and effect to test user interfaces
Much of the time when we test software we are testing its behaviour. As with interfaces, this is something that we might not even notice because it’s so much part of our everyday lives. We are interested in cause and effect: when this cause occurs (for instance, I select a link) I expect this effect (I am moved to a different part of the software).
Behaviour feeds into a useful tool for exploring software called the mental model. We use mental models all the time without being aware we are doing so. For instance, when we visit a new city, it’s common to gradually build up something like a map in our head of where things are. This map will start off almost empty, with only a small number of places and routes between them, such as the railway station where we arrived and the route from the station to a hotel. As we spend more time in the city, we add more places and more routes between them. This is how we navigate without needing to refer to a map, and is a mental model of the city.
Similarly, we can build up a mental model of software using two main inputs: how the software is built up from components, and the behaviour of each component. The components could be something like different sections of a website or different endpoints of an API. The behaviour is how the software changes from a starting state to another state.
Testing data quality is different from testing a user interface
We now turn to data. Other than the file it came in, or the database connection or some other transport mechanism, data has no interface. There is no meaningful outer shell that detail can hide within, as you have direct access to all the detail.
If the data is in a database table, then the columns of the table can enforce a loose contract. For instance, a FirstName column could be restricted to a text string of no more than 40 characters. However, while the database might enforce that, for instance, a Latitude column is a floating point number, it might not be able to limit the values to the range -90 to +90 or even the range for a particular region such as a country.
Data can often be divided into a small number of large conceptual chunks via things like different tables, where one table is about customers, another about orders, and so on. However, within a table it can be hard to divide data up into meaningful chunks. All you see is 400,000 customers, rather than distinct groups of customers.
So you have no interface, only a loose contract, and very few meaningful chunks to work with. What's more, there’s no cause and effect to help you. Often all you see is the end state of the software that produced the data (in a very large amount of detail). In these cases you can’t monitor the behaviour of the software, just its effect. All these factors mean it’s very hard to build up a mental model that helps you divide the data into meaningful chunks, navigate between them, reason about them, and assess their quality separately.
Can you trust this data?
One consequence of the lack of context described above is how someone trusts (or doesn’t trust) some data. If you see one cell in a spreadsheet and think that its value is wrong, there’s little framework to help you put that into a context. If the spreadsheet is about yields of different crops, for example, you can’t immediately spot that the problem pertains only to wheat grown in the northwest of the country. The lack of framework means there’s little to stop your doubt about one cell's data from spreading to doubt in all the data.
The way doubt in one part of the data spreads to other parts can be well-founded and not just an overreaction caused by the lack of a visible structure. This is because of the old computer adage: garbage in, garbage out. If you put bad data into a data processing pipeline, you are very likely to get bad data out of it. Also, data quality problems tend to get bigger, not smaller.
It’s a bit like a bad batch of a raw ingredient shipped to a food factory that leads to all food made from that ingredient needing to be thrown out. Similarly, if there’s a problem with the data about customers who bought things from an online store, then summaries of all customers (both online and from physical shops) will also be incorrect. This could then affect other data about cash flow, profit and loss, and so on.
If data is so different, how should I begin to test it?
Now that we have established that data quality is hard to get to grips with, and the fragile nature of trust in data, what should we do about it? Working out what to do to test effectively is an area of similarity between data and software quality, in terms of both the problem and a possible solution.
The problem is similar for data and software quality in that, unfortunately, there is usually more work that could be done than there is time to do it. At least, there’s enough work to do that it’s important to know what to tackle first.
One solution is to look beyond the product (the software or data) to the people using it. How does the software or data help someone to tackle a problem or question they have? With software the problem might be: "I want to buy a train ticket." With data the problem might be: "I want to know the best place to study Physics at university."
Putting yourself in the user’s shoes can help you to break the problem down into smaller problems that are more manageable to investigate. In the case of studying Physics at university, some useful questions to tackle include:
- Which country or countries need to be covered?
- How up to date is the information?
- Is there a link to the subjects and grades of any entry qualifications?
- Is there a link to the jobs and research that graduates go onto?
- Is there qualitative data as well as quantitative data, such as how much students are supported, how happy they feel with the place and course, and so on?
There might be more than one kind of user for the data, so you should consider their relative importance. While parents, careers advisors, and recruiters of graduates might use data about studying Physics at university, their needs are probably less important than those of students. So testing should prioritise areas that students would need.
Putting it another way, instead of checking how many rows of data there are, ask: how useful is this data? To whom?
Assessing data quality
We’ve now established that data quality is important but tricky, and identified a rough area to tackle first by thinking about people who will use the data. What can we do to assess the quality of what we consider to be the most important data?
This is where it helps to have some of the skills of a data analyst, just as it can help to have programming skills when assessing software quality. This might be nothing more fancy than looking at the data in a spreadsheet and calculating simple values such as counts, min(imum), max(imum), mean, median, inter-quartile range, and standard deviation. Plotting simple graphs can help, if only to allow an initial “That looks weird” reaction.
If you wanted to, you could go beyond this to tools such as a Jupyter notebook, which is a bit like a Python version of a spreadsheet. A Jupyter notebook, like a spreadsheet, allows you to explore some data as a series of small steps. You can take a small step by adding a graph or a small amount of code to produce some new data and see the result of that step right away. The pairing of small steps and fast feedback means you can spot and fix mistakes quickly, abandon dead ends quickly, and so on. The benefit of a Jupyter notebook with Python is that there are lots of useful Python libraries to help with data crunching and drawing graphs. (Jupyter notebooks can also be used with other languages, but Python is a good choice for data analysis due to the wide availability of libraries.)
Understanding dimensions of data quality
An initial way to break down data quality is into value and trust. What makes data valuable to your users based on the problem they’re trying to solve? What will help users trust it? It might seem an odd thing for data, but consider the user’s experience of the data, just as you would consider the user’s experience of software.
"Value" can still be too big and shapeless an idea to be useful, so data quality is often broken down into several dimensions or aspects. Not all of these will be equally valuable to your users, so you should concentrate on the dimensions that matter most to them. Different people use different-sized lists of data quality dimensions, but a common list is:
- Accuracy: how closely does the data represent the "real" world?
- Completeness: how much data is missing? This could be individual cells being empty, or entire rows or columns missing.
- Validity: does the data comply with expected standards, such as a text column containing valid phone numbers, or a number column containing integers in the range 0-8?
- Consistency: does one part of the data match with all other parts of the data it should? For instance, a product code on an order should match a code in the products table.
- Timeliness: is there too much delay between an event happening in the real world and its representation in the data?
- Uniqueness: is there unexpected repetition in the data? For instance, a single day’s orders from an online store might have been loaded twice.
You can find much longer lists of data quality dimensions online, such as on the DAMA (Data Management Association) wiki.
Testing for data validity
As discussed earlier, a database can enforce validity to a certain degree, such as ensuring that data in a column is text rather than a number. However, you can’t get this from something like a CSV file. And even with a database, you often want to go beyond the guarantees that a database can provide and check, for example, that a certain format is a valid phone number for a particular country.
Testing for data timeliness
To measure timeliness, it is helpful to have at least two timestamps per event, such as a purchase at a shop:
- When the event occurred in the real world
- When the data corresponding to the event was created in the system
You might want other timestamps, such as when the data was last updated. The data for the first timestamp could jump forwards and backwards in time as you ingest data.if for example, the recording of some timestamps is delayed more than others. For instance, you might receive some data relating to last week AFTER you receive data about yesterday.
The second timestamp should always be later than the first. If it isn’t, then this is an error.
Testing for data uniqueness
It can be difficult to measure uniqueness. At the simple end of the scale are items such as orders for an online shop that get a unique order ID when they’re created. A uniqueness problem can be detected when two or more rows have the same order ID.
Things get more complicated with items like names and addresses or any case where there are different ways of expressing the same concept. Even though these are different text strings, for example, they might refer to the same location:
- 3 St. James Street
- 3 St. James St.
- 3 Saint James Street
- 3, St. James St
Sometimes it’s possible to go from a value with a lot of variability like this to a less varied value, and then check for duplicates based on the second value. In the case of addresses, you might be able to find a latitude and longitude for the address, and then compare rows based on those coordinates rather than via the text.
Examining data structure and context
While it can be useful to know that, for example, 17 percent of rows in a file have an invalid value in a particular column, we might be able to produce a more useful assessment of the file’s quality. To do this, it’s worth seeing if a data set, such as a file, can be divided into meaningful parts.
What these parts are will depend on your situation, but they could be things like product line, geographic region, month, business unit, and so on. You might know that the invalid values are all for plumbing supplies, but the values for timber, concrete and so forth are all OK. So you know where it’s safe to have confidence in the data, and the knowledge can be helpful with diagnosing and fixing the underlying problem. In the software interface world it's often easy to see the structure of the input, for example via different endpoints in the API we’re using, or different screens in a GUI. But with data we often have to put effort into deriving a helpful structure.
Similarly, what is valid and what isn’t can depend on the context. Is 300kg a valid weight? It’s too heavy for a bar of soap, too light for a family car, and possibly OK for a steel beam.
Is it OK to find 700 people in a data set whose profession is "teacher"? If those people all appear to live in the same town and the town has only 1,000 inhabitants, then that’s probably too many. If the data set is an annual census for a country, and the previous census had 80,000 teachers, then that’s probably too few.
Just as structure will vary by situation, so will context.
Moving from detecting problems to preventing them
Another aspect that data quality shares with software quality is that it’s possible to move from simply detecting problems to trying to prevent them. Quality people can help engineers improve their practices, to help spot, reduce, and even prevent quality problems. This is true whether the engineers are software engineers or data engineers.
The practices to consider merit one or more articles of their own, but here is a quick summary of some of them. They all relate to the data processing pipelines that produce the data whose quality matters to us. They can be combined or used on their own.
Improving data quality stage by stage: understanding medallion architecture
The medallion in medallion architecture is a way of introducing three labels: gold, silver and bronze. They allow you to divide the work of ingesting and processing data into three stages, where the quality attributes change as you move from one stage to the next.
First comes the bronze stage. Data arriving into the system, via importing a file or calling an API, is stored in bronze tables (or whatever container for data makes sense for your situation). The aim with bronze tables is to capture accurately what was received, including any problems that were in the data when it was received. We deliberately apply little to no validation at this step, so that we capture potentially troublesome values rather than excluding them.
Then comes the silver stage. A series of business decisions needs to be made before we can design the data processing pipeline that creates silver data from bronze data. If a value is invalid or missing, what do we do? Ignore the whole row or file? Try to repair it? Something else?
Data in silver tables aims to have individual rows that are as useful as possible, rather than being an accurate reflection of what was received. This means dealing with problems as described above, but also doing things like using a product ID to find details for the relevant product (such as name or description) and adding them to the row.
The final, gold, stage. To create gold data from silver data, we aggregate silver rows until they are at the appropriate level of detail. For instance, if silver rows represent hourly readings from a smart meter and downstream uses of the data will all need daily totals, we can create the daily aggregates in a gold table.
The quality attributes we apply change with the stage:
- Bronze: accurate
- Silver: useful but possibly too detailed
- Gold: ready for customer use (such as in a data warehouse)
If a row in a gold table of daily meter readings has a value many times lower than we’d expect, having the silver and bronze values can help us to diagnose the problem. For instance:
- Many of the hourly readings didn’t arrive (signaled by too few rows in the bronze table)
- The hourly readings all arrived, but many had invalid values and were rejected (signaled by too few rows in the silver table)
There is a cost to medallion architecture: extra storage and extra reading and writing of data. This must be weighed against the benefits for your particular situation.
Catching data quality issues with the write, audit, publish pattern
Write, audit, publish (WAP) is a pattern for designing data pipelines. It treats a pipeline as being built up of several sections. Each section produces a result data set which is used by one or more other sections.
WAP splits each section into three stages:
- Write: create the data and store it somewhere that is inaccessible to other sections
- Audit: check the data
- Publish: export or publish the data to other sections only if it has passed the check
You will need to decide if rows will individually pass or fail the check, or if one row that fails the check will invalidate all the other rows that came from the same source. The point is that you stop quality problems in one section from spreading to downstream sections.
Designing and implementing data contracts
Data contracts, as their name suggests, are a way to enforce a contract between distinct sections of a data pipeline. These sections can correspond to different teams in an organisation, or different technologies or uses of data, such as a transactional database supporting operations versus a data warehouse. Data contracts can be thought of as a communication mechanism that lets teams collaborate better, as well as a checking mechanism.
They allow the consumers of some data to specify the attributes that matter to them in the data. These attributes could be the names and types of columns in the data, the range of values for each column, for example. It might be that a database table has 30 columns, but a given consumer of the data in the table is interested in only 17 of them. In this case, their contract will specify only the 17 columns that matter to them.
The contract is written in something like YAML, which both humans and computers can read. It is put under version control. The producer of the data can look at the contracts specified by the various consumers of the data and know if changing a table will affect any consumers. Contracts can be used to send notifications, or even to stop a build pipeline from running to completion.
To wrap up
In some ways, data and its quality can seem strange to people used to assessing the quality of software. There’s often no observable behaviour to check and little in the way of structure to help you break the world of the data into manageable pieces. This can mean that a tester’s sense of “that looks weird” has little to work with.
However, you can use a common software testing technique: thinking about important people and the problems that are important to them. And then you can find a useful starting point. From there you can refine the problem into attributes such as validity, uniqueness, and so on, and evaluate those using simple data analysis tools.
If testers are able to work with the engineers who produce data, then, just as with working alongside interface engineers, testers have a valuable role to play. They can point out risks and other consequences of design decisions, and suggest test approaches to help mitigate them.
What do YOU think?
Got comments or thoughts? Share them in the comments box below. If you like, use the ideas below as starting points for reflection and discussion.
Questions to discuss
- Have you ever worked with data whose quality you doubted?
- How did you discover the problem?
- Was there an isolated problem or was it widespread? If it was isolated, how did you know this?
Actions to take
- Think about decisions, products and services in your organisation that depend on data.
- Consider for whom the data is important, and what problems it helps them to solve.
- Investigate how the data’s quality is assessed
- If the data is produced or modified within your organisation, explore how the data pipeline is designed to prevent or at least catch data quality problems.
For more information
- Baking Quality into Your Data Pipeline, Ali Khalid
- Quality + Data Engineering: How to Thrive, James Prescott
- Using Data to Model User Behaviour, Heather Reid