Today's software platforms rely on complex databases that contain data from many technical and business entities. When new features are added or legacy code is refactored, current data often is altered and changed, caused by the implementation of new tables and fields in the database and the deletion of old ones.
If your organization's data will be affected in this way, it's likely that a data migration from the old to the new database structure will be necessary. This type of migration can sometimes be more complex and time-consuming than the implementation and testing of the new feature itself.
And that's where you as the software tester come in. Let's discuss data migration testing. I've tested a few different kinds of data migrations and I'll share some of my experience and lessons learned with you, giving you a basic framework from which you can develop your own approach.
What Is Data Migration Testing?
Data migration testing is part of a holistic effort that ensures a seamless transition from a legacy system to a new one with minimal disruption and no loss or corruption of data. It verifies that current data as well as new data will be handled correctly by the functional and non-functional aspects of your application.
In a data migration, you move all of the existing data (system and user data alike) into a new database. The end goal is for the application to handle both data with the new structure and the current data effectively. So you must ensure that
- Existing data gets to the new structure with no loss or corruption
- Legacy and new applications are functioning properly with respect to the "new" database structure (assuming that legacy applications will be in use in production after the migration, which is likely)
Why Is Data Migration Testing Necessary?
Data is migrated to new systems for various reasons, such as system consolidation, technology obsolescence, new features, or optimization.
Key considerations during migration testing are:
- Minimize disruption to customers. Your team should ensure that disruption is minimized in production. You want to avoid any inconvenience to the system's users, whether they be B2C customers or employees of organizations that use the system.
- Ensure feature continuity. Users should be able to access all software features with no issues after the migration is complete.
- Comply with applicable laws on data privacy and protection. Consult with your product manager or stakeholders and, if necessary, the organization's legal team to determine if the data migration involves any regulated user data and if additional steps have to be taken. This is crucial for compliance with legal standards and to avoid potential legal issues.
Data Migration Testing Strategy In A Nutshell
Phases of Data Migration Testing
Each of these phases should be executed in test before your team attempts a production migration.
Pre-migration audit: Before the migration begins, examine the data in legacy systems and database tables to identify issues such as inconsistencies, duplicates, corruption, or incompleteness. This can prevent complications during the actual migration in production and it helps to prepare realistic test data.
Backward compatibility testing: Ensure compatibility with existing data and features.
Rollback testing: Validate the ability to revert to the legacy database if needed.
Post-migration data validation: Confirm that all data is migrated, is in the expected format, and functions as intended in the new environment.
Features and performance testing:
- Verify that features and systems work with migrated data as expected.
- Test the interfaces between the migrated data in applications and other services they interact with.
- Test performance to ensure that it is on a par with (or faster than) that of the legacy system.
Data Migration Challenges
- Handling large datasets
- Ensuring data consistency
- Addressing potential bugs that arise during system migration
- Data in more than one character set
- Migrating data and introducing new features at the same time
- Proper obfuscation of personally-identifying user data
- "Movability" of a data set that contains obfuscated user data and user IDs
Data Migration Testing In Action
Let's consider our strategy in two real-world cases to get a better understanding of data migration testing.
Scenario One: A Multinational, Mirrored Web Service
Imagine a complex web service that's integrated with several different microservices and third-party apps. Think of it as customer support service for a big company.
The system consists of instances across the world: Europe, Africa, Asia, and North, Central, and South America. Each of the instances mirrors the whole setup for a specific region.
Here's the deal: the company wants to save some cash. Running all these separate systems for different regions isn't cheap. We're using more servers, there's a lot of hassle with updates and management, and those third-party apps cost a lot. So, the plan is to bring everything into one instance: one system to rule the data for all the regions. The system has to be able to handle all these regions without any issues.
Clearly, this is a big challenge. We can't just mash all the data together. Data from each region needs to stay in its lane. It's crucial that the system knows where each piece of data came from and keeps it all sorted according to that region's specific rules and regulations. This way, even though it's all running on one system, each region's data is still playing by its own rules.
How I Would Test This Scenario
Use a mix of real-world data and generated test data. How do you know which type to use? Well, in my case, I used real-world data whenever possible, but I generated data if there was a particular test case that wouldn't have been covered otherwise.
- Using real-world data also provided a lot of volume, meaning that performance would be assessed realistically.
- Finally, I avoided the extensive effort required to inspect real data just so that I could generate synthetic data, making the process more efficient.
Use several staging environments (test environments) to emulate different regions.
Make a copy of your production data. In my case, I was able to use an API to get all the needed data from the production environment. The system was integrated with a third-party solution (Zendesk), so I used Zendesk REST APIs to get all the needed data without the need for a ton of custom SQL queries and unwieldy database dumps.
Obfuscate or remove all personally identifying data. This isn't an easy task for the most part. You may need to consult with the product owner or stakeholders to clarify which data identifies people. That data should be removed or obfuscated to avoid potential legal issues.
- You will have to determine whether to remove or obfuscate data. For example, if you have usernames and email addresses, you can't simply delete them and leave the fields empty. Instead, you would delete the original data and replace it with randomly generated names and emails based on these names.
- The good news: these tasks can be automated.
Generate synthetic test data where needed. In some cases, it might not be necessary, especially if the entities in your databases are simple and few. But you might need synthetic data to cover specific test cases. For example, a customer support service database that includes the names of company products might have a lot of data connected to Adobe product names like "Photoshop" and "Illustrator". There is no need to spend time on getting this data from the production database, but you still need those names because many other entities will be connected to them.
Verify that all data sets are correctly represented. Implement data validation that compares source and target data sets for consistency. Depending on your data sets, you may need to do this manually or use scripts that will check selected characteristics of data.
Add data to each staging environment. If you have an API to do this, as I did (Zendesk), it makes this task easier.
- The challenge here is to match staging IDs with production IDs because, obviously, many instances can’t be directly migrated. If you use, for example, a table with users' actions on the site, which uses user UUID as the primary key, then the user UUID must be matched with other tables and services where these UUIDs are also used.
- Without this matching, you'll get inconsistencies, and some system functionality may not work at all. And because of the large number of integrations with different services, it's impossible simply to get a dump and use it as is.
Perform static and functional testing on each test environment. Check that the data looks fine, everything works, and there are no crashes on each staging separately.
Test the migration itself. This is the test of the big production migration. Showtime!
- Run all of the region-specific migrations (data from different staging environments to a single new environment).
- Check that everything works with “old” (migrated) data and with new data, ensuring no conflicts, issues, and no crashes.
- Check that everything works with new fresh data (created using this new realm).
- Ensure that data from different regions (Europe, Asia) still belongs to its original region, no data was lost, and all data is accessible and can be changed, deleted, or updated according to the application logic.
- Test the rest of the features: full regression testing.
- If any step fails, then go back, make fixes, and repeat. When I ran into an invalid migration, I already had all the scripts for data generation, creation, and obfuscation, so it wasn’t a problem to wipe the database of the new staging and retry migrations with fixes.
Scenario Two: A Cloud-Based Tool For Browser Extension Management
In this scenario, we have an application that manages several browsers and their settings on your laptop or desktop, similar to an antidetect browser. My focus here will be on browser extensions.
Each browser has a unique set of extensions. Some of these extensions may store users' data, and some extensions may run in multiple browsers. Our application allows you to handle extensions across browsers no matter where you have it installed. These extensions are stored in the cloud along with their configuration information, for example, which extensions are installed on each browser profile.
The product owner wants to enhance this extension management tool, introducing new features and more options for installing extensions from various sources such as the Chrome Web Store, .crx or .zip files, or unpacked extensions. The end user should be able to install extensions without the need to open the browsers for each installation. The end user will also be able to delete extensions or use different (older) versions of the extensions.
These improvements will require significant changes to the database structure. As a result, existing data on the extensions and versions installed on each browser instance will need to be migrated.
How will we know the migration was successful? Clearly, we do not want to lose any user data. All extensions and their configuration information, including extensions that aren't available in the Chrome Web Store, should be available and working after the migration. And new features, such as the ability to delete extensions, should work on existing extensions.
How I Would Test This Scenario
The main approach is similar to the first example, but due to specific requirements, I had to change some steps in my approach.
Create a test environment that is identical to the planned production environment.
Get a copy of the production database and inspect the data. This approach is different from using obfuscated data in the first example, where I had access to some of that data through application functionality (API), and the data was matched with staging data and entities. When you use a full production data dump, obfuscate it, and perform migration tests, it's not about making the data functional within the application. It's about ensuring that the migration commands and logic work correctly. This is a technique that developers can use when they write migration scripts.
Generate some test data that covers most of the known cases.
- Use real data as a source of statistics and particular details.
- Add different extensions and their versions, including the popular ones, the complex ones, those with different permissions, and so on.
- Use different ways to add these extensions, such as Web Store, .crx files, and unpacked extensions.
- It is important to have a test environment that is identical to the production environment at this point.
Test backward compatibility. Update the server code and test that the legacy client application works properly. Make sure that all user data is available and that the application works with it as expected. In this phase, you will get three sets of similar data:
- Data from step one that wasn’t used in any way
- Data from step one with which some tests were performed (updated, deleted, and so on)
- New data that was created in this step
Test the new client application and periodically check backward compatibility during the testing to make sure that the data created with the new client application was properly handled by the old client application. Also: test new features and conduct a smoke test and regression test of the existing features that were affected.
To Wrap Up
Two examples demonstrate that the phases of data migration testing are in general similar no matter what your scenario, but each migration will be unique in certain ways.
The key to successful data migrations is the quality of the test data. Whenever possible, use real data (obfuscated). Use real data, and statistics as a reference to ensure all test cases are covered and to generate robust test data sets.
Test backward compatibility, which may not always be necessary but often adds many additional tests and test data generation.
The data migration should not cause any loss or corruption of users' data, or change it in a way that it cannot be used as it was before. Test to eliminate this risk.
You need some skill in working with databases and scripting.
- I used Python: I had some programming experience, but I used functions in my scripts, not the entire set of object-oriented programming structures.
- Among the tools I used: PyCharm for development along with the Zendesk REST API library. I used many Python libraries such as requests, uuid, names, random, json.
- I also worked with databases, wrote SQL queries, and used tools like PgAdmin and Postman for debugging request and response flows.
Make sure that the entire team understands and accepts your plan for testing the data migration.
Do you have some data migration testing coming up? Have you tested a migration recently? Tell us about it in the Ministry of Testing Club!
For More Information
- Using Personal Data In Test Safely: How To Comply With the GDPR, Ioan Solderea
- Quality + Data Engineering: How to Thrive, James Prescott
- Baking Quality Into Your Data Pipeline, Ali Khalid