Reading:
Are You Seeing RED? Restoring Reliability To Test Results
Ministry of Testing Meetups image
We are a global community with member-led local software testing focused meetups.

Are You Seeing RED? Restoring Reliability To Test Results

Learn to enhance CI/CD pipelines for trustworthy testing results.

¨As quality assurance professionals, part of our role is to provide meaningful information to all interested stakeholders so they can make better informed decisions on time. Building and maintaining stable test stages plays a key role in helping our teams achieve a trustworthy CI / CD setup.¨

Is There A Problem Here? Why Certain Tests Always Seem To Fail 

As testers, we want our tests to tell us when product code behaves in a way we don’t expect. And in many continuous integration / continuous deployment (CI / CD) environments, we have come to expect that failed tests will show up as ¨red¨ on our results dashboards. The color red alerts us to a problem … somewhere.

Tests that fail every so often, especially in areas of the product that are prone to regression defects, are very useful IF they in fact turn red for good reason: when a defect has been introduced or something unexpected occurred in the test environment. 

But when tests fail frequently, sometimes as often as every day, it can be a sign that we cannot trust the results of those tests. Over time, after chasing down one false positive (failed) result after another without finding any actual defect in the code, people on the team start ignoring the failed tests. And when the tests do actually fail because of a defect in the code, no one notices. Meanwhile, the testers who wrote the automation and report the defects lose credibility. 

Today, many tests live in what are known in DevOps terminology as ¨pipelines.¨ In The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations, the authors state that the role of a “deployment pipeline” is to ensure that code and infrastructure are always in a deployable state, and to achieve that we require continuous build, testing, and integration.

So the focus of this article is not automated tests that yield true, reliable passes and failures. Instead, our main concern is what we can call an unreliable test pipeline, which produces untrustworthy results apt to be ignored. This scenario can be seen with both functional and non-functional tests. In this article, we’ll come to understand how we can identify unreliable test pipelines, and more importantly, what we can do to fix them or keep them from occurring.

What Leads To Unreliable Results From A Test Pipeline?

Notification Problems

One root cause of unreliable test results is the absence of proper failure notifications. This shows up in a few different ways:

  • The tests do not have notifications in place at all 
  • Too many notifications point to a single channel, which then ends up being ignored for being too noisy

Lack Of Ownership

After a while, team members can start walking away from investigating problems that the testing dashboard is supposed to report. Who is responsible for investigating and fixing those failures? Is it the developer who is creating the merge request? Is there enough QA support to work together on this? Should DevOps join in? 

All these questions are relevant, and at some point, you might need cross-expertise collaboration to get to the bottom of a given issue. As a rule of thumb, we should assume that whenever a pipeline’s test stage fails, we as quality assurance professionals need to take the lead.

Test Types And Environmental Differences

Last but not least, creating properly stable test stages is not as easy as it seems. If we have a mix of different types of tests in the pipeline (functional and nonfunctional, for example), there’s a broad range of problems we can find when running those checks. Another complicating factor is that our local machines are often different from the agents where the tests will run, therefore causing testers to run into the “works on my machine” trap and making it harder to fix existing problems.

Restoring Reliability To Test Results: Monitor, Act, Share 

Continuous reliable pipeline cycle

To create a test pipeline that reliably reports on application errors, we need to

  • Monitor the results, with appropriate notifications to the right people
  • Act on what we find out, fixing not just product code but any environmental issues that might cause unreliable test results
  • Share our experiences with other team members, for example, by documenting good practices 

This cycle helps to ensure a trustworthy test pipeline that produces a RED stage only when an actual application error is found. And the cycle needs to be applied continuously during the software development life cycle, not just once or twice. In the following sections, we’ll dive deeper into each step.

Monitor

Throughout a software development project, the application and its test code will evolve side by side. New scenarios will be added, existing ones will be modified, and new tools will be introduced to allow for other types of testing, such as accessibility and security. 

As changes take place, a once stable pipeline can quickly destabilize. A test that works perfectly on a development machine might not behave the same when running against a deployed version of our application. And so we need to monitor these changes continuously. 

First up, we need an efficient notification mechanism in place. We need to be notified as soon as something goes wrong in one of our test stages. And then we can investigate the issue.

A failed pipeline notification

As illustrated in the image above, notifications should be clean and simple, giving just enough information for us to start our investigation. Providing the team with a summary and a link to the actual pipeline execution details supports thorough issue analysis. The output of the pipeline should include logs and any available test reports.

Sending notifications ONLY when they are relevant to your team is another important aspect to consider. As an example: a notification that a given test stage has started running is of low value and is likely to clutter the channel, obscuring more important notifications.

Act

A pipeline’s test stage can fail for any number of reasons. The cause could be something as simple as a network issue, or it could be due to a very challenging test scenario that continues to misbehave no matter how much you work on trying to make it stable. 

The most important part of the Act step is that no failed test stage is ignored. Investigating it might not be a priority at the moment, but it cannot continue to be overlooked. The flowchart below shows how we can analyze the issue and determine our course of action.

Flowchart of analysis and action steps

1. Investigating Application Issues

The first and most important question we want to answer by looking at a failed test is: Is this an application issue? (flowchart question 1) Determining this might involve going back to the requirements, chatting with the developers and business analysts, and even manually attempting to reproduce the problem. Once we confirm that an actual issue exists, we can then raise it as a bug and gain confidence that our test set is doing its job properly.

2. Is The Test Code Correct?

On the other hand, when a test failure does not represent an actual issue with the product code, we can then question if the automated check is exercising the application properly (flowchart question 2). In that case, we inquire:

  • Is the test set up properly? 
  • Are we using the correct inputs? 
  • Do we have appropriate assertions?

3. Is There An Environmental Issue?

If we determine that the test code is correct, then the next area we can look into is our environment (flowchart question 3). Here are some typical environmental issues that can lead to test failures:

  • Required environment variables are undefined, set to the wrong values, or are not configured for the test environment
  • Network restrictions did not allow for the tests to communicate with a third party or database
  • Database is not in the appropriate state
  • The pipeline couldn’t retrieve its required dependencies 
  • The application did not start properly

To address some of the problems mentioned above, some work will be required, such as adding missing environment variables, ensuring network restrictions are dealt with, or making sure that the database setup is correct. In other cases, such as needed services being temporarily unavailable or the application not starting properly, it might be just a matter of re-running the pipeline.

4. Should The Test Be Fixed Or Should It Be Removed?

Sometimes, tests can be fixed, and sometimes, they can’t. Here, we are asking flowchart question 4. 

In some cases, the fix is trivial, as illustrated in the image below, where the test was expecting a Double format value but received a String. 

Test assertion expecting Double but getting a string

In other cases, we need to change the test to be more sensitive to application states, ensuring that we are moving between steps only when the application is ready. Examples:

  • UI test: Code attempts to interact with a given element before it is in an active state.
  • API test: HTTP request can only be made after an event has been consumed

But sometimes, the problem with a given check is not that straightforward. Some examples:

The issue can’t be reproduced in the local environment. To address this, you could:

  • Enhance the test logs, making sure they output as much relevant information as possible
  • Run the failed test from your local machine, targeting the deployed test instance. This method allows you to experiment with code changes and even debug each step for a complete view of how the test behaves
  • The test runs code in an area of the application that is typically hard to test.
    • Examples can include dealing with event-driven platforms like Kafka or RabbitMQ, reading data from logs or console, asynchronous behavior, or interaction with third party libraries
    • Fixing the broken test might be more time-consuming than it’s worth. It will probably require you to better understand the technologies in place and figure out if there are better ways to approach it from a test automation perspective 
    • This situation is a great opportunity for pairing with the developers from your team who have been involved with its implementation

If we determine that this test can’t be fixed in a reasonable amount of time, then we need to decide what to do with it. We could: 

  • Test the application at a different level. Some features of the application are easier to test at a lower level. So you could delete the higher-level scenario (UI or API testing) and ensure there is adequate coverage on a unit or integration level
  • Add a retry or polling mechanism. This may eliminate the unreliability and preserve coverage for that area of the application
  • Delete the test. This might feel scary, but keep in mind that it is better to have a trustworthy smaller subset of tests running than having a larger set that gets ignored. Deleting tests is not an excuse for decreased coverage; you can continue to cover the area through lower-level tests and exploratory testing where applicable
  • Stop the test from causing an overall pipeline failure. You could have the pipeline exit in a warning state, for example. This is a great technique when you are introducing a new type of test stage, such as accessibility, security, or contract testing, and you are aware that its findings won’t yet be prioritized by the development team. You can change the behavior back to exit with an error when the team is ready to start working on it again

Share

Quality is everyone’s responsibility, and part of that involves working to stabilize our test stages. Therefore, all involved parties need to document, formally and informally, solid practices and guides for dealing with the test stages of the existing pipeline. 

The level of desired documentation will depend on your team’s way of working, but at a high level, you should cover:

  • Links to relevant pipelines and their related code repositories
  • Channels in which notifications are sent and where they can be managed
  • Common issues

Making sure everyone knows how to be notified about failures and how to act in case of a failure will help get them on board and feel like true owners of the entire continuous integration pipeline. Collaboration with other team members (such as DevOps) will play a fundamental role for this information to be accessible to everyone.

To Wrap Up…

As quality assurance professionals, part of our role is to provide meaningful information to all interested stakeholders so they can make better informed decisions on time. Building and maintaining stable test stages plays a key role in helping our teams achieve a trustworthy CI / CD setup.

“The fundamental tenet of continuous delivery (CD) is to work so that our software is always in a releasable state. To achieve this, we need to work with high quality. That way, when we detect a problem, it is easy to fix, so that we can recover to releasability quickly and easily.” Dave Farley, Benefits of Continuous Delivery, DORA Metrics Report 2023

To have releasable code we need to be able to detect and fix a problem quickly, and to achieve that we require our test stages to provide feedback we can trust.

“Teams that prioritize getting and acting on high-quality, fast feedback have better software delivery performance.” Dave Farley, Benefits of Continuous Delivery, DORA Metrics Report 2023

Taking ownership of the test stages of our pipeline will demand time and effort, but only by prioritizing and acting on it daily will we be able to provide fast feedback and improve our overall performance.

Therefore, make sure each test you add and every pipeline run is watched closely. In doing so,  you’ll ensure that they always provide valuable feedback. 

For More Information

Jose Carrera
Senior QA Consultant
José Carréra is from Brazil, where he graduated in computer science in 2007 and completed his masters in 2009. Professionally, he has been working on quality related roles since 2006. Moved to the UK in 2015, where currently works as a Senior QA consultant at Ensono Digital.
Comments
Ministry of Testing Meetups image
We are a global community with member-led local software testing focused meetups.
Explore MoT
TestBash Brighton 2025 image
Wed, 1 Oct 2025
On the 1st & 2nd of October, 2025 we'll be back to Brighton for another TestBash: the largest software testing conference in the UK
MoT Foundation Certificate in Test Automation
Unlock the essential skills to transition into Test Automation through interactive, community-driven learning, backed by industry expertise
This Week in Testing
Debrief the week in Testing via a community radio show hosted by Simon Tomes and members of the community