🎉 Announcing SQEC! MoT's Software Quality Engineering Certificate 🎉

MoT Professional Membership

For the advancement of software testing and quality engineering

Failing with grace: A tester's guide to error culture

Improve user satisfaction and software resilience by prioritising thoughtful error messaging and a good error culture

by Stefan Dirnstorfer
Dec 3, 2024
11 min read

Failing with grace: A tester's guide to error culture image

Like Bookmark Add to collection

"Errors happen. And if they do, we focus on how they appear. A graceful failure is often better than none."

Why spend time thinking about error messages?

Before my latest trip, I attempted to check in online. The user flow, which led through a tedious series of form fields, ended abruptly with a succinct error message: HTTP_400_BAD_REQUEST.

Finding an error is an enticing experience for any software tester. In this particular instance I could have done quite well without it, but I had more reason than usual to appreciate how the failure was handled.

On the whole, users are more satisfied with a service that fails gracefully than a service that never appears to fail. This is known as the service recovery paradox, and it occurs because the user can clearly see that when failures happen, the system responds appropriately. It is similar to the IKEA effect, in which users are happier with a product or process that cost them some effort than if little or no effort was required.

The abrupt error message I encountered in that online check-in process encouraged me to write about why efforts to ensure good error messaging is more worthwhile than any attempts at eliminating errors entirely. This is especially true in the early stages of development, when sound error handling practice has the greatest effect.

The real message behind the error message

At first glance an error message of the type HTTP_400_BAD_REQUEST seems like it offers nothing in terms of user-friendliness. However, it conceals a "secret message" that I could decode with knowledge from my previous position as a web developer: "Don’t retry."

The HTTP protocol defines error codes in the 400 block, including HTTP_400_BAD_REQUEST and NOT AUTHORIZED (403), as due to actions that will continue to fail. This is in contrast to the retryable 500 block, such as TIMEOUT (504). Before I learned this, I used to spend hours in retry cycles only to have each attempt fail with an error message saying: “Try again later.” In fact, nothing would ever make a retry of those requests worthwhile. How much of my time would have been saved if the developers had told me the real story about code 400: "Don’t retry."

Good error culture: fix messages first, then fix the underlying error

Obviously, an error code of 400 will be understood only by a few people. In the absence of an adequate understanding of errors, a 400-block error might not even be the correct one.

In my experience, product owners tend to hone in on eliminating the presumptive code defects far too quickly, overlooking the detrimental long term cost of bad error messaging. In this article I want to advocate for an error culture in which the messaging gets as much attention as the underlying product defects themselves. The reason is threefold:

Fixing error messages is far easier than fixing functional code defects. Clear, correct error messages usually help users.
Error handling is not updated as frequently as feature code. So it's more stable and often extends to scenarios outside the original error case.
Thoughtful error handling gives you, as a software tester, a rare glimpse into the quality of the error-handling code. Take advantage of it. Writing tests that cover error messaging is often tedious and as a result this area does not often get the attention it deserves.

I dare to argue that the highest quality software reveals itself in the case of an unexpected failure. That’s when you see the difference between developers that foresee the unexpected as opposed to those who simply write the functions to be "testable at the time."

Common mistakes in error handling

Many tutorials purport to teach programmers to handle errors correctly, but there are few resources on how to spot violations of those patterns in the finished product. In other words, many programmers never learn why sound error handling practice matters from an end user’s perspective. So I want to give you, as a tester, a list of common error handling mistakes and how you can spot them.

The swallowed root cause

The most common pitfall in error handling is a lack of reporting on what led to the failure. The root cause is suppressed and the error message the end user sees is generic: "something went wrong", for example. Suppression of root cause isn't due to a lack of effort in creating a descriptive message. More often, a far more precisely worded error message is omitted in favor of a general one, out of fear that the precise message might be too technical for end users.

Such error handling practices can have serious consequences for your team's error culture. By suppressing error descriptions from the relevant submodules, maintainers are discouraged from writing user-friendly error messages. Knowing their effort will be disregarded, these developers don’t see their contribution to the user experience. This anti-pattern in error culture can quickly spread, undermining any sense of responsibility within the team.

Programmers are faced with an economic tradeoff when deciding between a thorough but time-consuming approach and a quick but shallow one when it comes to potential failures. As in the case of a swallowed root cause, less code usually yields better results, but it requires some discipline and consistency across an entire project. Hence, it is outside the scope of a single developer and becomes an issue of code (and product) quality.

Programmers see errors as objects with two parts: a plain-language message that describes the issue at a high level and a reference to a previous exception that was detected as a precursor to the current failure. The second part reveals the history of all previous errors along with links to the failing code segment. End users, in contrast, see only the plain-language message without internal details. The code example below illustrates how such error handling might be implemented. If end users never saw the result of this function, the code would never pose a problem. However, the practice illustrated below swallows the more detailed reasons that we should expect to emerge from the subfunctions. In the best case, users would see the message "Check-in failed", or worse, the surrounding code could suppress this message in favor of something even more generic.

try {

 checkKYC();

 checkAgeAndAttendance();

 checkVisa();

 checkPayment();

} catch(Exception e) {

 throw new Exception("Checkin failed", e);

}

Luckily that was not the case in my encounter with the online check-in process. A low level error code 400 was indeed forwarded directly to the end user, me. While the message did not present an "optimal" user experience, it demonstrated that everyone in the team from front-end to back-end could have helped to add more details.

The misleading error message

What could be worse than an error message that simply said “Something went wrong”? A misleading error message! This type of message can easily lead you to waste your time by trying and retrying to fix things. For example, a system could guide you to repeat an action despite the fact that an error is inherent. It may tell you to wait although it has already stopped processing. A rotating spinner could feign ongoing activity when there is nothing to wait for. An error message might simply refer to an alternate cause that a developer had in mind during testing, without considering other scenarios that could arise in such a scenario.

Such messages are lies, maybe not intentional, but negligent at best.

Here is code with a misleading error message, the likes of which I have seen far too often in projects with paying customers:

try {

 ...

} catch (Exception e) {

 throw new Exception("Try again later.", e);

}

If you see a misleading error message like this, report it as an issue and explain in detail why it's a problem. Often these messages are more harmful than the functional defect in the code. Once end users start to distrust the messages displayed by your product code, it will be hard to regain their confidence. I've seen instances where end users continued to dismiss any error details as hallucinations months after the root cause was fixed. Lost trust is hard to regain.

Reporting incorrect details is far worse than reporting no details at all. Its consequences can far outlive the instance when your application actually floundered.

The missing error message

While a missing error message might appear to be worse for the end user, there are cases where less can be more.

In many instances, developers add generic messages like "try again later" without thoroughly considering whether it provides any real value. They try to be helpful, but aren't. This habit is often based on faulty assumptions about user experience and about what end users will or will not understand. End users often grasp more than expected, including error codes. Also, they form their own theories about how the software works. If error messages are difficult to comprehend, they may end up confused. But that's still better than being misled.

Programmers rely on external libraries that produce peculiar error messages like HTTP_400_BAD_REQUEST. Such a message can reach the end user when nothing is added or hidden. While this may not be ideal, it at least ensures that the message is not inaccurate. Sometimes error resolution can be faster when low-level error codes are communicated to the support team over a hotline.

The overly inclusive error message

Imagine that the product that you are testing isn't targeted at end users directly, but instead is embedded in something bigger, perhaps as an integrated widget, or simply as a data source. You might not have much information about your actual end users. Your module might be used by tech-savvy hackers or buried deep inside some other app. And so the error message that the code produces might seem relevant to the end user, but may also be too detailed. You can't know for sure.

For that reason the ideal error message begins with the most general description and gets into more details as the end user reads on. That means that you add new information as it becomes available.

The example below shows the idea. Assuming that all module owners make their best efforts to report their failures in a user-friendly fashion, the final message could read “Check in failed because Visa could not be verified due to a service missing a compulsory visa id.” Obviously there is room for improvement, but at least end users would know where things failed.

try {

 ...

} catch (UserFacingException e) {

 throw new UserFacingException("Checkin failed because "+ e.message, e);

} catch (Exception e) {

 throw new UserFacingException("Checkin failed", e);

}

Why is this code snippet in the list of mistakes? I think it's close to perfection. The pitfall is that this is not an integral part of your application.

As a tester you obviously can't determine how code is written, but you can make a push for better messaging.

Takeaways for software testers

I want to share a few ideas on how to handle various error cases from a quality engineering perspective. The recommendations are based on my experience as CTO at testup.io. They might require some initial efforts during testing and fixing, but I have seen them pay off in the long run. Not only will new errors be more traceable, but end users will also feel better informed and more tolerant of occasional hiccups.

How, not why

When your application fails, it is naturally your first instinct to focus on the error cause. Of course, you want to know why it failed and you want the failure to go away. In the moment that might seem like the most direct route: find the why, then find a fix.

However, taking a broader perspective, you don't care that much about this issue by itself. You care about how fast issues are fixed in general. In the long run, speed is more important than the single step.

In the language of DORA metrics this is called the mean-time-to-recovery. This is a proven measure that shows high correlation with overall project success. It is empirically more predictive of your outcome than the number of defects in your backlog. It measures how fast you are at detecting, explaining, and fixing errors, not how many errors there are. For this to work, you should focus on how your product code fails and how much time is wasted finding, describing, reproducing or mitigating an issue. In a way it measures how smooth your job as a tester is.

Recoverability

There are multiple reasons why your application can fail on the end user side. Many errors are outside of your control. Internet connectivity might break down, external services might fail, or heavy load could lead to exhaustion of available resources.

Since your test coverage will never be perfect, you will not see all such errors before going live. Before trying to fix errors that may be temporary or even irreproducible, I recommend fixing the software's overall recoverability in a more generic way:

Form field recoverability reduces repeated efforts if an end user's interaction is interrupted and requires a restart. Browsers do a very good job at suggesting previous entries if annotated correctly. Typical weak points are drop-down selections and multiline text areas.
Recoverable checkpoints in a multi-screen user flow allow the end user to recover the initial screens in the event of a service breakdown. Is the shopping basket still available after a browser restart, for example? Do end users get a persistent ID for their work that allows them to continue from another device or ask for support? Are recoverable files written to disk before shutdown?

Restarting a process is an unavoidable reality in web-based applications. Fast recovery can do more for user acceptance than a marginally lower failure rate.

Test coverage

Let's assume for a moment that you found an issue before production deployment. You advocated for a reasonably descriptive error message that allows end users to repeat the process smoothly with minimum effort. If you are in it for the long haul you have now paved the way for a smoother error experience and faster recovery in the future.

Now it's time to fix the actual defect in the product code. How can you keep up your good work? Here are a few ideas for good test coverage:

Add error scenarios to your test plan. You probably won’t have the resources to address all errors immediately, but you can test for graceful failure instead. Since a fix to the error messaging is usually much faster than a fix to its root cause, you can split the ticket. Fix the message first and later fix the cause. In the meantime a nice error message may be allowed so that regression tests will pass.
Create error triggers. With the help of your developers you can bake error triggers deep inside your product. Using the code snippet we reviewed above, if your customer has a family name of type BlocksOnVisaCheck, your product code could create a nicely reported timeout error on the end user’s side. (Obviously, the effect should be limited to the test system and the component addressed.)
Log your progress. Nothing is more frustrating for end users than incomplete or misleading error reports. I recommend that you not only log the frequency of errors, but also rate the quality of the error message to measure your progress.

To create a culture of good error messaging it is important to appreciate progress. Do not simply throw the error count at your development team and call it a day. Show them how you value the efforts they spend crafting concise messages. Give positive feedback in cases of good error communication.

To wrap up

Error handling quality is often overlooked under time pressure. With thorough testing already a common target for cost cuts, this isn’t surprising. However, software’s true resilience is often revealed in how it handles failure. Customer satisfaction can actually increase when errors occur but are clearly communicated with mitigation options. This aligns with the counterintuitive psychological concepts like the “service recovery paradox” and the “IKEA effect.”

The impact of a poor error culture is especially damaging when applications provide misleading guidance such as “Try again” despite the issue being permanent or “Please wait” when no processing is happening. These flaws often stem from deep-rooted issues within the error infrastructure, affecting more than just isolated cases. If these mistakes persist, it may signal an error culture that needs changing.

Failing to consider how to describe unexpected events often means no thought was given to the issue at all. Thoughtful anticipation of error origins is a prerequisite, not a consequence, of robust software.

My advice: Fix your error culture first, then fix the errors.

For more information

DORA
When Things Go Wrong: Crafting User-Centric Error Messages and Fail States, Emily Lau
Best practices for error catching and handling, Programming Duck
How Does Your Mobile Application Handle Internet Connection Issues? You Might Be Surprised…, Ashutosh Mishra
Influence > Authority and Other Principles of Leadership, Elisabeth Hendrickson

Stefan Dirnstorfer

CTO

Stefan Dirnstorfer is the CTO at testup.io, where he focuses on creating an entirely visual workflow for test automation. With many years of experience as a software developer, he has now dedicated himself to a new mission in software quality assurance.