🔥 MoTaCon tickets are hot! Get yours today! 🔥

Two ways to use the “5 whys” method: Root cause of bugs and identifying continuous improvements

Apply the "5 Whys" technique to both technical debugging and organisational workflows to uncover deep-seated root causes and implement sustainable continuous improvements.

by Richard Adams, Emily O'Connor
31 Mar 2026

Two ways to use the “5 whys” method: Root cause of bugs and identifying continuous improvements image

Introduction

This article breaks down the “5 whys” or 5W method and how it can be applied to a variety of problem statements. They include negative customer reviews, live bugs, identifying root cause(s), and identifying continuous improvement areas to ensure issues do not repeat themselves.

What is the “5 Whys” method?

Developed at Toyota in the 1930s. The five whys method became integral to Toyota's manufacturing process and was later popularised globally as a fundamental component of problem-solving in Lean management. This method emphasises tracing issues back to their root causes through consecutive questioning, fostering more effective solutions.

Using the “5 Whys” is a thinking technique that prompts you to go beyond the surface-level issue being seen, sometimes exploring divergent lines of questioning when exploring a problem statement. Whilst this can be used to retrospectively understand why an issue arose, which we’ll cover later, it can also be a useful tool for identifying bugs and helping us respond swiftly to incidents. This approach is often used by incident response teams within corporations. While called the 5 Whys, it is quite possible that you may need more or less questions to get to the root cause of your investigation. This could mean asking three or four questions, or even six or more.

A problem in live brings an opportunity to use the 5 whys

Imagine the chaos. The internal platform team have highlighted a sharp decrease in the number of logins per hour, and a customer has opened a support ticket to complain that they have been unable to log in to the site since the maintenance window ended. The ticket has been forwarded to the engineers in the account team. Our team! Regardless of how the issue is flagged, our role now is to determine the root cause and resolve it as quickly as possible, so customers can log in and use the product.

Although I’m sure most businesses would be very concerned at the news that customers couldn’t log in and use their products. But the industry context is also important. If a customer is not able to log in and pay for their quarterly water bill, a day of downtime won’t have the same impact as if I were not able to log in to a taxi app and get home in the cold early hours of the morning. In short, when following the five whys method, be mindful of the context in which decisions were made and of their criticality, given the industry context.

Application 1: Using the “5 Whys” method to identify bugs

One way to use the “5 Whys” method is to help us track down the exact cause of bugs in our products. It is often the responsibility of quality professionals to help understand the root cause.

First, write a brief, clear problem statement or user story that focuses on the known facts of the issue, ensuring you have access to relevant tickets, reports, and logs. For example, your problem statement may be: ‘Customers are reporting being unable to log in following site downtime.’

Next, ask why this is happening using the “5 Whys method”. In our example, we would ask, “Why would customers be failing to log in after site downtime?”. We might initially have a few ideas of different potential causes for this, such as:

The login API was returning with errors.
The login API could not be reached.
The client was not sending requests to the login API.

Based on the logs, error messages shown in the user interface, and what you can reproduce, you may be able to narrow this down to one or two possibilities. In this example, we may have seen that the API was returning the following:h

{"errors":[{"code":399}]}

Following “5 Whys”, we don’t stop and report the bug as soon as we get the error code. Instead, the next step would be to ask our second question, such as “why would I get this error code?”. Looking at the internal documentation, we can see that this is returned when the API is unhappy with the token.

Then we can ask our third question, although there are a couple of possibilities here:

Why did the API reject the token being provided?
Why did the client send a token that the API would reject?

Diving into the API first, we learn that this only happens if the token is not recognised by the API. This leads us to our fourth question. Why wasn’t this token recognised by the API? Asking this question reveals that the tokens are only kept in memory by the API. The downtime in the problem statement was due to a server reboot. This meant that the login API would no longer know the token the client was using before the reboot. We may be satisfied with the depth that we’ve covered with these questions. Asking “Why was the server rebooted?” doesn’t give us any further insight into the bug that we’re looking into, so we can stop following this route of questioning.

We can summarise our findings in the following format:

Problem statement to question	Result
Customers were unable to log in following site downtime. Why?	The API was returning a 399 error code
The login API was returning an error code when people tried to log in. Why?	The session token was rejected
The session token was being rejected by the login API. Why?	The API didn’t recognise the token from the client
The login API didn’t recognise the token the client was using. Why?	The tokens are stored in memory and were lost when the server was restarted.

However, asking these questions is only part of the puzzle, and we do not have a complete understanding of the root cause in this example. By continuing to ask ‘why?’, we have learnt that the login API was rejecting the requests. Let’s go back to that other question from earlier. “Why did the client send a token that the API would reject?”

By continuing to ask why, we may be able to learn not only what the bug was at a user-facing level (e.g., unable to log in after downtime) but also to focus on understanding the root cause of the issue the customer was seeing. This helps us resolve the issue more reliably.

Once complete, our “5 Whys” may have uncovered the following:

Problem statement to question	Result
Customers were unable to log in following site downtime. Why?	The API was returning a 399 error code
The login API was returning an error code when people tried to log in. Why?	The session token was rejected
The session token was being rejected by the login API. Why?	The login API didn’t recognise the token being used by the client The client was sending an out-of-date token The client wasn’t requesting a new token when the server rejected it
The login API didn’t recognise the token the client was using. Why?	The server has been restarted and this caused the existing tokens to be lost from the login API’s memory.
The client was sending an out-of-date token. Why?	It wasn’t aware that the server had been restarted and the tokens would no longer be valid.
The client wasn’t aware that the tokens were no longer valid. Why?	The token lifetime is longer than it takes for a server to restart and didn’t expire for up to another 60 minutes.
The client wasn’t requesting a new token when the server rejected it. Why?	The client only requests new tokens when a timeout error code is returned.
The client requests new tokens only when a timeout error code is returned. Why?	The error handling is very niche, meaning that when the token error was returned, something not previously considered, the client didn’t try to use a new token. In this specific scenario, the cause could also lead to similar bugs.

The conclusions that we might write on the ticket are:

Every time the server is restarted, existing login tokens become invalid.
The client is only handling specific error messages, meaning that it does not request new tokens. This needs to be addressed to prevent this issue happening again.
There may be further undiscovered issues as a result of this error handling behaviour.

From the same starting point, there are countless enumerations of the root causes of issues. There are also many different questions we could have asked. Later in this article, you can read some great questions to ask in different problem spaces to identify a variety of root causes.

Application 2: Using the “5 Whys” method to reveal opportunities for continuous improvement

In the previous section, we looked at an example of using “5 Whys” for understanding the root cause within the system itself, which may lead to features being implemented to prevent similar issues.

Another use case is to apply the same approach within sessions, such as post-incident Root Cause Analysis (RCA) or a Blameless Postmortem, to learn from the problem and avoid similar challenges, and to explore continuous improvement and build quality into our working practices.

Worked example

Another example of our “5 Whys” may have uncovered the following:

Customers were unable to log in following site downtime. Why?
The site was unable to connect to the login API. Why?
API requests timed out. Why?
The authentication service was overloaded. Why?
Traffic spiked after downtime, and responses slowed. Why?
Autoscaling was not configured correctly. Why?

In this example, it was determined that scaling thresholds were never tested under load, identifying a new type of testing which should be incorporated into processes moving forward.

Similarly, you may choose to use each “why” to look for mini process improvements.

Customer “Simon” was unable to log in following site downtime. Why?

Emily entered the wrong password, but the login API returned the wrong-password message too late.

This wasn’t identified in the test environment. Why?

Continuous improvement suggestions:

Assess API latency monitoring in case of performance issues or response timing issues
Add timeout handling tests
Add automated testing around the display of incorrect password error messages that run on smaller devices

Not all examples lead to such obvious gaps. We can ask questions based on the full Software Development Life Cycle (SDLC). For example, you can start by asking questions about how we could have caught or prevented this issue from reaching production. Then asking about our testing in staging, in development, and even going back to refinement and planning.

These questions can give us meaningful insights into where our processes might have gaps or areas for improvement. One production bug can lead to multiple improvement initiatives (or highlight where practices such as Shift Left are not consistently applied) as we gain insights from all the processes involved in shipping a change.

For example, by continuing to ask “why” around detecting the issue ourselves (observability), we may have uncovered that the team didn’t have visibility on how to set up error alerting for their services, so we can organise some training. As we continue to ask “why?” about earlier stages of the SDLC, we may uncover that quality specialists were not involved early enough, and we can take action to help us shift left.

To learn more about the value in performing blameless postmortems, watch a great talk by Jitesh Gosai on the 2024 CrowdStrike incident and how we can learn from it. There are many questions we can ask to help us understand the real reason our processes have let us down.

Types of questions

Starting point

Regardless of the use case (identifying bugs or areas for continuous improvement), following the five whys method requires you to frame questions to enable effective investigation. If working on a customer-impacting bug, a negative App Store review, or informal feedback, the problem is likely to be written in a chatty or long-winded way, so it might help to structure the problem statement as a formal user story or a given-when-then statement.

Diagnostic checklists

A diagnostic checklist can help uncover known facts without jumping to conclusions. Questioning each of these data points may help us to uncover more of the problem statement:

Were there any error messages?
- User-visible errors
- Back-end errors, such as HTTP status codes, API responses
- Internal observability, such as metrics, logs or stack traces
- Infrastructure errors
Was the problem environment-specific?
- Browser
- Device
- OS
- App version
Was the problem user-specific? Were all customer types affected? If not, which customer types were and were not affected?
- One customer
- Account type or permission level
- Region
- All customers
Can the development team reproduce the issues?

Questions to enable continuous improvement

Earlier, we explored how asking questions can help us understand what led to the problem arising. A different diagnostic checklist should reframe questions focusing on procedures and ways of working, rather than application logic:

Why were we unable to detect this issue until customers raised it?
- What alerting do we have in place?
- Or, why don’t we use alerts?
- Or, why was none of the alerts checked?
Was this scenario tested?
- Is there a difference between lower environments and production which was not covered by testing?
- Was the testing reviewed?
- Was there any collaboration between the developer and tester? If not, why not?
Could we have caught this in code review, and why wasn’t it discussed?
Why was this design pattern for error handling used?
We weren’t aware of the need to handle this error condition - why?
Who was involved in planning and refinement?
- Why weren’t testers/quality specialists involved?
- Why is there no record of the refinement session?
- Why do we not call out test scenarios in our planning?

When asking these questions, ensure you provide psychological safety for your colleagues. This means people shouldn’t fear repercussions for any mistakes they have made. We are focused on asking questions around the processes (or lack thereof!) that allowed “human error” to result in customer impact. Read more in the book The Field Guide to Understanding “Human Error”, or watch Butch Mayhew’s short talk titled Blame The Process Not The People.

What do YOU think?

Got comments or thoughts? Share them in the comments box below. If you like, use the questions below as starting points for reflection and discussion.

Questions to discuss

Do you use 5W to identify continuous improvements?
Are you a quality coach or engineer who gets involved with customer-impacting bugs or postmortems?
Have you done a process similar to this, or have you seen this occur?
When you have worked under tight time constraints, how did you decide what not to test?
What signals do you use to decide whether a risk is acceptable or needs attention now?
Where has test coverage given you a false sense of confidence?

Actions to take

Pick one recent release and review it through a user-impact lens. What failures would have broken trust?
In your next planning or review, replace one “test everything” discussion with a conversation about risk, reversibility, and user impact.
Share this article with someone on your team and ask: What would we do differently if time were even tighter?

Richard Adams

Senior Test Analyst

He / Him

I am Open to Teach, Speak, Podcasting, Write, Meet at MoTaCon 2026

Passionate about quality & testing. Creator of Threat Agents card game and regularly found chatting cyber security.

Chapter Lead

52 followers

69 following

Emily O'Connor

Principal Quality Engineer

She/Her

Technical leader with a sixth sense for bugs. Avid learner, passionate about translating "dev-speak" to enable teams adopt automation and AI-accelerated quality engineering. I believe great software starts with user-focused problem solving, and automation should surface the bugs that PMs actually care about fixing.

55 followers

8 following

Comments

Explore MoT

MoTaCon 2026

Thu, 1 Oct

A tech conference to help you navigate the ever-shifting landscape of Quality Engineering, AI, Leadership, Product, Accessibility and Security.

MoT Software Testing Essentials Certificate

Boost your career in software testing with the MoT Software Testing Essentials Certificate. Learn essential skills, from basic testing techniques to advanced risk analysis, crafted by industry experts.

01 Sep 24

Certification

Into The Motaverse

Into the MoTaverse is a podcast by Ministry of Testing, hosted by Rosie Sherry, exploring the people, insights, and systems shaping quality in modern software teams.