Introduction
This article breaks down the “5 whys” or 5W method and how it can be applied to a variety of problem statements. They include negative customer reviews, live bugs, identifying root cause(s), and identifying continuous improvement areas to ensure issues do not repeat themselves.
What is the “5 Whys” method?
Developed at Toyota in the 1930s. The five whys method became integral to Toyota's manufacturing process and was later popularised globally as a fundamental component of problem-solving in Lean management. This method emphasises tracing issues back to their root causes through consecutive questioning, fostering more effective solutions. (Source: https://purplegriffon.com/blog/5-why-analysis)
Using the “5 Whys” is a thinking technique that prompts you to go beyond the surface-level issue being seen, sometimes exploring divergent lines of questioning when exploring a problem statement. Whilst this can be used to retrospectively understand why an issue arose, which we’ll cover later, it can also be a useful tool for identifying bugs and helping us respond swiftly to incidents. This approach is often used by incident response teams within corporations. While called the 5 Whys, it is quite possible that you may need more or less questions to get to the root cause of your investigation. This could mean asking three or four questions, or even six or more.
A problem in live brings an opportunity to use the 5 whys
Imagine the chaos. The internal platform team have highlighted a sharp decrease in the number of logins per hour, and a customer has opened a support ticket to complain that they have been unable to log in to the site since the maintenance window ended. The ticket has been forwarded to the engineers in the account team. Our team! Regardless of how the issue is flagged, our role now is to determine the root cause and resolve it as quickly as possible, so customers can log in and use the product.
Although I’m sure most businesses would be very concerned at the news that customers couldn’t log in and use their products. But the industry context is also important. If a customer is not able to log in and pay for their quarterly water bill, a day of downtime won’t have the same impact as if I were not able to log in to a taxi app and get home in the cold early hours of the morning. In short, when following the five whys method, be mindful of the context in which decisions were made and of their criticality, given the industry context.
Application 1: Using the “5 Whys” method to identify bugs
One way to use the “5 Whys” method is to help us track down the exact cause of bugs in our products. It is often the responsibility of quality professionals to help understand the root cause.
First, write a brief, clear problem statement or user story that focuses on the known facts of the issue, ensuring you have access to relevant tickets, reports, and logs. For example, your problem statement may be: ‘Customers are reporting being unable to log in following site downtime.’
Next, ask why this is happening using the “5 Whys method”. In our example, we would ask, “Why would customers be failing to log in after site downtime?”. We might initially have a few ideas of different potential causes for this, such as:
- The login API was returning with errors.
- The login API could not be reached.
- The client was not sending requests to the login API.
Based on the logs, error messages shown in the user interface, and what you can reproduce, you may be able to narrow this down to one or two possibilities. In this example, we may have seen that the API was returning the following:
```{"errors":[{"code":399}]}`
Following “5 Whys”, we don’t stop and report the bug as soon as we get the error code. Instead, the next step would be to ask our second question, such as “why would I get this error code?”. Looking at the internal documentation, we can see that this is returned when the API is unhappy with the token.
Then we can ask our third question, although there are a couple of possibilities here:
- Why did the API reject the token being provided?
- Why did the client send a token that the API would reject?
Diving into the API first, we learn that this only happens if the token is not recognised by the API. This leads us to our fourth question. Why wasn’t this token recognised by the API? Asking this question reveals that the tokens are only kept in memory by the API. The downtime in the problem statement was due to a server reboot. This meant that the login API would no longer know the token the client was using before the reboot. We may be satisfied with the depth that we’ve covered with these questions. Asking “Why was the server rebooted?” doesn’t give us any further insight into the bug that we’re looking into, so we can stop following this route of questioning.
We can summarise our findings in the following format:
| Problem statement to question | Result |
| Customers were unable to log in following site downtime. Why? | The API was returning a 399 error code |
| The login API was returning an error code when people tried to log in. Why? | The session token was rejected |
| The session token was being rejected by the login API. Why? | The API didn’t recognise the token from the client |
| The login API didn’t recognise the token the client was using. Why? | The tokens are stored in memory and were lost when the server was restarted. |
However, asking these questions is only part of the puzzle, and we do not have a complete understanding of the root cause in this example. By continuing to ask ‘why?’, we have learnt that the login API was rejecting the requests. Let’s go back to that other question from earlier. “Why did the client send a token that the API would reject?”
By continuing to ask why, we may be able to learn not only what the bug was at a user-facing level (e.g., unable to log in after downtime) but also to focus on understanding the root cause of the issue the customer was seeing. This helps us resolve the issue more reliably.
Once complete, our “5 Whys” may have uncovered the following:
| Problem statement to question | Result |
| Customers were unable to log in following site downtime. Why? | The API was returning a 399 error code |
| The login API was returning an error code when people tried to log in. Why? | The session token was rejected |
| The session token was being rejected by the login API. Why? |
|
| The login API didn’t recognise the token the client was using. Why? | The server has been restarted and this caused the existing tokens to be lost from the login API’s memory. |
| The client was sending an out-of-date token. Why? | It wasn’t aware that the server had been restarted and the tokens would no longer be valid. |
| The client wasn’t aware that the tokens were no longer valid. Why? | The token lifetime is longer than it takes for a server to restart and didn’t expire for up to another 60 minutes. |
| The client wasn’t requesting a new token when the server rejected it. Why? | The client only requests new tokens when a timeout error code is returned. |
| The client requests new tokens only when a timeout error code is returned. Why? | The error handling is very niche, meaning that when the token error was returned, something not previously considered, the client didn’t try to use a new token. In this specific scenario, the cause could also lead to similar bugs. |
The conclusions that we might write on the ticket are:
- Every time the server is restarted, existing login tokens become invalid.
- The client is only handling specific error messages, meaning that it does not request new tokens. This needs to be addressed to prevent this issue happening again.
- There may be further undiscovered issues as a result of this error handling behaviour.
From the same starting point, there are countless enumerations of the root causes of issues. There are also many different questions we could have asked. Later in this article, you can read some great questions to ask in different problem spaces to identify a variety of root causes.
Application 2: Using the “5 Whys” method to reveal opportunities for continuous improvement
In the previous section, we looked at an example of using “5 Whys” for understanding the root cause within the system itself, which may lead to features being implemented to prevent similar issues.
Another use case is to apply the same approach within sessions, such as post-incident Root Cause Analysis (RCA) or a Blameless Postmortem, to learn from the problem and avoid similar challenges, and to explore continuous improvement and build quality into our working practices.
Worked example
Another example of our “5 Whys” may have uncovered the following:
- Customers were unable to log in following site downtime. Why?
- The site was unable to connect to the login API. Why?
- API requests timed out. Why?
- The authentication service was overloaded. Why?
- Traffic spiked after downtime, and responses slowed. Why?
- Autoscaling was not configured correctly. Why?
In this example, it was determined that scaling thresholds were never tested under load, identifying a new type of testing which should be incorporated into processes moving forward.
Similarly, you may choose to use each “why” to look for mini process improvements.
Customer “Simon” was unable to log in following site downtime. Why?
Emily entered the wrong password, but the login API returned the wrong-password message too late.
This wasn’t identified in the test environment. Why?
Continuous improvement suggestions:
- Assess API latency monitoring in case of performance issues or response timing issues
- Add timeout handling tests
- Add automated testing around the display of incorrect password error messages that run on smaller devices
Not all examples lead to such obvious gaps. We can ask questions based on the full Software Development Life Cycle (SDLC). For example, you can start by asking questions about how we could have caught or prevented this issue from reaching production. Then asking about our testing in staging, in development, and even going back to refinement and planning.
These questions can give us meaningful insights into where our processes might have gaps or areas for improvement. One production bug can lead to multiple improvement initiatives (or highlight where practices such as Shift Left are not consistently applied) as we gain insights from all the processes involved in shipping a change.
For example, by continuing to ask “why” around detecting the issue ourselves (observability), we may have uncovered that the team didn’t have visibility on how to set up error alerting for their services, so we can organise some training. As we continue to ask “why?” about earlier stages of the SDLC, we may uncover that quality specialists were not involved early enough, and we can take action to help us shift left.
To learn more about the value in performing blameless postmortems, watch a great talk by Jitesh Gosai on the 2024 CrowdStrike incident and how we can learn from it. There are many questions we can ask to help us understand the real reason our processes have let us down.
Types of questions
Starting point
Regardless of the use case (identifying bugs or areas for continuous improvement), following the five whys method requires you to frame questions to enable effective investigation. If working on a customer-impacting bug, a negative App Store review, or informal feedback, the problem is likely to be written in a chatty or long-winded way, so it might help to structure the problem statement as a formal user story or a given-when-then statement.
Diagnostic checklists
A diagnostic checklist can help uncover known facts without jumping to conclusions. Questioning each of these data points may help us to uncover more of the problem statement:
-
Were there any error messages?
- User-visible errors
- Back-end errors, such as HTTP status codes, API responses
- Internal observability, such as metrics, logs or stack traces
- Infrastructure errors
-
Was the problem environment-specific?
- Browser
- Device
- OS
- App version
-
Was the problem user-specific? Were all customer types affected? If not, which customer types were and were not affected?
- One customer
- Account type or permission level
- Region
- All customers
- Can the development team reproduce the issues?
Questions to enable continuous improvement
Earlier, we explored how asking questions can help us understand what led to the problem arising. A different diagnostic checklist should reframe questions focusing on procedures and ways of working, rather than application logic:
-
Why were we unable to detect this issue until customers raised it?
- What alerting do we have in place?
- Or, why don’t we use alerts?
- Or, why was none of the alerts checked?
-
Was this scenario tested?
- Is there a difference between lower environments and production which was not covered by testing?
- Was the testing reviewed?
- Was there any collaboration between the developer and tester? If not, why not?
- Could we have caught this in code review, and why wasn’t it discussed?
- Why was this design pattern for error handling used?
- We weren’t aware of the need to handle this error condition - why?
-
Who was involved in planning and refinement?
- Why weren’t testers/quality specialists involved?
- Why is there no record of the refinement session?
- Why do we not call out test scenarios in our planning?
When asking these questions, ensure you provide psychological safety for your colleagues. This means people shouldn’t fear repercussions for any mistakes they have made. We are focused on asking questions around the processes (or lack thereof!) that allowed “human error” to result in customer impact. Read more in the book The Field Guide to Understanding Human Error, or watch Butch Mayhew’s short talk titled Blame The Process Not The People.
What do YOU think?
Got comments or thoughts? Share them in the comments box below. If you like, use the questions below as starting points for reflection and discussion.
Questions to discuss
- Do you use 5W to identify continuous improvements?
- Are you a quality coach or engineer who gets involved with customer-impacting bugs or postmortems?
- Have you done a process similar to this, or have you seen this occur?
- When you have worked under tight time constraints, how did you decide what not to test?
- What signals do you use to decide whether a risk is acceptable or needs attention now?
- Where has test coverage given you a false sense of confidence?
Actions to take
- Pick one recent release and review it through a user-impact lens. What failures would have broken trust?
- In your next planning or review, replace one “test everything” discussion with a conversation about risk, reversibility, and user impact.
- Share this article with someone on your team and ask: What would we do differently if time were even tighter?