Reading:
Back to the basics: Rethinking how we use AI in testing

Back to the basics: Rethinking how we use AI in testing

Identify blind spots created by AI-driven testing and prevent false confidence in testing outcomes

Back to the basics: Rethinking how we use AI in testing image

Over the past few years, AI has become an integral part of nearly every conversation in software testing.  

It’s used to summarize product requirements, generate test cases, refine test plan documents, suggest potential API tests, and leverage MCP and AI agents to improve test automation scripts. In some cases, we’ve even seen “self-healing” tests, where automated suites detect and adapt to changes in the application under test with minimal human intervention.  

It feels like we’re entering a new golden era of effortless testing, but in practice, it’s not that simple. While AI gives the impression of easing our testing efforts, it often replaces thoughtful analysis with superficial convenience. We start outsourcing curiosity, depth, and problem-solving to something that doesn’t understand or care about context, outcome, or business impact. Somewhere along the way, we’ve started forgetting the basics: analysis, collaboration, and deep understanding of the systems we test.

The AI hype versus testing reality

I’ve explored numerous AI-driven testing tools, watched impressive demos, listened to polished pitches, and even run a few solid proofs of concept. I've seen a consistent, disturbing pattern: once you move from the controlled demo environment into your product’s real workflows, major limitations appear quickly. Perhaps these tools work better in simpler or more predictable domains. But for use with complex applications or environments, I’m still not entirely convinced. 

What is clear is that the challenge goes far beyond prompts and output. AI doesn’t share our product’s history, architecture, or shifting business priorities; the context that actually shapes testing decisions. And meaningful testing requires rich contextual information. AI can generate output, but it cannot interpret unclear requirements, root-cause any possible production incidents, or replace the human-centered design of systems that guide risk-based testing. That’s judgment, and judgment comes from people.

In chasing automation or 10x productivity, it’s easy to forget a critical part of testing: discovery. Understanding system constraints. Learning how data truly behaves. Asking questions. Engaging with people. 

I saw this play out when, in the past, I evaluated an AI-powered low-code automation platform that allegedly could transform requirements into automated tests. It also claimed faster coverage with minimal maintenance. The demo looked great. But the actual application broke down when we tried to apply it to our actual data workflows, in which pipelines pull data from our data warehouse, transform and enrich big data, and deliver that data to a third-party system. 

Our AI-generated tests verified the requirements and ran without error in our test environments. But when we released the code to production, the external third-party system rejected some records due to validation rules we did not know existed. In our review apps, the mocked responses reflected only the behavior we believed was correct based on what we knew during analysis, design, and implementation. Each team worked within its own constraints, so the mocks naturally covered only the rules we were aware of, not the full set enforced by the partner system. As a result, our AI-generated tests were technically correct for the context we had, but they did not give us meaningful confidence in the regression test suite. 

Some might argue that this could happen even without AI, and that is true, but the lesson is different. When we over-rely on AI and stop thinking critically about gaps across the entire flow, these situations become far more likely to occur. In the end, the resolution didn’t stem from the use of any AI tool. It came from experienced testers, cross-team collaboration, and revisiting the acceptance criteria with the business stakeholder. 

We’ve forgotten the basic principles of testing

In the rush of delivery cycles, it’s tempting to swap collaboration for convenience, even if we lose depth when we do. Instead of clarifying requirements with developers or product managers, we ask AI to explain business logic. Instead of discussing flows together, we ask AI to generate scenarios. In doing so, we lose another core pillar of testing: shared understanding.

Before writing a single test, we need alignment, which includes understanding the intent of features, interactions among systems, and potential risk areas. Simply prompting an AI model skips this crucial phase entirely. Before deciding what to test, teams need a shared understanding of what they’re building: the actors involved, how systems interact, and which assumptions or unknowns might influence behavior. This kind of work often seems “slow” and doesn’t fit neatly into today’s push for constant acceleration. It requires short workshops, whiteboard discussions, architecture sketches, and mind maps. But this is where clarity is created. It’s where good testing actually begins.

AI can support this process, turning rough notes into diagrams or first-draft visuals, but it cannot build understanding for us. That comes from people. 

We’ve also drifted away from core testing principles. Risk-based testing requires context: knowledge of users' needs and preferences, business impact, architecture, and operations. AI cannot infer risk from first principles. Likewise, learning a new system or library isn’t just about generating code examples. You need to read documentation, explore the codebase, ask questions, and trace how data moves.

Overreliance on AI often produces shallow understanding and occasionally hilarious failures. Take a look at this funny Reddit thread of Microsoft engineers reacting to flawed pull requests by AI-generated code. But production failures aren't so funny when they happen to your own team.

In short, AI can accelerate parts of the process, but it cannot replace the human work of uncovering assumptions, validating intent, and connecting real-world behavior across systems.

Where AI is a great addition to a tester's toolbox

I don’t blame AI for its shortcomings, nor do I consider it useless. The pace of progress is incredible, and the benefits across many areas are undeniable. 

But that doesn’t mean we should use it everywhere. The real impact comes from using it wisely, in places where it actually helps. AI becomes genuinely valuable when it supports our thinking rather than trying to replace it. Here are the areas where, in my personal experience, I’ve seen AI consistently provide real leverage:

  • Structuring test plans, test summaries, and organizing information: Using speech-to-text or whiteboard photos, AI can transform rough thoughts into structured test plans, summaries, or stakeholder updates. You keep the insight while AI cleans the presentation.
  • Extending existing testing patterns: In one of our API automation projects, I used GitHub Copilot Chat by feeding it our current tests, the acceptance criteria of the new feature, and the API specs. It proposed parameterized variants, negative cases, and boilerplate updates. With a quick review, a few more prompts, and some “manual” code fixes, this accelerated the pace of work, provided great value, and saved time.
  • Generating repetitive data or test factories: For large message sets, event payloads, JSON templates, CSV files, or load-testing datasets, AI can quickly produce high-volume variations once you define a clear structure. When done well, this saves hours of manual setup time.
  • Summarizing test results and feedback for stakeholders: Teams often accumulate logs, comments, defects, and user feedback that are hard to communicate clearly to upper management. AI can turn this raw material into concise summaries that highlight risks, impact, and next steps, giving clarity about testing outcomes and their value.
  • Using AI-assisted code reviews: In our own teams, we’ve also enabled GitHub Copilot Chat directly in our IDEs, allowing engineers to get inline feedback on code as they write. Beyond that, we’ve added Copilot as a reviewer on our pull requests. It automatically provides early comments before the human review begins. This small addition helps us catch simple mistakes, formatting issues, and missing assertions early in the process, making the human review more focused on logic and design rather than syntax or structure.
  • Customizable AI agents:  Tools like Goose can automate tedious testing tasks when they’re properly configured and guided by senior engineers. In one of my recent projects, my team experimented with using an AI agent to help migrate several of our test automation repositories from poetry to uv for Python dependency management. After several iterations, corrections, and back-and-forth refinement, the agent reached a point where it could produce decent, reusable changes across multiple repos. Training the agent had a cost: we had to shape its prompts, define configuration patterns, validate the output, and allocate a small budget for GPT API usage. But once it “learned” our conventions and constraints, it started saving us meaningful time.

AI becomes genuinely useful only when we treat its output as a starting signal rather than a final answer. The first thing a model gives you is often rough, maybe plausible, or biased by the context it guessed. It becomes meaningful only after we validate it, challenge it, and connect it to the real behaviour of the system.

A simple rule of thumb: 

  • In general, AI is best reserved for well-defined, repetitive tasks. This includes anything that benefits from consistency over critical analysis. Examples include formatting test plans, generating mock data, extending test templates, and summarizing feedback.
  • Avoid AI when the task is ambiguous, risk-oriented, domain-specific, or dependent on system behaviour. Examples include risk assessment, interpreting business rules, test analysis, identifying system constraints, and designing the first version of a test strategy.

AI helps us move faster, but without human understanding, we’re just automating blind spots. And this isn’t just practitioner intuition. A 2025 field experiment showed that even seasoned engineers became 19 percent slower when relying on LLMs for real work. This is because models struggled with context, risk, and system knowledge, the very aspects that make testing meaningful. 

The mindset shift: back to human-driven testing

When delivery pressure rises, it’s easy to rely on AI as a shortcut. But the most important part of testing still begins before any tool enters the picture: understanding the problem. That work comes from discussion, clarification, and shared exploration, not automation.

I saw this clearly in an internal project aimed at improving an operations tool used for processing large datasets. Our early assumption seemed logical: add more filters, expose more information, allow users to undo mistakes, and productivity should improve. Instead, we moved too quickly into design and implementation without validating real user needs.

Once we paused, met with the operations team, and ran a few alpha-style usability sessions, the actual pain points became obvious. Some workflows weren’t used the way we imagined. A few actions were slower than expected. Certain screens created cognitive overload. None of this came through in documentation, dashboards, or discussions, but only through direct conversations and observation.

That experience reinforced something essential: quality improves when teams collaborate, challenge assumptions, and connect with real users. The turning points in testing rarely come from tools; they come from dialogue. Every organization has its own domain, constraints, and nuances, and understanding them requires human insight.

AI is most powerful after we understand the system. Once the thinking is done, AI can help us refine a test plan, draft communication, or automate repetitive parts of the workflow. But it can’t define the problem, uncover the unknowns, or align teams. That’s the human side of testing, and it remains irreplaceable.

The human-centered future of testing

The best testers I’ve worked with, regardless of title, seniority, or tech stack, all share one trait: they use tools to amplify their thinking, not replace it. They balance automation with exploration and interpret metrics with critical thinking, not blind trust. What makes them effective is not the number of tests they create, but their ability to understand the problem, challenge assumptions, and navigate ambiguity.

As AI becomes part of our everyday workflow, we need to keep those habits alive. One area where I genuinely hope to see more progress is in AI tools that support real test design and exploratory or experiential testing, not just generating steps or automating scripts. Most current tools focus on producing artifacts or extending patterns, but very few help testers think better. The value of AI is real, but only when paired with human judgment. A healthy balance looks like this:

  • Ask the right questions early in the process
  • Collaborate deeply with product, engineering, and operations, especially when requirements are unclear
  • Understand the system before automating scenarios around it
  • Use AI in areas where it genuinely reduces effort, saves time, or accelerates clarity
  • Keep advocating for quality, even when delivery pressure rises

The future of testing won’t be shaped by how much work AI can automate, but by how intentionally we integrate it. The real advantage comes when testers use AI as a support mechanism, a way to speed up routine work while staying focused on analysis, risk, and understanding. 

There’s also a nuance worth calling out: today’s AI often gets dropped into legacy systems, with much understanding embedded in the design and code that it can't be trained on easily. However, in a company starting from zero, with agents embedded early and learning as the system grows, AI could likely play a far bigger role in testing than what we see today. But even then, the foundation remains the same: people define the reasoning, the intent, and the decisions that any agent builds on.

To wrap up

What matters most in testing is still the human part, the conversations, the curiosity, the collaboration, and the decisions we make based on context that no tool can fully capture. Testing has never been about producing quick, simplistic answers. It has always been about thinking deeply, challenging assumptions, and understanding the real impact of what we build.

AI might run fast, but humans still decide where to run and why. The future of testing will still be human and thoughtful, creative, and grounded in real understanding.

What do YOU think?

Got comments or thoughts? Share them in the comments box below. If you like, use the ideas below as starting points for reflection and discussion.

Questions to discuss

  • When was the last time you stepped back from AI summaries and actually read the full requirements or API docs end to end?
  • Has your collaboration with developers and product managers changed now that AI can explain features for you?
  • Have you compared AI-generated test ideas with your own and noticed gaps in risk, assumptions, or real product behavior? What quality metrics does AI improve in your company?
  • Do you still start test planning for complex features with mind maps, diagrams, or analysis sessions? Or have those habits faded somehow?
  • If AI disappeared for a week, would your testing approach stay the same, or would you need to rebuild your process?

Actions to take

  • Revisit one recent feature and rebuild the test analysis manually using mind maps, flows, and risk lists. Then compare it with the AI-assisted version.
  • Hold a short alignment session with a developer or product manager before your next feature instead of prompting an AI for clarification.
  • Choose one AI tool you already use and evaluate it properly by checking what it gets right, what it misses, and where it creates blind spots.
  • Review your last three regression failures and identify whether they were caused by assumptions, gaps in context, or misunderstood system behavior.
  • Run one exploratory testing session without AI assistance and note the differences in depth, curiosity, and patterns you uncover.

For more information

Konstantinos Konstantakopoulos
Director of Quality Engineering at Orfium
Konstantinos Konstantakopoulos is Director of Quality Engineering at Orfium, leading quality and test automation with a focus on Developer Experience.
Comments
Gary Hawkes
I mentioned this in a club post. Quality is about balancing cost, quality and timescale. To improve, we need to get cheaper and/or better and/or quicker. How many articles do you see from people or vendors about using AI to be better? I'm seeing those outside QA, the focus is always on cheaper and faster, never on better.

Simon Tomes
This is an essential read and I highly recommend it to the tech community. Thanks for writing it, Konstantinos. It's brilliant!

Lewis Prescott
Absolutely agree. This is a must read!

Konstantinos Konstantakopoulos
Thank you all, I really appreciate the feedback and the discussion. Especially now, as AI produces code at scale, our role in quality and testing is more critical than ever. To adopt AI meaningfully, we must first deeply understand and respect the fundamentals. Otherwise, we risk creating noise and false signals, repeating past debates like manual vs automation that missed the essence of quality. This is the time to strengthen our craft, not dilute it, as Gary pointed out.

Lisa Crispin
Thank you for this comprehensive and valuable guidance! IMO it aligns well with what DORA has found in their extensive research on the impact of AI assistance in software development. Also congruent with DORA's AI capabilities model - https://cloud.google.com/resources/content/2025-dora-ai-capabilities-model-report

Deepak Karn
Good piece. AI can generate output, not understanding. Most failures happen because assumptions go unchallenged, context is missing, or risk is misunderstood. AI does none of that unless humans already did the thinking. Use it to assist work, not to outsource judgment. That’s the line teams keep crossing.

Iain
A must read, really highlights the pro's and con's - thank you for writing. Some take-aways for me "That experience reinforced something essential: quality improves when teams collaborate, challenge assumptions, and connect with real users." "AI helps us move faster, but without human understanding, we’re just automating blind spots."

Dan Alloun
waow ... that just shake and make you open your eyes. don't get fooled with buzzword. really good article! thank you.

Sign in to comment
Explore MoT
RiskStorming: Artificial Intelligence image
RiskStorming; Artificial Intelligence is a strategy tool that helps your team to not only identify high value risks, but also set up a plan on how to deal
MoT Software Testing Essentials Certificate image
Boost your career in software testing with the MoT Software Testing Essentials Certificate. Learn essential skills, from basic testing techniques to advanced risk analysis, crafted by industry experts.
This Week in Quality image
Debrief the week in Quality via a community radio show hosted by Simon Tomes and members of the community
Subscribe to our newsletter
We'll keep you up to date on all the testing trends.