RiskStorming: Artificial Intelligence - don't risk not being there!

The future of testing: Autonomous agents, ethical AI, and human oversight

Understand why testing must evolve beyond deterministic checks to assess fairness, accountability, resilience and transparency in AI-driven systems

by Matthew Whitaker
15 Jan 2026
12 min read

The future of testing: Autonomous agents, ethical AI, and human oversight image

Thank

Bookmark

Add to collection

Introduction

The role of the tester has never been static! From the personal touch of verification to automated regressions, Quality Assurance (QA), and now Quality Engineering, software testing has evolved alongside the software industry's transformations. Yet the rise of Artificial Intelligence (AI), Machine Learning (ML) and Self-Adaptive Systems introduces a fundamentally different view for software testing. Self-adaptive systems are structured to change their behaviour while running. They do this in response to changes in their environment or within the system itself. They are systems that learn, decide and evolve autonomously. Testing systems is not just about verifying static requirements. We have to think about the bigger picture, dynamic reliability, ethical integrity, and the big thing. Ongoing client or customer trust.

Why does traditional testing fall short?

Traditional testing, even when automated, is fundamentally deterministic. We operate on the assumption that a system will behave in a certain way and that a user will operate it in a certain way. Testers design cases to confirm that a system produces the expected outputs, measure coverage and ensure compliance with functional requirements.

This model works well for static or general rule-based systems, such as banking transaction software, e-commerce checkout flows, or even RESTful APIs, where the logic is well-defined and repeatable. However, there is a new generation of software, such as AI-driven, Self-Adaptive Systems and context-sensitive systems, that no longer follow these predictable patterns.

The nature of probabilistic behaviour

Probabilistic behaviour refers to a system's behaviour being governed by probability rather than being fully deterministic. In such systems, the same input or situation may lead to different outcomes, each with a certain likelihood. AI systems, in particular those that are based on ML, operate in a more probabilistic manner. Instead of executing predefined logic, they refer to outcomes and patterns learnt from data.

A chatbot trained on natural language models might often provide a different response to the same query, influenced by previous context or by randomisation in language generation.
A recommendation engine recalibrates its results continuously in response to shifting user behaviour and external trends.
An autonomous vehicle perception system might interpret identical sensor data differently depending on environmental data or prior model adjustments.

In such cases, there may not be a single “correct” output but a range of acceptable behaviours that meet confidence thresholds. This variability essentially challenges the logic of pass/fail testing.

Data Dependency and Model Drift

Unlike deterministic systems whose behaviour is embedded in the code, AI behaviour emerges from data. Testing must therefore validate not just the code that has been written but also the quality of the data and its inherent bias.

If the data distribution changes (for example, due to the user's demographics or seasonal behaviour), the system's performance may degrade without any code changes.
Models retrained on incomplete or skewed datasets may often unintentionally alter the system's fairness or accuracy.

Traditional test automation rarely monitors data drift, leaving organisations unaware of subtle or cumulative quality degradation.

The limits of scripted automation

Even the most robust automation frameworks often struggle with adaptive systems! Scripted tests are fragile when the UI or even API’s evolve frequently. In AI systems where behaviour may change with each retrained cycle, maintaining a static set of test scripts becomes unsustainable.

To solve this, testers must build more adaptive, self-learning validation frameworks that can respond to uncertainty. This is a major shift from static regression testing to continuous, intelligent regression testing.

Evaluating explainability, not just accuracy

Another limitation of traditional testing is its narrow-minded focus on output correctness. For AI systems, the correctness alone is not enough, and testers must evaluate the explainability factor, whether the system can justify its decision. For example, a credit-scoring model produces predictions, but if you cannot explain its reasoning, then this may fail with compliance audits or even ethical standards, even if this is technically right.

Complexity and emergent behaviour

AI-driven systems often exhibit emergent properties, such as unexpected behaviours arising from the interactions among multiple models or agents. Testing such emergent complexity requires some scenario simulation, behavioural analytics and stress testing across all interacting components.

In summary, the traditional testing assumes stability, whereas AI systems embrace evolution. Software testing’s job is no longer just to validate outcomes but to assess confidence, interpret variability, and safeguard trust not only in the system but also in the AI's behaviour.

The emerging challenge

As AI transitions from a supporting technology to an operational decision maker, the quality conversation expands from technical reliability to include ethical and social accountability.

The critical question moves from “does it work as designed?” to “can we trust it to act responsibly?".

Redefining quality in the age of AI: The AI FART Model

Historically, when we say the word “Quality”, we mean meeting both functional requirements and quality characteristics. Now, in the AI world, we need a broader view of quality that includes human and societal impacts. This broader view is captured in the AI FART model:

Fairness: The system must not discriminate or disadvantage people
Accountability: There must be clarity on who is responsible for the AI outcomes
Resilience: Systems must adapt safely to changing data and context
Transparency: Decisions should be explainable and traceable

This much wider definition of quality demands new metrics, including fairness indices, confidence intervals, and interpretability scores, alongside traditional KPIs such as accuracy, defect density, and pass / fail. These additional measures help us assess whether our AI systems are living up to the AI FART model in practice.

The trust gap

AI systems often make decisions faster and at a scale that humans can’t match. However, when they fail, the consequences can be bigger than belief. Biased hiring algorithms, self-driving car accidents, and even misdiagnosed medical conditions are among the catastrophic outcomes.

Software testing has become the guardian of trust, ensuring that automation doesn’t remove accountability. Building this trust requires combining quantitative validation (accuracy, robustness, and regulatory compliance) with qualitative assurance (ethics and user perception) and continually asking whether Fairness, Accountability, Resilience and Transparency are being upheld.

Blending technical and ethical assurance

The new generation of software testing professionals must operate at the intersection of three areas:

Technical Assurance: Understanding ML pipelines, testing AI explainability and verifying data integrity
Ethical Governance: Applying frameworks such as ISO and IEEE 7000 to guide responsible testing
AI Literacy: Being able to interpret model outputs, understand bias mechanisms and engage with data meaningfully

This fusion transforms software testing from a downstream activity into a strategic governance function, one that not only detects bugs but shapes organisational ethics and risk culture around fairness, accountability, resilience and transparency.

Continuous reasoning over static execution

AI systems always develop after deployment. They retain, readapt and sometimes self-optimise in real time. This means testing can't always end at the release. It becomes a continuous loop, monitoring live data, recalibrating benchmarks and validating how the system learns.

Preparing for the next phase

In the coming years, we will find that software testing will become less about tools and more about focusing on judgment.

The software testing of the future must:

Question whether system behaviour aligns with societal values
Collaborate with data and ethics teams to define risk thresholds
Audit autonomous decisions for fairness and transparency
Interpret AI-driven decisions, not just verify them

The professionals who succeed will not just find bugs but also build trust in intelligent systems.

The rise of autonomous testing agents?

Autonomous testing agents represent the most advanced form of AI-augmented software testing, and unlike rule-based automation frameworks, these agents leverage ML, natural language processing (NLP), and reinforcement learning to independently create, maintain, and adapt tests.

Capabilities and mechanisms

Self-healing test automation: When a UI element or API changes, the agent dynamically updates its locators and parameters using pattern recognition
Autonomous test generation: Agents use application models, code coverage metrics or behaviour-driven patterns to generate test cases without human scripting
Intelligent prioritisation: ML models analyse past defect trends and runtime data to focus on high-risk areas
Continuous learning: Through reinforcement feedback loops, the system learns from past executions to improve efficiency and accuracy
Predictive analytics: Agents forecast potential failure points or regression hotspots before they occur

These capabilities extend testing beyond automation and move towards autonomy, enabling the system to operate, adapt, and optimise independently within predefined ethical and operational constraints.

Benefits and real-world impact

Autonomous testing accelerates development cycles and strengthens reliability in continuous delivery environments.

Efficiency: Testing coverage increases without taking time away from the sprint and with minimal human effort
Resilience: Systems adapt to change without constant script maintenance
Scalability: Large-scale and distributed testing becomes more feasible
Cost reduction: Long-term maintenance and regression costs decline

Emerging implementations

AI-powered frameworks: Tools like Mabl and Testim already employ ML to generate and self-heal tests
Model-based testing with AI: Systems such as Diffblue Cover (for Java) use AI to automatically create unit tests from code
Autonomous agents in DevOps: AI-driven platforms that integrate directly into CI/CD pipelines and perform tests in real time, whilst adjusting coverage based on release frequency

With all the positive points mentioned in this section, there is also a risk of introducing new complexities, such as opaque decision-making, ethical risks, and potential dependence on systems that testers cannot fully explain.

The ethical dimension of AI in software testing?

Testing at its core is about trust. But trust in AI cannot be assumed. It must be engineered, audited and governed.

Bias and fairness

AI systems learn from data that often encodes human prejudice. In testing, this can lead to:

Biased datasets: Automated validation might ignore edge cases involving underrepresented user groups
Discriminatory logic: Algorithms may replicate societal inequities (e.g Lending decisions or hiring filters)
Confirmation bias in test selection: AI agents could favour scenarios similar to previous successful ones, which in turn reduces the coverage diversity

An ethical software testing framework requires testers to act as bias detectors by using data-sampling audits, fairness metrics, and adversarial testing to uncover hidden inequities.

Transparency and explainability

“Black box” AI testing presents a critical problem when an autonomous testing agent decides to skip, fail or prioritise a test. The reasoning can often be cloudy.

Software testing must then push for explainable AI techniques such as visualisations, traceability logs and model introspection that make decision-making transparent. This transparency is not optional. We should classify it as the focus for accountability.

Privacy and Data Ethics

Test data often mirrors production environments, and in an AI context, this means personal or sensitive information might inadvertently train or inform models. Ethical testing demands that:

Data anonymisation is complicit
Synthetic data generation for privacy-preserving tests
Clear data retention policies aligned with GDPR and ISO 27001

The Ethical Testers Mandate

Ethical software testing goes beyond compliance. It represents a commitment to societal responsibility and, in turn, ensures that technology enhances rather than exploits human welfare. Future software testers must be fluent in ethical frameworks (e.g. IEEE 7000, EU AI Act) and able to operationalise them in testing pipelines.

Human oversight: Why machines still need us?

Trust but verify

Autonomous agents excel at pattern recognition but lack the moral reasoning that humans bring to the table. They can’t understand why a bug matters, only that a pattern deviated. Human oversight provides the interpretive layer that converts data anomalies into actionable insight.

The human cognitive edge

Human testers bring a wealth of different talents and expertise. They bring intuition, empathy and contextual reasoning that no algorithm can replicate.

A human recognises when an error impacts accessibility for visually impaired users
A human senses reputational risks beyond metrics
A human questions whether “passing” AI behaviour is ethically acceptable

The oversight models

Modern software testing strategies should employ what some people may call a Human-in-the-loop (HITL) or Human-on-the-loop (HOTL) model:

HITL: Humans actively guide or correct AI actions during testing
HOTL: Humans supervise at a strategic level and intervene only when anomalies appear

This oversight ensures a balance between automation efficiency and human discernment with an ethical compass.

Evolving roles and responsibilities in software testing?

From executors to orchestrators

The job of software testing is shifting from testing to designing intelligent environments. Software testing leaders will manage networks of software testing agents, interpret results and align outcomes with the business strategy. Testing then becomes an act of orchestration, coordinating team members, tools, and data flows toward measurable quality outcomes.

New role archetypes

AI Trainer: Feeds and calibrates testing agents with the relevant data
AI Curator: Validates model outputs by ensuring alignment with goals and expected outcomes
Ethics Champion: Embeds fairness and transparency criteria into acceptance criteria
AI Auditor: Evaluates AI-driven software testing pipelines for compliance and accountability
Data Steward: Oversees data governance and synthetic data generation for ethical testing

The evolution of skills in an AI context

The table below explains how traditional software testing skills evolve in the context of AI. The intention is not to suggest that future-oriented skills directly replace traditional ones. Many of these traditional skills can be built on, extended or recontextualised. This can be done by amending or updating existing traditional testing practices to address the unique characteristics of AI-based systems. Such as the uncertainty, learning behaviour and ethical concerns that AI brings.

For example, test case design remains an important part of the process. But in AI systems, test case design can be completed by an AI tool. Therefore, we need the skills to undertake model validation, which assesses the model's performance, bias, robustness, and generalisation. Rather than on fixed input / output correctness. Similarly, automation scripting evolves into configuring and evaluating learning processes, such as reinforcement learning, rather than scripting deterministic test flows.

This table highlights a shift in emphasis from testing predictable, rule-based software to testing systems that adapt, learn, and operate under uncertainty. Traditional skills are therefore not discarded, but augmented with new competencies that reflect the demands of AI-driven systems.

Traditional software testing skill	Future-oriented skill
Test case design	AI model validation
Automation scripting	Reinformment learning configuration
Bug reporting	Ethical impact assessment
Regression testing	Continuous risk evaluation
Tool expertise	AI literacy and governance

Organisational integration

Future software testing teams will collaborate across multiple domains:

With developers: embedding testing earlier through AI-driven shift left practices
With AI engineers: co-designing interpretable and fair models
With compliance teams: ensuring legal and ethical conformity
With UX and product teams: connecting quality to user experience

Mapping the software testing horizon: Future scenarios

Fully autonomous pipelines

In this vision, AI-driven agents continuously generate, execute and evaluate tests reporting to the software testing team for validation. Systems learn from real-time findings, enabling autonomic quality control and an evolution parallel to self-driving vehicles in DevOps environments.

Key risks:

Over-reliance on machine decisions
Lack of explainability in automated go / no-go gates
Drift between organisational ethics and AI behaviour

Hybrid human-AI collaboration

A more balanced future sees the software testing team and AI working more in sync. AI handles scalability, coverage and optimisation, whilst software testing focuses on ethics, creativity and strategic direction. This hybrid model is the most practical near-team scenario and aligns with current business adoption methods.

Key risks:

Over-reliance on AI outputs
Skill Gaps and Misalignment
Decision Accountability
Cultural Resistance
Data Dependency

Regulation-driven software testing

As AI testing becomes integral to most critical infrastructure, governments and industry bodies' standards will begin to be structured more formally around AI assurance, auditability, and ethical compliance. Software testing will become a more regulated discipline, emphasising certification, traceability and ethical conformance. Software testing teams will need to demonstrate not only technical competence but also adherence to emerging frameworks such as ISO/ISEC 42001:2023 (AI management systems) and the EU AI Act.

Key risks:

Regulatory Fragmentation
Compliance Over Innovation
Audit Fatigue
Evolving Standards
Accountability Gaps

The tester as an AI Auditor

In this scenario, the software testing team member evolves into an AI auditor who specialises in reviewing, validating and certifying the quality decisions made by autonomous systems. This role demands multidisciplinary expertise spanning ML, data governance, ethics and risk analytics. Software testing in this model serves as the final line of defence, ensuring that AI-driven decisions remain transparent, explainable, and aligned with values. In many ways, the AI auditor becomes the guardian of the AI FART model across the organisation.

Key risks:

Knowledge Barriers
Tooling Limitations
Bias Blind Spots
Conflict of Interests
Liability Exposure

To sum up

The evolution of testing toward autonomy is not just a technical revolution. It is a philosophical one. As AI-driven systems take on greater responsibility for quality, the definition of “testing” itself expands from verifying code correctness to ensuring the integrity of intelligent systems.

Autonomous agents will bring unprecedented precision, scalability and adaptability. Yet, they will also magnify the ethical, social and governance implications of technology. Human oversight will remain indispensable, not because machines are incapable, but because quality is ultimately a human value, rooted in judgement, empathy and responsibility.

The future of testing will belong to those who embrace this duality and master the science of AI while also championing the art of ethics. In doing so, testers will transcend their traditional role, becoming not just guardians of software quality but stewards of digital trust in an AI world. Simple frameworks like the AI FART model can help keep that trust work concrete and focused.

What do YOU think?

Got comments or thoughts? Share them in the comments box below. If you like, use the ideas below as starting points for reflection and discussion.

Questions to discuss

How is your team adapting software testing practices for AI-driven or self-adaptive systems?
Have you implemented autonomous testing agents in your pipeline? What challenges or benefits have you observed?
How do you currently ensure ethical oversight, fairness and transparency in your testing processes? Do you use any simple quality models, such as the AI FART model, to support those conversations?
In hybrid human-AI software testing models, how do you balance automation efficiency with human judgment?
What skills do you think software testing professionals need to thrive in a world of AI-driven testing?

Actions to take

Educate yourself on AI-driven and self-adaptive systems
Conduct a proof of concept with AI tooling
Capture your thoughts and share with others

References:

Thank

Bookmark

Add to collection

Matthew Whitaker

QA Team Lead

He/Him

QA professional with 9+ years in various industries. I enjoying implementing testing frameworks in manual and automated testing. Passionate about collaboration and improvement

Comments

Ady Stokes

Really enjoyed editing this one. Great insight and ideas, and the child in me laughed at the FART model every time. :)

January 15th

Megan Ozanne

Great Article Matthew, thank you, very insightful. A lot to take away and think about.

January 17th

Gary Hawkes

Thats one of the most comprehensive and engaging articles I've read. Fantastic job Matthew 👏. Very thought provoking, so adding this to one of my collections.

January 20th

Maxwell Muruka

Great article

January 23rd

Explore MoT

RiskStorming: Artificial Intelligence

Tue, 3 Mar

RiskStorming; Artificial Intelligence is a strategy tool that helps your team to not only identify high value risks, but also set up a plan on how to deal

Advanced prompting for testers

Advanced prompting skills to turn AI into your trusted testing companion.

10 Sep 25

Course

Into The Motaverse

Into the MoTaverse is a podcast by Ministry of Testing, hosted by Rosie Sherry, exploring the people, insights, and systems shaping quality in modern software teams.

Unlock Confident Coverage and Traceability

The future of testing: Autonomous agents, ethical AI, and human oversight

Introduction

Why does traditional testing fall short?

The nature of probabilistic behaviour

Data Dependency and Model Drift

The limits of scripted automation

Evaluating explainability, not just accuracy

Complexity and emergent behaviour

The emerging challenge

Redefining quality in the age of AI: The AI FART Model

The trust gap

Blending technical and ethical assurance

Continuous reasoning over static execution

Preparing for the next phase

The rise of autonomous testing agents?

Capabilities and mechanisms

Benefits and real-world impact

Emerging implementations

The ethical dimension of AI in software testing?

Bias and fairness

Transparency and explainability

Privacy and Data Ethics

The Ethical Testers Mandate

Human oversight: Why machines still need us?

Trust but verify

The human cognitive edge

The oversight models

Evolving roles and responsibilities in software testing?

From executors to orchestrators

New role archetypes

The evolution of skills in an AI context

Organisational integration

Mapping the software testing horizon: Future scenarios

Fully autonomous pipelines

Key risks:

Hybrid human-AI collaboration

Key risks:

Regulation-driven software testing

Key risks:

The tester as an AI Auditor

Key risks:

To sum up

What do YOU think?

Questions to discuss

Actions to take

References:

Comments

Unlock Confident Coverage and Traceability

RiskStorming: Artificial Intelligence