Evals in practice for an AI coding agent

Head of Testing

30 Apr 2026

Locked

Implementing an AI coding agent introduces a quality challenge that traditional tests cannot address: how do you systematically verify that an agent produces the right code at the expected quality level?

This talk answers that question through a hands-on walkthrough of an evaluation system built for an API Automation Agent that generates test frameworks from API specifications.

The session introduces a practical eval taxonomy (assertion-based, rule-based, model-graded, hybrid, and security evals) and demonstrates each type with real examples: enforcing architectural patterns through deterministic checks, detecting hallucinated API fields, catching prompt injection attacks, and grading test quality with LLM evaluators.

Attendees will see concrete engineering decisions such as when string matching outperforms LLM grading, how to make scores deterministic and reproducible, and how to combine rule-based and model-graded criteria into composite evaluations.

The talk positions evals alongside unit tests and benchmarks within a complete testing strategy, and shows how eval results are interpreted to improve agent behaviour over time. Attendees will leave with a clear framework for designing and running evals for their own AI-powered developer tools.

After this session, you will be able to

Identify why evaluating AI agents requires a different approach than traditional testing
Build a layered evaluation strategy that starts with free, deterministic checks and scales up to LLM-based grading only where judgment is truly needed
Design model-graded evaluations that produce reliable, reproducible scores by constraining what you ask the judge to decide
Detect AI-specific failure modes, such as hallucinations and prompt injection, and choose the right evaluation technique for each
Interpret evaluation results across multiple models and datasets to drive concrete improvements to your agent

Resources

Comments

Explore MoT

Influence, from the other side of the table

Thu, 10 Sep

What I learned about influence by becoming a stakeholder

Advanced prompting for testers

Advanced prompting skills to turn AI into your trusted testing companion.

10 Sep 25

Course

Into The Motaverse

Into the MoTaverse is a podcast by Ministry of Testing, hosted by Rosie Sherry, exploring the people, insights, and systems shaping quality in modern software teams.

Evals in practice for an AI coding agent

After this session, you will be able to

Resources

Comments

End-to-End Testing - Manual, Auto & AI-Driven

Influence, from the other side of the table