Evals in practice for an AI coding agent thumbnail

Evals in practice for an AI coding agent

Implementing an AI coding agent introduces a quality challenge that traditional tests cannot address: how do you systematically verify that an agent produces the right code at the expected quality level?

This talk answers that question through a hands-on walkthrough of an evaluation system built for an API Automation Agent that generates test frameworks from API specifications.

The session introduces a practical eval taxonomy (assertion-based, rule-based, model-graded, hybrid, and security evals) and demonstrates each type with real examples: enforcing architectural patterns through deterministic checks, detecting hallucinated API fields, catching prompt injection attacks, and grading test quality with LLM evaluators.

Attendees will see concrete engineering decisions such as when string matching outperforms LLM grading, how to make scores deterministic and reproducible, and how to combine rule-based and model-graded criteria into composite evaluations.

The talk positions evals alongside unit tests and benchmarks within a complete testing strategy, and shows how eval results are interpreted to improve agent behaviour over time. Attendees will leave with a clear framework for designing and running evals for their own AI-powered developer tools.

After this session, you will be able to

  • Identify why evaluating AI agents requires a different approach than traditional testing
  • Build a layered evaluation strategy that starts with free, deterministic checks and scales up to LLM-based grading only where judgment is truly needed
  • Design model-graded evaluations that produce reliable, reproducible scores by constraining what you ask the judge to decide
  • Detect AI-specific failure modes, such as hallucinations and prompt injection, and choose the right evaluation technique for each
  • Interpret evaluation results across multiple models and datasets to drive concrete improvements to your agent

Comments

Sign in to comment
Explore MoT
Leading with AI - The London Edition image
Fri, 19 Jun
A half-day educational experience to navigate the world of AI
Advanced prompting for testers image
Advanced prompting skills to turn AI into your trusted testing companion.
This Week in Quality image
Debrief the week in Quality via a community radio show hosted by Simon Tomes and members of the community
Subscribe to our newsletter