Evals in practice for an AI coding agent

30 Apr 2026
  • Locked
Damian Pereira's profile
Damian Pereira

Head of Testing

Evals in practice for an AI coding agent thumbnail
Talk Description

Implementing an AI coding agent introduces a quality challenge that traditional tests cannot address: how do you systematically verify that an agent produces the right code at the expected quality level?

This talk answers that question through a hands-on walkthrough of an evaluation system built for an API Automation Agent that generates test frameworks from API specifications.

The session introduces a practical eval taxonomy (assertion-based, rule-based, model-graded, hybrid, and security evals) and demonstrates each type with real examples: enforcing architectural patterns through deterministic checks, detecting hallucinated API fields, catching prompt injection attacks, and grading test quality with LLM evaluators.

Attendees will see concrete engineering decisions such as when string matching outperforms LLM grading, how to make scores deterministic and reproducible, and how to combine rule-based and model-graded criteria into composite evaluations.

The talk positions evals alongside unit tests and benchmarks within a complete testing strategy, and shows how eval results are interpreted to improve agent behaviour over time. Attendees will leave with a clear framework for designing and running evals for their own AI-powered developer tools.

After this session, you will be able to

  • Identify why evaluating AI agents requires a different approach than traditional testing
  • Build a layered evaluation strategy that starts with free, deterministic checks and scales up to LLM-based grading only where judgment is truly needed
  • Design model-graded evaluations that produce reliable, reproducible scores by constraining what you ask the judge to decide
  • Detect AI-specific failure modes, such as hallucinations and prompt injection, and choose the right evaluation technique for each
  • Interpret evaluation results across multiple models and datasets to drive concrete improvements to your agent
Damian Pereira
Head of Testing

Passionate about research and innovation in software testing, especially in how AI can support better quality engineering. Creates open-source tools, experiments with new approaches, and enjoys sharing knowledge with the testing community to help teams evolve their practices, think differently about quality, and make testing more effective and accessible.

Damian Pereira
Head of Testing

Passionate about research and innovation in software testing, especially in how AI can support better quality engineering. Creates open-source tools, experiments with new approaches, and enjoys sharing knowledge with the testing community to help teams evolve their practices, think differently about quality, and make testing more effective and accessible.

Sign in to comment
More Talks
Vibe Coding for QA: Build a PRD-to-Test-Case Generator MoT San Francisco

1h 4m 11s

System design interview for test engineers MoT Manchester

0h 21m 42s

Software Testing Live: Episode 06 - Don't automate everything, review everything

1h 24m 8s

Subscribe to our newsletter