Evals in practice for an AI coding agent banner image

Evals in practice for an AI coding agent

Master this practical framework for building a layered evaluation strategy to systematically verify, secure, and improve the output quality of AI coding agents using both deterministic and model-grade

Damián Pereira will join us for a Masterclass, followed by a live Q&A. 


Implementing an AI coding agent introduces a quality challenge that traditional tests cannot address: how do you systematically verify that an agent produces the right code at the expected quality level?

This talk answers that question through a hands-on walkthrough of an evaluation system built for an API Automation Agent that generates test frameworks from API specifications.

The session introduces a practical eval taxonomy (assertion-based, rule-based, model-graded, hybrid, and security evals) and demonstrates each type with real examples: enforcing architectural patterns through deterministic checks, detecting hallucinated API fields, catching prompt injection attacks, and grading test quality with LLM evaluators.

Attendees will see concrete engineering decisions such as when string matching outperforms LLM grading, how to make scores deterministic and reproducible, and how to combine rule-based and model-graded criteria into composite evaluations.

The talk positions evals alongside unit tests and benchmarks within a complete testing strategy, and shows how eval results are interpreted to improve agent behavior over time. Attendees will leave with a clear framework for designing and running evals for their own AI-powered developer tools.

 

After this session you will be able to

  • Identify why evaluating AI agents requires a different approach than traditional testing
  • Build a layered evaluation strategy that starts with free, deterministic checks and scales up to LLM-based grading only where judgment is truly needed
  • Design model-graded evaluations that produce reliable, reproducible scores by constraining what you ask the judge to decide
  • Detect AI-specific failure modes, such as hallucinations and prompt injection, and choose the right evaluation technique for each
  • Interpret evaluation results across multiple models and datasets to drive concrete improvements to your agent

16:00 - 17:00 BST
Location: Online
Subscribe to our newsletter