Evals in practice for an AI coding agent

30 Apr 2026
  • Locked
Damian Pereira's profile
Damian Pereira

Head of Testing

Evals in practice for an AI coding agent thumbnail
Talk Description

Implementing an AI coding agent introduces a quality challenge that traditional tests cannot address: how do you systematically verify that an agent produces the right code at the expected quality level?

This talk answers that question through a hands-on walkthrough of an evaluation system built for an API Automation Agent that generates test frameworks from API specifications.

The session introduces a practical eval taxonomy (assertion-based, rule-based, model-graded, hybrid, and security evals) and demonstrates each type with real examples: enforcing architectural patterns through deterministic checks, detecting hallucinated API fields, catching prompt injection attacks, and grading test quality with LLM evaluators.

Attendees will see concrete engineering decisions such as when string matching outperforms LLM grading, how to make scores deterministic and reproducible, and how to combine rule-based and model-graded criteria into composite evaluations.

The talk positions evals alongside unit tests and benchmarks within a complete testing strategy, and shows how eval results are interpreted to improve agent behaviour over time. Attendees will leave with a clear framework for designing and running evals for their own AI-powered developer tools.

After this session, you will be able to

  • Identify why evaluating AI agents requires a different approach than traditional testing
  • Build a layered evaluation strategy that starts with free, deterministic checks and scales up to LLM-based grading only where judgment is truly needed
  • Design model-graded evaluations that produce reliable, reproducible scores by constraining what you ask the judge to decide
  • Detect AI-specific failure modes, such as hallucinations and prompt injection, and choose the right evaluation technique for each
  • Interpret evaluation results across multiple models and datasets to drive concrete improvements to your agent
Damian Pereira
Head of Testing

Passionate about research and innovation in software testing, especially in how AI can support better quality engineering. Creates open-source tools, experiments with new approaches, and enjoys sharing knowledge with the testing community to help teams evolve their practices, think differently about quality, and make testing more effective and accessible.

Damian Pereira
Head of Testing

Passionate about research and innovation in software testing, especially in how AI can support better quality engineering. Creates open-source tools, experiments with new approaches, and enjoys sharing knowledge with the testing community to help teams evolve their practices, think differently about quality, and make testing more effective and accessible.

Sign in to comment
More Talks
Ship with confidence: Agentic AI-Driven Quality with Rovo Dev and Xray

1h 3m 27s

Leading quality in a large team MoT Cincinnati

0h 57m 15s

A tester’s guide to AI guardrails

1h 5m 3s

Subscribe to our newsletter