A reference point used to decide whether a test has passed or failed. For LLM testing this becomes unreliable because multiple valid outputs can exist for the same input.
So what? The absence of a stable oracle is one of the central challenges of AI testing. Techniques like metamorphic testing exist partly to work around it by checking consistency rather than correctness.
Example: "Who was the first president of the USA?" has a clear oracle. A summarisation request does not.Â
"The 'expected result' can be determined, but will always have some ambiguity. Comparing it to the 'actual result' won't be as straightforward as you are used to." — Demi Van Malcot
So what? The absence of a stable oracle is one of the central challenges of AI testing. Techniques like metamorphic testing exist partly to work around it by checking consistency rather than correctness.
Example: "Who was the first president of the USA?" has a clear oracle. A summarisation request does not.Â
"The 'expected result' can be determined, but will always have some ambiguity. Comparing it to the 'actual result' won't be as straightforward as you are used to." — Demi Van Malcot