Quality Statements for LLMs: The Good, The Bad and The Ugly
-
Locked
Bastian Knerr
Teamlead Testing
Talk Description
AI as a buzzword is everywhere. It will steal our jobs, make us all obsolete and in the end: It will rule the world. We've been experiencing a shift in paradigms for two years and, most prominently, Large Language Models like LLaMA, ChatGPT or BARD are re-shaping industries and our everyday lives.
Using a Co-Pilot for Coding or Testing is seen as enhancing production and lowering barriers to entry.
But now that the uses of these LLMs are increasing rapidly:
- Who is testing them?
- And what actually is Quality in the age of AI?
In this talk, I want to provide results from my experience in projects of testing Large Language Models and regressive AI. I will explain the high-level function of a Large Language Model.
I will translate the components of a Copilot onto a newly thought testing pyramid from the component level to the system level. Now that we have a sort of framework to test LLMs, I will outline the metrics used and why testers will still be needed in the age of AI - maybe even more than ever.
By the end of this session, you'll be able to:
- Learn how a Large Language Model works on a high level and possible pitfalls for testing
- Discover a high-level standardized approach to testing Large Language Models
- Understand a new testing pyramid: What's the component level in LLM systems?
- What is quality in the age of AI? What metrics can we use - and how contextual are they?
- Understand the importance of a tester's perspective and why testers will still be important going forward