The dataset used to teach a machine learning model by exposing it to examples from which it learns statistical patterns, relationships, or classifications. The composition, quality, and representativeness of training data directly shape what a model can and cannot do well.
So what? For testers working with AI systems, training data is a primary source of risk. Gaps, skews, or errors in training data manifest as model failures that cannot be fixed through code alone they require the data itself to be identified, understood, and addressed.
Examples: A sentiment analysis model trained on English-language product reviews will perform poorly on reviews written in other languages or registers. A fraud detection model trained only on historical fraud patterns will fail to catch novel attack types not present in its training set.
So what? For testers working with AI systems, training data is a primary source of risk. Gaps, skews, or errors in training data manifest as model failures that cannot be fixed through code alone they require the data itself to be identified, understood, and addressed.
Examples: A sentiment analysis model trained on English-language product reviews will perform poorly on reviews written in other languages or registers. A fraud detection model trained only on historical fraud patterns will fail to catch novel attack types not present in its training set.