A systematic skew in a dataset that causes a model trained on it to produce outputs that are consistently inaccurate, unfair, or unrepresentative for certain inputs, groups, or contexts. Data bias can originate from how data was collected, labelled, filtered, or weighted and is often invisible until the model is tested across a broad range of conditions.
So what? Data bias is one of the most consequential quality risks in AI systems because it is baked in before a line of application code is written. Testing for it requires deliberate coverage of underrepresented groups, edge cases, and real-world distributions, not just happy-path inputs.
Examples: A hiring tool trained predominantly on CVs from male candidates learns to downrank applications from women, not because of an explicit rule but because of patterns in the training data. An image recognition model trained on photographs taken in high-income countries performs poorly on images from lower-income contexts where lighting conditions, camera quality, and subject framing differ.
So what? Data bias is one of the most consequential quality risks in AI systems because it is baked in before a line of application code is written. Testing for it requires deliberate coverage of underrepresented groups, edge cases, and real-world distributions, not just happy-path inputs.
Examples: A hiring tool trained predominantly on CVs from male candidates learns to downrank applications from women, not because of an explicit rule but because of patterns in the training data. An image recognition model trained on photographs taken in high-income countries performs poorly on images from lower-income contexts where lighting conditions, camera quality, and subject framing differ.