The practice by which large language model companies ingest publicly available material, peer-produced archives, journalism, creative works, and public knowledge repositories, at scale to build proprietary model capacity, without systematic consent, attribution, or benefit-sharing with the communities whose labour produced that material.
So what? Framing this solely as a copyright question obscures a deeper structural shift: the political economy through which shared knowledge becomes private computational power. For technologists and testers evaluating AI tools, it is a lens for asking who actually benefits from the systems they build with.
Example: GPT-3 was trained on filtered subsets of Common Crawl, Wikipedia, and books; once incorporated into a proprietary model, the underlying knowledge became part of a system whose internal workings remain opaque and privately controlled.
So what? Framing this solely as a copyright question obscures a deeper structural shift: the political economy through which shared knowledge becomes private computational power. For technologists and testers evaluating AI tools, it is a lens for asking who actually benefits from the systems they build with.
Example: GPT-3 was trained on filtered subsets of Common Crawl, Wikipedia, and books; once incorporated into a proprietary model, the underlying knowledge became part of a system whose internal workings remain opaque and privately controlled.