Languages that are significantly underrepresented in computational datasets and NLP research, typically because the volume of digitised text available for training is small relative to dominant languages such as English.
So what? Over 90% of the world's languages are classified as low-resource; this imbalance means AI models perform unevenly across linguistic and cultural contexts, producing representational injustice for speakers of those languages.
Example: A multilingual model trained predominantly on English-language web data may perform well for English speakers but generate unreliable or culturally inappropriate outputs for speakers of languages with limited digital corpora.
So what? Over 90% of the world's languages are classified as low-resource; this imbalance means AI models perform unevenly across linguistic and cultural contexts, producing representational injustice for speakers of those languages.
Example: A multilingual model trained predominantly on English-language web data may perform well for English speakers but generate unreliable or culturally inappropriate outputs for speakers of languages with limited digital corpora.