Reading:
Testing Language Models With The Philosophy of Wittgenstein
Qase — Automate manual tests in one click image
Orchestrate manual and automated testing with 35 integrations, AI test case generation, customizable reports and more.

Testing Language Models With The Philosophy of Wittgenstein

It’s Time To Think About Language Models Again

Nowadays everybody is talking about the new large language models (LLMs), such as GPT. I feel like it’s time to talk about a point of view that is too often forgotten while testing them. More than ever, we are confronted with models in various contexts, and it is our job to ensure their reliability, robustness, and unbiasedness.

Too often we rely on some technical method that some expert decided is the best fit for modern AI. What matters now is to question language itself, which forms the basis of LLMs after all. Since I have a bit of a linguistics background, in this article, I’m going to introduce you to some basic ideas demonstrating that some of the best insights might actually not be technical in nature.

Entering the realm of language and technology: who would be more suitable than the analytic philosopher Ludwig Wittgenstein?

Wittgenstein And AI

As one of the most influential thinkers in analytical philosophy, Wittgenstein (1889-1951) contributed a rather engineer-y perspective on language. This comes as no surprise since he was indeed studying mechanics.

“Perhaps, Wittgenstein never became a philosopher but was always a scientist and engineer.”

- Nordmann [3]

Especially in the realm of NLP, there has been a growing interest in his work over the last decade, since his ideas help us to formulate the expectations we might or even should have for language-processing AI.

“The solution to any problem in AI may be found in the writings of Wittgenstein, though the details of the implementation are sometimes rather sketchy.” 

- Duck-Lewis [1]

AI generated picture of wiggenstein created by the author with hotpot.ai

Created by the author with hotpot.ai

Expectations might be the central theme here, since QA (including testing) is all about meeting expectations of certain stakeholders. Our expectations of the model should of course match its purpose. Simply put, matching our intention with expectations is what this article is all about. 

But this time we will try doing it with the help of language philosophy and see what it means for testing. This might sound harder than it actually is, so let me try breaking down two of Wittgenstein’s main ideas and how they relate to testing.

Reviewing Data Sets Through A Lens Of ‘Early’ Wittgenstein

The Limits Of Reality

Wittgenstein’s earlier works, like Tractatus Logico-Philosophicus, can give us valuable insights when we are testing the pre-processing of language data. In that work, he highlights the gap in meaning between a word and the object it is pointing to in the real world. It is easy to imagine for terms like “bird”, “dog”, and “cat”: we are just using an arbitrary symbol to point towards some sensual experience.

The sentence “Ludwig is happy,” however, is more complex than a single word. Suddenly, multiple terms point to each other. What do we do here? One way you can understand this sentence is by its decomposition:  the symbol “Ludwig” represents a specific person and “happy” a state or property. In logical notation, imagine the expression as

x R y

Now, “R” stands for the relationship of possessing a property while the substitutions “x=Ludwig” and “y=happy” render the original sentence.

This slight detour shows Wittgenstein’s belief in how we think about our reality in such compositions.

“The limits of my language means the limits of my world.”  - Wittgenstein [5]

Applying Decomposition To Language Pre-Processing

Common ways for your fellow NLP engineers to restructure ideas expressed as textual data with the intent of preserving their meaning are: tokenization; lemmatization; and removal of stopwords.

If you’re not familiar with these three terms yet, their practice ensures consistent formatting of data.

1.       Tokenization: breaking the text into individual words or phrases, such as “Ludwig was happy” to “Ludwig”, “is”, and “happy”.

2.       Lemmatization: reducing words to their base forms, such as “eating” to “eat” and “healthier” to “health”.

3.       Stopwords removal: removing irrelevant words within the context, for example, reducing the sentence “The dog barks at the mailman” to the words “dog”, “barks”, and “mailman”.

In cases 1 and 2 above, the pre-processing techniques reduced the meaning of each word to its core information so that the relationship to other words becomes more visible. In case 3, the composition of the remaining words still draws a similar picture.

Meeting Expectations

After breaking down the pre-processing part, you probably can already imagine where this is going. How does the relationship between the symbols and the objects they are pointing to tell us about our expectations? In application, a data set consisting of “bird”, “dog”, “cat”, and “Ludwig is happy” would be such a set of symbols.

Reviewing Text In Data Pipelines For Fidelity To Original Meaning

With this in mind, let me show you two clear expectations one could have for the processed data set, if the intent is to leave the meaning unchanged. 

  1. No single manipulation of the data  changes the logical picture of the object. For example, lemmatization does not remove crucial information from the expression.
  2. The representation of the object in the data set is correct. For example, no bias has crept into the meaning.

Of course, if I already checked that the expectation in case 2 has not been met, you could argue it is unnecessary even to check case 1. Now, where could pre-processing go wrong, using more concrete examples?

  1. Recall the sentence “The dog barks at the mailman” which is reduced to the words “dog”, “barks”, and “mailman” after removing the stopwords. Using a little common sense, these three words by themselves are probably enough to understand the original sentence (or the object it is pointing at). However, what if the less common case “the dog barks with the mailman” or “‘Dog!’ barks the mailman” is what we were looking for?

    So we have to be careful when certain processing steps change the object referred to, or the original meaning. If this ambiguity is undesired, most of the time, we must review the pre-processing steps used by developers and see if they create matching the output in the intended way.
  2. Engineers usually train these language models on a vast amount of data. But something you will probably encounter in a lot of these data sets is bias that distorts the image of reality in the data set.

    For example, the Wikipedia comments section has long been known to be a breeding ground for hostility against women. An LLM that is trained on such a data set might very well pick up this hostility and present a false image of women.

    Another example is when we expect balanced weighting of input from people in different demographic groups. Imagine a chatbot trained on blog posts and comments that contain many more texts from 20-year-olds than from any other age group. Depending on the purpose of the model, this will distort your output if you expect it to use only “age neutral” expressions.

    Making testing a little more dynamic, you could schedule regular statistical tests in this area, whether they be manual or automated. To reduce age bias, for example,  the developer could be warned every time the processed data contains too much offensive language or slang, or other potential signals of the writer’s age.

Testing Contextual Models With The Ideas Of ‘Late’ Wittgenstein

The Construction Of Meaning As A Social Action

We can find a more pragmatic approach to language and meaning in Wittgenstein’s later works, such as Philosophical Investigations. In the last section, you saw the consequences of defining meaning somewhere between the word and the object you are referring to. In the later work of Wittgenstein, he posited that the meaning of a word exists simply in the social context it is used. This replaces pointing to some object that might be as well your own arbitrary idea of it.

The language is meant to serve for communication between a builder A and an assistant B. A is building with building-stones: there are blocks, pillars, slabs and beams. B has to pass the stones, and that in the order in which A needs them. For this purpose they use a language consisting of the words “block”, “pillar”, “slab”, “beam”. A calls them out;—B brings the stone which he has learnt to bring at such-and-such a call. 

- Wittgenstein [4]

This concept is perhaps the most significant of Wittgenstein’s with regard to language processing. The assistant learns to bring a pillar when the builder shouts “pillar” even though the assistant might still have no idea what a pillar actually is. He learns that in the context of this social interaction, “pillar” is some speech sound that signals him to act in a certain way. So “pillar” is not referring to any concrete object out there. It is just a social contract that helps us to communicate.

Modelling Meaning As A Public Event

What does it mean now for AI? I want to avoid going into the technical depths of machine learning models (particularly neural networks) here. Instead, we are going to link the previous example directly to the observable behaviour of such models.

The model itself doesn’t actually understand the input. It just learnt how to act on it by giving the corresponding output. Like our assistant in Wittgenstein’s example, the meaning of an input for the model is reacting with the most probable output. The construction and maintenance of meaning becomes a public event with no personal and private idea we might have about it.

Word embeddings are a good example of this. Imagine having the words “queen”, “king”, “apple”, and “banana”. How could you capture their meaning in a data set? Both the “The king is ruling over the country”, as well as “The queen is ruling over the country,” are possible sentences that mean something to most people. However, “the apple is ruling over the country”  would be a sentence we would not encounter.

A graph showing a picture of images that are ranked on an X Y axis. The images start with the first being highest on the Y axis and shortest on the X axis to the last image being lowest on the Y axis and furthest on the X axis. The images are a king, a queen, an apple and a banana.

In the examples above, we didn’t even define the words “queen” and “king”. There was no need. Nevertheless, they share some meaning since they fit into the same sentence, while “apple” does not. Word embeddings simulate such similarities of meaning into these vector spaces as in the image. If we simply measure the sentences a word occurs in, instead of thinking about the word itself, the construction of meaning becomes a public event as in Wittgenstein’s example.

Reviewing The Data, With Examples

So, when you are familiar with these concepts, it changes the perspective you need as you review the quality of the data output. Since the context is what matters now, you can double-check examples in the data set. In the example above, is “queen” closer to “king” than to “apple”? If not, what is the model doing with regard to defining the context for its output not to be as expected? Or is it indeed just the data set showing a distorted image of “queen”? For the latter, you can also return to the last section, checking with ‘early’ Wittgenstein why the meaning of the context itself seems skewed.

At this point, you might think “Well, that seems a little impractical if I have to review 20,000 different words in such a way”. Of course, if you’re working with bigger data sets, you can at least focus on a subset. A good way to start is by thinking of abstract categories, such as words related to time or belief. In Wittgenstein’s example, these categories also get rather complicated. Believing in some value is not something I can concretely point to in the physical world, so we can easily misunderstand it in any social context as well as in your model.

Secondly, it is often such abstract words that are very prone to bias and will skew your model’s output. Abstract words occur especially in the contexts of ethics and aesthetics. What is good or beautiful is usually just a mere opinion of the person saying it. So the model won’t capture any sense of the word. It will probably just reproduce mere opinions in the data set. After all, most of the time, we talk about something being pretty and not about prettiness itself. The model will then regard the same thing as pretty.

Checking The Parameters

How do we define what the context of the word even is? Word embeddings usually look to the word’s left and right, but the number of words to consider can vary quite a bit. So, the size of the context is something to consider. Also, does the context have to be a single sentence? Maybe the document where you find the word also matters since usage can differ, for example between prose and poetry.

There is even more meta-information you should consider beyond what you can see on your screen, like sentences or documents. If I write something with sarcasm, then you won’t get far by just looking at single words. Or perhaps the expression is accompanied by a performance of action as in Wittgenstein’s example. If I say “We are going to the cinema”, can the sentence be understood in the same way if we are not actually going right now? To come back to our original hypothesis, the purpose of the model defines the expectations we should have of it. So, it could also be your responsibility as a tester to review if the design and testing parameters fit the intention, and add meta-descriptions accordingly.

Statistical Testing

The last point I’d like to make is about the limits of machine learning models when we are talking about Wittgenstein’s philosophy. We define words only through the context in which we use the word. But the problem we face then is that no data set could ever contain every possible context where the word could make sense. In practice, it is impossible to save infinite amounts of information in a database.

At the risk of repeating myself, here again, I’m emphasising the expectations we ought to have depending on your intended use of the model. The contexts we choose to be in the data set should match the purpose of the data set. For example, one has different expectations of a translator than of a grammar checker. While we should expect the former to preserve meaning as much as possible, the latter must focus especially on structural rules of the language you are working with.

Finally, to create appropriate statistical tests, you can always do a web search for the test sets. A lot of them have already been created for nearly every purpose and can often fulfil the role of gold standards. Best of all, many are openly accessible. An interesting test set that I can recommend to check the representation of meaning in the model is called the Google analogy test set. Mixing different test sets will help you tailor the test process depending on the expectations you have on the behaviour of language.

Why Should We Even Care About Philosophy?

When we want to test large language models, our expectations of language always influence us. I hope this article was able to demonstrate why it is important to ask questions about the underlying principles. For me personally, Wittgenstein is a good entry point to language philosophy, since he has a very analytical perspective matching IT contexts nicely.

In general, philosophy can also help us to re-discover the wonder and fascination of the everyday things we take for granted. It can remind us to look deeper and to see the world in a new light. Of course, my sketch here might not be the only solution, but changing our perspective on things leads to a more meaningful engagement with software testing.

Sources

[1] Hirst, Graeme. 1997. Context as a spurious concept. arXiv preprint cmp-lg/9712003.

[2] Sandra Luksic. 2020. Wittgenstein, natural language processing, and ethics of technology. Master’s thesis, Duke University, 4. Department of Philosophy.

[3] Alfred Nordmann. 2002. Another new wittgenstein: The scientific and engineering background of the tractatus. Perspectives on Science, 10(3):356–384.

[4] Wittgenstein, Ludwig. 2010. Philosophical investigations. John Wiley & Sons.

[5] Wittgenstein, Ludwig. 2013. Tractatus logico-philosophicus. Routledge.

For Further Information

Rise of the Guardians: Testing Machine Learning Algorithms 101, Patrick Prill

Technical Risk Analysis For AI Systems, Bill Matthews

Ludwig Wittgenstein, Stanford Encyclopedia of Philosophy

Software Quality Architect & Technical Editor
Joël Doat is currently a Software Quality Architect & Technical Editor at Eloquest GmbH. He has a background in mathematics and is currently studying philosophy. His interest lies in conceptualizing software quality assurance phenomena from a mathematical-philosophical perspective, with the aim of finding the most suitable methods for addressing the problem at hand.
Comments
Qase — Automate manual tests in one click image
Orchestrate manual and automated testing with 35 integrations, AI test case generation, customizable reports and more.
Explore MoT
TestBash Brighton 2025 image
Wed, 1 Oct 2025
On the 1st & 2nd of October, 2025 we'll be back to Brighton for another TestBash: the largest software testing conference in the UK
Introduction To HTTP
Learn the fundamental rules that make up HTTP requests and responses
This Week in Testing
Debrief the week in Testing via a community radio show hosted by Simon Tomes and members of the community
Subscribe to our newsletter
We'll keep you up to date on all the testing trends.