Apple just tested the smartest "reasoning" AI Models out there: Claude 3.7 Sonnet, DeepSeek-R1, OpenAIโs o1/o3.
The verdict?
They didnโt just underperform.
They ๐ฐ๐ผ๐น๐น๐ฎ๐ฝ๐๐ฒ๐ฑ when things got to complex.
Even when you gave them the algorithm, they couldnโt follow it.
Worse, when tasks got harder, they ๐ฟ๐ฒ๐ฎ๐๐ผ๐ป๐ฒ๐ฑ ๐น๐ฒ๐๐, not more.
This confirms what many testers already feel in their gut:
AI looks smart until it has to think.
Because real reasoning isnโt just generating confident answers.
Itโs about:
โข Navigating uncertainty
โข Spotting whatโs missing
โข Asking, โWait, does this even make sense?โ
And thatโs what great testers do every day.
We donโt just validate that something works.
We question ๐๐ต๐, ๐ต๐ผ๐, ๐ฎ๐ป๐ฑ ๐๐ต๐ฎ๐ could break it next.
AI can make us more productive.
But when complexity scales, ๐๐ต๐ฒ ๐๐ ๐ถ๐ ๐ป๐ผ๐ the reasoning engine.
๐ฌ๐ผ๐ ๐ฎ๐ฟ๐ฒ.
Original Paper: https://machinelearning.apple.com/research/illusion-of-thinking