AI doesnโt fail at randomness. It fails at complexity.
![A screenshot from the paper: The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
Showing a composite figure illustrating how reasoning models solve the Tower of Hanoi problem, and how their performance varies with problem complexity.
Top Section โ LLM Response Workflow:
On the left is a code-like LLM response showing a section with a list of disk moves (e.g., [1, 0, 2], [2, 0, 1], etc.) and a section referencing the final moves list. Arrows indicate:
Moves are extracted from the section for analysis.
Final answer is extracted from the section for measuring accuracy.
To the right, a sequence of three Tower of Hanoi diagrams represents:
Initial State: All disks stacked on peg 0.
Middle State: Disks distributed across pegs.
Target State: All disks correctly stacked on peg 2.
Each disk is color-coded and numbered for clarity.
Bottom Row โ Three Line Graphs:
Left Graph: Accuracy vs. Complexity
Y-axis: Accuracy (%)
X-axis: Problem complexity (number of disks, from 1 to 20)
Two lines: Claude 3.7 (red circles) and Claude 3.7 with โthinkingโ mode (blue triangles).
Accuracy drops sharply for both as disk number increases, with โthinkingโ performing slightly better up to 8 disks.
Middle Graph: Response Length vs. Complexity
Y-axis: Token count
X-axis: Number of disks
โThinkingโ responses grow rapidly in length with complexity, peaking near 8 disks.
Right Graph: Position of Error in Thought Process
Y-axis: Normalized position in the LLMโs reasoning (0 to 1)
X-axis: Complexity (1 to 15 disks)
Shows where correct vs. incorrect reasoning paths diverge; incorrect solutions typically fail earlier in the thoughts.
Background colors across all graphs denote complexity bands: yellow (easy), blue (moderate), red (hard).](https://www.ministryoftesting.com/rails/active_storage/blobs/redirect/eyJfcmFpbHMiOnsibWVzc2FnZSI6IkJBaHBBeGRrQVE9PSIsImV4cCI6bnVsbCwicHVyIjoiYmxvYl9pZCJ9fQ==--fedc637a41c8b2a7ac496d64cff3f1fb1ff08050/Screenshot%202025-06-09%20at%2016.07.56.png)
Apple just tested the smartest "reasoning" AI Models out there: Claude 3.7 Sonnet, DeepSeek-R1, OpenAIโs o1/o3.
The verdict?
They didnโt just underperform.
They ๐ฐ๐ผ๐น๐น๐ฎ๐ฝ๐๐ฒ๐ฑ when things got to complex.
Even when you gave them the algorithm, they couldnโt follow it.
Worse, when tasks got harder, they ๐ฟ๐ฒ๐ฎ๐๐ผ๐ป๐ฒ๐ฑ ๐น๐ฒ๐๐, not more.
This confirms what many testers already feel in their gut:
AI looks smart until it has to think.
Because real reasoning isnโt just generating confident answers.
Itโs about:
โข Navigating uncertainty
โข Spotting whatโs missing
โข Asking, โWait, does this even make sense?โ
And thatโs what great testers do every day.
We donโt just validate that something works.
We question ๐๐ต๐, ๐ต๐ผ๐, ๐ฎ๐ป๐ฑ ๐๐ต๐ฎ๐ could break it next.
AI can make us more productive.
But when complexity scales, ๐๐ต๐ฒ ๐๐ ๐ถ๐ ๐ป๐ผ๐ the reasoning engine.
๐ฌ๐ผ๐ ๐ฎ๐ฟ๐ฒ.
Original Paper: https://machinelearning.apple.com/research/illusion-of-thinking