AI doesnโ€™t fail at randomness. It fails at complexity.

Add Memory
A screenshot from the paper: The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
Showing a composite figure illustrating how reasoning models solve the Tower of Hanoi problem, and how their performance varies with problem complexity.

Top Section โ€“ LLM Response Workflow:

On the left is a code-like LLM response showing a  section with a list of disk moves (e.g., [1, 0, 2], [2, 0, 1], etc.) and a  section referencing the final moves list. Arrows indicate:

Moves are extracted from the  section for analysis.

Final answer is extracted from the  section for measuring accuracy.

To the right, a sequence of three Tower of Hanoi diagrams represents:

Initial State: All disks stacked on peg 0.

Middle State: Disks distributed across pegs.

Target State: All disks correctly stacked on peg 2.
Each disk is color-coded and numbered for clarity.

Bottom Row โ€“ Three Line Graphs:

Left Graph: Accuracy vs. Complexity

Y-axis: Accuracy (%)

X-axis: Problem complexity (number of disks, from 1 to 20)

Two lines: Claude 3.7 (red circles) and Claude 3.7 with โ€œthinkingโ€ mode (blue triangles).

Accuracy drops sharply for both as disk number increases, with โ€œthinkingโ€ performing slightly better up to 8 disks.

Middle Graph: Response Length vs. Complexity

Y-axis: Token count

X-axis: Number of disks

โ€œThinkingโ€ responses grow rapidly in length with complexity, peaking near 8 disks.

Right Graph: Position of Error in Thought Process

Y-axis: Normalized position in the LLMโ€™s reasoning (0 to 1)

X-axis: Complexity (1 to 15 disks)

Shows where correct vs. incorrect reasoning paths diverge; incorrect solutions typically fail earlier in the thoughts.

Background colors across all graphs denote complexity bands: yellow (easy), blue (moderate), red (hard).

Apple just tested the smartest "reasoning" AI Models out there: Claude 3.7 Sonnet, DeepSeek-R1, OpenAIโ€™s o1/o3.
The verdict?

They didnโ€™t just underperform.
They ๐—ฐ๐—ผ๐—น๐—น๐—ฎ๐—ฝ๐˜€๐—ฒ๐—ฑ when things got to complex.

Even when you gave them the algorithm, they couldnโ€™t follow it.
Worse, when tasks got harder, they ๐—ฟ๐—ฒ๐—ฎ๐˜€๐—ผ๐—ป๐—ฒ๐—ฑ ๐—น๐—ฒ๐˜€๐˜€, not more.

This confirms what many testers already feel in their gut:
AI looks smart until it has to think.

Because real reasoning isnโ€™t just generating confident answers.
Itโ€™s about:

โ€ข Navigating uncertainty
โ€ข Spotting whatโ€™s missing
โ€ข Asking, โ€œWait, does this even make sense?โ€

And thatโ€™s what great testers do every day.

We donโ€™t just validate that something works.
We question ๐˜„๐—ต๐˜†, ๐—ต๐—ผ๐˜„, ๐—ฎ๐—ป๐—ฑ ๐˜„๐—ต๐—ฎ๐˜ could break it next.

AI can make us more productive.
But when complexity scales, ๐˜๐—ต๐—ฒ ๐—”๐—œ ๐—ถ๐˜€ ๐—ป๐—ผ๐˜ the reasoning engine.
๐—ฌ๐—ผ๐˜‚ ๐—ฎ๐—ฟ๐—ฒ.

Original Paper: https://machinelearning.apple.com/research/illusion-of-thinking

CPTO of Epic Test Quest
Berlin-based Automation Engineer with 18+ years in tech. Now Co-Founder & CTPO at Epic Test Quest: building playful, powerful Slack & Teams app that put quality into daily team flow.
MoT Professional Membership image
For the advancement of software testing and quality engineering
Explore MoT
The awesome power of shifting left โ€” Software Testing Live image
Software Testing Live: Episode 04
MoT Software Testing Essentials Certificate image
Boost your career in software testing with the MoT Software Testing Essentials Certificate. Learn essential skills, from basic testing techniques to advanced risk analysis, crafted by industry experts.
Leading with Quality
A one-day educational experience to help business lead with expanding quality engineering and testing practices.
This Week in Testing image
Debrief the week in Testing via a community radio show hosted by Simon Tomes and members of the community
Subscribe to our newsletter
We'll keep you up to date on all the testing trends.