top of page

"The Illusion of Thinking". A deep dive into Apple's new research paper.

Updated: Sep 27

Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity.


ree

It is in no way unusual in my community that new papers cause controversy, but Apple's latest paper, "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity" (Bauer et al., 2024) has done more than that - it provided a bracing, quantified reality check on today's vaunted AI reasoning capability.


As a researcher in this area, I believe this paper is worthwhile as far as it goes in capturing an important distinction: that of good pattern-matching as opposed to actual, generalizable reasoning.


Its systematic approach, in my view, is what makes this work especially interesting, as instead of adopting commercial benchmarks, the authors created a challenging test in the style of classic logic puzzles such as the Tower of Hanoi, River Crossing, and Blocks World.


By incrementally increasing the complexity of these problems, I was able to observe precisely how Large Reasoning Models (LRMs) - large language models (LLMs) that generate a step-by-step thinking process, perform across different difficulty levels.


My Take on the Three Performance Regimes


The findings from this paper resonate with some of my own observations and intuitions about LLMs. The study identifies three distinct performance regimes, and they are worth examining closely:


  • Low Complexity (Simple Puzzles): At the simplest level, I was not surprised to see that standard LLMs (non-reasoning models) often outperformed their LRM counterparts. The step-by-step thinking process, which we assume is a strength, appeared to become a liability. The reasoning models tended to "overthink" and make unnecessary mistakes, whereas the standard models were more direct and accurate. This suggests a form of "overthinking" where the model generates superfluous tokens that may actually introduce incorrect information.


  • Medium Complexity (Moderate Puzzles): This is the domain where LRMs truly showcase their strengths. The explicit, sequential reasoning process allowed them to break down problems into manageable steps, leading to superior performance over standard LLMs. This was the "sweet spot" where their step-by-step reasoning allowed them to solve problems more effectively than their standard counterparts. This is where the illusion of thinking is at its strongest, and it’s where we tend to project human-like intelligence onto these models.


  • High Complexity (Very Hard Puzzles): This, for me, is the most profound and concerning finding. As the puzzles reached a high level of complexity, the performance of both model types—LRMs included—collapsed completely, with accuracy plummeting to near zero. What’s even more revealing is the models’ behavior at this threshold. The paper found that as problems became more difficult, the reasoning models counterintuitively used fewer tokens for "thinking," suggesting they "gave up" or became inefficient when faced with overwhelming complexity. This is a stark contrast to what we would expect from a system with genuine reasoning capabilities; a human would likely increase their effort when a problem becomes harder.


The Uncomfortable Truth About AI Reasoning


My primary takeaway from this research is that our current perception of AI's reasoning abilities is fundamentally flawed. The paper convincingly argues that what we are observing is not generalizable reasoning, but rather a highly sophisticated form of pattern recognition and procedural memorization. The systems excel when a problem's structure aligns with patterns in their vast training data, but they fail when true algorithmic or abstract problem-solving is required. The research raises questions about the true nature of "reasoning" in these models, as they fail to use explicit algorithms even when provided with them.


The paper has sparked debate in the AI community. Some researchers argue that the findings are a crucial revelation about the limitations of current AI architectures, while others contend that the experimental design may have been flawed, for example, by not accounting for a model's token limits. For my fellow researchers and for anyone building applications with these models, this paper serves as a critical warning. We must be cautious about over-extrapolating from impressive benchmark results and understand that our models have a hard, definable limit. The "thinking" we see may be nothing more than a generated text block that mimics a human thought process without actually replicating it. This realization should not be seen as a setback, but as a catalyst for a new wave of research focused on creating truly robust, generalizable AI systems.


References



Bauer, C., Hobbhahn, M., Roegiest, T., and Rozen, S. (2024). The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity. Apple. Available at: https://arxiv.org/abs/2402.04631
 
 
  • PMP in Bahrain Shibu Valsalan
  • PMP in Bahrain Shibu Valsalan
  • PMP in Bahrain Shibu Valsalan
  • PMP in Bahrain Shibu Valsalan

©2025 Dr. Shibu Valsalan. All rights reserved.

bottom of page