Home OpenAI Why Apple’s Critique of AI Reasoning Is Premature

OpenAI

Why Apple’s Critique of AI Reasoning Is Premature

adminUpdated 13 hours Ago3 Mins read1 Views

Why Apple’s Critique of AI Reasoning Is Premature

The debate around the reasoning capabilities of Large Reasoning Models (LRMs) has been recently invigorated by two prominent yet conflicting papers: Apple’s “Illusion of Thinking” and Anthropic’s rebuttal titled “The Illusion of the Illusion of Thinking”. Apple’s paper claims fundamental limits in LRMs’ reasoning abilities, while Anthropic argues these claims stem from evaluation shortcomings rather than model failures.

Apple’s study systematically tested LRMs on controlled puzzle environments, observing an “accuracy collapse” beyond specific complexity thresholds. These models, such as Claude-3.7 Sonnet and DeepSeek-R1, reportedly failed to solve puzzles like Tower of Hanoi and River Crossing as complexity increased, even exhibiting reduced reasoning effort (token usage) at higher complexities. Apple identified three distinct complexity regimes: standard LLMs outperform LRMs at low complexity, LRMs excel at medium complexity, and both collapse at high complexity. Critically, Apple’s evaluations concluded that LRMs’ limitations were due to their inability to apply exact computation and consistent algorithmic reasoning across puzzles.

Anthropic, however, sharply challenges Apple’s conclusions, identifying critical flaws in the experimental design rather than the models themselves. They highlight three major issues:

Token Limitations vs. Logical Failures: Anthropic emphasizes that failures observed in Apple’s Tower of Hanoi experiments were primarily due to output token limits rather than reasoning deficits. Models explicitly noted their token constraints, deliberately truncating their outputs. Thus, what appeared as “reasoning collapse” was essentially a practical limitation, not cognitive failure.
Misclassification of Reasoning Breakdown: Anthropic identifies that Apple’s automated evaluation framework misinterpreted intentional truncations as reasoning failures. This rigid scoring method didn’t accommodate models’ awareness and decision-making regarding output length, leading to unjustly penalizing LRMs.
Unsolvable Problems Misinterpreted: Perhaps most significantly, Anthropic demonstrates that some of Apple’s River Crossing benchmarks were mathematically impossible to solve (e.g., cases with six or more individuals with a boat capacity of three). Scoring these unsolvable instances as failures drastically skewed the results, making models appear incapable of solving fundamentally unsolvable puzzles.

Anthropic further tested an alternative representation method—asking models to provide concise solutions (like Lua functions)—and found high accuracy even on complex puzzles previously labeled as failures. This outcome clearly indicates the issue was with evaluation methods rather than reasoning capabilities.

Another key point raised by Anthropic pertains to the complexity metric used by Apple—compositional depth (number of required moves). They argue this metric conflates mechanical execution with genuine cognitive difficulty. For example, while Tower of Hanoi puzzles require exponentially more moves, each decision step is trivial, whereas puzzles like River Crossing involve fewer steps but significantly higher cognitive complexity due to constraint satisfaction and search requirements.

Both papers significantly contribute to understanding LRMs, but the tension between their findings underscores a critical gap in current AI evaluation practices. Apple’s conclusion—that LRMs inherently lack robust, generalizable reasoning—is substantially weakened by Anthropic’s critique. Instead, Anthropic’s findings suggest LRMs are constrained by their testing environments and evaluation frameworks rather than their intrinsic reasoning capacities.

Given these insights, future research and practical evaluations of LRMs must:

Differentiate Clearly Between Reasoning and Practical Constraints: Tests should accommodate the practical realities of token limits and model decision-making.
Validate Problem Solvability: Ensuring puzzles or problems tested are solvable is essential for fair evaluation.
Refine Complexity Metrics: Metrics must reflect genuine cognitive challenges, not merely the volume of mechanical execution steps.
Explore Diverse Solution Formats: Assessing LRMs’ capabilities across various solution representations can better reveal their underlying reasoning strengths.

Ultimately, Apple’s claim that LRMs “can’t really reason” appears premature. Anthropic’s rebuttal demonstrates that LRMs indeed possess sophisticated reasoning capabilities that can handle substantial cognitive tasks when evaluated correctly. However, it also stresses the importance of careful, nuanced evaluation methods to truly understand the capabilities—and limitations—of emerging AI models.

Check out the Apple Paper and Anthropic Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

Source link

Previous post DeepSeek Researchers Open-Sourced a Personal Project named 'nano-vLLM': A Lightweight vLLM Implementation Built from Scratch

Next post EmbodiedGen: A Scalable 3D World Generator for Realistic Embodied AI Simulations

EmbodiedGen: A Scalable 3D World Generator for Realistic Embodied AI Simulations

The Challenge of Scaling 3D Environments in Embodied AI Creating realistic and...

admin3 Mins read

OpenAI

DeepSeek Researchers Open-Sourced a Personal Project named ‘nano-vLLM’: A Lightweight vLLM Implementation Built from Scratch

The DeepSeek Researchers just released a super cool personal project named ‘nano-vLLM‘,...

admin3 Mins read

OpenAI

Google Researchers Release Magenta RealTime: An Open-Weight Model for Real-Time AI Music Generation

Google’s Magenta team has introduced Magenta RealTime (Magenta RT), an open-weight, real-time...

admin3 Mins read

OpenAI

IBM’s MCP Gateway: A Unified FastAPI-Based Model Context Protocol Gateway for Next-Gen AI Toolchains

The development and deployment of advanced AI systems increasingly depend on flexible,...

admin3 Mins read

This Week

Gemini 2.5 model family expands

Gemini 2.5: Updates to our family of thinking models

How to Use python-A2A to Create and Connect Financial Agents with Google’s Agent-to-Agent (A2A) Protocol

Weekly Newsletter

Why Apple’s Critique of AI Reasoning Is Premature

Leave a comment

Leave a Reply Cancel reply

Latest Posts

Gemini 2.5: Updates to our family of thinking models

How to Use python-A2A to Create and Connect Financial Agents with Google’s Agent-to-Agent (A2A) Protocol

EPFL Researchers Introduce MEMOIR: A Scalable Framework for Lifelong Model Editing in LLMs

OpenBMB Releases MiniCPM4: Ultra-Efficient Language Models for Edge Devices with Sparse Attention and Fast Inference

EmbodiedGen: A Scalable 3D World Generator for Realistic Embodied AI Simulations

DeepSeek Researchers Open-Sourced a Personal Project named ‘nano-vLLM’: A Lightweight vLLM Implementation Built from Scratch

Google Researchers Release Magenta RealTime: An Open-Weight Model for Real-Time AI Music Generation

IBM’s MCP Gateway: A Unified FastAPI-Based Model Context Protocol Gateway for Next-Gen AI Toolchains

Get to Know Us

keep in touch