-

@ FLASH⚡️
2025-06-14 07:45:05
⚡️🚨 NEW - Claude Opus published a response paper to Apple’s paper, criticizing their experiment design, putting models under token limit constraints, and having them solve unsolvable problems.
The study rehabilitates the thinking ability of large models; Apples study is wrong.
The new follow-up study to Apple's paper “Illusion of Thinking” shows that the alleged collapse of model performance in complex tasks was not due to a lack of thinking ability, but to testing errors—more specifically, overly restrictive token limits and problematic formatting.
The original paper claimed that language models such as GPT-4 would fail completely at increasingly complex thinking tasks such as the “Tower of Hanoi” or the “River Crossing” problem. However, the follow-up study now proves that as soon as the models are allowed to give compressed answers instead of step-by-step descriptions (e.g., as a Lua function), they reliably solve even difficult tasks – in some cases with impressive efficiency.
The alleged “breakdown” did not occur because the model did not understand the problem, but because the format consumed too many tokens and the evaluation metrics registered hard errors when the output was truncated or the tasks were mathematically unsolvable. With better-suited formats and fair evaluations, the effect disappears completely.
Conclusion: The study rehabilitates the thinking ability of large models. It wasn't their logic that was the problem, but the way we tested them.
https://blossom.primal.net/f6228d36c75d3e6a1d38c1507b6d70b9bcfe5bc4d6637c88e8932e90e7183cec.png