How do machine learning models work? And are they really capable of “thinking” or “reasoning” in the same way humans do? This philosophical and practical question has been the subject of ongoing debate. However, a recent paper titled “Understanding the Limitations of Mathematical Reasoning in Large Language Models,” authored by a team of AI research scientists at Apple, suggests a clear answer: not yet.
The core of the research centers around the difference between symbolic learning and pattern reproduction, and while these concepts are complex, the basic premise is straightforward. To illustrate, imagine a simple math problem:
Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday. How many kiwis does Oliver have?
The correct answer, using basic arithmetic, is 44 + 58 + (44 * 2) = 190. While large language models (LLMs) have a known weakness when it comes to math, they can often handle problems like this. But what if we add a random detail?
Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?
Though this is essentially the same math problem, the extra, irrelevant information—about smaller kiwis—tends to confuse even advanced LLMs. For instance, one model, GPT-o1-mini, responded as follows:
“On Sunday, 5 of these kiwis were smaller than average. We need to subtract them from the Sunday total: 88 (Sunday’s kiwis) – 5 (smaller kiwis) = 83 kiwis.”
Clearly, the model misunderstood the problem by subtracting the smaller kiwis, which shouldn’t have changed the total count. This is just one example out of many slightly altered questions that resulted in a significant drop in performance among models tested in the study.
Why does this happen? The researchers suggest that this kind of failure reveals a deeper issue: LLMs don’t truly “understand” the problems they are solving. Instead, they rely on their training data to provide the correct response in familiar situations, but when faced with even minor distractions or deviations, their “reasoning” falls apart.
As the paper explains, “we investigate the fragility of mathematical reasoning in these models and demonstrate that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data.”
This issue is not limited to math. Similar behaviors are observed in language tasks. For example, when an LLM encounters the phrase “I love you,” it often predicts the response “I love you, too,” based purely on statistical patterns rather than any true comprehension of emotion. The same holds true for more complex reasoning chains. If the model has seen the pattern before, it can follow along. But once that pattern is slightly altered, the model’s responses can become unintuitive or incorrect.
The study reinforces the idea that while LLMs are impressive in their ability to replicate human-like language and reasoning, they are still far from actual understanding. For now, these models excel at pattern matching, not real thought.
By Impact Lab