The world’s leading publication for data science, AI, and ML professionals.

The Savant Syndrome: Is Pattern Recognition Equivalent to Intelligence?

Exploring the limits of artificial intelligence: why mastering patterns may not equal genuine reasoning

|LLM|INTELLIGENCE|REASONING|

image generated by the author using DALL-E
image generated by the author using DALL-E

I have hardly ever known a mathematician who was capable of reasoning. – Plato

Reasoning draws a conclusion, but does not make the conclusion certain, unless the mind discovers it by the path of experience. – Roger Bacon

Large Language Models (LLMs) have shown remarkable capabilities, especially for classical tasks in natural language processing (such as question answering). Surprisingly, they showed improvement in complex tasks requiring reasoning (such as coding and mathematics). These capabilities have long been considered exclusive to humans. So claiming that LLMs can solve tasks that require reasoning has opened a heated debate.

Can Large Language Models (LLMs) truly reason? Or are they just sophisticated pattern matchers?

Reasoning capabilities are crucial to enable AI systems to interact with humans and to be able to be used in critical tasks. Reasoning requires you to reason logically, conduct inference, solve problems, and be able to make decisions from available information. Similar skills are needed for models that can really help us in scientific discovery, healthcare, finance, and education.

With the release of the new models, this debate has become even more heated. With the release of OpenAI GPT-4o1, there has been a strong interest in training models with Chain-of-Thought (COT) to improve reasoning. The results of CoT-trained LLMs have led some companies to declare that today’s LLMs possess reasoning capabilities and AGI is getting closer.

So today we have a great debate: On the one hand companies and some researchers claim that models possess reasoning capability, on the other hand, others define LLMs as stochastic parrots.

A Requiem for the Transformer?

OpenAI’s New ‘Reasoning’ AI Models Arrived: Will They Survive the Hype?

In this article we will focus on trying to answer these questions:

  • What does reasoning mean?
  • Do LLMs possess reasoning or are they just parrots?
  • Are we really measuring reasoning in the right way?

A definition of reasoning?

Reasoning is the fundamental cognitive process of drawing conclusions or making decisions based on available information, logic, and analysis. According to Aristotle, reasoning can be divided into two types:

For a long time, it was suggested that only human beings were capable to reason. Today it has been shown that primates, octopuses, and birds also exhibit basic forms of reasoning such as making decisions or solving problems.

In general, reasoning is supposed to be the process of solving complex problems or making decisions. Complex problem-solving requires identifying the problem, dividing it into subproblems, finding patterns, and then choosing the best solution. Decision-making similarly requires identifying problems and patterns and evaluating alternatives before choosing the best solution.

The problem with these definitions is that they are not entirely clear. Moreover, according to these definitions, LLMs could also be considered capable of reasoning.

Are LLM able to reason?

In benchmarks that measure reasoning skills (such as GLUE, SuperGLUE, and Hellaswag) LLMs outperformed humans. For some, this means that LLMs can conduct reasoning and draw logical conclusions.

These new reasoning capabilities would be mainly due to two factors:

  • LLMs are showing reasoning in all the benchmarks dedicated to reasoning
  • the emergence of new properties with increasing parameters, number of tokens, and compute budget.
  • The use of techniques such as CoT allows the model to use its potential.

So if we want to claim that LLMs are incapable of reasoning, we have to challenge these claims.

LLMs surprising results in reasoning benchmark

Of course, when someone claims that LLMs do not reason, proponents of incoming AGI respond "Look at the results in reasoning benchmarks." To paraphrase the duck test: if it solves problems like a human, decides like a human, and wins in reasoning benchmarks, then it probably reasons like a human.

Other authors have questioned this conclusion [1]. While on a superficial level, models seem capable of complex reasoning, looking in more detail they rely on probabilistic pattern-matching rather than formal reasoning.

A strong token bias suggests that the model is relying on superficial patterns in the input rather than truly understanding the underlying reasoning task. – source

In other words, these brittle performances show that the LLMs fail to generalize when encountering new examples that differ from the patterns seen during training. So changing the tokens in the examples leads to logical fallacies (since the models can no longer map the example to what is seen in training). Therefore, the models would be highly sensitive and fragile to which examples they are tested (this would explain why they sometimes seem to show great reasoning ability and sometimes fail spectacularly).

This fragility is highlighted by the perturbation of the example tokens, leading to LLM’s failure to solve the problem (so its "reasoning" depended on those tokens and mapping them to what it had seen in the training set). This it is confirmed by a correlation between the example’s frequency in training data and test performance [8].

"the classic "twenty-five horses" problem in graph theory. The top two sub-figures, generated by GPT-4o for illustration purposes only1 , demonstrate the concept by altering the name "horses" to "bunnies", irrelevant to the problem's underlying logic. The bottom two sub-figures show experimental results in GPT-4 and Claude, where performance significantly drops due to perturbations in animal names and numbers." -image source: here
"the classic "twenty-five horses" problem in graph theory. The top two sub-figures, generated by GPT-4o for illustration purposes only1 , demonstrate the concept by altering the name "horses" to "bunnies", irrelevant to the problem’s underlying logic. The bottom two sub-figures show experimental results in GPT-4 and Claude, where performance significantly drops due to perturbations in animal names and numbers." -image source: here

This phenomenon is called prompt sensitivity (a different response to a prompt that is semantically equivalent to another) [11–12]. This suggests that the model responds better to prompts that are more similar to the text seen at training.

They are also sensitive to noise [2]. In fact, an LLM is easily distracted by irrelevant context which leads to degraded performance in reasoning. Moreover, the noise effect is not canceled out even by all those prompting techniques specialized to improve reasoning. This suggests that disturbing the mapping with noise impacts the model’s ability to find patterns in its memory.

Intelligence is an emergent property

For many, intelligence is an emergent property. Biological systems naturally tend to become more complex and acquire new capabilities or they will be swept away by evolutionary pressure. The evolutionary process thus leads to increasingly intelligent or more specialized beings. Intelligence has therefore evolved under this pressure. It obviously requires resources, so the brain has grown to a critical level to support intelligence. For some, loss functions in pattern training function as an evolutionary pressure. So once models have had enough ‘neurons’ they can develop reasoning skills (in technical jargon, reasoning properties emerge with scale).

As said, this increased capacity for reasoning is attributed to increasing scale (whether of parameters or training tokens). However, for several authors, reasoning ability is an emergent property that needs a certain threshold of parameters to emerge. However, later studies suggest that emergent properties in LLMs can be a measurement error, and with it, the whole theory is related to the reasoning emergency [3, 13].

Emergent Abilities in AI: Are We Chasing a Myth?

Sometimes Noise is Music: How Beneficial Noise Can Improve Your RAG

CoT is not all you need

According to other authors, LLMs are capable of reasoning but it needs to be unlocked. Chain-of-thought (CoT) Prompting thus helps the model to unlock its potential through intermediate reasoning and thus guiding it to the correct answer in arithmetic problems [4]. A few weeks ago an article questioned the real benefit of CoT [5]:

As much as 95% of the total performance gain from CoT on MMLU is attributed to questions containing "=" in the question or generated output. For non-math questions, we find no features to indicate when CoT will help. – source

So CoT at best helps in solving math problems but certainly does not help in unlocking the reasoning potential of an LLM. Despite this, CoT is boasted as a panacea and is considered to be the basis of the recent reasoning ability of the latest generation of LLMs.

"meta-analysis of CoT literature. In both sets of results, math and other kinds of symbolic reasoning are the domains that consistently see substantial improvements from CoT (red dotted line indicates the mean improvement from CoT across experiments)." -image source: here
"meta-analysis of CoT literature. In both sets of results, math and other kinds of symbolic reasoning are the domains that consistently see substantial improvements from CoT (red dotted line indicates the mean improvement from CoT across experiments)." -image source: here

To CoT or Not to CoT: Do LLMs Really Need Chain-of-Thought?

These results seem to rule out common-sense reasoning abilities, but this does not rule out other forms of reasoning.

Are LLMs really capable of mathematical reasoning?

Although mathematical reasoning would seem to be the strong point in reasoning for LLMs, some studies suggest that LLMs merely recognize patterns. In other words, they search for patterns without really understanding the symbols.

According to some authors [6] LLMs are not capable of formal reasoning in mathematics because they are not capable of being able to develop a plan (plan defined as a course of actions (policy) which when executed would take an agent from a certain initial state to a desired world state). So without this plan, a model cannot solve a problem unless simply maps patterns seen in training. Or even in some cases, it is the user who can unconsciously guide LLM to the solution [7]:

The Clever Hans effect , where the LLM is merely generating guesses, and it is the human in the loop, with the knowledge of right vs. wrong solutions, who is steering the LLM–even if they didn’t set out to do so deliberately. The credit and blame for the ensuring accuracy, if any, falls squarely on the human in the loop. –source

"Claimed reasoning capabilities of LLMs are sometimes due to the subconscious helpful iterative prompting by the humans in the loop"-image source: here
"Claimed reasoning capabilities of LLMs are sometimes due to the subconscious helpful iterative prompting by the humans in the loop"-image source: here

Summarizing so far, proponents of LLM reasoning argue that there are several reasons why we observe this behavior today. We have shown, that there are several studies that show that contradict these claims.

Despite these studies claiming that they do not reason, LLMs perform astoundingly well in all benchmarks and pass complex tests even for humans. So the evidence we presented seems more theoretical versus experimental evidence of LLMs’ abilities to solve mathematical and complex problems.

is it that humans outcry being beaten by LLMs or is there something wrong?

Catching a student that is copying

Surely it is irritating to read claims that an LLM performs like a PhD student:

The o1-preview model is designed to handle challenging tasks by dedicating more time to thinking and refining its responses, similar to how a person would approach a complex problem. In tests, this approach has allowed the model to perform at a level close to that of PhD students in areas like physics, chemistry, and biology. – source

Irritation aside, the problem is how these model capabilities are measured. We are probably not measuring their reasoning skills in the right way, and it is time to use new systems.

These models are all tested on the same benchmarks as the GSM8K (Grade School Math 8K) dataset, which provides complex arithmetic problems but is at risk of data leakage (considering how many billions of tokens are used to train an LLM, the model may have already seen the answer in the training). In addition, it provides only a single metric on a fixed set of questions, giving us little information about the LLM’s reasoning (fun fact, an LLM can answer a question correctly while blatantly getting the reasoning wrong). Finally, this dataset is static and does not allow us to change conditions.

In this work, they propose a new benchmark dataset GSM-Symbolic [9] where different issues are generated using symbolic templates. This dataset allows for varying the difficulty of the question and a more fine-grained control when testing the dataset. This dataset is virtually the same dataset on which reasoning was tested. The questions were just modified to make statistical pattern matching difficult. If the LLM is capable of reasoning it should be able to solve the problems easily, but if it is incapable of generalizing it will fail miserably.

Illustration of the GSM-Symbolic template creation process. image source: here
Illustration of the GSM-Symbolic template creation process. image source: here

Testing state-of-the-art LLMs, the authors found no evidence of formal reasoning in language models. The models are not robust and have a drop in performance when numerical values are changed, and their capabilities degrade sharply as the complexity of the problem increases.

One example out of all: the model is easily fooled if seemingly relevant statements are added to the questions that are, in fact, irrelevant to the reasoning and conclusion. Instead, the model takes these statements into account and is induced to errors. According to this study, the model does not understand mathematical concepts but tries to convert these statements into operations. The authors suggest that this occurs because their training datasets included similar examples that required conversion to mathematical operations.

For instance, a common case we observe is that models interpret statements about "discount" as "multiplication", regardless of the context. This raises the question of whether these models have truly understood the mathematical concepts well enough. – source

image source: here
image source: here

This is another sign that the model tries to look for these patterns even when they are just background noise. When the noise increases and it becomes more difficult to search for patterns (or to map them consistently to reach the solution) performance drops dramatically [10]. This is also true for LLMs that have been trained on CoT (such as ChatGPT4-O1). This further is an indication that CoT does not really improve reasoning skills.

image source: here
image source: here

Parting thoughts

In this article we discussed the great debate: are LLMs capable of reasoning? Or at least some form of reasoning?

The studies we have shown disagree, and suggest that LLMs are sophisticated pattern-matching machines. In summary, these studies suggest:

  • LLMs are trained with a huge number of tokens and there is a risk of data contamination with major benchmarks. Even if the model did not see a mathematical problem, it has probably seen plenty of similar examples.
  • Given their enormous knowledge and innate ability to find patterns (thanks to attention mechanisms and in-context learning) they manage to solve most problems.
  • Their lack of robustness to variations in the problem, tokens bias, and susceptibility to noise strongly suggest that the LLMs are not capable of formal reasoning.
  • New results confirm that even using advanced prompting techniques the models remain susceptible to noise and irrelevant (or potentially misleading) information.
  • The models are capable of pattern matching but do not appear to understand any of the mathematical concepts underlying problem-solving.

These results do not question the usefulness of LLMs but criticize the assumption that an LLM is capable of reasoning. They suggest that one can see an LLM as a machine with prodigious memory but incapable of reasoning (or the most sophisticated mechanical parrot to date). This does not detract from the prodigy of the technology required for their creation but celebrates the wonder of human ingenuity. Further studies are probably needed to better explain the capabilities of LLMs and new architectures for models capable of reasoning.

What do you think? Do you think LLMs are capable of reasoning? let me know in the comments


If you have found this interesting:

You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.

Get an email whenever Salvatore Raieli publishes.

Here is the link to my GitHub repository, where I am collecting code and many resources related to machine learning, Artificial Intelligence, and more.

GitHub – SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…

or you may be interested in one of my recent articles:

Power Corrupts: Hierarchies, Persuasion, and Anti-Social Behavior in LLMs

Through the Uncanny Mirror: Do LLMs Remember Like the Human Mind?

Lie to Me: Why Large Language Models Are Structural Liars

Forever Learning: Why AI Struggles with Adapting to New Challenges

Reference

Here is the list of the principal references I consulted to write this article, only the first name for an article is cited.

  1. Jiang, 2024, A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners, link
  2. Shi, 2023, Large Language Models Can Be Easily Distracted by Irrelevant Context, link
  3. Schaeffer, 2023, Are emergent abilities of large language models a mirage? link
  4. Wei, 2022, Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, link
  5. Sprague, 2024, To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning, link
  6. Valmeekam, 2023, PlanBench: An Extensible Benchmark for Evaluating Large Language Models on Planning and Reasoning about Change
  7. Kambhampati, 2024, Can Large Language Models Reason and Plan? link
  8. Razeghi, 2022, Impact of Pretraining Term Frequencies on Few-Shot Reasoning, link
  9. Mirzadeh, 2024, GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models, link
  10. Valmeekam, 2024, LLMs Still Can’t Plan; Can LRMs? A Preliminary Evaluation of OpenAI’s o1 on PlanBench, link
  11. Lu, 2022, Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity, link
  12. Zhao, 2021, Calibrate Before Use: Improving Few-shot Performance of Language Models, link
  13. Rogers, 2024, Position: Key Claims in LLM Research Have a Long Tail of Footnotes, link

Related Articles