Solving Reasoning Problems with LLMs in 2023

Published in

Towards Data Science

17 min readJan 6, 2024

It’s the beginning of 2024 and ChatGPT just celebrated its one-year birthday. One year is a super long time for the community of large language models, where a myriad of interesting works have taken place. Let’s revisit the progress and discuss topics for the coming year.

The School of Agents: LLMs are retrieving knowledge from textbooks and performing reasoning. Image by authors & DALL·E 3.

This post was co-authored with Michael Galkin (Intel AI Lab), Abulhair Saparov (New York University), Shibo Hao (UC San Diego) and Yihong Chen (University College London & Meta AI Research). Many insights in this post were formed during the fruitful discussions with Emily Xue (Google), Hanjun Dai (Google DeepMind) and Bruno Ribeiro (Purdue University).

Introduction
Tool Use
1. In-context learning enables using more tools
2. Most used tools: code interpreters and retrievers
3. Let LLMs create their own tools
Reasoning
1. Planning
2. Self series
3. Evaluations and observations
What needs to be solved in 2024?

Introduction

🔥 Large language models (LLMs) must be the hottest topic in 2023. At the NeurIPS conference last month, the recurring topics in social events were: 1) what research are we doing with/for LLMs? 2) how can my research be integrated with LLMs? 3) what is the best strategy to shift from XXX to LLMs? 4) what research can we do as a GPU-poor group? The reason is that everyone was informed about the groundbreaking news of LLMs on X, Discord, Slack, and everywhere else.

If you take a look at the language model papers on arXiv, there is a leap from 2,837 to 11,033 in 2023, which breaks the linear trend from 2019 to 2022. Papers in the past year can be roughly clustered into 3 major categories: 1️⃣ pretraining and alignment; 2️⃣ tool use and reasoning; 3️⃣ systems and serving. As the title indicates, this post will focus on the progress of LLM research on tool use and reasoning. We picked ~20 👀 mind-blowing 👀 papers and summarized their insights and implications. By no means can this post be a comprehensive summary of all the achievements made by the community. Feel free to comment on any topics we missed.

This post is composed of two topics: tool use and reasoning.

Tool use is more about how to solve reasoning problems by equipping LLMs with external tools, such as retrievers, search engines, and code interpreters. While tool use is not essential for building strong AI (see Yann’s classification below), tool use provides a practical solution to many applications when domain-specific tools are easily accessible.
By contrast, reasoning focuses on solving complex problems with the internal reasoning capacities of LLMs. Research on reasoning tries to figure out the limit of the capabilities that LLMs possess and approaches to push that limit.

There isn’t a strict dichotomy between the two topics, as we will see in the rest of this post.

Yann LeCun’s classification of retrieval and reasoning.

Tool Use

In-context learning enables using more tools

➡️ One limitation of LLM tool usage is the necessity of sufficient human annotations. Whenever we want to teach an LLM to use a tool, we need enough annotated tool calls to finetune the LLM. In the Toolformer paper by Meta, the authors use in-context learning to create a model that annotates tool calls for the input query. This model is then used to generate tool calls on an unllabeled dataset. While the generations may be far from perfect, incorrect calls can be filtered by executing the tools and filtering the outputs based on the ground truth answer. The correct calls are collected and used to finetune the model. In this way, we can teach Transformers to use any tool based on a conventional dataset and merely 5 additional annotations — easy work for any engineer.

Automatic annotation of tool calls. Source: Schick et al.

➡️ Lu et al. proposed Chameleon 🦎 to compose tools for multi-step reasoning. The core idea is to use an LLM to decompose the question into a sequence of tools, and then generate the arguments for each tool call. Both steps are implemented by few-shot prompts. Such an idea is reminiscent of Neural Module Networks (NMNs) from 2016, which decompose a question into sub tasks and learn a module for each sub task. The main obstacle of NMNs is that they are hard to train without annotations of the decompositions (see this study). Fortunately, this is not a problem in pretrained LLMs. With in-context learning, Chameleon can generate different compositions of tool calls to solve a problem. A similar idea on visual reasoning got the best paper award in CVPR this year.

Chameleon for multi-step tool use. Source: Lu et al.

➡️ While in-context learning is highly efficient compared to traditional methods, it faces certain limitations, such as difficulty in managing a large array of tools. Addressing this, Hao et al. introduced ToolkenGPT, which augments a frozen LLM with new token embeddings specifically for tools, termed “toolkens”. This technique was originally used in multilingual language models to accomodate a new language. ToolkenGPT allows tool calling during inference in the same way as next word token prediction. It demonstrates the capacity to handle over 200 tools while being cost-efficient, establishing a new effectiveness-efficiency trade-off compared to LoRA finetuning. Similar ideas are also integrated into multi-modal LLMs for robotic actions and image generation.

ToolkenGPT for massive tool use. Source: Hao et al.

Most used tools: code interpreters and retrievers

If you ask us which tools are most generally applicable to reasoning tasks, we would say they are code interpreters and retrievers. Code interpreters are probably the most expressive environment that humans have invented for logic and computation. Retrievers are a good complement to the parametric knowledge of LLMs when a question or assumed knowledge is out of their training distribution. Let’s see how these tools can be used by LLMs.

➡️ One common failure of chain-of-thought (CoT) is that LLMs fail to perform arithmetic operations. In program-aided language modeling (PAL) and program-of-thoughts (PoT) prompting, the authors prompt a code language model with programs to solve math problems. One may insert standard chain-of-thought texts as comments in the programs. The final answer is then generated by executing a Python interpreter. The insight behind these methods is that the code interpreter provides a perfect tool for all kinds of calculation, reducing failure cases to only incorrect reasoning. Code-style prompts are also commonly used in planning tasks.

Comparison between CoT and PAL. Source: Gao and Madaan et al.

➡️ Retrievers are commonly used as preprocessing tools for LLMs to augment the question with relevant documents, often referred to as retrieval-augmented generation (RAG). However, when it comes to multi-step question answering, it is challenging to select the correct documents based on the question alone. In the IRCoT proposed by Trivedi et al., the authors interleave thought generation and knowledge retrieval. Whenever the LLM generates a thought sentence, IRCoT uses the sentence to retrieve documents from the corpus. The documents are prepended to the prompt to augment later generations. Even with a weak retriever like BM25, IRCoT outperforms one-step RAG on several open-domain question answering benchmarks.

IRCoT that interleaves CoT and knowledge retrieval. Source: Trivedi et al.

➡️ Yang et al. presented a novel usage of RAG for theorem proving. They built a gym-like environment LeanDojo 🏯 based on the proof assistant Lean. Lean is an interactive programming environment where its compiler can verify whether the written proof proved the goal. It also includes numerous proved theorems in its standard libraries, similar to STL in C++. The cool thing is that because proofs are constructed by decomposing theorems into known premises, theorem proving can benefit from RAG. Given a theorem, we retrieve the relevant premises from the standard libraries and then ask an LLM to generate a proof step. The authors show that RAG requires much less training resources and generalizes better to novel premises.

Proof of a simple logical theorem in Lean. Source: Xena Project

➡️ Finally, DSPy by Khattab et al. presents a novel approach towards programming LLMs where the framework can actually improve the prompts over time and combine prompting techniques (CoT, PoT) with retrieval automatically. Further, DSPy introduces teleprompters for optimizing the prompts and bootstrapping new ones. It’s hard to fit a description of DSPy in one paragraph — it’s not your average RAG technique, but rather an evolution of it.

Let LLMs create their own tools

Tool use has an inherent limitation: it relies on the presence of tools for a specific task. In nature, tool use is not an exclusive skill of humans, as many other animals can also use tools. However, what distinguishes humans from other animals is the ability to create tools. In 2023, we’ve seen a few preliminary works exploring the ability of tool making in LLMs.

➡️ In LLMs as tool makers (LATM) proposed by Cai et al., the authors prompt an LLM to craft tools in the form of Python functions for a given task. The tools are then verified on a few samples, similar to how engineers solve problems on LeetCode. Once some tools pass the verification test, they are wrapped with documentation strings generated by an LLM to describe their usage. At test time, an LLM is prompted to dispatch the question to one of the tools at hand, and execute the tool based on the usage. LATM significantly outperforms CoT on a wide range of reasoning tasks in BigBench.

➡️ Voyager brought the idea of tool making to the world of Minecraft, and came up with incredible results. The core idea of Voyager is to use an LLM to propose tasks based on existing skills and world state. Then the LLM is prompted to synthesize code (i.e. skills) to solve the tasks. The skills are refined based on environment feedback and mastered skills are committed to an external memory. Because new skills are built on top of existing skills, this significantly reduced the difficulty of learning a complex skill (e.g. building a diamond tool in Minecraft). While the idea of learning a library of skills can be traced back to DreamCoder, Voyager demonstrates the superiority of GPT-4 in searching over skills in a challenging open-world game. Take a look at the fancy demos from the paper!

Minecraft items and skills discovered over time. Source: Wang et al.

➡️ Both of the above works craft tools as code. In fact, tools can be in natural language, too. (shameless self-promotion) In the hypothesis-to-theories (HtT) work from Zhu et al., the authors show that we can use LLMs to induce a library of textual rules from a standard multi-step reasoning training set. The insight is that among all the rules that LLMs produce for different samples, rules that occur and lead to correct answers more often are likely to be correct. We then collect the rules and prepend them to a standard CoT prompt to perform deduction and get the answer. One interesting aspect about HtT is that it can be viewed as a novel way of learning: instead of learning model parameters, we learn a library of rules, which works well with black-box LLMs.

Reasoning

Planning

One drawback for CoT-style reasoning is that LLMs have to greedily decode a path towards an answer. This is problematic for complex problems like math questions or games, since it is hard to predict a path without trial-and-error. In 2023, the community made some progress on this issue with new frameworks that enable planning with LLMs.

➡️ If we conceptualize CoT as “system 1” reasoning — characterized by its automatic, unconscious nature — then a question arises: Is it feasible to replicate the more conscious “system 2” reasoning of humans using LLMs? This query finds relevance in two methodologies: reasoning-via-planning (RAP) and tree-of-thoughts (ToT). Both empower LLMs to navigate through possible reasoning steps, and to search for the optimal reasoning chain based on specific evaluations. RAP additionally prompts an LLM as a “world model”, which predicts the next states following actions. This enables the LLM to operate within a self-simulated world, as opposed to interacting with an external environment. Both algorithms are available in the LLM Reasoners library now!

RAP that repurposes LLMs as an agent and a world model. Source: Hao et al.

Self series

Self series are a family of techniques that replace human efforts with LLM predictions in the loop of LLM development. The year of 2023 has witnessed quite a few papers on this track. Let’s take a closer look at some representative works.

➡️ Many people have the experience that ChatGPT doesn’t provide the desired output on the first trial, and this sometimes can be fixed by pointing out its mistake. Self-debugging and self-refinement automate this procedure by replacing human feedback with machine feedback. The feedback either comes from a program executor or an LLM that compares the generation with the explanation of the problem. One key observation is that the performance of self-refine depends on the quality of the feedback, where stronger base models that provide better feedback benefit more. Such iterative refinement methods have also been shown to be super effective in pose estimation and protein structure prediction, where it is difficult to predict the structure in a single run.

Illustration of Self-Debugging. Source: Chen et al.

➡️ In the memory-of-thought (MoT) framework from Li and Qiu, the authors ask an LLM to generate CoT rationales on an unlabelled dataset and use them for RAG. You may ask how this can be useful given that the generated rationales often contain errors. The key trick is to filter the rationales based on majority vote or entropy minimization (a similar idea is used in Wan et al. to filter rationales). Once we have good rationales on the unlabelled dataset, we dynamically retrieve few-shot examples based on the test question, which is shown to be much better than fixed few-shot examples. MoT can be interpreted as converting a parametric model to a non-parametric model without additional supervision.

MoT that generates and recalls memory. Source: Li and Qiu.

➡️ Going beyond MoT, Yasunaga et al. proposed analogical prompting that eliminates the need of dumping rationales on an unlabeled dataset. Analogical prompting asks an LLM to recall relevant exemplars based on the question, and thereby generates dynamic few-shot exemplars from scratch. In fact, the authors found that analogical prompting is an emergent ability in large language models, similar to previous works on open-domain question answering. Larger-scale LLMs can self-generate better exemplars compared to standard RAG solutions. Besides, this work provides a cool trick to fuse multi-step generations into a single prompt with markdown grammar — a godsend for prompt engineers with a tight budget! 💡

Analogical prompting. Source: Yasunaga et al.

➡️ Are self-refine and self-generate the limit of LLM reasoning? Yang et al. show a more advanced usage of the reasoning abilities of LLMs — to optimize a prompt based on the history of generated prompts. This is a cool reinvention of the famous meta-learning paper “Learning to learn by gradient descent by gradient descent”, but all the steps here are performed by LLMs on text. At each step, an LLM is prompted with previous solutions and corresponding performance metrics and tries to predict a new solution. Notably, even without telling the LLM how to perform optimization, the LLM can gradually find better solutions that maximize the metric. Maybe this work brings prompt engineers one step closer to unemployment?

Performance of prompts optimized by LLM. Source: Yang et al.

🔁 Probably the most eye-opening 👀 work in self series is the self-taught optimizer (STOP) by Zelikman et al. We know LLMs are guided by textual prompts, take texts as input and output texts. While these these texts are usually separate variables, what will happen if we model them as a single variable? In STOP, the authors draw inspiration from self-modifying code and use a self-improvement prompt to improve itself.

The seed improver that improves itself in STOP. Source: Zelikman et al.

While the seed prompt isn’t more complicated than a random search algorithm, with a strong LLM, one can discover many advanced meta-heuristic algorithms. Interestingly, GPT-4 discovers many prompting strategies that are published after the training cutoff for GPT-4, including ToT and Parsel. It seems that the day when LLMs conduct research for themselves is approaching. One step in this direction is a recent work by Huang et al. showing that LLMs are capable of designing ML models for common benchmarks and even Kaggle challenges.

Algorithms found by STOP. Source: Zelikman et al.

Evaluations and observations

➡️ Kandpal et al. conducted a systematic study on the memorization ability of LLMs. They asked an LLM about factual questions from Wikipedia and found that the accuracy is highly correlated with the frequency of questioned entities in the pretraining documents, regardless of the scale of the model. By extrapolating the trend, the authors estimate that a model with 10¹⁸ is needed to match human performance on long-tail entities — which is way bigger than today’s LLMs. Hence an important takeaway is to use LLM reasoning for tasks related to frequent knowledge, and consider RAG or other tools for tasks related to long-tail knowledge.

LLMs can hardly memorize long-tail knowledge. Source: Kandpal et al.

➡️ As the community tries to build bigger mixtures for training LLMs, one concern is that LLMs may not learn to actually reason but simply to memorize the solutions from the training distribution, just like humans in teaching to the test. Wu et al. answers this concern by comparing the performance of GPT-4 with zero-shot CoT on 11 different tasks, each with a default setting and a counterfactual setting. They observe that despite LLMs performing better than random in the counterfactual settings, their performance is consistently behind that in the default settings. It remains an open question how we can train models to focus more on reasoning rather than memorization.

GPT-4 underperforms on counterfactual variants. Source: Wu et al.

➡️ Saparov et al. extended a synthetic dataset PrOntoQA to OOD setting to test generalization ability of LLMs on deductive reasoning with controlled depth, width, compositional structure, etc. The authors found that CoT can generalize to compositional and longer proofs. This is in contrast with previous conclusions on compositional semantic parsing, possibly because deductive reasoning only requires composing deduction steps, while semantic parsing additionally deals with growing outputs. While LLMs are able to use most deduction rules, they require explicit demonstrations of proof by cases and proof by contradiction. There are also counterintuitive qualitative differences between in-context learning and supervised learning.

OOD generalization over deductive reasoning. Source: Saparov et al.

➡️ Regarding the parametric knowledge in LLMs, Berglund et al. found a phenomenon they called the reversal curse. That is, LLMs trained to memorize “A is B” do not know that “B is A” in closed-book question answering, despite the fact that they can be prompted to perform deductive reasoning. This indicates that LLMs lack certain kinds of symmetry in its parametric knowledge, and it is crucial to endow them with such symmetry to enable better generalization. Actually, the knowledge graph community has been a leader in this area, with works like double permutation equivariance and relational rotation. It would be interesting to see how these ideas are adapted to LLMs.

What needs to be solved in 2024?

2023 has been an exciting year for tool use and reasoning, and we expect the new year to be more exciting. Let’s wrap up this post with predictions from the authors.

Zhaocheng Zhu:

1️⃣ Reasoning with LLMs still requires ad-hoc engineering efforts for each specific task. By contrast, once humans acquire the skills for a task, they can quickly adapt the skills to similar tasks with very few or even no samples (e.g. from chess to poker). If we can create LLM solutions that generalize across tasks, it will save a lot of engineering efforts and boost the performance in low-resource domains.

2️⃣ Solving reasoning problems usually involves a lot of commonsense knowledge, ranging from math, physics to strategies like enumeration and proof by contradiction, if any. While LLMs may have obtained such knowledge from their training data, we lack precise control over the parameteric knowledge in LLMs. We would like to see new studies on the knowledge representations of LLMs, and techniques that verbalize, inject or delete knowledge in LLMs.

Michael Galkin:

1️⃣ In 2023, we saw an increasing effort to understand the basic principles of what can be learned by Transformer-based LLMs — can we actually expect LLMs to be able to solve any arbitrary reasoning tasks? A few famous papers like faith and fate and on length generalization suggest that the autoregressive nature of LLMs might not be the optimal way to approach complex reasoning. In 2024, I’d expect more efforts on understanding the algorithmic alignment with LLMs.

2️⃣ It is likely that in 2024, most open and closed foundation models will be multi-modal supporting vision, text, audio, and other inputs. Incorporating other modalities into reasoning is the natural next step.

Abulhair Saparov:

1️⃣ I anticipate there will be more efforts to find a more mechanistic understanding of reasoning in LLMs. What algorithms do they use when performing reasoning tasks? More precisely to what extent do they exploit shortcuts or heuristics that hurt robustness/generalization?

2️⃣ Relatedly, I would expect researchers to make progress in answering whether increasing the scale of LLMs and/or their training will resolve their limitations in reasoning, or whether these limitations are fundamental, e.g. inherent to the architecture.

Shibo Hao:

1️⃣ Over the past year, the primary focus of LLM reasoning research has been on prompting and supervised fine-tuning, with some approaches, like STaR, Reflexion, and RAP, already drawing inspiration from RL. However, we have yet to witness a breakthrough method that effectively employs RL to enhance an LLM’s reasoning capabilities, especially when compared to the advancements in RLHF for alignment.

2️⃣ On the flip side, in the future, language could become the primary medium of expression in RL systems. The key advantage lies in the rich information of language compared with the traditional scalar rewards / values. The prospect of an LLM agent that can autonomously improve its reasoning abilities with RL (no need for supervision data or prompting engineering), is not only exciting but may also indicate a significant leap towards AGI.

Yihong Chen:

1️⃣ Structured & unstructured. I assume LLMs will be gradually eating the pies of traditional products, which are mostly based on large databases, rules and piles of small classifiers. In this case, what we are referring as “LLM reasoning” is probably the hope that we will have a method “X” that can bridge the structured world, which most product data is currently living in, and the unstructured world, which most LLMs are living in. Knowledge graphs kind of champion the structured world and there have been fruity research on how to reason well on a knowledge graph, while LLMs are championing the unstructured world though we are still unclear about how they do the reasoning. They have different advantages and limitations. I’d expect that a nice bridge between the two would lead us to more pragmatic solutions for products.

2️⃣ Sample efficiency. As Zhaocheng mentioned, current LLMs reasoning is hard at generalising across a large number of tasks. Instead of ad-hoc efforts, which are usually customised for specific problems, I would be interested in if we can simply pretrain a LLM that generalize with less data, similar to what’s done for generalizing across multiple languages.

3️⃣ Reasoning inside LLMs. As Abulhair and Michael mentioned, the community do not have a crystal understanding about how LLMs perform reasoning, if they indeed are reasoning. I’d expect more efforts on reverse-engineer LLMs’ reasoning process, either in a mechanistic way or other interpretability approaches.

Meme Time

Following the tradition of Michael Galkin, no blog post is truely complete without a meme. DALL·E 3 is almost a meme wizard if it can spell words correctly. Guess what prompts are used for each panel?