
Large language models (LLMs) have shown astonishing progress in natural language understanding using few-shot prompting, where models accomplish exceedingly difficult tasks, having seen only a few examples that demonstrate how to solve a given problem. However, the same LLMs often stumble with tasks requiring complex or multi-step logic (e.g., the Big-Bench Hard benchmark) and have difficulty propagating rules or constraints to subsequent steps. For humans, these kinds of tasks require logical deduction and reasoning. Although we understand these models are incapable of either (in a human sense), researchers at Microsoft hope to teach LLMs to be increasingly better at exhibiting these notions. As such, Xu et al. propose "Reprompting," an automated approach to prompt optimization for multi-step problem-solving.
Prior research involving engineered prompts has shown that providing LLMs with chain-of-thought (CoT) prompting can improve performance along dimensions like deduction and perceived reasoning. Chain-of-thought prompting is a technique that enables large language models to tackle complex arithmetic and symbolic reasoning tasks by guiding a model with intermediate steps (Wei et al., 2022).
As an evolution of CoT, this research introduces Reprompting, an iterative sampling algorithm that automatically discovers the most effective CoT prompts for a model from a given a set of question-answer pairs (i.e., few-shot in-context examples). The research promises to improve performance for state-of-the-art LLMs and transfer gains from one model to the next (Xu et al., 2023). However, before diving into a deconstruction of Reprompting, we should highlight a few of the concepts that have led to this novel approach.
Few-Shot Prompting
In practice, the concept of few-shot prompting (i.e., in-context learning) is straightforward. Supplying example prompts that contain questions and the corresponding correct answer allows the model to better learn the given context and formulation of the answer simultaneously. As a result, LLMs improve generalization and adapt to new tasks more efficiently, requiring relatively little input and supervision compared to traditional (and often costly) fine-tuning (i.e., additional supervised model training).

A standard LLM is pre-trained to optimize the probability of generating the correct next token (word or subword) in a sequence given the context (Brown et al., 2020). Generally, the model learns an approximate probability distribution P(y|x) of the next token y given the context x.
Additionally, the model can be conditioned on a tokenized sequence containing the example question and answer pairs. Then, during inference, the model uses its learned parameters θ to generate an output sequence of tokens y* by conditioning on the additional context from the exemplar Exmp:
_P(y_t | y1, …, y(t-1), Exmp; θ)_
where _yt is the probability distribution for the t-th token in the output conditioned on the previously generated tokens _(y1, …, y(t-1))_ and the exemplar sequence (Exmp). Typically, at inference, autoregressive transformers will sample a token _yt from the distribution at each step, and the process repeats (token-by-token) until the model generates a stop token or reaches a predefined maximum output length, resulting in a generated response that should apply the context and constraints learned from the provided examples. (Wei et al., 2022; Vaswani et al., 2017; Xu et al., 2023).
Chain-of-Thought Prompting
Chain-of-thought prompting evolves the idea of few-shot prompting, concentrating on tasks that require multi-step logic by guiding the model towards a sequence of intermediate logical steps. This approach emulates human-like problem-solving and, in some ways, common sense reasoning (Wei et al., 2022). For example, each generated token _yt now resolves to become part of the larger formulation needed to answer correctly. This enables the model to solve the given problem and others like it more efficiently. An oversimplified formulation of inferencing applying CoT is given as follows:
_P(y_t | y1, …, y(t-1), {Exmp_1, Exmp_2, …, Exmp_N}; θ)_
where the model generates token yt by also conditioning on the concatenated exemplar tokenized sequences {Exmp_1, Exmp_2, …, ExmpN}, each containing distinct intermediate steps (as illustrated).

Reprompting
With that context, we can discuss the proposed Reprompting, an iterative sampling algorithm that automatically discovers effective CoT prompts without human intervention. The primary goal of the algorithm is to infer a set of "recipes" that consistently perform well as few-shot examples for solving problems that typically require deductive reasoning.
The researchers primarily focus on the problem of resampling from the joint distribution of chain-of-thought recipes. Remember, at inference, the model samples the next token y_t from the probability distribution at each step until it reaches a stopping condition. However, with CoT, the model is now sampling from a joint probability distribution that combines the learned probabilities and the contextual information provided by CoT. While it is impossible to characterize this distribution directly, the researchers employ the Gibbs Sampling strategy to approximate it effectively (Wei et al., 2022). In this way, the sampling process can now be influenced by both the previously generated tokens and the prompts designed to guide subsequent tokens’ generation. With each iteration, the algorithm optimizes for solutions from the training set that serve as effective CoT recipes for solving test set problems.
An Aside on Gibbs Sampling
The Gibbs sampler (introduced in 1984) provides an alternative approach for obtaining marginal distributional characteristics (e.g., mean or variance) when direct calculations are complex. For example, given a joint distribution _f(x, y, …, yn), instead of directly computing f(x), the Gibbs sampler generates a sample from f(x) without requiring its explicit form. After generating a sufficiently large sample, the Gibbs strategy can approximate marginal distribution without directly computing f(x) (Casella & George, 1992).
Automatic Discovery of CoT Recipes
Reprompting uses Gibbs sampling to approximate a joint distribution of CoT recipes that perform well on problems that require logical deduction when solved by humans. The process initially samples the recipes by zero-shot prompting and then iteratively samples recipes by concatenating a few prior recipes as the prompt, eventually converging into a set of recipes that share similar chains of thought and will include intermediate instruction or step-by-step formulation of the problem. Xu et al. characterize the algorithm as follows:

Ideally, the algorithm should converge in such a way that the probability of generating a step-by-step solution _zj followed by the correct answer _yj is high and agnostic to the choice of S_j; where S_j is a subset of indices chosen to correspond to the CoT recipe tuples {x_i, z_i, y_i}.
_pLLM(z_j, y_j | {x_i, z_i, y_i}_i ∈ S_j, x_j, m)_
This will result in a set of {z_j} that works as prompts for solving similar problems in the test set (Xu et al., 2023).
Combining Models
Additionally, Reprompting facilitates combining models by utilizing different LLMs for initialization and sampling. Empirically, using ChatGPT to generate initial recipe samples for InstructGPT led to meaningful improvement compared to using InstructGPT or ChatGPT alone on specific tasks. However, results also indicated that performant CoT recipes for one model could perform poorly on another, despite the latter achieving similar performance using human-optimized prompts. This suggests that CoT recipes must be composed with model combinations in mind.
Benchmark Results
A comparison of the performance of Reprompting against prior state-of-the-art prompting techniques confirms that, when using Reprompting, LLMs can achieve better performance (without human intervention) compared to the existing chain-of-thought prompts. For example, Reprompting combined with ChatGPT often achieves higher scores on all tasks compared to human-written CoT prompts (Suzgun et al., 2022).

In practice, we can observe the evolution of CoT recipes through Reprompting as follows:

Initially, ChatGPT prioritizes constraints, focusing on absolute ranking positions first (in dark blue). Next, the model attempts to deduce objects at specific positions but makes a mistake (in red). However, the recipe still provides a helpful strategy for solving similar problems. When applied to a new problem, the model adopts the same reordering strategy and proposes an alternative method to handle constraints (in orange). Despite some errors, this recipe improves the solution for this specific problem. Finally, when used as a new prompt, the model follows the same formula and correctly deduces the answer for a new problem.
The introduction of Reprompting likely marks another milestone in developing Large Language Models, particularly for tasks requiring multi-step logic and constraint propagation. Leveraging chain-of-thought prompting and Gibbs sampling, Reprompting can automatically discover effective CoT prompts without human intervention. As a result, LLMs can achieve better performance on complex tasks when compared to zero-shot or traditional few-shot prompting techniques. Additionally, with optimization, Reprompting has shown the potential to transfer gains between different LLMs. Ultimately, this approach may bring us closer to the goal of arriving at LLMs that exhibit human-like logical deduction and a semblance of reasoning.
References
Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, et al. 2020. Language Models are Few-Shot Learners. arXiv [csCL]. http://arxiv.org/abs/2005.14165
Casella G, George EI. 1992. Explaining the Gibbs Sampler. Duke.edu. [accessed 2023 May 29]. http://www2.stat.duke.edu/~scs/Courses/Stat376/Papers/Basic/CasellaGeorge1992.pdf
Geman S, Geman D. 1984. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell. PAMI-6(6):721–741. doi:10.1109/tpami.1984.4767596. [accessed 2023 May 29]. http://image.diku.dk/imagecanon/material/GemanPAMI84.pdf
Suzgun M, Scales N, Schärli N, Gehrmann S, Tay Y, Chung HW, Chowdhery A, Le QV, Chi EH, Zhou D, et al. 2022. Challenging BIG-Bench tasks and whether chain-of-thought can solve them. arXiv [csCL]. http://arxiv.org/abs/2210.09261
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. 2017. Attention is all you need. arXiv [csCL]. http://arxiv.org/abs/1706.03762
Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, Chi E, Le Q, Zhou D. 2022. Chain-of-thought prompting elicits reasoning in large language models. arXiv [csCL]. http://arxiv.org/abs/2201.11903
Xu W, Banburski-Fahey A, Jojic N. 2023. Reprompting: Automated Chain-of-Thought prompt inference through Gibbs sampling. arXiv [csLG]. http://arxiv.org/abs/2305.09993