Prompt Engineering for Arithmetic Reasoning Problems

Explore various prompt engineering techniques for arithmetic reasoning problems, best practices, and rapid experimentations for production-grade prompts with Vellum.ai

Kaustubh Bhavsar
Towards Data Science

--

Architecture of 4 different prompting techniques: Input Output, Chain-of-Thought (CoT), Self Consistency with Chain-of-Thought (CoT), and Tree of Thought (ToT)
Architecture of four different prompting techniques: Input-Output, Chain-of-Thought (CoT), Self Consistency with Chain-of-Thought (CoT), and Tree of Thought (ToT) (Image from Yao et el. (2023))

Introduction

Large Language Models (LLMs) have been drawing increasing attention from both academic researchers and industry experts due to their proficiency in understanding and generating language. The reason behind their text comprehension skills lies in their training process, which involves exposure to vast amounts of data with the primary goal of predicting subsequent words. To optimize these models for specific tasks, fine-tuning is essential. This can be achieved through two methods: ‘pre-training and fine-tuning’ or ‘prompt fine-tuning’.

In the conventional ‘pretraining and fine-tuning’ approach, the LLM is fine-tuned on a dataset pertinent to the tasks it will perform later, thereby updating parameters in the fine-tuning phase. Conversely, ‘prompt fine-tuning’ guides the model with a text segment to execute the task.

A prompt is an input provided by the user that the model is designed to respond to. Prompts can contain instruction, context, question or an output indicator. Prompt engineering is an emerging field dedicated to the development and refinement of prompts for the effective utilization of language models.

However, a significant challenge lies in ensuring the model can tackle problems necessitating arithmetic and commonsense reasoning. In this article, our focus is specifically on the engineering of prompts for arithmetic problems.

Prerequisites

No prior knowledge is required. All the demonstrations provided below can be executed either on the OpenAI Playground or run through the OpenAI API. Although this article primarily utilizes OpenAI models, it’s important to note that these are merely prompting techniques that we’ll be exploring, and you’re free to apply any LLM available in the market.

Prompt Engineering for Arithmetic Problems

All the problems mentioned henceforth are taken from the GSM8K dataset and are tested using OpenAI’s GPT-3.5 Turbo Instruct model with default configurations.

We will be testing various techniques specifically on the following arithmetic problem:

Jennifer purchased 40 cans of milk at the store before meeting her 
classmate Mark, who was also buying milk. Jennifer bought 6 additional
cans for every 5 cans Mark bought. If Mark purchased 50 cans, how many
cans of milk did Jennifer bring home from the store?

The correct answer to the above problem is 100 litres.

Zero-Shot Prompting

A shot essentially refers to an example. Zero-shot is a basic prompting technique in which a question is posed to the model without providing any demonstrations. This technique, in general, produces favourable results in larger models that have been trained on extensive amounts of data.

Prompt:

Q: {{ question }}
A:

Output:

 Jennifer brought home 76 cans of milk from the store.

In the preceding prompt, we didn’t prepend the question with any demonstrations, leading to the models' failure to provide the correct answer for the arithmetic problem. Reasoning and commonsense problems seldom yield satisfactory results with zero-shot prompts. Therefore, it’s essential to include examples when dealing with such problems.

Few-Shot Prompting

With few-shot prompting, we provide the model with a small number of demonstrations, also known as exemplars. These exemplars serve to guide the model so that the subsequent questions can be similarly handled. The number of demonstrations needed varies based on the complexity of the problem and the specific model in use.

Prompt:

Q: Ken created a care package to send to his brother, who was away at boarding school. Ken placed a box on a scale, and then he poured into the box enough jelly beans to bring the weight to 2 pounds. Then, he added enough brownies to cause the weight to triple. Next, he added another 2 pounds of jelly beans. And finally, he added enough gummy worms to double the weight once again. What was the final weight of the box of goodies, in pounds?
A: 16

Q: Mark has a garden with flowers. He planted plants of three different colors in it. Ten of them are yellow, and there are 80% more of those in purple. There are only 25% as many green flowers as there are yellow and purple flowers. How many flowers does Mark have in his garden?
A: 35

Q: James writes a 3-page letter to 2 different friends twice a week. How many pages does he write a year?
A: 624

Q: Albert is wondering how much pizza he can eat in one day. He buys 2 large pizzas and 2 small pizzas. A large pizza has 16 slices and a small pizza has 8 slices. If he eats it all, how many pieces does he eat that day?
A: 48

Q: Betty is saving money for a new wallet which costs $100. Betty has only half of the money she needs. Her parents decided to give her $15 for that purpose, and her grandparents twice as much as her parents. How much more money does Betty need to buy the wallet?
A: 5

Q: {{ question }}
A:

Output:

76

Although this technique works well in tackling difficult problems, the model may still fail to find satisfactory results in arithmetic reasoning problems. This is because the model fails to reason immediate steps.

Few-Shot Chain-of-Thought (CoT) Prompting

Few-shot chain-of-thought (CoT) prompting equips the model with a few examples to guide it through the process of reaching a solution. This guidance, also known as intermediate reasoning steps, is crucial in assisting the model to work through the steps and generate the desired output logically.

Prompt:

Q: James writes a 3-page letter to 2 different friends twice a week. How many pages does he write a year? 
A: He writes each friend 3*2=6 pages a week So he writes 6*2=12 pages every week That means he writes 12*52=624 pages a year. The answer is 624.

Q: {{ question }}
A:

Output:

 Jennifer bought 40+6/5*50=40+12*10=40+120=160 cans of milk. The answer is 160.

In the above output, the model attempted to reason through the steps but arrived at the incorrect final answer. We observe that this discrepancy may be attributed to the irrelevant example provided. There are two ways to address this issue: either provide the model with more examples or ensure that the examples given are relevant. Below, we attempt to rectify this with a relevant example.

Prompt:

Q: Mark has a garden with flowers. He planted plants of three different colors in it. Ten of them are yellow, and there are 80% more of those in purple. There are only 25% as many green flowers as there are yellow and purple flowers. How many flowers does Mark have in his garden?
A: There are 80/100*10=8 more purple flowers than yellow flowers. So in Mark's garden, there are 10+8=18 purple flowers. Purple and yellow flowers sum up to 10+18=28 flowers. That means in Mark's garden there are 25/100*28=7 green flowers. So in total Mark has 28+7=35 plants in his garden. The answer is 35.

Q: {{ question }}
A:

Output:

 For every 5 cans Mark bought, Jennifer bought 6 additional cans. So for 50 cans, Jennifer bought 6/5*50=60 additional cans. In total, Jennifer bought 40+60=100 cans of milk. Therefore, Jennifer brought home 100 cans of milk from the store.

The final answer obtained is now correct. Consider the task, the choice of the model, and the relevance of examples, as all contribute to arriving at the correct solution.

Program-Aided Language Model (PAL)

A program-aided language model (PAL) is not a prompting technique; instead, it is an application of prompt engineering. Similar to CoT, PAL aims to reason out the intermediate steps. However, instead of using regular text as reasoning steps, PAL uses intermediate programmatic steps using programmatic runtime such as Python interpreter.

Prompt:

Q: James writes a 3-page letter to 2 different friends twice a week. How many pages does he write a year? 
A: #James has
number_of_friends=2

#He writes each friend 3-page letter each time.
page_letter_size=3

#He writes twice to each friend in a week.
number_of_times_written_to_friend=2

#Total pages written in a week to each friend
pages_written_to_each_friend_per_week=page_letter_size*number_of_times_written_to_friend=6

#So, total pages written by him per week,
total_pages_each_week=pages_written_to_each_friend_per_week*number_of_friends=12

#Number of weeks in a year
total_weeks_in_a_year=52

#That means he writes,
total_pages_written_in_year=total_pages_each_week*total_weeks_in_a_year=624
#pages per year.

The answer is 624.

Q: {{ question }}
A:

Output:

#Jennifer initially bought 40 cans of milk.
initial_cans_bought=40

#For every 5 cans Mark bought, Jennifer bought 6 additional cans.
additional_cans_bought=6
mark_cans_bought=50

#So, for every 5 cans Mark bought, Jennifer bought 6 additional cans.
additional_cans_bought=6
mark_cans_bought=50

#Number of times Mark bought 5 cans
number_of_times_mark_bought=mark_cans_bought/5=10

#Total additional cans bought by Jennifer
total_additional_cans_bought=additional_cans_bought*number_of_times_mark_bought=60

#Total cans of milk brought home by Jennifer
total_cans_brought_home=initial_cans_bought+total_additional_cans_bought=40+60=100

The answer is 100 cans of milk.

Even with the irrelevant example that failed to yield a correct answer using the few-shot CoT technique, PAL successfully found the correct final solution.

Zero-Shot Chain-of-Thought (CoT) Prompting

In a situation where demonstrations for the few-shot chain-of-thought (CoT) technique are not available, a straightforward and effective approach is to ask the model itself to reason through the intermediate steps. This is called the zero-shot chain-of-thought (CoT). You can achieve this by adding a text similar to ‘Let’s think step by step…’ to the prompt question. This simple technique has shown exceptional results.

Prompt:

Q: {{ question }}
A: Let's think step by step...

Output:


1. Jennifer initially bought 40 cans of milk.
2. For every 5 cans Mark bought, Jennifer bought 6 additional cans.
3. Since Mark bought 50 cans, Jennifer bought 6 additional cans for every 5 cans, which means she bought (6/5) x 50 = 60 cans.
4. Therefore, Jennifer brought home a total of 40 + 60 = 100 cans of milk from the store.

From the above example, we observe that the model successfully reasons through the steps and arrives at the correct solution without requiring any examples or guidance.

Tree-of-Thought (ToT) Prompting

Tree-of-thought (ToT) framework involves exploring potential solutions in a manner akin to navigating a tree structure of thoughts, similar to human problem-solving. This approach enables the possibility of retracing steps when needed, mirroring the way humans may reassess and adjust their thinking during the problem-solving process. In essence, ToT aims to replicate the adaptive and iterative nature of human reasoning through trial and error.

Based on the ToT framework, Dave Hulbert proposed the tree-of-thought prompting. This technique employs a straightforward single-prompt approach, asking the model to assess intermediate steps.

Prompt:

Identify and behave as three different experts that are appropriate to answering this question.
All experts will write down the step and their thinking about the step, then share it with the group.
Then, all experts will go on to the next step, etc.
At each step all experts will score their peers response between 1 and 5, 1 meaning it is highly unlikely, and 5 meaning it is highly likely.
If any expert is judged to be wrong at any point then they leave.
After all experts have provided their analysis, you then analyze all 3 analyses and provide either the consensus solution or your best guess solution.

Following is the question:
Q: {{ question }}
A:

Output:


Expert 1: Math Expert
Step 1: Calculate the total number of cans of milk purchased by Mark.
Mark bought 50 cans of milk.
Score: 5 (highly likely)

Step 2: Calculate the number of cans of milk purchased by Jennifer before meeting Mark.
Jennifer purchased 40 cans of milk.
Score: 5 (highly likely)

Step 3: Calculate the number of additional cans of milk purchased by Jennifer for every 5 cans Mark bought.
For every 5 cans Mark bought, Jennifer bought 6 additional cans.
Score: 5 (highly likely)

Step 4: Calculate the total number of cans of milk purchased by Jennifer.
Jennifer bought 40 + (6/5)*50 = 40 + 60 = 100 cans of milk.
Score: 5 (highly likely)

Step 5: Conclusion
Jennifer brought home 100 cans of milk from the store.
Score: 5 (highly likely)

Expert 2: Retail Expert
Step 1: Determine the average number of cans of milk purchased by customers at the store.
Based on previous sales data, the average number of cans of milk purchased by customers is 10.
Score: 4 (likely)

Step 2: Calculate the total number of cans of milk purchased by Mark.
Mark bought 50 cans of milk.
Score: 5 (highly likely)

Step 3: Calculate the number of cans of milk purchased by Jennifer before meeting Mark.
Jennifer purchased 40 cans of milk.
Score: 5 (highly likely)

Step 4: Calculate the total number of cans of milk purchased by Jennifer.
Jennifer bought 40 + (6/5)*50 = 40 + 60 = 100 cans of milk.
Score: 5 (highly likely)

Step 5: Conclusion
Jennifer brought home 100 cans of milk from the store.
Score: 5 (highly likely)

Expert 3: Psychology Expert
Step 1: Analyze Jennifer's behavior and decision-making process.
Jennifer may have purchased 40 cans of milk based on her own needs and preferences.
Score: 4 (likely)

Step 2: Analyze Mark's behavior and decision-making process.
Mark may have purchased 50 cans of milk based on his own needs and preferences.
Score: 4 (likely)

Step 3: Consider the influence of social norms and peer pressure.
Jennifer may have felt pressured to buy more cans of milk after seeing Mark's purchase.
Score: 3 (somewhat likely)

Step 4: Analyze the impact of Jennifer's decision to buy additional cans for every 5 cans Mark bought.
Jennifer may have made this decision based on a desire to keep up with Mark or to appear generous.
Score: 3 (somewhat likely)

Step 5: Conclusion
Jennifer brought home 100 cans of milk from the store, possibly influenced by social norms and peer pressure.
Score: 3 (somewhat likely)

Consensus Solution:
Based on the analysis of all three experts, it is highly likely that Jennifer brought home 100 cans of milk from the store. The math expert's analysis provides a solid calculation, while the retail expert's analysis takes into account the average number of cans purchased by customers. The psychology expert's analysis adds a different perspective by considering the influence of social norms and peer pressure. Overall, the consensus solution is that Jennifer brought home 100 cans of milk from the store.

The ToT technique employed in the above example yields an accurate solution. It’s worth mentioning that the ToT prompt used is taken from Dave Hulbert’s ToT prompts. However, one drawback of this technique is a significant increase in the prompt size.

Besides the engineering techniques mentioned above, a few additional methods can be adopted for solving arithmetic reasoning problems. However, it’s important to note that these techniques — Automatic Chain-of-Thought (Auto-CoT), Self-Consistency, Automatic Reasoning and Tool-Use (ART), and Active Prompt - are beyond the scope of this article.

Best Practices

Models perform effectively when fine-tuned with well-designed prompts. Elvis Saravia noted that a prompt may consist of four elements, though not all are necessary each time:

  1. Instruction: This specifies a particular task or action you want the model to undertake.
  2. Context: External information or additional details that guide the model to generate more accurate responses.
  3. Input Data: The specific input or question for which we seek a response.
  4. Output Indicator: The expected type or format of the output.

It is advisable to place the instructions at the beginning, separated from the rest of the prompt using designated separators like ‘#.’

Santu and Feng (2023) introduced a comprehensive taxonomy, TELeR (Turn, Expression, Level of Details, Role), designed to enable more meaningful comparisons among multiple LLMs concerning their performance across various tasks. The prompt taxonomy categorizes prompt directives into six levels based on the level of detail provided. The sixth level can be effectively utilized to optimize the model’s performance based on the given prompt:

  1. Provide a brief description of the high-level goal.
  2. Present a detailed bulleted list of sub-tasks.
  3. Specify the criteria for evaluating the Language Model (LLM) output or include a few shot examples.
  4. Include additional pertinent information obtained through retrieval-based techniques.
  5. Request an explicit statement from the LLM, explaining its output.

In general, it’s advisable to be specific, offer clear details, continuously refine the prompt through iteration, properly format the prompt with clear separations, and mention do’s and not don’ts.

Let’s revisit the prompt, considering the above-mentioned practices and applying the zero-shot prompting technique.

Prompt:

### INSTRUCTION ###
You are required to answer Arithmetic Reasoning questions. Please follow the steps below as you go about solving the problem.

1. All problems should be solved in as many steps as required and each step should be numbered.
2. All calculations should be clear and simple to understand.
3. Provide an accurate final answer at the end of your reasoning in the following format: 'Answer is '.

You'll be evaluated on:
- how easier are the calculations done to understand?
- how accurate is the answer?

### INPUT DATA ###
Q: {{ question }}
A:

Output:

1. Jennifer bought 40 cans of milk at the store.
2. Mark bought 50 cans of milk.
3. Jennifer bought 6 additional cans for every 5 cans Mark bought.
4. For every 5 cans Mark bought, Jennifer bought 6 additional cans.
5. Therefore, for 50 cans Mark bought, Jennifer bought (6/5)*50 = 60 additional cans.
6. Total cans of milk Jennifer bought = 40 + 60 = 100 cans.
Answer is 100 cans.

Initially, the model failed to provide the correct solution using the zero-shot prompting technique. However, with the application of appropriate instructions and formatting, the same model successfully generates the correct response using the zero-shot prompting technique.

Disclaimer: The following section provides information about Vellum, a developer platform useful for the rapid productionization of prompts. The mention of Vellum is for informational purposes only and does not constitute an endorsement or promotion. I have no affiliation with Vellum and receive no compensation for its inclusion.

Vellum

Identifying the appropriate prompt can pose a challenge. The process begins with a basic prompt, which is then incrementally refined through multiple iterations — a task that can be tough to oversee. Spreadsheets might be adequate for a handful of experiments, but they falter when faced with intricate issues that demand a multitude of iterations. So, the question arises — how can we efficiently monitor our prompts? What if our goal is to seamlessly share our prompt experiments with our team? Introducing Vellum - answer to these questions.

Vellum provides a suite of tools designed for prompt engineering, semantic search, version control, quantitative testing, and performance monitoring to help develop production-grade LLM applications. In the context of prompt engineering, it enables testing and evaluating a given prompt across all major LLMs available in the market. It also facilitates collaboration on prompts.

A screenshot of the prompt engineering tool from Vellum.ai
Screenshot of Vellum sandbox (Screenshot by author)

The image above is a representative screenshot from the Vellum sandbox. Vellum enables easy switching between text and chat models and allows for effortless adjustment of model parameters. It also offers the advantage of latency tracking and quantitative assessment across four evaluation metrics: exact match, regex match, semantic similarity, and webhook.

Summary

This article begins by introducing prompts and then delves into prompt engineering for arithmetic reasoning problems. We explored various prompt engineering techniques, including zero-shot, few-shot, few-shot chain-of-thought, zero-shot chain-of-thought, program-aided language model, and tree-of-thought.

Later, we learned some best practices for crafting better prompts. In general, it’s recommended to be specific, provide clear details, refine the prompt continuously through iteration, format it properly with clear separations, and mention dos and not don’ts.

Finally, to address the challenge of tracking and evaluating prompts, as well as sharing them among team members, we investigated the Vellum tool.

If you like this article, make sure to follow me here. You can connect and reach out to me via LinkedIn and X (formally, Twitter).

--

--