
Now that the dust has settled, the weaknesses of LLMs are known.
Even the powerful GPT-4 struggles with math operations.
Also, the training cut-off time is an inherent weakness of every LLM. They struggle to answer queries on new things.
A loose fix is to use external Plugins (e.g. ChatGPT plugins). Still, the user has to manually specify some actions, and these plugins are sometimes unreliable.
What if there was a model that knew its weaknesses – and was trained to natively call the optimal external tool when uncertain?
That’s what Meta did, by creating ToolFormer[1].In this article, we discuss :
- What is ToolFormer and why is it a breakthrough?
- How the model works.
- How ToolFormer’s methodology can be applied to any LLM.
- Why AI research heads towards ToolFormer’s vision.
Let’s dive in.
Weaknesses of Large Language Models
Before starting to describe ToolFormer, let’s explore what issues the modern LLMs face:
- Progression of Time: Every LLM has a training cutoff date. Hence, they can’t access up-to-date information and recent events.
- Incorrect Facts: LLMs are infamous for making up facts, places, events, products, and even research papers.
- Arithmetic operations: LLMs struggle with mathematical calculations.
- Rare languages: LLMs cannot handle low-resource languages, usually due to a lack of training data.
Obviously, these issues are irrelevant to language mechanics. An ideal solution would be to combine text generation with external tools.
Here comes ToolFormer.
What is ToolFormer?
ToolFormer is an LLM, trained to decide which APIs to call, when to call them, and what arguments to pass to call them.
ToolFormer is amazing because of:
- Best of both worlds: ToolFormer is an LLM, like GPT-3. But when uncertain, it learns to call external APIs – thus avoiding common mistakes.
- Portability: The methodology of training ToolFormer can be applied to any LLM.
- Superior Performance: ToolFormer is smaller, but outperforms much larger models like OPT and GPT-3.
- Open Source: While Meta has not released the original version yet, the community has created a few great open-source implementations.
ToolFormer provides the following tools. These are shown in Figure 1:

According to Figure 1, ToolFormer provides:
- QA: A question-answering system
- Calculator
- MT: a machine translation system
- WikiSearch: a Wikipedia search engine API
- Calendar: a calendar API that returns the current date (not shown in Figure 1)
How ToolFormer generates textIn each case, the model decides which API to call and what arguments to use. The tool names like QA
and Calculator
are special tokens.

If the model generates `Calculator(400/1400), then the model is ready to call the **Calculator API** with
(400/1400)` as argument.
The ->
token signifies that the model next expects the response for the API call.
When that happens, decoding(inference) is interrupted, and the model places the answer from the corresponding API. Decoding then continues, if necessary, until the answer is complete (…passed the test.).
How ToolFormer is Built
It’s time to delve into technical stuff.
The key innovation of ToolFormer isn’t the base pretrained model – it’s the dataset used for training and particularly the unique way the authors augmented it.
Fundamentally, ToolFormer is a GPT-J pretrained model:
ToolFormer, a small pretrained GPT-J 6.7B model, beats the much larger GPT-3 and OPT on numerous tasks.
The authors used a subset of the CCNet training dataset (abbreviated as C). Then, they augmented that dataset with API calls – and called it **C***

The process of augmenting C to C* is the real novelty of ToolFormer; This dataset can be used to teach any model how to effectively use API calls.
However, augmenting the training dataset is not an easy feat. We discuss this next.
Augmenting the Training Dataset
This is the most crucial part. The authors have 3 goals here:
- No human intervention. We don’t expect that a human will manually perform the process shown in Figure 2.
- Top-quality data: The augmentation should be meaningful and helpful for the model. For example, the augmentation: Pittsburgh is also known as the [QA: (What type of material characterizes Pittsburgh? –> Steel)] the Steel City is wrong and not meaningful.
- No loss of generality: With **** the new dataset, ToolFormer will still function as an LLM (able to optimally predict the next word).
Now, it’s time to zoom in on Figure 2 and reveal the intermediate steps between Dataset C and Dataset **C***.
A more detailed view is shown in Figure 3:

x
for the QA tool. (Source)There are 3 steps – let’s decompose them. Given a sentence x
, we **** have:
- Sample API Calls: Sample a position
i
in sentencex
that is likely to be used for an API call. Then, generate sample candidate API calls [c1
,c2
..ck
] - Execute API Calls: Execute those API calls, and take the responses [
r1
,r2
..rk
] - Filter API Calls: Not all pairs (
ci
->ri
) are useful or correct. We filter the API calls that don’t reduce the loss function L over the next tokens.
Don’t worry if you don’t fully understand the steps. In the next section, we will delve deeper into each step.
Step 1: Sample API Calls
In this step, we generate possible API calls – from the dataset C.
The prerequisites are i) no manual intervention and ii) the API calls should be as meaningful as possible.
The best way to automate this task is to ask the GPT-J model to make the annotations itself!
Specifically, we will write a prompt P(x)
with instructions and a few examples – and encourage the model to annotate a sentence x
with **** API calls. For the QA tool, the authors use the following prompt:

Note: The process of including a few examples in the prompt to help the model better understand the given task is called in-context learning.
But remember, language models tend to hallucinate or produce errors – that’s why we need the filtering process in Step 3.
Next, let’s examine how Step 1 works in practice:
- We will use the prompt
P(x)
from Figure 4 to annotate a sentencex
with **** some candidate API Calls. - Let’s try the sentence
x
= Pittsburgh is also known as the Steel City.

x
=Pittsburgh is also known as the Steel City. (Image by author)We got 3 annotated sentences as output. Obviously, only the 2nd candidate API call is meaningful here.
The purpose of this step is to generate multiple annotated sentences without human effort. We will address how to filter out the incorrect ones later.
Note: The authors also impose a minimal filtering process here, to save costs. For more info, check the Appendix at the end of the article.
Step 2: Execute API Calls
This is straightforward – given the candidate calls from Step 1, we ask APIs for responses:

In reality, there are no actual APIs for all cases – except for the Calculator and the Calendar cases, which are simple scripts.
For the other cases, the authors use specialized LMs. For example, for the QA tool, they use Atlas, (Izacard et al., 2022), a retrieval-augmented LM finetuned on Natural Questions.
Feel free to check the paper for further details on each tool.
Step 3: Filter API Calls
While Step 1 generated numerous API Calls, Step 3 keeps only the meaningful ones.
Meaningful API calls: Those which improve the model’s capability to call external APIS.
This improvement is measured with a loss function and a formula – called helpfulness ratio.
The whole process is displayed in Figure 7:
The goal is to check if the following example is meaningful enough to be included in the augmented **C*** dataset:

Let’s break down what happens here:
We calculate the negative log-likelihood of each case as the prefix to the phrase "the Steel City":
- No API: We calculate p(the Steel City | Pittsburgh is also known as) . Here, we don’t help the model much – that’s why the loss is high.
- Input Only: We calculatep(the Steel City | Pittsburgh is also known as [QA("What other name is Pittsburgh known by?") ->?] )
- Full API: We calculatep(the Steel City | Pittsburgh is also known as [QA("What other name is Pittsburgh known by?") ->Steel City] )
The full API case is the most helpful. Besides, the prefix contains the correct answer ‘Steel City‘. That’s why we obtain the lowest loss here!
However, we have to quantify how helpful an annotation is. Here comes the helpfulness ratio. The higher the ratio, the more helpful the annotation is for training ToolFormer.
In Figure 7, we achieve a high helpfulness ratio, so our annotated example [**Pittsburgh is also known as [QA ... city]
goes into the augmented dataset C*.**
The role of the helpfulness ratio
Now, consider this:
A more powerful model like GPT-4 probably knows that Pittsburgh is also known as the "Steel City". In that case, the loss of the No API case would be low and almost similar to the Full API case. That leads to a helpfulness ratio close to 0.
But GPT-J, being a smaller model, doesn’t know the answer.
Hence, the GPT-J model benefits from being finetuned on the annotated example Pittsburgh is..city
, while GPT-4 doesn’t. Probably, GPT-4 would require more complex examples.
Thus, the process of training ToolFormer can be applied to any LM – thanks to the 3-step pipeline and the helpfulness ratio formula.
A rejection example
Remember, for our case, we sampled 3 API calls (Figure 5). Only the 2nd one is meaningful – the other 2 should be rejected.
Let’s see a rejection example. We will use as an example the following API call (the 1st in Figure 5):
Pittsburgh is also known as [QA(In which state is Pittsburgh?) -> Pensylvania] the Steel City.

Here, we make an API call for a completely irrelevant question – "which state does Pittsburgh belong". This does not help our model answer "how else is Pittsburgh known".
Hence, we get a high loss, which means a negative helpfulness ratio. Thus, that annotated API call is not inserted in the **C*** dataset.
The threshold τf for the helpfulness ratio
So far so good, but how high should the helpfulness ratio of an example be to be considered meaningful— and eligible to enter C*?
The authors found this threshold τf
experimentally – by considering the number of training examples per category on **C* for different values of the helpfulness ratio. The results are shown in Figure 9**:

Obviously, by increasing the threshold τf
, fewer examples are going into *C.**
However, Machine Translation and Calculator examples were fewer than in the other categories. Hence, to avoid a serious imbalance, the authors set:
- For QA, Wiki Search, and Calendar,
τf = 1
- For Machine Translation and Calculator,
τf = 0.5
Final Step: Finetuning ToolFormer
The augmented dataset **C*** is now ready.
We finetune GPT-J on **C*— and voila, we get ToolFormer!**
Finetuning is trivial. The authors use perplexity for the finetuning objective.
Perplexity is a standard metric to evaluate how uncertain a language model is. For instance, a perplexity of 32 means the model is as sure for predicting the next word as throwing a 32-side die. Hence, lower is better.
Evaluation
Next, the authors evaluate ToolFormer.
In total, the authors use the following models:
- GPT-J: The original GPT-J pretrained model.
- GPT-J on C: Here, GPT-J fine-tuned on the C dataset (the one without API calls).
- ToolFormer: GPT-J model fine-tuned on **C*** dataset.
- ToolFormer (disabled): ToolFormer, but using API calls is disabled. (This is done by setting the probability of generating the
[
token during inference to zero)
The goal is to evaluate the model for each separate task: QA, Calculator, Calendar, and so on. Let’s start:
QA Evaluation (LAMA Benchmark)
The authors evaluate ToolFormer on 3 subsets of the LAMA Benchmark: SQuAD, Google-RE, and T-REx.
For each of these subsets, the task is to complete a short statement with a missing fact – e.g. The theory of relativity is developped by ___
and the model should fill in the correct fact.
The results of this benchmark are shown in Figure 10.
Note: The perfomance scores below represent metrics that are evaluated differently for each dataset. To avoid getting into details, consider that higher is better. This is true for the other benchmarks throughout this paper.

The results are particularly interesting: ToolFormer outperforms the much larger OPT and GPT-3 on all benchmarks.
The power of ToolFormer comes from its ability to call external APIs in challenging situations.
Specifically, the model decided to call the QA API in 98.1% of all cases. For only very few examples, it uses a different tool (0.7%) or no tool at all (1.2%).
Calculator Evaluation (Math Benchmark)

- ToolFormer again outperforms OPT and GPT-3 by a large margin.
- The model decides to call the Calculator API in 97.9% of all cases.
Wiki Search Evaluation (Search Benchmark)
Here, ToolFormer is not the best model:

ToolFormer outperforms OPT but loses to GPT-3. The authors provide the following reasons:
- ToolFormer Wiki Tool searches on Wikipedia only, instead of the whole Web (GPT-3 was trained on a huge portion of online content).
- ToolFormer would have outperformed GPT-3 if it was able to call the QA Tool as well. The authors disabled the QA Tool on purpose – because the datasets used to train the QA system have some potential overlap with the data in these benchmarks.
- An additional layer on top of the Wiki Tool is **** necessary – to reformulate the return results and provide clearer answers.
Translation Evaluation (MLQA Benchmark)
Here, the results are very interesting:

ToolFormer easily outperformed OPT and GPT-3 in all languages (except Zh, with GPT-3).
However, ToolFormer was surpassed by the original GPT-J. The authors explained this was because GPT-J was also pretrained on multilingual data – while C had very few multilingual examples.
Calendar Evaluation (Temporal Datasets)
Finally, we evaluate ToolFormer’s ability to extract dates and recent information.
The authors used 2 datasets:
- TempLAMA: Contains masked data that change with time (e.g., "Kylian Mbappé plays for ___"
- DATESET: Contains random queries about dates (e.g. "What day of the week was it 30 days ago?"
Figure 14 displays the results:

Again, ToolFormer outperforms the much larger models. The difference is huge in DATESET – this is expected since finding dates is an inherent weakness of LLMs.
Scaling Laws
An integral part of training LLMs is whether the training model obeys the scaling laws.
Scaling laws are empirical rules that describe the relationship between a LM’s parameter size, tokens(dataset size), training time and performance.
Scaling laws were first introduced in [2], but were later re-examined in [3], where Deepmind researchers discovered that many LMs were significantly undertrained.
Here, the authors explored ToolFormer’s ability to scale, compared to the other models of the benchmark. The results are shown in Figure 15:

Evidently, ToolFormer displays excellent signs of scalability – following scaling laws.
Smaller models (less than 775M parameters) achieve similar performance – they don’t gain an advantage by calling external APIs. After 775M parameter size, ToolFormer starts to scale dramatically.
Decoding Strategy
An interesting part about ToolFormer is how the authors implemented the decoding strategy.
In truth, [
is a special token – and signifies the start of an API call.
During word generation, an LM model calculates the probabilities of every token in the vocabulary, and generates the one with the highest probability (I explain it roughly). This is known as greedy decoding.
The authors found experimentally that ToolFormer performs better if [
is generated when it is in the top `k=10most probable tokens, instead of getting generated when it's the most likely token (
k=1`).
The parameter k=10
is found experimentally. The results are shown in Figure 16:

Clearly, the model performs best, on average when `k=10` . Figure 16 displays only 2 datasets. However, the pattern holds for the other benchmarks as well.
The Tools LLMs ecosystem
ToolFormer uses external APIs to solve maths and reasoning problems. But these problems can also be addressed by other approaches.
The most popular approach is called ‘chain-of-thoughts’ [4]:
In ‘chain of thoughts’, the LLM learns to break a prompt into intermediate steps – solving each step individually before giving the final answer.
Here, we don’t call external APIs. Instead, we teach the model to decompose a prompt into smaller parts – which helps the model with arithmetic tasks. An example is shown in Figure 17:
![Figure 17: Using chain-of-thought prompting (right) the model can figure out the correct answer. Chain-of-thought reasoning processes are highlighted in green [Wei et al.]](https://towardsdatascience.com/wp-content/uploads/2023/10/1AYtC9aGdVAzs1FmMCglQLg.png)
The ‘chain-of-thoughts’ paradigm has been improved in the latest research.
Program-aided Language models (PAL) [5] achieve even better results by breaking down the prompts into both textual intermediate steps and Python code (Figure 18):
![Figure 18: Chain-of-thought (left) gives a wrong answer, while PAL (right) is correct. PAL combines Chain-of-thought reasoning (highlighted in blue) with programming annotations (highlighted in pink) [Luyu Gao et al.]](https://towardsdatascience.com/wp-content/uploads/2023/10/128W1afyNxTnlg87bEtvGjA.png)
Lastly, we can use LangChain, an application framework for LLMs. Langchain uses agents to integrate with various search APIs, capable of searching the web. Figure 19 shows the SerpAPI tool:

What is the difference between ToolFormer and Langchain agents?
- Langchain agents have to first use the appropriate API (which a human specifies) and then combine the results with the prompt to get a correct answer.
- In contrast, ToolFormer was explicitly trained to call and integrate API tools (no manual intervention).
Closing Remarks
This article explored ToolFormer, a model capable of calling external Tools.
Essentially, ToolFormer is a process that can teach any LLM to call external APIs.
With the adoption of LLMS, the necessity to call external resources will become apparent. Even ChatGPT now allows the user to enrich his prompt with search results from the web.
Thank you for reading!
I write an in-depth analysis of an impactful AI paper once a month. Stay connected!
- Follow me on Linkedin!
- Subscribe to my newsletter, AI Horizon Forecast!
Appendix
The authors also impose a minimal filtering process on the 1st step of the data augmentation process, to save costs. For example, the sentences that are not annotated at all with special tokens
[QA..
etc are rejected from the next steps.Also, the authors calculate the most probable positions in the sentence that are likely to initiate an API call. The symbols
[
,]
are ** also** special tokens and signify the start and the end of an API call.So, the authors calculate the position
i
where the token[
has the highest probability of appearing. Hence, only the sentences where the start of the API call(the token[
) is generated on the most probable positioni
, are passed to the next step.
References
[1] Timo Schick et al. Toolformer: Language Models Can Teach Themselves to Use Tools
[2] Jared Kaplan et al. Scaling Laws for Neural Language Models
[3] Jordan Hoffmann et al. Training Compute-Optimal Large Language Models
[4] Jason Wei et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (January 2023)
[5] Luyu Gao et al. PAL: Program-aided Language Models (January 2023)