The world’s leading publication for data science, AI, and ML professionals.

NLP & Numeracy (Pt 1) – Calculators for Natural Language

What is numeracy in NLP and why it matters?

Original image from [Chen, et al. 2020], DeepArt image by DeepArt
Original image from [Chen, et al. 2020], DeepArt image by DeepArt

We humans never really liked math. It’s why we built computers in the first place. One of my first maths teachers once said: "Science & invention, in most cases, is driven by laziness. We don’t want to do something ourselves, so we invent things that do it for us" That really spoke to me.

Numbers are important. They are particularly important in Finance, Trading, Health, Conversations and decision making in general. For a long time, however, we have been looking for ways of offloading the actual numeric operations to machines (moving arithmetics and algebra to calculators) or use machines to more efficiently look for the numbers we need (filtering and sorting to Excel).

As soon as we knew how to arrange marbles, we invented the abacus. We have calculators and more advanced tools for most forms of computation. In NLP, we use information extraction to pull out the relevant numbers from text. But how do we contextually understand numerical relations in text?

  • Instead of reading financial news, can we just ask: Which company increased revenue the most last quarter? Which companies exceeded their target?
  • In science and academic research: instead of scouring through dozens of papers, can we just ask: Which models perform task A with accuracy 85% or above?
  • In politics: How many more votes did candidate B get compared to candidate D?

When scoping out a NLP solution, some numeric capabilities are likely required of your model: this may mean anything between adding a whole bundle of models to handle the numeric part of the solution or no change at all.

Can the current best in class NLP models deal with such questions? If so, how well? Do they need to?

In this article, I will arm myself with the findings of about a dozen recent NLP numeracy papers and try to answer those questions.

This is a non-technical intro to the topic and the state of Numeracy in NLP. Part 2 will explore some of the details of the different solutions and bring up more technical terms. Here is my agenda:

Part 1 (this article):

  • Does numeracy in NLP matter?
  • Why can’t we just use a calculator?
  • What is numeracy in NLP?
  • How to measure NLP numeracy?
  • What can we do about it?

Part 2: How to be good at math if you are a Transformer:

  • Numeracy and Embeddings
  • Language Models and Numeracy
  • Calculators for Natural Language

Does numeracy in NLP matter?

Does any of this matter? In almost all applications, there are cases where numbers play a crucial role. These are cases where numeric quantities and operations provide crucial information about the content/outcome of the text without being expressed in the semantics of the text. Let us look at some use cases:

Search – Question-Answering (QA) with numeracy

So far, for question answering, the answer needed the answer to be pretty much spelt out in a document for a QA system to give it back to us. If we ask: "What are financial services companies with market value over USD 100mn?" Likely, for an answer, we will need an article titled: "List of companies by market value" and would receive back the whole list without any further reasoning over it. The systems I describe below provide techniques for obtaining an explicit list as an answer. This has profound implications for domain-specific datasets, where QA systems need to do a lot more heavy lifting than when searching the web – the internet has a lot of redundancies in information which makes finding answers easier there than in a domain-specific set of documents where information may only occur once.

Fact checking

Typically, we might want to compare a set of statements to either confirm the reinforce each other, they are contradictory or independent of each other. Different sources of information can express the same statements very differently, in ways that simple comparisons often fail to match them. comparing words or embeddings is not good enough. The task of entailment can help here, two sources:

  • Statement Equivalence: Revenue for the quarter is 500mn vs 450mn expected. Revenue beats expectations.
  • Statement comparative: New subscriptions exceed 5mn. 5.2mn new subscribers
  • Error detection: Quarterly sales for Germany, France and Italy were up €5mn, €3mn and €1bn respectively.

Chatbots

Commonly, chatbots aim to either collect information from the user to perform certain task, e.g. restaurant reservation or to provide information back to the user, e.g. find a good answer to a question by the user by using a FAQ database or the customer history. Again, if these include reference to numeric operations, traditional chatbots may struggle. Examples where the tools in this article will help are:

  • Chatbot (restaurant reservation): How many people would you like to make a reservation for? Customer: myself, my wife and son → 3
  • Customer (phone contract): How long do I have until contract renewal? Chatbot: Three months (example background calculation: 4 months until contract expiry – contract renews 1 month before expiry)

From the above, you might notice a range of numerical concepts used: from sorting, through addition, counting etc. It turns out, NLP models perform significantly differently on tasks of different complexity. As you might anticipate, different tasks have different specialist solutions, we also have a few generalist solutions on call…


Why can’t we just use a calculator?

Punching in 3 + 4.1–1 into a calculator for an answer is easy. It is a bit more complicated to do the same with the sentence:

"Company Retail Sales for Germany have added EUR 3mn revenue in the first quarter, but were approximately EUR 1mn short of the expected EUR 4.1mn for Q2"

Notice that there are a number of complications here vs using a calculator:

  • The text blurb lends itself to a number of different questions: What was Q2 revenue? What was Q2 expectation? etc.
  • In addition to figuring out a specific number, in a NLP context we often want to be able to somehow assign the knowledge from the text a specific entity: here, the specific business unit and region this information applies to.
  • In dollars, please, the rest of my table is in dollars!
  • Also, what does the word approximately mean???

In short, understanding numeric relationships and performing operations on text has a number of complications in addition to the actual computation and can also lead to different results depending on the quantity we are after.


What is numeracy in NLP?

When using language, we need to express a variety of numerical concepts as part of natural text. The challenge for NLP is to translate the vagueness and frivolous nature of language to the stern and concrete numeric form.

Here is a list of common numeracy tasks NLP needs to deal with. This is a summary of the main types covered by the DROP [Dua, et al., 2019] and EQUATE [Ravichander, et al., 2019] datasets which we will look at in more detail later. Note that this is an oversimplification trying to capture the main items pertaining to numeracy in text

  • Comparison: Number scales / Relative size **** – ability to understand which number is higher from a pair of numbers, when are numbers significantly different from each other
  • Finding min/max, argmax/argmin: finding the largest number from several mentions, e.g. "Which was the best quarter?", or the argmax: "Which business had the best quarter?"
Some examples from DROP dataset, picture from [Wallace, et al., 2019]
Some examples from DROP dataset, picture from [Wallace, et al., 2019]
  • Fractions & Percentages: information given as % which we might want to compare to an absolute number to then compare with others, e.g. is 10% of $10 more than $1.5
  • Ranges and Approximations: ability to understand that "more than 5" and "7" are compatible statements, but "more than 5" and "4" are not, or that, "ca 5mn" and "5.02mn" are essentially the same
  • Performing individual arithmetic operations: These are sentences where an individual set of addition/multiplication: I have 3 apples and John has 5 apples, how many apples together?
  • Solving equations / multiple operations ability to both derive a number of arithmetic operations implied in text and performing those successfully

Original image from [Chen, et al. 2020], DeepArt image by DeepArt
Original image from [Chen, et al. 2020], DeepArt image by DeepArt

How to Measure NLP Numeracy?

Before we can answer whether we are good or bad at numeracy, we need to establish a framework for testing that. In this article, we will focus on two types of reasoning over text: Question answering and Natural Language Inference.

Question Answering (QA)

The task of answering questions (typically reading comprehension questions), but abstaining when presented with a question that cannot be answered based on the provided context. (definition taken from paperswithcode)

General Comprehension QA SQuAD 1.1 & 2 – Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles […] SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. Definition from SQuAD explorer page, consider checking out the paper too

Two Examples of QA sets: Paragraph, question and possible answers. Objective is to find the "span" of text that matches any of the ground truths. Source: SQuAD 2 dataset
Two Examples of QA sets: Paragraph, question and possible answers. Objective is to find the "span" of text that matches any of the ground truths. Source: SQuAD 2 dataset

QA for numeracy The equivalent dataset for numeracy is DROP [Dua, et al., 2019] it follows the same format as SQuAD, however, the focus is on numeracy. AllenNLP provides the dedicated dataset and maintains a leaderboard, we will go into some of the best performing models later.

DROP looks for any of four types of outcomes we might be looking for. If answering questions those would be the types of answers we are looking for if performing other tasks. In parenthesis their proportion within DROP

  • Numeric outcomes (63%) – both verbal and numbers based statement. For instance, arithmetic questions will mostly require a numeric outcome.
  • Single Span (32%) – entities which are the answer to the question posed, in a numeric context this can be the answer of "which group is the largest", "Who scored the most points in the game". While the answer itself is not numeric, numeracy is required to get to the answer
  • Multi-span (4.8%) – equivalent to the single-span problem where there are multiple answers, eg. "which countries have a population larger than X", "Which players scored more than 1 goal"
  • Dates, periods (1.6%) – Most of the operations above apply for dates too which add additional complications as dates and times can be expressed in a number of ways, Q1, Winter 2020, 5 Apr 2020, etc

Natural language inference (NLI)

The task of determining whether a "hypothesis" is true (entailment), false (contradiction), or undetermined (neutral) given a "premise". (definition taken from paperswithcode)

General Comprehension NLI One commonly used dataset to measure general NLI skills is MNLI. The Multi-Genre Natural Language Inference (MultiNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information. Definition taken from the dataset page, consider checking out the paper too

Four Examples of NLI: premise, label, hypothesis sets. Objective is to correctly predict the label, given the other two. Source: MNLI dataset
Four Examples of NLI: premise, label, hypothesis sets. Objective is to correctly predict the label, given the other two. Source: MNLI dataset

NLI for Numeracy The equivalent dataset for numeric reasoning here is EQUATE [Ravichander, et al., 2019]. They focus specifically on numerical tasks by selecting statements from several different sources: RTE (existing NLI datasets), CNN, reddit, additionally they generate synthetic examples (StressTest and AWPNLI). They use the cases where numerical operations are crucial to the statement. They then ask crowdworkers or experts to provide a paraphrased or contradictory statement. Several datasets are created, here is the final breakdown.

Table from EQUATE github
Table from EQUATE github

Like with traditional NLI tasks, the outcomes are one of three options, with some exceptions where only entailment and no entailment are available.

Here are some examples from the paper:

Table from EQUATE [Ravichander, et al., 2019]
Table from EQUATE [Ravichander, et al., 2019]

Original image from [Chen, et al. 2020], DeepArt image by DeepArt
Original image from [Chen, et al. 2020], DeepArt image by DeepArt

General Comprehension vs Numeracy-only Results

To demonstrate how we can measure numeracy, we will use some results from using BERT [Delvin, et al., 2019] for prediction. BERT is now commonly used as a baseline model architecture for measuring NLP performance. The choice comes mostly from its accessibility and name recognition.

Table by the author, results are collected from DROP & EQUATE papers
Table by the author, results are collected from DROP & EQUATE papers

Consider BERT’s performance on the above two tasks:

  • For QA, a widely used dataset for measuring general comprehension is SQuAD 1.1, for numeracy, the equivalent QA dataset is DROP (paper, dataset). BERT shows a ca. 50 point drop in performance for the numeracy task: 88.5% on SQuAD vs. 32% F1 on DROP
  • For NLI, one commonly used general comprehension dataset is MNLI part of the Glue benchmark where BERT achieves a 84.4% accuracy (taken from original paper). However, the NLI alternative for numeracy we use, the EQUATE dataset has BERT achieving anywhere between 36% and 72% accuracy (EQUATE is composed of 6 different domain tasks, hence the range for performance)

In short, standard NLP solutions do not fare well in numerical challenges without help. Fear not, though, help is coming…


So what can we do about it?

Here, I will provide a brief overview of the results achieved since BERT. Please note, that the next article will deal with explaining the differences between the mentioned models, this just outlines some conclusions.

Performance by type of numeracy task and trade-offs associated

Table by the author, scores taken from the published sources. References: GenBERT [Geva, et al., 2020], QDGAT [[Chen, et al., 2020](https://openreview.net/forum?id=ryxjnREFwH)], BERT+Calculator [Andor et al., 2019], NeRd [Chen, et al., 2020], TASE [Segal, et al., 2020] Note: only results available from openly available papers and some available code are listed
Table by the author, scores taken from the published sources. References: GenBERT [Geva, et al., 2020], QDGAT [[Chen, et al., 2020](https://openreview.net/forum?id=ryxjnREFwH)], BERT+Calculator [Andor et al., 2019], NeRd [Chen, et al., 2020], TASE [Segal, et al., 2020] Note: only results available from openly available papers and some available code are listed

The results are also very different by the type of operations needed. When matching objectives with solutions, therefore one can consider which types of tasks are relevant. Please note, that I use results from DROP as proxies here as it is the most widely used dataset. However, this means these are purely QA based outcomes and may not translate exactly to other tasks.

Evidence suggests that (see table above for summary results):

  • Common tasks: Basic notions of relative size of single & double digit numbers, addition and finding comparative size between numbers seems to work out of the box with traditional model choices (see [Wallace, et al., 2019]). Some models are available here which can solve both numeric and general QA tasks or just present language models that have better numeracy understanding. My choice of models here, retain relative flexbility to perform other tasks while still achieving 75–84% F1 vs ca. 15% for BERT
  • Medium difficulty present more advanced operations like max/min, argmax/argmin, ranges, etc additional modifications are needed. Typically, this means that standard models can be used as a starting point, with dedicated "calculator" modules trained for specific numeric tasks. Will discuss more in the Calculators for natural language section of the next article. As a proxy, I use the overall DROP scores in the table as those operations can be seen across answer types. The best model here, QDGAT reaches almost 90% F1 vs 33.6% for BERT. Unfortunately this is a tool that can purely solve numeric tasks which means some extra overhead to incorporate with other models
  • Harder tasks seem to be: providing lists of answers or solving formulas provide a more significant challenge. Additional model extensions, have helped some models to perform better than others here. The best scores I could find were about 80% vs 25% F1 for BERT, once more, these are numeracy specific tools

As we see from the evidence, the answer as to what can be done is rather nuanced, hence further consideration is needed based on the use case that applies.


Conclusion

This was the non-technical part of a two part series on NLP numeracy. We have seen that:

  • Numeracy presents a number of complications in natural language. From the ambiguity of numeric outcomes targeted to the variety of operations needed to produce an answer
  • Numeracy can be important in a number of use cases where traditional NLP models do not deal with it well out of the box. We saw question answering ability deteriorate from an impressing 89% on general reasoning questions to just 33% on numeric ones
  • Finally, we have glimpses how different types of numeracy challenges can have different solutions depending on our actual need. We saw that by using the right specialised models we can get numeracy performance back to 80–90% F1 depending on the task we want to perform

In Part 2 we will spend more time trying to understand why and how they are different.

Thank you for reading


Special thanks to Rich Knuszka for valuable feedback.


Hopefully, this was useful or … curious (either the good or bad curious). Thank you for reading. If you feel like saying hi, do reach out via LinkedIn


Related Articles