Fastcoref — A Practical Package for Coreference Resolution

Understand the main work behind F-coref model for coreference resolution in NLP and how to use it via fastcoref intuitive package

Noa Lubin
Towards Data Science

--

Fastcoref — Photo by Ryan Stone on Unsplash

Introduction

This week Shon Otmazgin, Arie Cattan and Prof. Yoav Goldberg from Bar Ilan University, Israel released a coreference resolution package named ‘fastcoref’ [2], that is fast, accurate and easy to use. In this article we’ll understand the main work behind this package and how to use it. You can also see a demo of this great package in huggingface created by Ari Bronstein.

Coreference Resolution is the task of identifying textual mentions (aka: mention detection) and link entities that refer to one another (aka: coreference decision) in documents. Let’s look at this comics by Laura Yang and try to understand which entities are the same.

comics by Laura Yang on yangstercomics (with permission)

‘Daddy’, ‘He’ and ‘His’ refer to the same entity, identifying those entities and linking them to their antecedent is what we call in NLP co-reference resolution. I assume you are aware with the coreference resolution task and models, if not here is a good article to give you more background by Gal Hever.

The coreference task is a basic task used for additional NLP tasks such as: Information Extraction, Question Answering, Summarisation, Machine Translation and more. Despite its importance as a basic NLP task, current research had failed to achieve decent models that work fast or work with limited resources. The model that the authors introduced, ‘F-coref’ [1], achieves fast speed and works with limited resources thanks to a combination of a hard target distillation model and an efficient batching implementation (they named ‘leftover batching’).

Background — S2E and LINGMESS Models

A general neural co-reference model is composed of three blocks. A contextualised encoder which embeds the words into vectors, a mention scorer and once we have the mentions a coreference scorer.

General Co-reference model pipeline

The F-coref model is based on S2E model [3]. S2E is a neural model which represents each span as a function of its start and end tokens. It’s architecture includes: Longformer (a contextualised encoder), a parameterised mention scoring function (f_m) and a parameterised pairwise antecedent scoring function (f_a).

Notice that in order to save computation time, the antecedent function scores only the λT spans with highest mention scores. T- number of tokens. λ — pruning hyperparam set to 0.4 in order to achieve high mention recall.

The score is calculated as the sum of the two mention scores and the pair antecedent score.

The score is calculated as the sum of the two mention scores and the pair antecedent score [1]

The total model is composed of 26 layers and 494M parameters.

Recent work, LINGMESS model [4], improves coreference accuracy by introducing multiple scorers for each type of mentions being scores. This results in SOTA accuracy but this model is less efficient than S2E.

F-coref Model

Now let’s understand what are the elements of F-coref in order to reach great efficiency. We’ll first talk about knowledge distillation and later about maximising parallelism.

Knowledge Distillation — Reducing the Model Size

knowledge distillation is the process of transferring knowledge from a large model to a smaller one. When we are aiming to build fast models, we need less parameters. Knowledge distillation is a great method to reduce the number of parameters in the model and yet still maintain most of its accuracy.

‘Teacher — Student Distillation’ photo by sofatutor on Unsplash

Teacher — Student Model

In order to train F-coref the authors used LINGMESS as a teacher model. The student model was built similarly to S2E model with few modifications: fewer layers and therefore fewer parameters, Longformer which is relatively slow was replaced with DistilRoBERTa. In total the number of parameters was reduced from 494M to 91M and the number of layers is reduced from 26 to 8.
They used hard target knowledge distillation. Meaning, the teacher model acts as an annotator for the unlabelled data and the student model learns from these annotations, instead of the soft version where the student learns on the logits of the teacher model. The reason that soft distillation did not work for the coreference task was the mention pruning and the violation of transitivity (this is interesting and can help you when choosing your distill model, you can read more about it in the paper [1]).

Maximising Parallelism — Make it Even Quicker

The authors used a lower λ for pruning (0.25 instead of 0.4), which they claim decreased the number of pairs by 2.56 without hurting performance. They also used a new version of dynamic batching, which batches the number of documents up onto a certain threshold they named ‘leftovers batching’.

‘Leftover Batching’ Photo by Richad Bell on Unsplash

Leftovers batching

The most time consuming part of the coreference model is the encoder. Traditionally, long documents are split into non-overlapping segments of max_length, and each segment is encoded. In case the document is longer than max_length it’s split into two or more segment and the last segment is padded to max_length. This results in a high number of padded tokens.

The change the authors suggested is to create two separate batches, one for the full segments without padding and one for the leftover segments. Then, they pad the second batch to the max_leftovers_length rather than padding the leftovers segments to max_length. Notice, this leftover-batching can be useful to other ML tasks as well.

Comparison between traditional batching to leftover batching [1]

F-coref Results

Now we want to test the new model performance in terms of inference time and F1 score. We expect the F1 score to be slightly lower than SOTA co-reference model, but the inference time to be much quicker.

The experiment setup used Multi-news dataset OntoNotes dataset to train the teacher-student model. The average F1 score F-coref reaches is 78.5. If we compare it to out two previously discussed models, F-coref degrades by 2.9 F1 points compared to LINGMESS (the teacher), and by 1.8 F1 points the s2e model.

Now, the authors compared the time for each coreference model for inference of 2.8K documents. We see F-coref is a magnitude faster than the previous fastest model for an average of three runs. We see that most significant reduction in time, is thanks to the distillation model. Later, the batching and leftover batching further reduce the time to only 25 seconds.

F-coref is a magnitude faster than previous fastest model [1]

This tradeoff looks great for applied data science, where one has limited resources and needs quick inference time.

Fastcoref an Intuitive Package

The authors also released a very intuitive package for you to use. Let’s see how to use it.

  1. install package
    package is pip installable
pip install fastcoref

2. import model
import the F-coref model

from fastcoref import FCoref
model = FCoref(device='cuda:0')

3. predict coreference

preds = model.predict(
texts=[‘Fastcoref is a great package. I am glad Noa introduced it to me.’]
)
preds[0].get_clusters()>> [['Fastcoref', 'it'], ['I', 'me']]

When comparing the inference time of our simple example on CPU between F-coref and LingMessCore model we see that F-coref achieves 1/5 of the wall time and 1/18 CPU time.

F-coref

CPU times: user 595 ms, sys: 22 ms, total: 617 ms
Wall time: 494 ms

LingMessCore

CPU times: user 10.7 s, sys: 617 ms, total: 11.3 s
Wall time: 2.42 s

Notice that you can also train for distillation your own model (see github for instructions)

Conclusion

If you have a coreference task you need to do at scale, I believe that fastcoref is the right package to use. The package is easy to use, fast and maintains the same magnitude of accuracy.

Resources

1. F-COREF: Fast, Accurate and Easy to Use Coreference Resolution (Otmazgin et al.)

2. Fastcoref Package (Otmazgin et al.)

3. Coreference Resolution without Span Representations (Kirstain et al., ACL 2021)

4. LingMess: Linguistically Informed Multi Expert Scorers for Coreference Resolution

--

--

data science manager, AI researcher, space enthusiast and social entrepreneur. I hope this blog helps you navigate your way into the incredible world of AI.