How to Set up a Machine Learning Model for Legal Contract Review — Part 2

Overcoming the infamous 512 token limit

Heiko Hotz

Published in

Towards Data Science

5 min readApr 13, 2021

What is this about?

In a previous blog post we looked at how to get started with the newly released CUAD dataset that helps automate contract reviews. We loaded the model and ran a first prediction on a short extract (first 100 words) of a contract. As mentioned in the article, the kind of NLP models we use for this task usually have a 512 word limit. What this means is that the model we have set up is not able to scan the entire contract for information. Instead it is limited to an extract of the contract that is shorter than 512 words.

In this blog post we will have a look at how to overcome this limitation so that the model can search the entire contract for key information lawyers are interested in.

Side note: Technically speaking, the limit is not 512 words, but 512 tokens. Tokens are the building blocks of Natural Language and therefore for NLP models. They are usually represented by splitting up words into subwords:

Tokens for the word “Highlight”. The first and last token are special tokens to identify the start and end of a text.

For our purpose this technical distinction is largely irrelevant and I will use the terms tokens and words interchangeably in this article.

What is the problem we are trying to solve?

Last time we searched for a particular piece of information (date of contract) within the first 100 words of a contract. This worked because we were well within the 512 word limit. However, if we tried to find this information in the entire contract, we would exceed that limit. In this case we would receive an error message like below:

Error message when the contract is too long

The error message informs us that the number of tokens for the contract (7,548) is greater than the maximum length allowed for this model, which is 512.

For many similar Question & Answering (Q&A) models this is not a problem. This is because the related paragraphs in those Q&A pairs are either less than 512 words long or the paragraphs are just being truncated to fit within the word limit without losing crucial information. Examples for those types of Q&A tasks can be found in the Stanford Question Answering Dataset (SQUAD). For example, these paragraphs about Southern California are all shorter than 512 words.

How many words in a contract?

For legal contracts, the situation is quite different. Looking at the contracts included in the CUAD dataset, we find that only 3.1% are shorter than 512 words.

Code to identify how many contracts are shorter than 512 words. For the CUAD dataset this will be 3.1%.

We can also see that we will run into the 512 word limit with most contracts by plotting a histogram of contracts by length:

Overcoming the 512 word limitation

To overcome this limitation a contract has to be split up into several parts of 512 words. The model can then analyse each part individually and aggregate the results to derive a final prediction. Luckily, when the model was trained on the contracts in the CUAD dataset, the programmers had to overcome the same challenge. This means, that by identifying the relevant pieces of code in the training code we can use the same logic for our prediction task. I have compiled the relevant code snippets in a Python script. In this section I will walk through the crucial parts of this script.

The function that will convert the questions and the contract into features the model accepts is a function called squad_convert_examples_to_features(). Its documentation and its implementation can be found on the Huggingface website. As we can see from the documentation this function converts a list of questions and contracts into features for the model:

Converting the questions and contract into features for the model

The resulting features are then loaded into a DataLoader and fed into the model in batches. The model predicts start and end logits just like the example we saw in the first blog post:

Making predictions on the contract chunks

The resulting start and end logits are for the individual parts of the contract and have to be summarised into one final model prediction. For this we can utilise the function compute_predictions_logits():

Bringing everything together for the final prediction

Conclusion

In this blog post we overcame the 512 word limitation of the NLP model for contract reviews. This is important because the key information lawyers are looking for in a contract could be anywhere in the document and, as we have seen, most contracts are much longer than 512 words.

The code to overcome this limitation is encapsulated in this script. An example on how to utilise this script to answer all 41 questions of the CUAD dataset on a sample contract can be found in this notebook. The output of this notebook contains all 41 questions and the respective model predictions. It is rather long, so I have saved the model predictions in this text file for easy review.

The first 5 questions and model predictions

With these resources you will now be able to spin up a model for legal contract review yourself. You can upload a contract, provided it’s in text format, and run the model to find key information in the contract, just like we did here.

In a future blog post we will have a look on how to set up a demo website for contract review with Streamlit. This will make it even easier to upload a contract and run the model with an easy-to-use web interface, very similar to this example.