Using NLP to understand laws

An unsupervised analysis of the Accessibility for Ontarians with Disabilities Act

Published in

Towards Data Science

7 min readNov 22, 2019

The process of legal reasoning and decision making is heavily reliant on information stored in text. Tasks like due diligence, contract review, and legal discovery, that are traditionally time-consuming, can benefit from NLP models and be automated, saving a huge amount of time. But can NLP be leveraged to improve the public’s understand of laws?

The idea is to obtain an abstract representation of laws that would make it easier to extract the rules and obligations defined in the text and understand what are the entities responsible for compliance, highlight patterns of similarity across industries, differences between public and private responsibilities, or even identify parts that are unclear.

Challenges

Working with laws and regulations adds a few levels of difficulties to the analysis:

Language parsing and tokenization are made harder by the use of formatting, abbreviations, and references that are specific to legal documents.
The vocabulary is relatively limited and very specialized, but the interpretation is highly sensitive to the context and there are no industry-specific pre-trained models that incorporate semantic analysis.
The syntax of sentences is often complex and non-linear, making information extraction more difficult.

Extract from the Accessibility for Ontarians with Disabilities Act (AODA) — bullet points, references and general formatting break out-of-the=box tokenizations algorithms.

Framework

To overcome these challenges, and in absence of a labelled set, instead of training a single model, we developed a methodology that combines a rule-based system with elements of a standard NLP pipeline and unsupervised ML to define a framework for analysis that can be generalized to various domains.

Modules in the NLP pipeline that are relevant to the framework are:

Tokenizer: splits a document into units called tokens and at the same time throws away non-informative characters like spaces and punctuation.
Lemmatizer: reduces inflectional forms and map words to their dictionary form.
Parts-of-speech tagger: assigns each token to a group of words that have similar grammatical properties (Parts Of Speech)
Dependency parser: identify the grammatical structure of the sentence by identifying head words and words which modify those heads, building a tree of grammatical dependencies.

Within this framework, our goals are to extract information in terms of the rules defined in the legislation, the entities that are responsible for compliance, and organize rules into homogeneous groups.

In order to test this approach, we produced a Proof Of Concept based on the Accessibility for Ontarians with Disabilities Act (AODA) — a bill passed in 2005 that defines rules and requirements for accessibility, and sets out processes for eliminating barriers for people with disabilities in Ontario.

Rules extraction

Our first objective is to automate the process of scanning the text of a law and extracting sentences that define a rule. In the context of AODA, we’re particularly interested in burdens — i.e. requirements or obligations that organizations have to comply with.

To get around the problem of not having a labelled set to train on, we build a lightweight ontology and identify the verbs that express a rule or obligation. This can be done by querying WordNet for synonyms of verbs that express obligations, e.g. shall, must, oblige etc. Sentences that contain one of these verbs are labelled as burdens.

This is a coarse classification rule, but in this case the fact that sentences follow a well defined template and have a somewhat limited vocabulary, works in our favour. On AODA, it achieves .89 overall accuracy and .97 recall on burdens.

Subjects extraction

The next step is to identify the entities responsible for complying with the burdens extracted. This is equivalent to identify the grammatical subject of the sentences, where the subject is the word or phrase that indicates who or what performs the action of the verb.

The dependency parser can be used to identify the token that acts as the subject of the verb. However, the parser alone won’t be enough to identify the subject when it’s a phrase. Here’s an example:

Obligated organizations that are school boards or educational or training institutions shall keep record of the training provided.

In this sentence, the subject is the phrase obligated organizations that are school boards or educational or training institutions, but the dependency parser only tags organizations as the noun subject (nsubj).

The subject of the sentence includes, in practice, all the dependencies of the nsubj “*organization”*

A possible solution here, is to use the dependency tree to find the subject of the sentence, and then use Breadth First Search to navigate the tree and find all the tokens that are related to the subject by a parent-child relationship. This gives us all the words involved in defining the entities responsible for complying with the rules extracted at the previous step.

Subjects clustering

Once the entities responsible have been identified, we want to arrange them into homogeneous groups. This is done by passing the subjects as input to k-means clustering, but before we can proceed with the clustering, the subjects need to be projected into a vector space. This pre-processing includes:

Normalization: token are lemmatized and stop words removed.
Vector representation: word embeddings (GloVe) are used to project the normalized words into a semantic space, then word-vectors for each subject are averaged so that we can have a single-vector representation for each of them.
Dimensionality reduction: Spectral decomposition is used to reduce the number of dimensions, keeping only the first two components.

Finally, we run the subject-vectors though k-means and extract three groups.

Results

K-means partitions the data in groups such that each data point is assigned to the cluster with the nearest mean, which means the averages of the clusters — their centroids — can be used as prototypes for the groups.

We’re looking for a representation that allows us to simplify the interpretation and understanding of the rules we’ve extracted, and we are particularly interested in differentiating them between private and public responsibilities, i.e. burdens that fall on private business and burdens that are responsibility of Government agencies.

The 2d plot of the subject-vectors indicates that the groups are well separated, but to really understand what the clusters represent, we look at the tf-idf of the centroids. This represents the average frequency of a word/token in a subject, weighted by a quantity that is proportional to how frequently the word is used across all subjects.

The distribution of tf-idf for the top 20 words of the centroid in Group #1 is very skewed to the right and dominated by the words transportation, service and provider so it’s not surprising that ~90% of these burdens come from the section on transportation standards. This is about 20% of all the burdens we have extracted and makes a lot of sense, as we’re talking about accessibility.

The skew in the top 20 words is smaller in Group #2 , with the distribution being a a bit closer to uniform. Most of the words in this group are referring to physical objects, surface and ramp having the highest scores, and many others like trail, stair, etc. Of the total burdens we have extracted in the first stage of the analysis, 25% are in this group, and 80% of them comes from a section on the design of public spaces. Once again, there is no indication that a distinction between requirements that fall on public or private entities exists.

In the last group, the highest score for tf-idf is given, by a long shot, to organization, while the difference between all the others is much smaller. These other words are a mix of pointers for government and non government entities — we have minister and municipality but also employer and person. These burdens are 50% of the total, come from a variety of sections, and primarily point at administration, compliance and standards but it’s unclear whether there’s a distinction between public and private obligations.

Summary

We started out without a labelled set but were still able to build a generic approach that allowed us to automate the extraction of rules and find burdens defined by the legislation with good accuracy. By extracting the grammatical subjects of the sentences, we were able to identify the entities that are affected by the legislation, and finally we were able to organize rules into homogeneous groups that helped us understand what’s the focus of this legislation and even find parts of the text that are ambiguous and need clarification.

This a first step towards an abstract representation of laws and can serve the purpose of improving law interpretability in at least a couple of ways.

On one end, it helps extracting information and summarizing it so that rules and requirements can be made more accessible to anyone that needs to follow them, whether they’re an organization with a legal department or a regular person. On the other end, it can help law makers by highlighting parts of legislation that are ambiguous and could be re-written or adapted to be more clear and more accessible.

Ultimately, this framework can be an instrument for both helping the understanding of existing legislation, but also improving the way laws are written so that going forward our legislators can write laws that are easier to understand and interpret, and therefore more accessible to everyone.