A Bigram Analysis of the EU General Data Protection Regulation

Visualised with word cloud

H Lee
Towards Data Science

--

Image by mohamed Hassan from Pixabay

Implemented in May 2018, the General Data Protection Regulation (GDPR) is by far the most comprehensive data protection law in the EU and possibly the entire world. In this tutorial intended for beginners, I will visualise the corpus of the GDPR in a bigram word cloud in R using RWeka.

Word cloud and its weakness

Word cloud is a simple yet powerful tool for text visualisation. It allows you to print keywords in varying sizes in proportion to how often they appear in the data set. It is intuitive and visually pleasing. But fatally, it takes words out of context.

For example, in the representation above, it is easy to understand that “authority” is a prominent word in our data set. But because it has been stripped of its surrounding words, we don’t know if it’s used in the sense of a right as in “authority to declare war,” a government body as in “local authority,” or a source of citation as in “binding authority.”

Bigram and the GDPR

One way to mitigate this problem is by making a two-word word cloud. Of course, even bigrams — as two-word compounds are technically known — don’t fully capture the nuance and context of the original, full-sentenced text data. But luckily for us, many of the central concepts in the GDPR often consist of two words:

It is therefore desirable to conduct a bigram analysis instead of the default unigram analysis, which would semantically rip apart those notions that make sense only in their compound form.

Data import and corpus creation

Let us begin by installing and loading the necessary packages.

The tm package is a widely used text mining tool for refining raw text data. Similarly, RWeka is an interface that simplifies complicated pre-processing techniques and other machine learning algorithms. In this tutorial, I will use it specifically to group pairs of words into bigrams. We will then draw an exploratory graph with ggplot2 and finally move on to print our word cloud.

First, load the text data and check if everything is intact. The text begins with the official long title of the legislation and ends with the name of then-EU Council President.

It is always a good habit to inspect the first few lines and the last few lines of the data.

Then, we create a corpus by using the VCorpus() function. A corpus is an abstract structure that holds documents when we use the tm package. On the hardware level, the VCorpus (volatile corpus) stores documents in memory, whereas the PCorpus (permanent corpus) utilises a separate database. For our purposes, the VCorpus is adequate — just bear in mind that the corpus object will be lost upon clearing the memory or shutting down RStudio.

We’ve now created a corpus from the text file and are ready to mine.

Preprocessing

As we’ve seen from the results of the head() and tail() functions above, the original text data is full of orthographic inconsistencies in the eye of the machine. There are numbers, punctuations, and capital letters, and the text also seems to be double-spaced. We will homogenise it with the tm package to make it easier to work with.

The tm_map() function is the silver bullet for simple text refinement. By providing different arguments like removeNumbers and removePunctuation in the second parameter, the function literally removes numbers and punctuations from the corpus. To alter the content instead of simply deleting, e.g. from uppercase to lowercase, we use content_transformer().

Then there are stop words. Stop words are the words that should be filtered out due to their semantic negligibility in the natural language source. Despite their lack of substantive meaning, they often occur numerous times in a text and skew the result of the frequency analysis. Examples include the articles “the” and “a/an” and prepositions like “on,” of,” and “about.”

The tm package provides a set of collated stop words in various languages. To access the English set, we set the argument to "english" in stopwords(). Because this only contains the most common examples, I’ve added a set of custom stop words as above. With removeWords, the selected stop words will be removed. Finally, with stripWhitespace, we rid the corpus of empty lines and spaces.

The twenty-first line from the same text now looks “clean” for the RWeka package to bundle into meaningful bigrams.

Before we proceed, however, we must also perform basic lemmatisation to prevent counting different forms of the same word as separate words. Lemmatisation is the process of consolidating the inflections of a word into its standard dictionary form known as the lemma. It helps our machine treat, for example, “runs,” “ran,” and “running” all as the canonical instance of “run.”

There are R packages like textstem to facilitate this process. Feel free to experiment with them, but keep in mind that some lemmatisation functions could do more harm than good. For instance, they might butcher “General Data Protection Regulation” into “General Datum Protect Regulate.” In our case, all we want to do is to treat the plural forms of certain bigrams as singular, so it might not worth risking such destructive makeover.

To replace the plural form with the singular form, we provide the gsub function as the argument for content_transformer(). The pattern parameter receives the pattern of characters to be replaced, and the replacement parameter defines how that pattern is to be altered.

Unfortunately, the RWeka package does not evaluate the semantic viability of a bigram and simply returns every word pair it can identify in the corpus. It is essential to check the top results and eradicate nonsensical groupings.

Tokenisation

In order to generate bigrams with RWeka, the corpus must be tokenised. Tokenisation is the process of splitting a string into smaller units called tokens. A token could be a word, several words, a sentence, or any other logical segment. In our bigram analysis, a token should consist of two words, so in the min and max parameters of Weka_control() in NGramTokenizer(), enter 2 respectively.

We then shoehorn the data into a document-term matrix, which is a mathematical matrix that structures the frequency of terms into rows and columns. Next, sort the matrix in a descending order to arrange the terms from the most frequent to the least. The top ten and bottom ten results are as follows:

We’ve successfully extracted bigrams from the corpus.

Exploratory visualisation with ggplot2

Let us briefly visualise this in a bar graph with ggplot2.

This is going to be a bar graph displaying the fifteen most frequent bigrams. I’ve labelled the x-axis “Bigrams” and y-axis “Frequency,” and adjusted the position and the angle of the terms along the x-axis.

The graph shows the overwhelming occurrences of the first two bigrams. This means our word cloud is going to be neither balanced nor pretty, but it is perhaps unsurprising that there are disproportionately many invocations of “personal data” in the world’s most extensive piece of personal data legislation.

Word cloud

Let’s move on to print out word cloud. First, we print everything as it is to examine the shape of our data, and next, set the minimum frequency to 2 to discard the statistically inconsequential bigrams that appear only once. Finally, we could apply some formatting to accentuate the major bigrams that appear more than 20 times with dark blue while leaving the rest with light blue.

# Print all
# Minimum size 2
# Colour-coded

If we were to treat the first two data points as outliers for the moment, we could also come up with a representation like this:

As acknowledged, bigram word clouds do not overcome the inherent shortcomings of word clouds or frequency analysis. They could, however, offer doubly interesting insight into a text.

--

--