Linguistic Fingerprinting with Python

Single forensic fingerprint in yellow tones with blue semicolons (image by DALL-E2 and author)

Stylometry is the quantitative study of literary style through computational text analysis. It’s based on the idea that we all have a unique, consistent, and recognizable style in our writing. This includes our vocabulary, our use of punctuation, the average length of our words and sentences, and so on.

A typical application of stylometry is authorship attribution. This is the process of identifying the author of a document, such as when investigating plagiarism or resolving disputes on the origin of a historical document.

In this Quick Success Data Science project, we’ll use Python, seaborn, and the Natural Language Toolkit (NLTK) to see if Sir Arthur Conan Doyle left behind a linguistic fingerprint in his novel, The Lost World. More specifically, we’ll use semicolons to determine whether Sir Arthur or his contemporary, H.G. Wells, is the likely author of the book.

The Hound, The War, and The Lost World

Sir Arthur Conan Doyle (1859–1930) is best known for the Sherlock Holmes stories. H. G. Wells (1866–1946) is famous for several groundbreaking science fiction novels, such as The Invisible Man.

In 1912, Strand Magazine published The Lost World, a serialized version of a science fiction novel. Although its author is known, let’s pretend it’s in dispute and it’s our job to solve the mystery. Experts have narrowed the field down to two authors: Doyle and Wells. Wells is slightly favored because The Lost World is a work of science fiction and includes troglodytes similar to the Morlocks in his 1895 book, The Time Machine.

To solve this problem, we’ll need representative works for each author. For Doyle, we’ll use The Hound of the Baskervilles, published in 1901. For Wells, we’ll use The War of the Worlds, published in 1898.

Fortunately for us, all three novels are in the public domain and available through Project Gutenberg. For convenience, I’ve downloaded them to this Gist and stripped out the licensing information.

The Process

Authorship attribution requires the application of Natural Language Processing (NLP). NLP is a branch of linguistics and artificial intelligence concerned with giving computers the ability to derive meaning from written and spoken words.

The most common NLP tests for authorship analyze the following features of a text:

Word/Sentence length: A frequency distribution plot of the length of words or sentences in a document.
Stop words: A frequency distribution plot of stop words (short, noncontextual function words like the, but, and if).
Parts of speech: A frequency distribution plot of words based on their syntactic functions (such as nouns and verbs).
Most common words: A comparison of the most commonly used words in a text.
Jaccard similarity: A statistic used for gauging the similarity and diversity of a sample set.
Punctuation: A comparison of the use of commas, colons, semicolons, and so on.

For this toy project, we’ll focus on the punctuation test. Specifically, we’ll look at the use of semicolons. As "optional" grammatical elements, their use is potentially distinctive. They’re also agnostic to the type of book, unlike question marks, which may be more abundant in detective novels than in classic science fiction.

The high-level process will be to:

Load the books as text strings,
Use NLTK to tokenize (break out) each word and punctuation mark,
Extract the punctuation marks,
Assign semicolons a value of 1 and all other marks a value of 0,
Create 2D NumPy arrays of the numerical values,
Use seaborn to plot the results as a heat map,

We’ll use each heat map as a digital "fingerprint." Comparing the fingerprints for the known books with the unknown book (The Lost World) will hopefully suggest one author over the other.

Heatmaps for question mark use (blue) in a detective novel (left) vs. a sci-fi novel (right) (image by the author)

The Natural Language Toolkit

Multiple third-party libraries can help you perform NLP with Python. These include NLTK, spaCy, Gensim, Pattern, and TextBlob.

We’re going to use NLTK, which is one of the oldest, most powerful, and most popular. It’s open source and works on Windows, macOS, and Linux. It also comes with highly detailed documentation.

Installing Libraries

To install NLTK with Anaconda use:

conda install -c anaconda nltk

To install with pip use:

pip install nltk (or see https://www.nltk.org/install.html)

We’ll also need seaborn for plotting. Built on matplotlib, this open-source visualization library provides an easier-to-use interface for drawing attractive and informative statistical graphs such as bar charts, scatterplots, heat maps, and so on. Here are the installation commands:

conda install seaborn

pip install seaborn

The Code

The following code was written in JupyterLab and is described by cell.

Importing Libraries

Among the imports, we’ll need the list of punctuation marks from the string module. We’ll turn this list into a set datatype for faster searches.

import math
from string import punctuation
import urllib.request

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import seaborn as sns

import nltk

PUNCT_SET = set(punctuation)

Defining a Function to Load Text Files

Since we’ll be working with multiple files, we’ll start by defining some reusable functions. The first one uses Python’s urllib library to open a file stored at a URL. The first line opens the file, the second reads it as bytes data, and the third converts the bytes to string (str) format.

def text_to_string(url):
    """Read a text file from a URL and return a string."""
    with urllib.request.urlopen(url) as response:
        data = response.read()  # in bytes
        txt_str = data.decode('utf-8')  # converts bytes to string
        return txt_str

Defining a Function to Tokenize Text

The next function accepts a dictionary where the keys are the authors’ names, and the values are their novels in string format. It then uses NLTK’s word_tokenize() method to reorganize the strings into tokens.

Python sees strings as a collection of characters, like letters, spaces, and punctuation. The process of tokenization groups these characters into other elements like words (or sentences), and punctuation. Each element is referred to as a token, and these tokens permit the use of sophisticated analyses with NLP.

The function ends by returning a new dictionary with the authors’ names as keys and the punctuation tokens as values.

def make_punct_dict(author_book_dict):
    """Accept author/text dict and return dict of punctuation by author."""
    punct_by_author = {}
    for author, text in author_book_dict.items():
        tokens = nltk.word_tokenize(text)
        punct_by_author[author] = [token for token in tokens 
                                   if token in PUNCT_SET]
        print(f"Number punctuation marks in {author} = {len(punct_by_author[author])}")
    return punct_by_author

I should note here that it’s possible to use basic Python to search the text file for semicolons and complete this project. Tokenization, however, brings two things to the table.

Firstly, NLTK’s default tokenizer (word_tokenize()) does not count apostrophes used in contractions or possessives, but rather, treats them as two words with the apostrophe attached to the second word (such as I + ‘ve). This is in accordance with grammatical usage.

The tokenizer also treats special marks, such as a double hyphen ( – – ) or ellipses (…) as a single mark, as the author intended, rather than as separate marks. With basic Python, each duplicate mark is counted.

The impact of this can be significant. For example, the total punctuation count for The Lost World using NLTK is 10,035. With basic Python, it’s 14,352!

Secondly, all of the other NLP attribution tests require the use of tokenization, so using NLTK here gives you the ability to expand the analysis later.

Defining a Function to Convert Punctuation Marks to Numbers

Seaborn’s heatmap() method requires numerical data as input. The following function accepts a dictionary of punctuation tokens by author, assigns semicolons a value of 1 and all else a value of 0, and returns a list datatype.

def convert_punct_to_number(punct_by_author, author):
    """Return list of punctuation marks converted to numerical values."""
    heat_vals = [1 if char == ';' else 0 for char in punct_by_author[author]]
    return heat_vals

Defining a Function to Find the Next Lowest Square Value

The list of heat values is in 1D, but we want to plot it in 2D. To accomplish this, we’ll convert the list into a square NumPy array.

This will have to be a true square, and it’s unlikely that the number of samples in each list will have an integer as its square root. So, we have to take the square root of the length of each list, convert it to an integer, and square it. The resulting value will be the maximum length of the list that can be displayed as a true square.

def find_next_lowest_square(number):
    """Return the largest perfect square less than or equal to the given number."""
    return int(math.sqrt(number)) ** 2

Loading and Preparing the Data for Plotting

With the functions defined, we’re ready to apply them. First, we load the text data for each book into a dictionary named strings_by_author. For The Lost World, we’ll enter "unknown" for the author’s name, as this represents a document of unknown origin.

We then turn this into a punctuation dictionary named punct_by_author. From this dictionary, we find the next lowest squares for each list and choose the minimum value going forward. This ensures that all the datasets can be converted into a square array and also normalizes the number of samples per dataset (The Lost World, for example, has 10,035 punctuation tokens compared to only 6,704 for The Hound of the Baskervilles).

war_url =  'https://bit.ly/3QnuTPX'
hound_url = 'https://bit.ly/44Gdc2a'
lost_url = 'https://bit.ly/3QhTfKJ'

# Load text files into dictionary by author:
strings_by_author = {'wells': text_to_string(war_url),
                     'doyle': text_to_string(hound_url),
                     'unknown': text_to_string(lost_url)}

# Tokenize text strings preserving only punctuation marks:
punct_by_author = make_punct_dict(strings_by_author)

# Find the largest square that fits all datasets:
squarable_punct_sizes = [find_next_lowest_square(len(punct_by_author[author])) 
                         for author in punct_by_author]
perfect_square = min(squarable_punct_sizes)
print(f"Array size for perfect square: {perfect_square}n")

Using an array with 6,561 tokens going forward means that each punctuation list will be truncated to a length of 6,561 before plotting.

Plotting the Heat Maps

We’ll use a for loop to plot a heat map for each author’s punctuation. The first step is to convert the punctuation tokens to numbers. Remember, semicolons will be represented by 1 and everything else by 0.

Next, we use our perfect_square value and NumPy’s array() and reshape()methods to both truncate the heat list and turn it into a square NumPy array. At this point, we only need to set up a matplotlib figure and call seaborn’s heatmap() method. The loop will plot a separate figure for each author.

Semicolons will be colored blue. You can reverse this by changing the order of the colors in the cmap argument. You can also enter new colors if you don’t like yellow and blue.

# Convert punctuation marks to numerical values and plot heatmaps:
for author in punct_by_author:
    heat = convert_punct_to_number(punct_by_author, author)
    arr = np.array(heat[:perfect_square]).reshape(int(math.sqrt(perfect_square)), 
                                                  int(math.sqrt(perfect_square)))
    fig, ax = plt.subplots(figsize=(5, 5))
    sns.heatmap(arr,
                cmap=ListedColormap(['yellow', 'blue']),
                cbar=False,
                xticklabels=False,
                yticklabels=False)
    ax.set_title(f'Heatmap Semicolons: {author.title()}')
    plt.show();

Outcome

It should be clear from the previous plots that – based strictly on semicolon usage – Doyle is the most likely author of The Lost World.

Having a visual display of these results is important. For example, the total semicolon counts for each book look like this:

Wells: 243 Doyle: 45 Unknown: 103

And as a fraction of the total punctuation, they look like this:

Wells: 0.032 Doyle: 0.007 Unknown: 0.010

From this textual data, is it immediately clear to you that the Unknown author is most likely Doyle? Imagine presenting this to a jury. Would the jurors be swayed more by the plots or the text? You can’t beat visualizations for communicating information!

Of course, for a robust determination of authorship, you would want to run all the NLP attribution tests listed previously, and you would want to use all of Doyle’s and Well’s novels.

Thanks

Thanks for reading and please follow me for more Quick Success Data Science projects in the future. And if you want to see a more complete application of NLP tests on this dataset, see Chapter 2 of my book, Real Word Python: A Hacker’s Guide to Solving Problems with Code.

Linguistic Fingerprinting with Python

The Hound, The War, and The Lost World

The Process

The Natural Language Toolkit

Installing Libraries

The Code

Importing Libraries

Defining a Function to Load Text Files

Defining a Function to Tokenize Text

Defining a Function to Convert Punctuation Marks to Numbers

Defining a Function to Find the Next Lowest Square Value

Loading and Preparing the Data for Plotting

Plotting the Heat Maps

Outcome

Thanks

Related Articles

Check Your Biases

Evaluating Cinematic Dialogue - Which syntactic and semantic features are predictive of genre?

Feature Engineering with Microsoft Fabric and Dataflow Gen2

Understanding Predictive Maintenance - Wave Data: Feature Engineering (Part 2 Spectral)

Gauss, Imposters, and Making Room for Creativity

Python for Data Scientists: Choose Your Own Adventure

Minimum Meeting Rooms Problem in SQL