The world’s leading publication for data science, AI, and ML professionals.

Master Dispersion Plots in 6 Minutes

Learn graphical text analysis with NLTK

Quick Success Data Science

Sherlock Holmes (by DALL-E3)
Sherlock Holmes (by DALL-E3)

The Natural Language Tool Kit (NLTK) ships with a fun feature called a dispersion plot that lets you post the location of a word in a text. More specifically, it plots the occurrences of a word versus the number of words from the beginning of the corpus.

Here’s an example dispersion plot for the main characters in the Sherlock Holmes novel, The Hound of the Baskervilles:

Dispersion plot for major characters in "The Hound of the Baskervilles" (by author)
Dispersion plot for major characters in "The Hound of the Baskervilles" (by author)

The vertical blue tick marks represent the locations of the target words in the text. Each row covers the corpus from beginning to end.

If you’re familiar with The Hound of the Baskervilles – and I won’t spoil it if you’re not – then you’ll appreciate the sparse occurrence of Holmes in the middle, the late return of Mortimer, and the overlap of Barrymore, Selden, and the hound.

Dispersion plots can have more practical applications. For example, imagine you’re a data scientist working with paralegals on a criminal case involving insider trading. To find out whether the accused contacted board members just before making the illegal trades, you can load the subpoenaed emails of the accused as a continuous string and generate a dispersion plot to check for the juxtapositions of names.

Social scientists analyze dispersion plots to study language trends related to specific topics. By tracking the occurrence of terms like "climate change" or "gun control" in news articles, they can gain insights into priorities that are important to society over specific timeframes.

In this Quick Success Data Science project, we’ll write the Python code that generated The Hound of the Baskervilles dispersion plot shown previously.


Downloading "The Hound of the Baskervilles"

We’ll use a copy of the novel stored in this Gist. It originally came from Project Gutenberg, a great source for public domain literature. As recommended for natural language processing, I’ve stripped it of extraneous material such as the table of contents, chapter titles, copyright information, indexes, and so on.


Installing NLTK

To generate the plot, we’ll use the Natural Language Toolkit (NLTK), a free package for working with human language data in Python.

To install NLTK with pip use:

pip install nltk

For Anaconda, use:

conda install anaconda::nltk

For additional installation instructions see this site.

NOTE: A full installation of NLTK requires downloading additional datasets and models. These aren’t required, however, for making dispersion plots.


Checking for the Y-label Bug

For almost a year now, NLTK’s dispersion_plot() method has been plagued with a bug that reverses the order of the y-labels. There’s an easy workaround, but first, we need to confirm that the bug hasn’t been corrected. The following code generates a simple dispersion plot that should make this obvious.

# Test if y-labels are reversed:
import nltk

corpus = 'cat cat cat cat dog dog'
tokens = nltk.word_tokenize(corpus)
tokens = nltk.Text(tokens)  # NLTK wrapper for automatic text analysis.
target_words = ['cat', 'dog']

nltk.draw.dispersion.dispersion_plot(tokens, target_words);

Here, we import nltk and then assign a simple string to the variable, corpus. This string repeats the word cat four times and dog two times.

We then use NLTK’s word_tokenize() method to break out each word in the string as a discrete item. We then pass this tokens list to the NLTK Text() class, which includes methods for performing automated analyses such as counting occurrences of specific words or phrases, finding lines where a given word occurs, identifying words that frequently occur together, and more.

Next, we prepare a list, named target_words, of the words we want to include in the dispersion plot. Then we call the dispersion_plot() method and pass it tokens and target_words. This generates the following plot.

The dispersion plot for cats and dogs (by the author)
The dispersion plot for cats and dogs (by the author)

Right away, we can see that the "dog" and "cat" labels are posted in the wrong order on the y-axis. In the next section, we’ll add code to correct this.


Creating a Dispersion Plot with Correct Y-labeling

The following code generates a dispersion plot for The Hound of the Baskervilles. It also uses recommendations from the NLTK GitHub repository Issues tab to correct the problem with the y-labels.

import urllib.request
import nltk

def text_to_string(url):
    """Read a text file from a url and return a string."""
    with urllib.request.urlopen(url) as infile:
        return infile.read().decode('utf-8')

def custom_dispersion_plot(text, target_words):
    """Return a dispersion plot with corrected y-axis label order."""
    ax = nltk.draw.dispersion.dispersion_plot(text, target_words)
    ax.set_yticks(list(range(len(target_words))))
    ax.set_yticklabels(reversed(target_words))
    return ax

target_url = 'https://bit.ly/4bRSMY3'  # Gist location.
corpus = text_to_string(target_url)
tokens = nltk.Text(nltk.word_tokenize(corpus))
words = ['Holmes',
         'Watson',
         'Mortimer',
         'Henry',
         'Barrymore',                                                                        
         'Stapleton',
         'Selden',
         'hound']

custom_dispersion_plot(tokens, words);
# plt.show()

NLTK uses Matplotlib under the hood, so the key to handling the bug is to create an ax plotting object and then manually run set_yticks and set_yticklabels. This is encapsulated in the custom_dispersion_plot() function. The rest of the process is the same as before.

By default, NLTK adds the title "Lexical Dispersion Plot" to the figure. To use a custom title, add a call to the ax object’s set_title() method in the custom_dispersion_plot() function:

def custom_dispersion_plot(text, target_words):
    """Generate a dispersion plot with corrected y-axis label order."""
    ax = nltk.draw.dispersion.dispersion_plot(text, target_words)
    ax.set_yticks(list(range(len(target_words))))
    ax.set_yticklabels(reversed(target_words))
    ax.set_title('Lexical Dispersion Plot for The Hound of the Baskervilles')
    return ax

This produces the following figure:

The dispersion plot with a custom title (by the author)
The dispersion plot with a custom title (by the author)

NOTE: If the y-label bug has been corrected, you’ll want to remove the calls to the ax.set_yticks() and ax.set_yticklabels() methods before using the previous code.


The Concordance and Frequency Methods

NLTK also comes with a Concordance() method that shows every occurrence of a target word. To provide context, some surrounding text is also included.

As with the dispersion plot, you have to instantiate a Text object before calling the method. Here’s an example using the word, "hound" (this builds on the previous code):

tokens.concordance('hound')

Here’s the (truncated) response:

The truncated output of the concordance() method (by the author)
The truncated output of the concordance() method (by the author)

To count the total number of occurrences of a word, use the frequency() method:

frequency = tokens.count('hound')
print("Frequency of 'hound':", frequency)
The output of the frequency() method (by the author)
The output of the frequency() method (by the author)

Dealing with Chronological Data

It’s also possible to create dispersion plots that use dates and times, rather than word offsets, to post the occurrence of words. Here’s an example that manually creates the Dispersion Plot using a seaborn strip plot:

Tutorial: Plotting Lexical Dispersion (Conspiracy Lies from the Left-of-Center)


Summary

The Nltk dispersion plot posts the locations of target words in a text, measured from the start of the text. This plot makes it easy to see the distribution of the words and whether any words tend to cluster together.

The dispersion_plot() method currently contains a bug that causes the y-axis labels to plot in reverse order. This can be easily corrected by manually posting the y-labels. You’ll first want to run a simple program to test that the bug is still in play before applying the fix.


Thanks!

Thanks for reading and please follow me for more Quick Success Data Science projects in the future. And if you enjoy natural language processing, check out my article on attributing authorship with NLTK:

Use Stylometry to Identify Authors


Related Articles