Quick Success Data Science

The Natural Language Tool Kit (NLTK) ships with a fun feature called a dispersion plot that lets you post the location of a word in a text. More specifically, it plots the occurrences of a word versus the number of words from the beginning of the corpus.
Here’s an example dispersion plot for the main characters in the Sherlock Holmes novel, The Hound of the Baskervilles:

The vertical blue tick marks represent the locations of the target words in the text. Each row covers the corpus from beginning to end.
If you’re familiar with The Hound of the Baskervilles – and I won’t spoil it if you’re not – then you’ll appreciate the sparse occurrence of Holmes in the middle, the late return of Mortimer, and the overlap of Barrymore, Selden, and the hound.
Dispersion plots can have more practical applications. For example, imagine you’re a data scientist working with paralegals on a criminal case involving insider trading. To find out whether the accused contacted board members just before making the illegal trades, you can load the subpoenaed emails of the accused as a continuous string and generate a dispersion plot to check for the juxtapositions of names.
Social scientists analyze dispersion plots to study language trends related to specific topics. By tracking the occurrence of terms like "climate change" or "gun control" in news articles, they can gain insights into priorities that are important to society over specific timeframes.
In this Quick Success Data Science project, we’ll write the Python code that generated The Hound of the Baskervilles dispersion plot shown previously.
Downloading "The Hound of the Baskervilles"
We’ll use a copy of the novel stored in this Gist. It originally came from Project Gutenberg, a great source for public domain literature. As recommended for natural language processing, I’ve stripped it of extraneous material such as the table of contents, chapter titles, copyright information, indexes, and so on.
Installing NLTK
To generate the plot, we’ll use the Natural Language Toolkit (NLTK), a free package for working with human language data in Python.
To install NLTK with pip use:
pip install nltk
For Anaconda, use:
conda install anaconda::nltk
For additional installation instructions see this site.
NOTE: A full installation of NLTK requires downloading additional datasets and models. These aren’t required, however, for making dispersion plots.
Checking for the Y-label Bug
For almost a year now, NLTK’s dispersion_plot()
method has been plagued with a bug that reverses the order of the y-labels. There’s an easy workaround, but first, we need to confirm that the bug hasn’t been corrected. The following code generates a simple dispersion plot that should make this obvious.
# Test if y-labels are reversed:
import nltk
corpus = 'cat cat cat cat dog dog'
tokens = nltk.word_tokenize(corpus)
tokens = nltk.Text(tokens) # NLTK wrapper for automatic text analysis.
target_words = ['cat', 'dog']
nltk.draw.dispersion.dispersion_plot(tokens, target_words);
Here, we import nltk
and then assign a simple string to the variable, corpus
. This string repeats the word cat
four times and dog
two times.
We then use NLTK’s word_tokenize()
method to break out each word in the string as a discrete item. We then pass this tokens
list to the NLTK Text()
class, which includes methods for performing automated analyses such as counting occurrences of specific words or phrases, finding lines where a given word occurs, identifying words that frequently occur together, and more.
Next, we prepare a list, named target_words
, of the words we want to include in the dispersion plot. Then we call the dispersion_plot()
method and pass it tokens
and target_words
. This generates the following plot.

Right away, we can see that the "dog" and "cat" labels are posted in the wrong order on the y-axis. In the next section, we’ll add code to correct this.
Creating a Dispersion Plot with Correct Y-labeling
The following code generates a dispersion plot for The Hound of the Baskervilles. It also uses recommendations from the NLTK GitHub repository Issues tab to correct the problem with the y-labels.
import urllib.request
import nltk
def text_to_string(url):
"""Read a text file from a url and return a string."""
with urllib.request.urlopen(url) as infile:
return infile.read().decode('utf-8')
def custom_dispersion_plot(text, target_words):
"""Return a dispersion plot with corrected y-axis label order."""
ax = nltk.draw.dispersion.dispersion_plot(text, target_words)
ax.set_yticks(list(range(len(target_words))))
ax.set_yticklabels(reversed(target_words))
return ax
target_url = 'https://bit.ly/4bRSMY3' # Gist location.
corpus = text_to_string(target_url)
tokens = nltk.Text(nltk.word_tokenize(corpus))
words = ['Holmes',
'Watson',
'Mortimer',
'Henry',
'Barrymore',
'Stapleton',
'Selden',
'hound']
custom_dispersion_plot(tokens, words);
# plt.show()
NLTK uses Matplotlib under the hood, so the key to handling the bug is to create an ax
plotting object and then manually run set_yticks
and set_yticklabels
. This is encapsulated in the custom_dispersion_plot()
function. The rest of the process is the same as before.
By default, NLTK adds the title "Lexical Dispersion Plot" to the figure. To use a custom title, add a call to the ax
object’s set_title()
method in the custom_dispersion_plot()
function:
def custom_dispersion_plot(text, target_words):
"""Generate a dispersion plot with corrected y-axis label order."""
ax = nltk.draw.dispersion.dispersion_plot(text, target_words)
ax.set_yticks(list(range(len(target_words))))
ax.set_yticklabels(reversed(target_words))
ax.set_title('Lexical Dispersion Plot for The Hound of the Baskervilles')
return ax
This produces the following figure:

NOTE: If the y-label bug has been corrected, you’ll want to remove the calls to the
ax.set_yticks()
andax.set_yticklabels()
methods before using the previous code.
The Concordance and Frequency Methods
NLTK also comes with a Concordance()
method that shows every occurrence of a target word. To provide context, some surrounding text is also included.
As with the dispersion plot, you have to instantiate a Text
object before calling the method. Here’s an example using the word, "hound" (this builds on the previous code):
tokens.concordance('hound')
Here’s the (truncated) response:

concordance()
method (by the author)To count the total number of occurrences of a word, use the frequency()
method:
frequency = tokens.count('hound')
print("Frequency of 'hound':", frequency)

Dealing with Chronological Data
It’s also possible to create dispersion plots that use dates and times, rather than word offsets, to post the occurrence of words. Here’s an example that manually creates the Dispersion Plot using a seaborn strip plot:
Tutorial: Plotting Lexical Dispersion (Conspiracy Lies from the Left-of-Center)
Summary
The Nltk dispersion plot posts the locations of target words in a text, measured from the start of the text. This plot makes it easy to see the distribution of the words and whether any words tend to cluster together.
The dispersion_plot()
method currently contains a bug that causes the y-axis labels to plot in reverse order. This can be easily corrected by manually posting the y-labels. You’ll first want to run a simple program to test that the bug is still in play before applying the fix.
Thanks!
Thanks for reading and please follow me for more Quick Success Data Science projects in the future. And if you enjoy natural language processing, check out my article on attributing authorship with NLTK: