
When I was still a student, I read articles that said that linguists can use text analytic techniques to determine the author of an anonymous book. I thought it is cool at that time.
When looking back, I feel this technique is still cool. But, nowadays with the help of Nltk and Python, you and I can be a "real" linguist with several lines of code.
Prepare analysis target
You don’t need to write a crawler to scape the analysis corpus. For learning and research purposes, a huge existing text database already there well maintained in the NLTK package. If you don’t have the package installed, simply run pip to install it.
pip install nltk
Then download the book data from Gutenberg, a small selection of texts from the Project Gutenberg electronic text archive.
import nltk
nltk.download("gutenberg")
The downloading should complete in 1 or 2 seconds. Let’s list the name list of download books.
from nltk.corpus import gutenberg
gutenberg.fileids()
You will see books like Shakespeare’s Caesar, Austen’s Emma, and The Bible etc. Let’s see how many words are included in The Bible the KJV version.
bible = gutenberg.words('bible-kjv.txt')
len(bible)
Total 1,010,654 words.
Metrics to measure book author’s writing pattern
I am going to detect an author’s writing pattern with the following 3 metrics.
-
Average numbers of the character of words This metric reflects the author’s vocabulary usage preference, long words or short.
-
Average words of a sentence This metric reflects the author’s sentence preference, like using long sentences or short.
-
Average distinct vocabulary used in a book This metric reflects the book author’s vocabulary, which is hard to forge.
You may say, why not capture the most frequently used words and phrases from each author. Yes, this is nice and may generate more interesting results. But the frequent word detection will also bring additional logic to the context and may ruin the readability of this piece. (maybe it is worth writing another article on how to detect keywords automatically).
Apply the 3 metrics on Gutenberg books
Now, apply these three metrics to Gutenberg books in Python code. The logic is quite simple, use NLTK’s raw()
, words()
and sents()
to capture character #, words #, and sentence #. And use the cool Python comprehensive with set
data container to get vocabulary numbers in one line code.
You can copy and paste it into your Jupyter notebook to see the result if you want.
import pandas as pd
data = []
for fileid in gutenberg.fileids():
num_chars = len(gutenberg.raw(fileid))
num_words = len(gutenberg.words(fileid))
num_sents = len(gutenberg.sents(fileid))
# get total vocabulary used in this book
num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))
data.append([
fileid.split('.')[0] # remove .txt from file name
,round(num_chars/num_words)
,round(num_words/num_sents)
# total vocabulary used divide total words used
,round(num_vocab/num_words,2)
])
pattern_metrics = pd.DataFrame(data,columns=['author-book','chars_per_word','words_per_sentence','vocabulary_rate'])
pattern_metrics
From the result set, we can easily see Austen use 25 to 28 words per sentence, while Shakespeare uses more short sentence, a consistent 12.

Shakespeare bears a high reputation as the one land the foundation of English, True, his work used apparently more vocabulary than others except for Blake’s poems.
The result is so stunning and obvious, and you can draw more conclusions from this table. If the number is not clear enough, a bar chart should show the result more intuitively.

With this kind of chart, you can also generate a reading list for your kids. Start reading with less vocabulary and short sentences used. Hm, Edgeworth’s The Parent’s Assistant looks a good start. When you can’t fall asleep in the middle night, going to read a few of Blake’s poems may be a nice choice.
Conclusion
Next time, when J.K. Rowling publishes another novel with a fake name. You can easily run several lines of Python code to determine if the new novel is written by J.K. Rowling or not. Publish it in a magazine, be a "Linguist". Maybe another young mind will read it and become Data Scientist.