The world’s leading publication for data science, AI, and ML professionals.

Natural Language Processing(NLP) with NLTK and spaCy- Sir Arthur Conan Doyle and Agatha Christie’s…

As a data scientist, I am curious to know if there are any similarities in the writing style of Arthur Conan Doyle and Agatha Christie.

Inside AI

In this article, I will perform the side by side analysis and the comparison of "The Hound of the Baskervilles" by Sir Arthur Conan Doyle and "The Murder on the Links" by Agatha Christie.

Photo by Markus Winkler on Unsplash
Photo by Markus Winkler on Unsplash

I am sure all of us enjoy reading the evergreen thrilling detective stories of the famous detective Sherlock Holmes and Hercule Poirot. As a data scientist, I am curious to know if there are any similarities in the writing style of Arthur Conan Doyle and Agatha Christie. It would be great if I could extract any insight into the successful detective story writing recipe with the help of Natural Language Processing(NLP).

Objective

In this article, I will perform the side by side analysis and the comparison of "The Hound of the Baskervilles" by Sir Arthur Conan Doyle and "The Murder on the Links" by Agatha Christie.

In my full analysis, I have considered ten stories from each author to draw the parallels, but for the sake of simplicity, I have explained the process with one story from each author in this article. The framework and coding mentioned below are easily scalable, and I will highly encourage you all to try it will full sets of stories from Sir Arthur Conan Doyle and Agatha Christie.

The main objective is to show how easy it is to perform natural language processing (NLP) with packages like NTLK and spaCY and provide an initial framework structure to self explore and deep dive other author’s writings.

Getting Started

I have downloaded the text version of "The Hound of the Baskervilles" by Sir Arthur Conan Doyle and "The Murder on the Links" by Agatha Christie from Project Gutenberg.

I have used the "en_core_web_lg" general-purpose pre-trained models to predict named entities, part-of-speech tags and syntactic dependencies.

spaCY provides three sizes of the pre-trained model viz. small, medium and large with increasing sizes and accuracy of prediction.

As a prerequisite, we need to download the large model with below command.

Python -m spacy download en_core_web_lg

Step 1: We will be using the NLP packages NLTK and spaCy for the analysis. First, all the package required are imported and also an NLP object is created by loading the pre-trained model "en_core_web_lg". As the name suggests, the Matcher package will help in performing pattern-based search and will discuss in detail later in the article.

import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_lg")
matcher = Matcher(nlp.vocab)

Step 2: In the below code, the text file of "The Hound of the Baskervilles" "The Murder on the Links" download from Project Gutenberg is read.

Sherlock= open("Sherlock.txt", "r")
Poirot= open("Poirot.txt", "r")

Step 3: We would like to know the number of nouns, verbs and sentences in both the stories. We will write a small custom pipeline component which can be added to standard out of box spaCY pipelines to get this information.

In the main program( explained later), we will divide the full text into individual word token. Here in the code below, we are checking if the parts of speech of token are noun or verb, and increasing the respective counter based on the test.

Further, we are counting the number of sentences in the text using standard the doc.sents method in spaCY.

def length(doc):
    nou=0
    ver=0
    sentence=0
    for token in doc:
        if token.pos_=="NOUN":
            nou=nou+1 
        elif token.pos_=="VERB":
            ver=ver+1
    print("Number of Verbs in the text = ",ver)
    print("Number of Nouns in the text = ",nou)
for sent in doc.sents:
       sentence=sentence+1
    print("Number of Sentences in the text = ",sentence)
    return doc

Step 4: We will add the custom pipeline component "length" to the standard spaCY pipeline using the add_pipeline method.

nlp.add_pipe(length)

Step 5: In the code below, we pass the texts read in step 2 to create the doc objects, Arthur and Agatha. A set of actions is performed in sequence in the background while creating the doc object, a sequence of words with linguistic annotation.

Arthur=nlp(Sherlock.read())
Agatha=nlp(Poirot.read())

It starts with tokenisation and followed by tagging, parser etc. as shown below. We have included the custom pipeline at the end to count the nouns, verbs and sentences.

Image drawn by the author
Image drawn by the author

For the scope of this article, we are interested in the output of the last custom pipeline action we have written in step 3.

Custom Pipeline Output with number of Verbs, Nouns and Sentences in "The Hound of the Baskervilles"
Custom Pipeline Output with number of Verbs, Nouns and Sentences in "The Hound of the Baskervilles"
Custom Pipeline Output with number of Verbs, Nouns and Sentences in "The Murder on the Links"
Custom Pipeline Output with number of Verbs, Nouns and Sentences in "The Murder on the Links"
Parts-Of Speech Analysis based on program output
Parts-Of Speech Analysis based on program output

Step 6: Next, we will analyse the number of times the lead detective Sherlock Holmes and Poirot took any action. As a proximate solution, we need to look for patterns where the detective’s name is followed by a verb.

In the below code, we have defined a pattern holmes_pattern for finding the number of occurrence of "holmes" immediately followed by a verb.

Add method is used to include the pattern to the matcher. Finally, the doc object i.e. is passed to the matcher to collect all the instances in the text matching the pattern of holmes followed by a verb.

matcher = Matcher(nlp.vocab)
holmes_pattern = [{"TEXT": "Holmes"}, {"POS": "VERB"}]
matcher.add("Action", None, holmes_pattern )
matches = matcher(Arthur)
print("Sherlock Acted:", len(matches))

Length of the matches gives us a number of times the pattern match is found in the text.

If we loop the match object as shown in the below code then we can print the individual pattern match text

for match_id, start, end in matches:
    print("Sherlock Action Found:", Arthur[start:end].text)
Program Output
Program Output

Similar logic is written for the detective Poirot in the story "The Murder on the Links" by Agatha Christie

poirot_pattern = [{"TEXT": "Poirot"}, {"POS": "VERB"}]
matcher.add("Action", None, poirot_pattern )
matches = matcher(Agatha)
print("Poirot Acted:", len(matches))
for match_id, start, end in matches:
    print("Hercule Action found:", Agatha[start:end].text)
Program Output
Program Output

Looks like Poirot is much more active than Sherlock Holmes 🙂

Step 7: Next, we will see how many times lead detective and his associate’s names are described in the respective stories.

We learned earlier during the tokenisation process each word is separated and classified. We will compare the text of each token to count the instance of Sherlock, Watson, Poirot and Hastings in the respective stories.

sherlock_count=0
watson_count=0
for token in Arthur:
    if token.text == "Sherlock":
        sherlock_count=sherlock_count+1
    elif token.text == "Watson":
        watson_count=watson_count+1
print(sherlock_count)
print(watson_count)
poirot_count=0
hasting_count=0
for token in Arthur:
    if token.text == "Poirot":
        poirot_count=sherlock_count+1
    elif token.text == "Hastings":
        hasting_count=watson_count+1
print(poirot_count)
print(hasting_count)
Chart from program output
Chart from program output

Step 8: Finally, we will see the 15 most frequent words used by each author in the stories. We will use the NLTK model to do this analysis and plot a cumulative graph of it.

from nltk import *
Arthur_plot = FreqDist(Sherlock)
Arthur_plot .plot(15, cumulative=True)

In the code above, we pass the text to FreqDist built-in function of NLTK and then use the plot method to draw the cumulative chart of the most used words. We can see that 15 most frequently used words in "The Hound of the Baskervilles" add up close to 1700 words.

Cumulative chart of most frequent words in "The Hound of the Baskervilles"
Cumulative chart of most frequent words in "The Hound of the Baskervilles"

We did the same analysis for the story "The Murder on the Links" by Agatha Christie.

Agatha_plot = FreqDist(Poirot)
Agatha_plot .plot(15, cumulative=True)
Cumulative chart of most frequent words in "The Murder on the Links"
Cumulative chart of most frequent words in "The Murder on the Links"

We have covered a lot of ground here, so just to quickly recap, we have started with analysing the number of verbs, nouns and sentences in each story. As a self-exploring exercise, I will recommend you to expand the part-of-speech analysis for adjectives, prepositions etc. to gain better insight.

We further learnt to use pattern search and found the number of times lead character name is followed by a verb. Next, we analysed the number of times lead detective and his associates’ name is mentioned in the respective stories. We concluded our analysis with 15 most frequently used words in the stories.

As mentioned in the outset, this article intends to show how easy it is to perform natural language processing (NLP) with packages like NTLK and spaCY and provide an initial framework structure to self explore and deep dive other author’s writings.


Related Articles