Simple Python script without the use of heavy text processing libraries to extract most common words from a corpus.

What is the most used word in all of Shakespeare plays? Was ‘king’ more often used than ‘Lord’ or vice versa?
To answer these type of fun questions, one often needs to quickly examine and plot most frequent words in a text file (often downloaded from open source portals such as Project Gutenberg). However, if you search on the web or on Stackoverflow, you will most probably see examples of nltk and use of CountVectorizer. While they are incredibly powerful and fun to use, the matter of the fact is, you don’t need them if the only thing you want is to extract most common words appearing in a single text corpus.
Below, I am showing a very simple Python 3 code snippet to do just that – using only a dictionary and simple string manipulation methods.
Feel free to copy the code and use your own stopwords to make it better!
import collections
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# Read input file, note the encoding is specified here
# It may be different in your text file
file = open('PrideandPrejudice.txt', encoding="utf8")
a= file.read()
# Stopwords
stopwords = set(line.strip() for line in open('stopwords.txt'))
stopwords = stopwords.union(set(['mr','mrs','one','two','said']))
# Instantiate a dictionary, and for every word in the file,
# Add to the dictionary if it doesn't exist. If it does, increase the count.
wordcount = {}
# To eliminate duplicates, remember to split by punctuation, and use case demiliters.
for word in a.lower().split():
word = word.replace(".","")
word = word.replace(",","")
word = word.replace(":","")
word = word.replace(""","")
word = word.replace("!","")
word = word.replace("“","")
word = word.replace("‘","")
word = word.replace("*","")
if word not in stopwords:
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
# Print most common word
n_print = int(input("How many most common words to print: "))
print("nOK. The {} most common words are as followsn".format(n_print))
word_counter = collections.Counter(wordcount)
for word, count in word_counter.most_common(n_print):
print(word, ": ", count)
# Close the file
file.close()
# Create a data frame of the most common words
# Draw a bar chart
lst = word_counter.most_common(n_print)
df = pd.DataFrame(lst, columns = ['Word', 'Count'])
df.plot.bar(x='Word',y='Count')
An example of the code output and plot of the 10 most frequently used words in the corpus. The text is ‘Pride and Prejudice‘ and you can see the familiar names of Elizabeth and Mr. Darcy! 🙂

Loved the article? Become a Medium member to continue learning without limits. I’ll receive a portion of your membership fee if you use the following link, with no extra cost to you.