Generating Text in Shakespearean English with Markov Chains
This is a gentle introduction to Markov chains for text generation. We are going to train a model using the Markovify library on three of William Shakespeares most famous Tragedies and see if what we generate is legible and coherent. I am aware that these two terms are somewhat nebulous but I feel that most people will understand my intention when using them, especially when they see the variation in text that is generated.
Markovify is a python library that brands itself as "A simple, extensible Markov chain generator. Uses include generating random semi-plausible sentences based on an existing text.". And I must admit, it is incredibly easy and fast to use. Markov Chains themselves are nifty creations that give ‘remain’ and ‘change’ probabilities for multi-state processes. I will not go deep into the Mathematics of Markov Chains here but feel free to reference [this](https://setosa.io/ev/markov-chains/) and this for a comprehensive overview and visualizations, respectively. For our purposes, I will explain Markov chains visually.

Looking at the above image, we can see we have three possible states; Cloudy, Rainy, and Sunny. Markov chains rely on the current state to predict a future outcome. If we observe that today it is Rainy, our probabilities are as follows: The likelihood it is still raining tomorrow is 60%, the probability it is cloudy is 30%, and the likelihood it is sunny is 10%. The same logic can be applied when we start at the Cloudy and Sunny states.
So how the hell does this work with text? Essentially, every word in our corpus is ‘connected’ to every other word with varying probabilities using Markov Chains. So if our initial word(state) is ‘Thou’, Markovify assigns a probability to every other word in our corpus of how likely it is to follow our initial word. It might have ‘Shall’ at 65% likely to follow ‘Thou’, along with ‘is’ at 20%, ‘may’ at 10%, and so on and so forth for our entire corpus which will make up the last 5%. Note that the likelihood of ‘Thou’ following itself should be close to 0%, since a word repeating itself like that wouldn’t make much sense and is true for almost all words. For a deeper dive, check out Movies, Metrics, and Musings breakdown here.
Generating Text

So we are finally ready to implement Markovify to generate text. You can find my Colab notebook here on Github. First we need to install our libraries and packages.
!pip install nltk
!pip install spacy
!pip install markovify
!pip install -m spacy download en
We will be using NLTK and spaCy for text preprocessing since they are the most common and our model will generate text better if we parse it first. Now we can import our libraries.
import spacy
import re
import markovify
import nltk
from nltk.corpus import gutenberg
import warnings
warnings.filterwarnings('ignore')
nltk.download('gutenberg')
!python -m spacy download en
For this demo, we are going to use three of Shakespeares Tragedies from the Project Gutenberg NLTK corpus. We will first print all the documents in the Gutenberg corpus so you can mix and match these as you please.
#inspect Gutenberg corpus
print(gutenberg.fileids())
For this demo we will use the three Shakespeare Tragedies Macbeth, Julius Caesar, and Hamlet. So we will next import them and inspect the text.
#import novels as text objects
hamlet = gutenberg.raw('shapespeare-hamlet.txt')
macbeth = gutenberg.raw('shakespeare-macbeth.txt')
caesar = gutenberg.raw('shakespeare-caesar.txt')
#print first 100 characters of each
print('nRaw:n', hamlet[:100])
print('nRaw:n', macbeth[:100])
print('nRaw:n', caesar[:100])
Next we will build a utility function to clean our text using the re library. This function will remove unneeded spaces and indentations, punctuations, and such.
#utility function for text cleaning
def text_cleaner(text):
text = re.sub(r'--', ' ', text)
text = re.sub('[[].*?[]]', '', text)
text = re.sub(r'(b|s+-?|^-?)(d+|d*.d+)b','', text)
text = ' '.join(text.split())
return text
Next we will continue to clean our texts by removing chapter headings and indicators and apply our text cleaning function.
#remove chapter indicator
hamlet = re.sub(r'Chapter d+', '', hamlet)
macbeth = re.sub(r'Chapter d+', '', macbeth)
caesar = re.sub(r'Chapter d+', '', caesar)
#apply cleaning function to corpus
hamlet = text_cleaner(hamlet)
caesar = text_cleaner(caesar)
macbeth = text_cleaner(macbeth)
We now want to use spaCy to parse our documents. More can be found here on the text processing pipeline.
#parse cleaned novels
nlp = spacy.load('en')
hamlet_doc = nlp(hamlet)
macbeth_doc = nlp(macbeth)
caesar_doc = nlp(caesar)
Now that our texts are cleaned and processed, we can create sentences and combine our documents.
hamlet_sents = ' '.join([sent.text for sent in hamlet_doc.sents if len(sent.text) > 1])
macbeth_sents = ' '.join([sent.text for sent in macbeth_doc.sents if len(sent.text) > 1])
caesar_sents = ' '.join([sent.text for sent in caesar_doc.sents if len(sent.text) > 1])
shakespeare_sents = hamlet_sents + macbeth_sents + caesar_sents
#inspect our text
print(shakespeare_sents)
Our text pre-processing is done and we can start using Markovify to generate sentences.
#create text generator using markovify
generator_1 = markovify.Text(shakespeare_sents, state_size=3)
And now for the fun part. We just need to write a loop to generate as many sentences as we want. Below, we will create 3 sentences of undefined length and 3 more with a length of less than 100 characters.
#We will randomly generate three sentences
for i in range(3):
print(generator_1.make_sentence())
#We will randomly generate three more sentences of no more than 100 characters
for i in range(3):
print(generator_1.make_short_sentence(max_chars=100))
Some example text:
"He will stay till ye come K. Hamlet , this Pearle is thine , Here ‘s to thy health ."
"My Honourable Lord , I will speake to him ."
Not bad for Shakespearean English. But I think we can do better. We will implement POSifiedText using SpaCy to try and improve our text prediction.
#next we will use spacy's part of speech to generate more legible text
class POSifiedText(markovify.Text):
def word_split(self, sentence):
return ['::'.join((word.orth_, word.pos_)) for word in nlp(sentence)]
def word_join(self, words):
sentence = ' '.join(word.split('::')[0] for word in words)
return sentence
#Call the class on our text
generator_2 = POSifiedText(shakespeare_sents, state_size=3)
And finally, print more sentences using our new generator.
#now we will use the above generator to generate sentences
for i in range(5):
print(generator_2.make_sentence())
#print 100 characters or less sentences
for i in range(5):
print(generator_2.make_short_sentence(max_chars=100))
Some examples:
"He ha ‘s kill’d me Mother , Run away I pray you Oh this is Counter you false Danish Dogges ."
"Thy selfe do grace to them , we rest your Ermites King ."
In Closing
In this article we walked through how to quickly and easily implement Markovify for text generation using Markov chains. You can see how easy it is to implement and get up and running once you have a cleaned text. I plan on publishing more NLP/text generation models using neural networks, transformers, and others with this same corpus with the goal of comparing complexity and performance between them.