Simulating Text With Markov Chains in Python

Published in

Towards Data Science

3 min readDec 22, 2017

In my last post, I introduced Markov chains in the context of Markov chain Monte Carlo methods. This post is a small addendum to that one, demonstrating one fun thing you can do with Markov chains: simulate text.

A Markov chain is a simulated sequence of events. Each event in the sequence comes from a set of outcomes that depend on one another. In particular, each outcome determines which outcomes are likely to occur next. In a Markov chain, all of the information needed to predict the next event is contained in the most recent event. That means that knowing the full history of a Markov chain doesn’t help you predict the next outcome any better than only knowing what the last outcome was.

Markov chains aren’t generally reliable predictors of events in the near term, since most processes in the real world are more complex than Markov chains allow. Markov chains are, however, used to examine the long-run behavior of a series of events that are related to one another by fixed probabilities.

For any sequence of non-independent events in the world, and where a limited number of outcomes can occur, conditional probabilities can be computed relating each outcome to one another. Often this simply takes the form of counting how often certain outcomes follow one another in an observed sequence.

To generate a simulation based on a certain text, count up every word that is used. Then, for every word, store the words that are used next. This is the distribution of words in that text conditional on the preceding word.

In order to simulate some text from Donald Trump, let’s use a collection of his speeches from the 2016 campaign available here.

First import numpy and the text file containing Trump’s speeches:

import numpy as np
trump = open('speeches.txt', encoding='utf8').read()

Then, split the text file into single words. Note we’re keeping all the punctuation in, so our simulated text has punctuation:

corpus = trump.split()

Then, we define a function to give us all the pairs of words in the speeches. We’re using lazy evaluation, and yielding a generator object instead of actually filling up our memory with every pair of words:

def make_pairs(corpus):
    for i in range(len(corpus)-1):
        yield (corpus[i], corpus[i+1])
        
pairs = make_pairs(corpus)

Then we instantiate an empty dictionary, and fill it words from our pairs. If the first word of the pair is already a key in the dictionary, simply append the next word to the list of words that follow that word. Otherwise, initialize a new entry in the dictionary with the key equal to the first word and the value a list of length one:

word_dict = {}for word_1, word_2 in pairs:
    if word_1 in word_dict.keys():
        word_dict[word_1].append(word_2)
    else:
        word_dict[word_1] = [word_2]

Finally we pick some random word to kick off the chain, and choose the number of words we want to simulate:

first_word = np.random.choice(corpus)chain = [first_word]n_words = 30

After the first word, every word in the chain is sampled randomly from the list of words which have followed that word in Trump’s actual speeches:

for i in range(n_words):
    chain.append(np.random.choice(word_dict[chain[-1]]))

The final join command returns the chain as a string:

' '.join(chain)

When I run this code, my first result is:

'I will be able to vote. Got them back. We’re going to make a total lie, proven out right after. "During the opposite. We have some turnout. My patients are really'

The nice thing here is that we’re using a dictionary to actually look up the next word in the chain. Of course, you can wrap this all up in a function, which I leave as an exercise to the reader.

There are a lot of tools are there to ‘Markovify’ text, and I encourage you to look them up. But for someone just learning Markov chains, the code here is an easy place to start. But there are endless possibilities for improvement. For example, you might require the first word to be capitalized, so your text doesn’t begin mid-sentence:

while first_word.islower():
    first_word = np.random.choice(corpus)

I hope this is helpful for those of you getting started in the wide world of Markov chains. If this code can be improved without sacrificing clarity, leave a comment!

Simulating Text With Markov Chains in Python

Written by b