The world’s leading publication for data science, AI, and ML professionals.

Using a transition matrix to generate text in Python

A simpler (and less efficient) way to create text than use neural nets.

Generated text based on Tolstoy's text (image by the author)
Generated text based on Tolstoy’s text (image by the author)

Currently, neural networks represent the state-of-the-art in the field of text generation. Deep neural nets like GPT-3 with billions of parameters and trained on TB of data are truly impressive. But, using neural nets is not the only way to generate text. In this post, we demonstrate how transition matrices can help with text generation.

What is a transition matrix

Imagine having a system with a finite set of different states. The system can move from one state to another with a certain probability. A transition matrix encodes all these in a matrix of the form:

Classic examples are:

  • Monopoly, where the states are the squares of the board,
  • the study of infectious diseases from epidemiology where in SIR model the states are susceptible, infectious and removed (immune /deceased),
  • the movement of internet user from one link to the other, with states representing the various links.

A more detailed explanation, along with references can be found in Wikipedia.

The probability of moving from one square to another in Monopoly board can be encoded by a transition matrix (image from original patent, source: Wikipedia https://commons.wikimedia.org/wiki/File:BoardGamePatentMagie.png)
The probability of moving from one square to another in Monopoly board can be encoded by a transition matrix (image from original patent, source: Wikipedia https://commons.wikimedia.org/wiki/File:BoardGamePatentMagie.png)

Using a transition matrix to generate text

In our case, the states will be triplets of words and the only possible moves will be from a state (word_1, word_2) to a state (word_2, word_3). Practically, this means that our model will use the last two words of a text to decide what the next word will be. The states, as well as the probabilities from moving between them, are the parameters of the model. In order to define them a text corpus is used. You can find a working example in Python jupyter notebook in github. There is also a link, that allows you to run the notebook in Google colab. The code uses two different corpora:

  • one that includes "Aesop’s Fables" by Aesop, "Odyssey" and "Iliad" by Homer,
  • one that uses "War and Peace", "Anna Karenina" and "The Cossacks" by Leo Tolstoy.

All the texts are freely available from Project Gutenberg.

As a first step, we define a class that will store our transition matrix. Actually, we are going to use a dictionary to implement what is called a sparse matrix. For example the table

will be represented by the dictionary:

{ "i,am": {"a":1, "happy":2}     , "a,nice": {"thing":3} }

Note that:

  • we are not adding entries for cells with zero value,
  • for keys in the columns, we are using only the next word

The careful reader will notice that we are not using probabilities in the matrix but instead an integer that counts how many times a transition between two states occurred. We will use these counts to calculate the corresponding probabilities.

The class has

  • method "add_tripple" that allows to update the count of the occurrences between two states,
  • method "next_word" that given a pair of words selects randomly a next word based on the transition matrix.

For this example, we want to keep things simple, hence we convert to lowercase all our text inputs and remove any special characters.

Next, we read each text file, split it into triplets of successive words and update with them the transition matrix.

Finally, we use the transition matrix to generate text. We start by giving two words ("it" and "is" in our example) and then by using the matrix to select randomly the words that will follow them.

An example of the output for the greek corpus is:

it is enchanted and you can easily find another seat near telemachus he said to her ships and shelters there .
 the writer as stretching all and you do or cause to fight for the soil .
 he fenced the raft .
 at last they shall have plenty of it but by

while for the Tolstoy corpus:

it is really ended ? i am an exception .
 .
 but being with nature seeing her patient smiling face and rigid .
 only when they came to a longstanding impression related to the war of 1815 alexander possesses all .
 he could not make out something black .
 pierre received one

A final step would be to remove the space before the punctuation points, convert to upper case letters in words following them and "i" to "I". This is left as an exercise. (A more challenging task would be to convert the first letter of names, places to capital).

Conclusion

Chimpanzee seated at a typewriter - Public Domain
Chimpanzee seated at a typewriter – Public Domain

a monkey hitting keys at random on a typewriter will almost surely type any given text, such as the complete works of William Shakespeare – Infinite Monkey Theorem

One likely question might be "is this method of any value?"

The answer is that it depends. The general consensus is that neural network methods are state-of-the-art when it comes to generating texts. And it is clear that the text we generated is far from perfect. But, it is worth noting that it took us seconds to train the model and that we used less than 6MB of data. Using more data would provide better results. In my opinion, the method might come in handy in cases that no so perfect text needs to be generated fast and with little amount of data as input.

Have fun generating text using the jupyter notebook!


Related Articles