Summarizing The Great Gatsby using Natural Language Processing

An Introduction to Automatic Summarization using NLP, Markov Chains, Adjacency Matrices, and Machine Learning

Andrew Oliver
Towards Data Science

--

If there is one consistent assignment across 10ᵗʰ grade English classes, it is writing a summary of The Great Gatsby (TGG). TGG is an engaging and exciting work of literature. It has central themes that weave themselves throughout the novel — themes of society and class, wealth and status, the current and the past. A cogent and concise summary is no easy task for a 10ᵗʰ grader.

In this article, we’re going to summarize TGG using some methods from Natural Language Processing (NLP). NLP is a subfield of Artifical Intelligence and Machine Learning that analyzes how computers process and understand organic methods of communication (e.g. written language). While these buzzwords can be intimidating, the underlying concepts are not.

Cleaning and Aggregating TGG

Before we can do any summary, we need to actually get this book. We begin with a copy provided by Project Gutenberg Australia. First, we need to remove stop words. Stop words are words like the, and, in, for, an, etc. These words are necessary to create well-formed sentences but they don’t add easily discernable meaning and they can distort word frequency analysis. We’ll use the list of stop words here. Second, we’ll break TGG into a map of (key, value) pairs. The keys in our map will be the full sentences (with stop words and everything). The values will be arrays representing the cleaned version of the sentence.

Markov Chains with Cosine Similarity

For those unfamiliar with graph theory, it’s very simple. Elementary graphs have two parts: nodes and edges. A node represents a real-world concept like people, phone numbers, or cities. An edge is a connection between two nodes. One common example of a graph is Facebook. The nodes would be Facebook users. There would be an edge between two nodes if these two Facebook users were friends. There’s more on graph theory here.

For our purposes, we’re going to represent TGG as a graph. We’ll have a node for every sentence. And two nodes will have an edge between them that is equal to their sentence similarity (we’ll get to this in a second). But, before that, why is this representation helpful?

This allows us to represent TGG as a Markov Chain. A Markov Chain is a probabilistic model that describes a sequence of states by defining the probability of transitioning from one state to another.

Suppose I want to represent the places I drive as a Markov Chain. Let’s say that I only drive to and from 4 places: home, work, the store, and the gym. For each possible location, there is some probability that I will drive to a different location. This is illustrated graphically below. If the nodes are not connected, there is a 0% probability. In the graph below, I never drive between the store and the gym without coming home first.

We can use this Markov Chain to find the stationary probability that I will be in any given location. In the above graph, it intuitively makes sense that I will most likely be at home at any given time. This is because there are many nodes that point to home with a high probability.

Now, back to Gatsby! Let’s define the transition probabilities between two sentences as equal to the cosine similarity between the two sentences. We’ll then find the stationary probability distribution for our Markov Chain. The sentences with the highest stationary probability are the nodes that are most well connected within our graph. In the example below, Node A would likely have the highest stationary probability.

Highly connected nodes will have a high stationary probability. These nodes should represent summaries of a key topic because these nodes are most related to many other sentences. But, before we get too far ahead of ourselves, we need to define cosine similarity.

Suppose we have the two sentences — “Jack and Jill went up the hill” and “Jill and Jack run down the hill”. Cosine similarity considers these sentences as vectors of words and measures their overlap using the formula below. Cosine similarity calculates the dot product of two word vectors and divides this by the product of each vector’s magnitude.

And now we are all set. We’re going to represent our graph as a matrix. The value at index (X, Y) will be the cosine similarity between Sentence X and Sentence Y. This value is the transition probability between Sentence X and Sentence Y. We’ll use these transition probabilities to find the stationary probability of each node.

Finding stationary probabilities is relatively straight forward in a Markov Chain. We can repeatedly multiply the transition probability matrix by itself until we reach a steady state — when all the transition probabilities each converge to a single value. A more efficient solution is to use left eigenvectors. For those interested in this method, see here.

Now that we have a steady-state, we can look for the highest probabilities. The sentences with the highest steady state probability are below.

"I'm Gatsby," he said suddenly. 
--------------------------------------------------------------------
"You two start on home, Daisy," said Tom. "In Mr. Gatsby's car."
--------------------------------------------------------------------
"I told you I went there," said Gatsby.
--------------------------------------------------------------------
"I want you and Daisy to come over to my house," he said, "I'd like to show her around."
--------------------------------------------------------------------
She had told him that she loved him, and Tom Buchanan saw. He was astounded. His mouth opened a little and he looked at Gatsby and then back at Daisy as if he had just recognized her as some one he knew a long time ago.

Now, here comes the most fun part of data science — making conclusions that our data does not support. Let’s evaluate our summary below.

In our last sentence, Daisy tells Gatsby she loves him and Tom Buchanan, her husband, sees. This sentence captures the complex relationship between Gatsby, Daisy, and Tom. In our fourth sentence, we see Gatsby wanting to show Daisy around his house. He’s convinced Daisy will want to be with him if she sees that he is now rich and successful. This captures Gatsby’s struggle to obscure his past with his current success, a central theme in the novel. And our first sentence captures the iconic moment of Gatsby introducing himself. Our model has done it! We have summarized The Great Gatsby!

Credit: The Great Gatsby, Warner Brothers, 2013

There’s an easy way to get from our analysis to the above paragraph. It only requires one hop, one skip, and one jump. Our data in no way implies the above. Our methods were strong and our analysis was thoughtful. But I brought in a lot of outside knowledge to get to the conclusion above.

We highlight this point not to shortchange this approach but to recognize the limitations of our method. We can reasonably conclude that Gatsby, Daisy, and Tom are relevant characters and that there is some relationship between Gatsby and Daisy. We have certainly found some key ideas but we are far from a well-formed and thorough summary.

Looking Forward

There are certainly some things we can do to improve our method, primarily surrounding the determination of sentence similarity. We can use TF*IDF to see which words are most relevant in a sentence and weight them accordingly. When measuring cosine similarity, we do not need to only consider strict equality. We can consider words that are similar in meaning but not similar in spelling (e.g. happy and elated). If we want to get more intense, we can use advanced topic models like Latent Dirichlet Allocation (LDA).

Automated summarization is broken down into two main areas — extractive methods and abstractive methods. Everything we have talked about here is an extractive method. We are trying to pull relevant information from the text itself. But no human being writes an extractive summary like that. Humans take concepts, generalize them, consider patterns, and produce a result. This is an abstractive method. And for that, we’ll need the three buzziest words in computer science: a deep neural network.

We’re going to use that approach soon so keep an eye out for that article.

Code

For those interested, all the code required to run this can be found in the Github repo here. There is Python code and a Jupyter Notebook. Cleaning the data and calculating the adjacency matrix does take a little bit of time. With Jupyter Notebook, you only have to run these methods once. The method definition and code structure parallel this article so it’ll be easy to follow along.

Thanks for reading!

Questions? Comments? Email me at andrew.oliver.medium@gmail.com. I’d love to hear from you!

--

--