The world’s leading publication for data science, AI, and ML professionals.

Text Summarization in Python with Jaro-Winkler and PageRank

Building a Text Summarizer with Jaro-Winkler and PageRank

Image taken by
nadi borodina from Unsplash
Image taken by nadi borodina from Unsplash


Table of Contents

  • Introduction to Text Summarization
  • Difficulties of Text Summarization
  • Architecture Overview
  • Problem Statement
  • Installation Requirements
  • Data
  • Implementation
  • Load Text
  • Clean Text
  • Create Network
  • Generate Summary
  • Concluding Remarks
  • Resources

Introduction to Text Summarization

In modern society, with the need to ingest information in a rapid format, text summarization is an essential tool used by many on a daily basis. With up prominent startups like Blinkist which provide short detailed summaries of a large corpus of books. This saves their users hours if not days of reading. Many other companies also use their proprietary text summarization software to save their users time. Other forms of summarizations (albeit in a different medium) come in the forms of highlight reels of sports events or movie recaps.

There are two main approaches of text summarization, namely, extractive and abstractive. The extractive solution requires selecting specific sentences from the body of text to generate the final summary. The general approach to extractive solutions is associated with ranking sentences on their importance in the body of text and returning the most important sentences back to the user. However, abstractive solutions to text summarization involve creating new sentences to capture context and meaning behind the original body of text. Although this is how humans approach text summarization, it’s quite a difficult thing to generalize and teach to a machine. Text compression techniques are commonly used to solve abstractive approaches in text summarization [4].

In this article I will show you how to build an algorithmic text summarizer using the extractive approach. I will rely on two main algorithms, firstly, the Jaro-Winkler distance to measure the distance between a pair of sentences. Lastly, the page rank algorithm, which will rank the sentences based on their influence in the network.

Difficulties of Text Summarization

To solve summarization problems is quite tricky. There are a lot of caveats and it is quite difficult to generalize a solution to work for any body of text. There are various factors in text summarization which changes the impact the summary might have on the original story. I’ve outlined a few components below which makes summarization a very difficult task in NLP:

  1. What is the ideal number of sentences that the summary must hold? A number too large might be useless since you’re practically reading the entire body of text. A number too small might have large consequences in the summary, skipping over important details and plot lines. The shorter the summary, the less information it can hold.
  2. How can you give context in the summary? When providing summaries of subplots within the story the context is the most important part. It’s what makes the summary useful, keeping context in the summary is a very difficult task. This problem is more prominent in extractive solutions than abstractive.

Architecture Overview

The extractive solution I’m proposing will have the following architecture:

Extractive summarization architecture proposed for this article. Image provided by author.
Extractive summarization architecture proposed for this article. Image provided by author.

It will be a 6 step solution:

  1. Get the input text
  2. Clean the input text of certain punctuations, stopwords, etc.
  3. Generate a sentence similarity adjacency matrix
  4. Create the sentence network (nodes are sentences and edges hold the similarity of a pair of sentences)
  5. Rank the nodes in the network using page rank or other ranking measures
  6. Generate the summary based on the top N ranked nodes (sentences).

Problem Statement

Given a body of text we will create a pipeline which will generate a summary of the input body of text. For the approach to solving this problem which was outlined above, this pipeline requires the following python modules and versions.

Installation Requirements

Python=3.8.8
jaro-winkler=2.0.1
networkx=2.5
nltk=3.6.1
numpy=1.20.1

To install the Jaro package through the following command !pip install jaro-winkler or you can reference the installation documentation here.

Data

I’ve set up this tutorial in a manner that any body of text can be input to the pipeline. Continue on with this tutorial with the text I’ve selected or use your own. Below are a set of books you can easily download through the NLTK library and Gutenberg project. All of these books are free to use as they are a part of the public domain.

For the purposes of this tutorial, I will be using Shakespeare’s play on Julius Caesar. Do note that this and all other bodies of text written by William Shakespeare is in the public domain, as the bodies of text were created prior to the existence of any copyright restrictions [2].

Implementation

Load Text

Below is a sample output of how the book variable looks:

Subset of the book from which we imported. Image provided by the author.
Subset of the book from which we imported. Image provided by the author.

Clean Text

In this section we’re going to clean up the text we’ve just imported. For this we are primarily going to remove stopwords and certain punctuations so the procedures later on in the pipeline can be more efficient and accurate in their calculation.

After cleaning up the body of text, we’re left with 1289 sentences to create a network from.

Create Network

In this section we will create an adjacency matrix through the similarity of various sentences. The similarity of sentences will be calculated using the Jaro-Winkler distance [3]. The output of this distance measure will be a floating point number between the values of 0 and 1. The closer this number is to 1 indicates the more similar the pair of sentences are. The closer the number is to 0 indicates the more the pair of sentences are not similar.

This adjacency matrix will then allow us to create a weighted network (through NetworkX). Please keep in mind the size of your input body of text, since the similarity matrix needs to assign a score for each pair of sentences it will be an extensive and exhausting process for your computer if your body of text is very large (> 2000 sentences).

In the created network, the nodes will be indices associated with the sentences in the book and the edges connecting the sentences will be weighted based on the similarity of a pair of sentences. We can then run the page rank algorithm on that weighted network to identify nodes with a large rank associated with them.

At a very high level, the intuition behind the page rank algorithm is to identify highly popular nodes through the use of random walks. It’s a very popular network based algorithm Larry Page and was famously used by Google to rank their web pages in their search engine results [1]. In the context of our example, page rank will tell us highly influential sentences in the network. We can then take the top N most influential sentences, join them together and return them back to the user in the form of a summary of the original body of text. Do note that there are many ways to rank a node, although page rank is the one I’m proposing we use, ranking measures like eigenvector centrality, degree centrality, betweenness centrality and others are also valid for the purposes of this pipeline.

It took ~2 minutes to create the similarity matrix given 1289 sentences. The creation of the network and calculation of the page rank scores were almost instantaneous. The following is a summary of the created network :

Sentence graph stats. Image provided by the author.
Sentence graph stats. Image provided by the author.

The built in page rank function will output a dictionary where the keys are the nodes (sentence index in our case) and the values are the associated page rank score to that node.

Page rank calculation output. Keys are nodes and the values are the page rank scores associated with that node. Image provided by the author.
Page rank calculation output. Keys are nodes and the values are the page rank scores associated with that node. Image provided by the author.

Do keep in mind that we want to map the sentence index back to the original sentence and not the cleaned one used to create the network. This way, the resulting summary generated will be much easier to interpret when including the stop words, punctuations, etc.

Generate Summary

Now that we have ranked the sentences based on their page rank score, we can generate a summary given the users input number of sentences they want in the summary.

For the purposes of this article, I chose 25 sentences to be the number of sentences to summarize this body of text. This was the associated output:

Shakespeare's Julius Caesar summary in 25 sentences. Image provided by the author.
Shakespeare’s Julius Caesar summary in 25 sentences. Image provided by the author.

Concluding Remarks

The result of the summary would vary from text to text, certain bodies of text might be better than others for the given task. Overall, the summary outlined in this pipeline for Julius Caesar wasn’t too poor. It outlines the main plot line that Brutus (Caesar’s friend) was trying to assassinate Caesar.

You can find the Jupyter Notebook associated with the implementation of this pipeline in my GitHub here.

Resources


If you liked this article, you might also enjoy.

Link Prediction Recommendation Engines with Node2Vec

Word2Vec Explained

Recommendation Systems Explained

Optimize Training and Predicting SK-Learn Models in Python with Sparse Matrices

Dynamic Time Warping Explained

Mining & Modelling Character Networks – Part II

Monte Carlo Method Explained

Bayesian A/B Testing Explained


Related Articles