Table of Contents
- Introduction to Text Summarization
- Difficulties of Text Summarization
- Architecture Overview
- Problem Statement
- Installation Requirements
- Data
- Implementation
- Load Text
- Clean Text
- Create Network
- Generate Summary
- Concluding Remarks
- Resources
Introduction to Text Summarization
In modern society, with the need to ingest information in a rapid format, text summarization is an essential tool used by many on a daily basis. With up prominent startups like Blinkist which provide short detailed summaries of a large corpus of books. This saves their users hours if not days of reading. Many other companies also use their proprietary text summarization software to save their users time. Other forms of summarizations (albeit in a different medium) come in the forms of highlight reels of sports events or movie recaps.
There are two main approaches of text summarization, namely, extractive and abstractive. The extractive solution requires selecting specific sentences from the body of text to generate the final summary. The general approach to extractive solutions is associated with ranking sentences on their importance in the body of text and returning the most important sentences back to the user. However, abstractive solutions to text summarization involve creating new sentences to capture context and meaning behind the original body of text. Although this is how humans approach text summarization, it’s quite a difficult thing to generalize and teach to a machine. Text compression techniques are commonly used to solve abstractive approaches in text summarization [4].
In this article I will show you how to build an algorithmic text summarizer using the extractive approach. I will rely on two main algorithms, firstly, the Jaro-Winkler distance to measure the distance between a pair of sentences. Lastly, the page rank algorithm, which will rank the sentences based on their influence in the network.
Difficulties of Text Summarization
To solve summarization problems is quite tricky. There are a lot of caveats and it is quite difficult to generalize a solution to work for any body of text. There are various factors in text summarization which changes the impact the summary might have on the original story. I’ve outlined a few components below which makes summarization a very difficult task in NLP:
- What is the ideal number of sentences that the summary must hold? A number too large might be useless since you’re practically reading the entire body of text. A number too small might have large consequences in the summary, skipping over important details and plot lines. The shorter the summary, the less information it can hold.
- How can you give context in the summary? When providing summaries of subplots within the story the context is the most important part. It’s what makes the summary useful, keeping context in the summary is a very difficult task. This problem is more prominent in extractive solutions than abstractive.
Architecture Overview
The extractive solution I’m proposing will have the following architecture:

It will be a 6 step solution:
- Get the input text
- Clean the input text of certain punctuations, stopwords, etc.
- Generate a sentence similarity adjacency matrix
- Create the sentence network (nodes are sentences and edges hold the similarity of a pair of sentences)
- Rank the nodes in the network using page rank or other ranking measures
- Generate the summary based on the top N ranked nodes (sentences).
Problem Statement
Given a body of text we will create a pipeline which will generate a summary of the input body of text. For the approach to solving this problem which was outlined above, this pipeline requires the following python modules and versions.
Installation Requirements
Python=3.8.8
jaro-winkler=2.0.1
networkx=2.5
nltk=3.6.1
numpy=1.20.1
To install the Jaro package through the following command !pip install jaro-winkler
or you can reference the installation documentation here.
Data
I’ve set up this tutorial in a manner that any body of text can be input to the pipeline. Continue on with this tutorial with the text I’ve selected or use your own. Below are a set of books you can easily download through the NLTK library and Gutenberg project. All of these books are free to use as they are a part of the public domain.
For the purposes of this tutorial, I will be using Shakespeare’s play on Julius Caesar. Do note that this and all other bodies of text written by William Shakespeare is in the public domain, as the bodies of text were created prior to the existence of any copyright restrictions [2].
Implementation
Load Text
Below is a sample output of how the book variable looks:

Clean Text
In this section we’re going to clean up the text we’ve just imported. For this we are primarily going to remove stopwords and certain punctuations so the procedures later on in the pipeline can be more efficient and accurate in their calculation.
After cleaning up the body of text, we’re left with 1289
sentences to create a network from.
Create Network
In this section we will create an adjacency matrix through the similarity of various sentences. The similarity of sentences will be calculated using the Jaro-Winkler distance [3]. The output of this distance measure will be a floating point number between the values of 0 and 1. The closer this number is to 1 indicates the more similar the pair of sentences are. The closer the number is to 0 indicates the more the pair of sentences are not similar.
This adjacency matrix will then allow us to create a weighted network (through NetworkX). Please keep in mind the size of your input body of text, since the similarity matrix needs to assign a score for each pair of sentences it will be an extensive and exhausting process for your computer if your body of text is very large (> 2000 sentences).
In the created network, the nodes will be indices associated with the sentences in the book and the edges connecting the sentences will be weighted based on the similarity of a pair of sentences. We can then run the page rank algorithm on that weighted network to identify nodes with a large rank associated with them.
At a very high level, the intuition behind the page rank algorithm is to identify highly popular nodes through the use of random walks. It’s a very popular network based algorithm Larry Page and was famously used by Google to rank their web pages in their search engine results [1]. In the context of our example, page rank will tell us highly influential sentences in the network. We can then take the top N most influential sentences, join them together and return them back to the user in the form of a summary of the original body of text. Do note that there are many ways to rank a node, although page rank is the one I’m proposing we use, ranking measures like eigenvector centrality, degree centrality, betweenness centrality and others are also valid for the purposes of this pipeline.
It took ~2 minutes
to create the similarity matrix given 1289
sentences. The creation of the network and calculation of the page rank scores were almost instantaneous. The following is a summary of the created network :

The built in page rank function will output a dictionary where the keys are the nodes (sentence index in our case) and the values are the associated page rank score to that node.

Do keep in mind that we want to map the sentence index back to the original sentence and not the cleaned one used to create the network. This way, the resulting summary generated will be much easier to interpret when including the stop words, punctuations, etc.
Generate Summary
Now that we have ranked the sentences based on their page rank score, we can generate a summary given the users input number of sentences they want in the summary.
For the purposes of this article, I chose 25 sentences to be the number of sentences to summarize this body of text. This was the associated output:

Concluding Remarks
The result of the summary would vary from text to text, certain bodies of text might be better than others for the given task. Overall, the summary outlined in this pipeline for Julius Caesar wasn’t too poor. It outlines the main plot line that Brutus (Caesar’s friend) was trying to assassinate Caesar.
You can find the Jupyter Notebook associated with the implementation of this pipeline in my GitHub here.
Resources
- [1] https://en.wikipedia.org/wiki/PageRank
- [2] https://en.wikipedia.org/wiki/Public_domain
- [3] https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance
- [4] https://arxiv.org/pdf/1707.02268.pdf
If you liked this article, you might also enjoy.
Link Prediction Recommendation Engines with Node2Vec
Recommendation Systems Explained
Optimize Training and Predicting SK-Learn Models in Python with Sparse Matrices
Dynamic Time Warping Explained