
Have you ever had one too many reports to read and you just want a quick summary of each report? Were you ever in a situation where everybody just wanted to read a summary instead of a full-blown report?
Summarization has become a very helpful way of tackling the issue of data overburden in the 21st century. In this story, I will show you how you can create your personal text summarizer using Natural Language Processing (NLP) in Python.
Foreword: Personal text summarizer is not hard to create – a beginner can easily do it!
What is text summarization?
It’s basically a task to generate an accurate summary while maintaining key information and not losing overall meaning.
There are two general types of summarization:
- Abstractive summary >> generate new sentences from original text.
- Extractive summary >> recognize important sentences and create a summary using those sentences.
Which summarization method should I use, and why?
I use extractive summary because I can apply this method to many documents without having to do a lot of (daunting) machine learning model training tasks.
Besides that, extractive summarization gives better summary outcome than abstractive summary, because abstractive summarization has to generate new sentences from the original text, which is a more difficult method than a data-driven approach to extract important sentences.
How to create your own Text Summarizer?
We will use word histogram to rank the importance of sentences and, subsequently, create a summary. The benefit of doing this is that you don’t need to train your model to use it for your document.
Text Summarization Workflow
Below is the workflow that we will be following…
import text>> >> clean text and split into sentences >> remove stop words >> build word histogram>> rank sentences>> select top N sentences for summary
(1) Sample Text
I used the text from a news article entitled Apple Acquires AI Startup For $50 Million To Advance Its Apps. You can find the original news article here.
You can also download the text document from my Github.
(2) Import libraries
# Natural Language Tool Kit (NLTK)
import nltk
nltk.download('stopwords')
nltk.download('punkt')
# Regular Expression for text preprocessing
import re
# Heap (priority) queue algorithm to get the top sentences
import heapq
# NumPy for numerical computing
import numpy as np
# pandas for creating DataFrames
import pandas as pd
# matplotlib for plot
from matplotlib import pyplot as plt
%matplotlib inline
(3) Import text and perform preprocessing
There are many ways to do this. The goal here is to have a clean text that we can feed into our model.
# load text file
with open('Apple_Acquires_AI_Startup.txt', 'r') as f:
file_data = f.read()
Here, we use regular expression to do text preprocessing. We will (A) replace reference number, i.e. [1], [10], [20], with empty space (if any…), (B) replace one or more spaces with single space.
text = file_data
# replace reference number with empty space, if any..
text = re.sub(r'[[0-9]*]',' ',text)
# replace one or more spaces with single space
text = re.sub(r's+',' ',text)
Next, we form a clean text with lower case (without special characters, digits and extra spaces) and split it into individual word, for word score computation and formation of the word histogram.
The reason to form a clean text is so that the algorithm won’t treat, e.g. "understanding" and understanding, as two different words.
# convert all uppercase characters into lowercase characters
clean_text = text.lower()
# replace characters other than [a-zA-Z0-9], digits & one or more spaces with single space
regex_patterns = [r'W',r'd',r's+']
for regex in regex_patterns:
clean_text = re.sub(regex,' ',clean_text)
(4) Split (tokenize) text into sentences
We split the text into sentences using NLTK sent_tokenize() method. We will evaluate the importance of each of the sentences, then decide if we should include each in our summary.
sentences = nltk.sent_tokenize(text)
(5) Remove stop words
Stop words are English words which do not add much meaning to a sentence. They can be safely ignored without sacrificing the meaning of the sentence. We already downloaded a file with English stop words in ‘(2) Import libraries’ section.
Here, we will get the list of stop words and store them in stop_word variable.
# get stop words list
stop_words = nltk.corpus.stopwords.words('english')
(6) Build word histogram
Let’s evaluate the importance of each word based on how many times it appears in the entire text.
We will do so by (1) splitting the words in clean_text, (2) removing the stop words, and then (3) checking the frequency of each word as it appears in the text.
# create an empty dictionary to house the word count
word_count = {}
# loop through tokenized words, remove stop words and save word count to dictionary
for word in nltk.word_tokenize(clean_text):
# remove stop words
if word not in stop_words:
# save word count to dictionary
if word not in word_count.keys():
word_count[word] = 1
else:
word_count[word] += 1
Let’s plot the word histogram and see the results.
plt.figure(figsize=(16,10))
plt.xticks(rotation = 90)
plt.bar(word_count.keys(), word_count.values())
plt.show()

Ahhh… it’s a bit difficult to read the plot. Let’s convert it to horizontal bar plot and display only the top 20 words with a helper function below.
# helper function for plotting the top words.
def plot_top_words(word_count_dict, show_top_n=20):
word_count_table = pd.DataFrame.from_dict(word_count_dict, orient = 'index').rename(columns={0: 'score'})
word_count_table.sort_values(by='score').tail(show_top_n).plot(kind='barh', figsize=(10,10))
plt.show()
Let’s display the top 20 words.
plot_top_words(word_count, 20)

From the plot above, we can see the words ‘ai’ and ‘apple’ appear on the top. This makes sense because the article is about Apple acquiring an AI startup.
(6) Rank sentences based on scores
Now, we are going to rank the importance of each sentence based on sentence score. We will:
- remove sentences that have more than 30 words, recognizing that long sentences may not always be meaningful**;
- then, add score (count) from each word that forms the sentence to form the sentence score.
Sentences that have high scores will form our top sentences. The top sentences will form our summary later.
**Note: In my experience, any word count between 25 and 30 should give you a good summary.
# create empty dictionary to house sentence score
sentence_score = {}
# loop through tokenized sentence, only take sentences that have less than 30 words, then add word score to form sentence score
for sentence in sentences:
# check if word in sentence is in word_count dictionary
for word in nltk.word_tokenize(sentence.lower()):
if word in word_count.keys():
# only take sentence that has less than 30 words
if len(sentence.split(' ')) < 30:
# add word score to sentence score
if sentence not in sentence_score.keys():
sentence_score[sentence] = word_count[word]
else:
sentence_score[sentence] += word_count[word]
We convert the sentence_score dictionary to a DataFrame and display the sentences and scores.
Note: dictionary doesn’t allow you to sort the sentences based on scores, so you need to convert the data stored in dictionary to DataFrame.
df_sentence_score = pd.DataFrame.from_dict(sentence_score, orient = 'index').rename(columns={0: 'score'})
df_sentence_score.sort_values(by='score', ascending = False)

(7) Select top sentences for summary
We use heap queue algorithm to select the top 3 sentences and store them in best_sentences variable.
Usually 3–5 sentences will be enough. Depending on the length of your document, feel free to change the number of top sentences to be displayed.
In this case, I chose 3 because our text is a relatively short article.
# display the best 3 sentences for summary
best_sentences = heapq.nlargest(3, sentence_score, key=sentence_score.get)
Let’s display our summarized text using print() and for loop functions.
print('SUMMARY')
print('------------------------')
# display top sentences based on their sentence sequence in the original text
for sentence in sentences:
if sentence in best_sentences:
print (sentence)
Here is the link to my Github to get the Jupyter notebook for this.
Below is the complete Python script that you can use right away to summarize your text.
Let’s see the algorithm in action!
Below is the original text from a news article entitled Apple Acquires AI Startup For $50 Million To Advance Its Apps (the original news article can be found here):
In an attempt to scale up its AI portfolio, Apple has acquired Spain-based AI video startup - Vilynx for approximately $50 million.
Reported by Bloomberg, the AI startup - Vilynx is headquartered in Barcelona, which is known to build software using computer vision to analyse a video's visual, text, and audio content with the goal of "understanding" what's in the video. This helps it categorising and tagging metadata to the videos, as well as generate automated video previews, and recommend related content to users, according to the company website.
Apple told the media that the company typically acquires smaller technology companies from time to time, and with the recent buy, the company could potentially use Vilynx's technology to help improve a variety of apps. According to the media, Siri, search, Photos, and other apps that rely on Apple are possible candidates as are Apple TV, Music, News, to name a few that are going to be revolutionised with Vilynx's technology.
With CEO Tim Cook's vision of the potential of augmented reality, the company could also make use of AI-based tools like Vilynx.
The purchase will also advance Apple's AI expertise, adding up to 50 engineers and data scientists joining from Vilynx, and the startup is going to become one of Apple's key AI research hubs in Europe, according to the news.
Apple has made significant progress in the space of Artificial Intelligence over the past few months, with this purchase of UK-based Spectral Edge last December, Seattle-based Xnor.ai for $200 million and Voysis and Inductiv to help it improve Siri. With its habit of quietly purchasing smaller companies, Apple is making a mark in the AI space. In 2018, CEO Tim Cook said in an interview that the company had bought 20 companies over six months, while only six were public knowledge.

… and the summarized text is as follows:
In an attempt to scale up its AI portfolio, Apple has acquired Spain-based AI video startup - Vilynx for approximately $50 million.
With CEO Tim Cook's vision of the potential of augmented reality, the company could also make use of AI-based tools like Vilynx.
With its habit of quietly purchasing smaller companies, Apple is making a mark in the AI space.
Conclusion… and one last tip
Congratulations!
You have created your personal text summarizer in Python. The summary, I should hope, looks pretty decent.
It is important to note that we used word frequency in a document to rank the sentences. The advantage of using this method is that it does not require any prior training and can work on any piece of text. As another tip, you can further tweak the summarizer to your liking based on:
(1) The number of top sentences: The simple rule-of-thumb here is that the length of a summary should not be more than ¼ of the original text – it can be one sentence, one paragraph or multiple paragraphs depending on the length of the original text and your purpose of getting the summary. If the text you want to summarize is long, then you can increase the number of top sentences; or
(2) Sentence length: On average, a sentence length today ranges between 15 and 20 words.). Therefore, limiting your summarizer to take only the sentences of length than 25–30 words is enough; however, feel free to increase and decrease the word count.
Thank you for reading this story. Follow me on medium for more of my sharing on Data Science and machine learning.