A Simple Text Summarizer written in Rust

Published in

Towards Data Science

8 min readNov 24, 2020

Starfield Library in Seoul (*Image by author*)

Motivation

I read a lot of online news sites to help me catch up with the latest stories. From time to time, I found that some news stories from different sources are simply rephrasing each other. This happens sometimes with technical articles too, but I digress.

This love of reading news stories lead me into the idea of text summarization. Wouldn’t it be great if I can summarize across multiple news sources and let me gather all the information?

I have another goal in mind. I love computer languages. Rust has been on my radar for a long time but I have never come up with a good excuse to learn to use it. I figured that a data process program like text summarization could be a good fit to learn about the gory details of memory ownership.

Text Summarization in Short

There are basically two techniques when it comes to text summarization: Abstractive and Extractive.

Abstraction-based summarization takes words from the original article based on semantic meaning. It can sometimes pick words from another source if the words fit the meaning. The idea is not unlike how a human would have summarize a piece of text. As you can imagine, this is not an easy problem and would almost require some form of machine understanding.

Extraction-based summarization takes a different approach. Instead of trying to understand the underlying text. It uses some mathematical formulas to rank each sentence from the article and output only sentences that are above certain score. This way, the meaning of the original text is mostly preserved without coding the machine to understand.

In this article, we will use an extraction based summarization technique. It’s perfect for individual developer, like me (or you), to experiment with and to appreciate the problem.

Let’s Start Programming!

Programming is fun. Learning a new language along the way is extra fun. It’s no secret that Rust can be intimidating. That’s okay. It only means that more people are there to help because we’ve all been there.

I don’t claim to be a Rust expert. In fact, this is my first meaningful Rust program. So, if you find anything unorthodox, please let me know! With that said, let’s start programming.

Breaking Down the Problem

As with any problem in the world, it is always a good idea to break down the problem into smaller ones and conquer each one separately (and savour successes along the way). With text summarization, I can think of the followings subproblems:

Turning a paragraph into sentences and words
Calculate sentence similarity
Sentence ranking
Putting it all together

Turning a paragraph into sentences and words

Now, this seems like an easy problem to solve… until you realize that maybe you want to also consider languages other than English. When it comes to non-English language, you really have only one option — Use a Unicode library. Luckily Rust has exactly that. It’s called unicode-segmentation . Let’s go ahead and add that into our Cargo.toml file:

Once the library is installed, we can use the following code fragment to get the sentences and words out of a paragraph.

In the code above, you should notice a few Rust features that stand out:

Memory Ownership

Did you notice the little & symbols? In Rust, memory is managed by a set of ownership rules. At any point in time, each value (a piece of memory) can only be owned by one variable. If you need to share the value with other parts of the program, you have to let them borrow it. The syntax for borrowing is the & sign. For more information, refer to the Rust Documentation.

Mutability

The mut keyword means that the sentences_and_words vector (yes, vec! is a macro to create a Vector object) is mutable. I like languages where the default for a variable is immutable. Mutability should be an opt-in, not the default.

Closure

The second feature is the use of closure in|&sentence| {...} . Finally a language that doesn’t use => or -> for closure. I do hope that Rust would adopt either -> or => to reduce the transition efforts but hey I am not a language designer.

String vs str

Do you also notice that we are using str (or specifically &str ) instead of the more popular String in other languages? In fact, Rust also have a String type. The comparison between str and String is similar to C++’s char* vs std::string . It takes some getting used to for developers from Java or JavaScript where a single String type is used. Luckily, converting between the two types is relatively pain free (but not automatic, unfortunately).

Trait

Oh, one more thing. Did you notice that once we have installed the library, the Rust string suddenly got a couple of new methods, e.g. text.unicode_sentences() and sentence.unicode_words() ? This is a really powerful feature called trait. Considering that Rust is a zero cost abstraction system. This is quite an achievement.

Calculate sentence similarity

Now that we have the basic data structure, we can move on to the next problem: Calculate sentence similarity. Before we dive into the code, let’s introduce a simple concept: cosine similarity.

Cosine Similarity

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. It measures the cosine of the angle between two vectors projected in a multi-dimensional space. The smaller the angle, higher the cosine similarity.

Now, what does it mean to text summarization? What is in this vector? Given that we are trying to find how similar are two sentences, it probably makes sense to store in the vector the frequency of each word as it appears in each sentence.

Sentence vector

If you are OK with that, let’s go ahead and build the sentence vector. Instead of going through every line of code, let’s focus on the following function:

This function takes 3 arguments:

A sentence (which is a list (or slice in Rust) of words &str )
A set of words in lower case all_words_lc . This is a set of words that are collected from the 2 sentences we are going to compare.
A list of stop words (words that are not important, in English, they are like ‘a’, ‘this’, ‘is’, etc.).

Inside the function, it calculates the frequency of occurrence of a word inside the sentence. The resulting vector has the same size as the all_words_lc set. The number in each element represents the number of times the word at that index in the all_words_lc set appears in the sentence.

Sentence similarity

Using the sentence vectors, we can finally calculate the sentence similarity:

We didn’t talk about the cosine_distance function yet. Don’t worry. It’s pure arithmetics and isn’t too hard to implement it yourself. (See Appendix)

Sentence ranking

Next, it is the most important part of the program: Sentence ranking.

Using the outputs of the sentence_similarity function, we can build a similarity matrix for each sentence pair in a paragraph. This matrix is then fed into a ranking algorithm to provide the score for each sentence, which is the basis of our summarization output.

PageRank in Rust

The PageRank algorithm is popularized by Google. It is used to rank the importance of a web page based on its outgoing links and its incoming links.

We are going to use the same algorithm to rank our sentences. Instead of web pages, we have sentences. Instead of the probability of a user moving from one web page to the other, we use the similarity of one sentence to the other.

Yes, we are taking a huge leap of faith here and assume that the importance of a sentence for summarization correlates to the similarity score of this sentence in relation to other sentences. Interestingly, this seems to work reasonably well but feel free to customize the algorithm to suit your needs and experiment with it.

There are ready made PageRank algorithm in Python in the networkx library. I don’t know if there is one for Rust. In any case, I wrote my own routine by following Wikipedia. It’s actually quite enjoyable.

Notice that in the above code, I also used the ndarray library in Rust. It provides important linear algebra model such as 2D Array (Array2) and arithmetics with overloaded operators (e.g. the - and * in the code above are overloaded to work with these new types).

Putting it all together

Finally, we can put everything together. The summarize function below makes use of our code above to create a summary of the input text. The arguments are the input text, the list of stop words, and the number of resulting sentences.

Output

Let’s give our summarizer a test, given the following text:

As of Sunday, there were more than 58.2 million reported cases of COVID-19 worldwide, with more than 37.2 million of those cases listed as recovered, according to a COVID-19 tracking tool maintained by Johns Hopkins University. The global death toll stood at more than 1.3 million. In Asia, the daily tally of reported cases in Japan hit a record for the fourth day in a row, with 2,508 people confirmed infected, the Health Ministry said Sunday. A flurry of criticism has erupted, from opposition legislators and the public, slamming the government as having acted too slowly in halting its \"GoTo\" campaign, which encouraged travel and dining out with discounts. In Europe, French authorities ordered the culling of all minks at a farm after analysis showed a mutated version of the coronavirus was circulating among the animals. The move follows virus developments in mink farms in Denmark and other countries, including the Netherlands, Sweden and Greece. In the Americas, Chile says it will open its main border crossing and principal airport to foreign visitors on Monday after an eight-month pandemic shutdown. Arrivals will have to present evidence of a recent negative test for the novel coronavirus, as well as health insurance. They'll also have to report their whereabouts and health status for a two-week watch period. Those coming from high-risk countries will have to quarantine for 14 days. In Africa, Sudan's minister of cabinet affairs on Sunday tested positive for the coronavirus, the prime minister's office said, the latest in a string of senior officials to be infected as the country shows an increase of confirmed cases of COVID-19. Over the past month, acting ministers of finance and health, the central bank governor and two associates to Prime Minister Abdalla Hamdok have tested positive.

The output with 5 sentences are:

The global death toll stood at more than 1.3 million. A flurry of criticism has erupted, from opposition legislators and the public, slamming the government as having acted too slowly in halting its "GoTo" campaign, which encouraged travel and dining out with discounts. In Europe, French authorities ordered the culling of all minks at a farm after analysis showed a mutated version of the coronavirus was circulating among the animals. The move follows virus developments in mink farms in Denmark and other countries, including the Netherlands, Sweden and Greece. In the Americas, Chile says it will open its main border crossing and principal airport to foreign visitors on Monday after an eight-month pandemic shutdown.

Although there are obvious shortcomings in the output, the resulting text looks quite promising.

Conclusion

We talked about text summarization and we walked through an extraction based text summarizer written in Rust. This article neither goes deep into text summarization nor programming in Rust. It is my hope that you will do your own research into these topics if they pique your interest.

Appendix

Wait a minute… You want to see the program in its entirety? Sure, here you go: