Building a news aggregator from scratch: news filtering, classification, grouping in threads and ranking

Published in

Towards Data Science

14 min readJan 27, 2020

Fake news headlines from https://www.designboom.com/design/the-fake-newsstand-tbwa-chiat-day-columbia-journalism-review-11-05-2018/

The idea behind this post is to show a reasonably simple approach one can implement in a couple of weeks to solve a real-world problem of creating a news aggregator like Google news or Yandex news showing the top news threads out of millions of news scraped all over the web.

Problem statement and restrictions

So this is another post about NLP where I shall describe a few algorithms for texts filtering, classification, grouping and ranking developed during the Telegram Data Clustering contest. The motivation behind this post is to demonstrate that you can build a decent texts processing system and run it on your laptop without even a GPU.

The contest included five tasks — detecting news languages, filtering news from other texts (such as encyclopedic articles, some random entertainment posts, blogposts, etc), news classification in one of seven categories (Society, Economy, Sports, Science, Technologies, Entertainment and Other), grouping news in threads and ranking these threads by importance. The full contest rules are avaliable here.

Selection of the particular algorithms and instruments was largely dependant on the the contest rules putting some constraints on the implementation — each task had to be executed within 60 sec per 1000 articles on a Debian machine with 8 CPUs and 16 Gb of RAM, there should be no external services or APIs used, the algorithm even sould not assume there is an internet connection (to download some pretrained models, for example). Hence no SOTA Transformer models like BERT, ALBERT or GPT-2 should have been involved.

The high-level solution architecture described in this article looks the following way:

Text preprocessing and vectorization
Texts classification with a custom Deep Neural Network (DNN) with a LSTM and an Attention layers
Grouping texts in threads with the nearest neighbors search algorithm controlled by the Levenshtein distance within each group.
Ranking news threads by importance

Note: all the code and ideas described in this post have been developed during the contest period (2 weeks), though some fine-tuning of the grouping algorithm as well as code styling have been performed afterwards during the New Year holidays.

Raw data parsing

Contest participants have been provided with hundreds of thousands of publications saved as html files containing news title, text, sometimes an image, a publication date, an author and media source. I used the Beatifulsoup library as a handy html parser to extract all the needed data into a pandas dataframe.

Language detection

This part was pretty straightforward — I used a fast langdetect implementation as language detector and feeded it with titles by default to speed up detection — language detection on texts with 10x larger average length is slower.

Language detection block

This step took 12 sec per 1000 files, the detection accuracy is over 99%. The resulting data looked the following way:

Sample of news (titles only) with detected language

Text preprocessing logic

Text vectorisation logic is one of the core algorithm decisions contest participants had to come up with. We had a large and variative enough corpus of around 1M articles to use pretrained word embeddings instead of the basic TF-IDF approach. But first we had to perform common tokenization and stemming procedures. I used stopwords list and Porter Stemmer from the nltk library.

Tokenization function

After this step each text was represented by a list of word tokens.

The next step was the replacement of each token in a list with a vector from one of the pretrained language models — Glove or fasttext.

The result of this operation was that each text was now represented by a list of semantically rich word vectors. I have put a restriction on the maximum length of the list — 50 words, the headline was concatenated with the beginning of the article’s body.

In order to equalise the lenght of all vectors the padding operation has been performed. We’ve now got the feature tensor of our text corpus, each row represents a sequence of pretrained word vectors of the chosen dimension.

Vectorization function using pretrained word embeddings

Deep neural network architecture

Since we had quite enough data for a neural network training I decided to use a deep neural network (DNN) classifier in order to tell news from not news and to define news topics.

Taking into account it was a algorithm contest an obvious choice maximising the solution’s accuracy would be a SOTA NLP model architecture, namely some kind of large Transformer like BERT, but as I have mentioned before, these kinds of models would be too large and too slow to pass the restrictions imposed on hardware and on text processing speed. The other drawback is that such model’s training would take a few days leaving me no time for model’s fine-tuning.

I had to come up with a simpler architecture so I implemented a lightweight neural net with an RNN (LSTM) layer taking the sequence of words embeddings representing texts and an Attention layer on top of it. The output of the network should be the class probabilities (binary classification for news filtering step and multiclass for news categories detection), so the upper part of the network was comprised by a set of fully connected layers.

News topic detection — multiclass classification

I shall be explaining the selection of particular DNN architecture using the multiclass classifier (detecting news categories) as an example because its objective is more challenging and its performance is independent of the threshold value used in the binary classifier to draw the margin between positive and negative classes.

The upper part of our neural net was made up of three Dense layers, the output layer had 7 units corresponding to the number of classes with the softmax activation function and categorical crossentropy as a loss function.

To train it I used the News_Category_Dataset and applied a mapping logic in order to fit the initial 31 news categories into one of 7 categories: Society, Economy, Sports, Technology, Entertainment, Science and Other to comply with the contest objectives.

DNN model architecture — multiclass classifier

I would like to share the logic behind the NN model’s architecture selection and hyperparameters fine-tuning.

The Dropout layers increase the generalisation ability of our model and prevent it from overfitting (Dropout layer randomly zeroes out the given percentage of weights at each update of the training phase). Model’s overfitting without the Dropout layers is clearly demonstrated by the learning curves — the accuracy on the training set grows up to 95% while the accuracy on the validation dataset barely grows during training.
A BatchNormalisation layer applied on top of the LSTM-Attention construction normalized the activations after the dropout at each batch keeping the activation mean close to 0 and the activation standard deviation close to 1. It tends to increase test accuracy on the larger batch sizes and to decrease it on the smaller ones.
Regularization applied to Dense layers penalizes the extreme values of layer parameters.
Batch size selection can also affect a model’s performance. The larger the size is the faster your model trains and the more precisely the gradient vector is calculated on each step. This results in noise reduction which makes the model more prone to converging to a local minimum so the batch size selection is usually a tradeoff between speed, memory consumption and model’s performance. Values between 32 and 256 are the common choice, in our case the model showed top accuracy with batch size 64. Increasing batch size up to 512 or 1024 significantly decreases model’s accuracy (by 2% and 4% accordingly)

Model’s performance depending on architecture and hyperparameters

Attention layer explained

It is worth saying a few words about the attention mechanism used in our model. The theory behind the applied approach is described in the arXiv paper by Raffel et al. This is a simplified model of attention for feed-forward neural networks addressing the RNN information flow problem for long sequences. Particularly, the attention layer provides an optimal transition to the fully connected layer, creating a context vector (an embedding for the sequence of input word vectors) as the weighted average of the hidden states of the input sequence with the weights representing the importance of the elements of the sequence. The explicit notation looks the following way:

where T is the length of sequence and a is a learnable function, namely a single hidden layer feed-forward network with the tanh activation function, jointly trained with the global model.

Attention layer class, implementation by qqgeogor on Kaggle

The drawback of this model is that it does not take the order within an input sequence into account but for our task of news headlines vectorization this is not as significant as for some sequence-to-sequence problems such as phrase translation.

There is a number of publications describing the more complicated attention mechanisms designed for seqence-to-sequence problems, a good start would be the Neural Machine Translation with Attention tutorial by Google Research, I would also recommend to check the Attention? Attention! post by Lilian Weng with an overview of attention mechanisms and their evolution and make sure you have read the original Bahdanau et al, 2015 paper. Among the more up-to-date LSTM + Attention papers of the post-Transformer era I would recommend reading recent Single-headed attention RNN by Stephen Merity, proving that the huge Transformers are not the only possible approach.

P.S. Do not forget to add a saving config in the Attention class as it will enable seamless loading of the saved model with the custom layer.

An alternative simpler transition to the Dense layers would be the Flatten layer used for tensor reshaping.

Other ideas & results

In order to increase the classifier quality one could take advantage of the other information provided like the publication source and the author and add them to the network as one-hot encoded features, but the contest rules explicitly said that the algorithms would be evaluated on different datasets with other specific source list.

To speed up the training phase I ran the code in a colab notebook with TeslaK80 GPU runtime.

News sample with both news categories and binary news / not news scores

The classification step takes 13 sec per 1000 texts.

News / not news — binary classification

To filter only news from the given dataset of publications we had to implement a binary classifier. Actually this step has been performed before news categories classification. The binary calssifier had pretty much similar architecture with the obvious difference in the output layer — the last layer had 1 unit with the sigmoid activation function and binary crossentropy as a loss function.

The data provided by contest organizers had over 90% of news publications in it so I decided to filter 100% news by source (The Guardian, Bloomberg, CNN, etc) and then to use an english Wikipedia atricles (the good ones, not promotional) to represent the NOT NEWS class.

The trained binary classifier model outputs the probability of the object being a positive (1) class, in order to interpret these probabilities we needed to impose a treshold wich would tell news from not news. This selection has been performed manually after a long careful look on the marginal cases has been taken.

The classification step takes 14 sec per 1000 texts. The classification results can be observed in the previous table.

Grouping news in threads

The news grouping task was solved by building a Ball Tree on the text embedding vectors and then searching this tree with an adaptive search radius controlled by the normalized Levenshtein distance within each group of neighbors.

Text vectorization

In order to group news first we should introduce some kind on metrics on the dataset. Since we used DNNs for texts processing and classification the most obvious way to get a text’s vector would be taking the embedding obtained by passing the text through the pretrained multiclass DNN without the last layer. I used the 128 unit dense layer as the output to create texts embedding vectors.

As I found later this approach was not the optimal one — a significantly better performance in news grouping was demonstrated when I switched to a simpler TF-IDF vectorization. This is quite explainable — while in news category classification we needed to generalize the whole news semantic meaning regardless of particular politician’s names, gadget, celebrities, and even particular circumstances, the sequence of pretrained word vectors fed to the DNN was a suitable choice for the task. In news grouping we are dealing with a different setting — each thread should be describing a very particular event — something has happened to somebody, the characters names, even verbs and adverbial modifiers should be the same across the whole news thread, so the classic TF-IDF (calculating the vector of n-gram frequencies in each text normalized by the n-gram frequencies in the whole corpus) is the approach loosing less valuable information. To fight the TF-IDF output matrix sparsity I have applied an SVD decomposition compressing each text’s vector to the selected dimension (1000 in our case).

TF-IDF texts vectorization with SVD on tom to create dense embeddings

A fast nearest neighbor search

The next step is actually texts grouping. The basic unsupervised approach for this task would be clustering, but since we are unlikely to get a perfect clustering from the first shot our algorithm requires an iterative approach and the clustering of the whole dataset is expensive. Besides that we do not have a proper metrics of grouping quality to run an automated hyperparameters tuning procedure, neither we should tune them by hands since it is highly possible that we’ll need a different set of hyperparameters for different news datasets.

I decided that a more precise and flexible approach would be the construction of a fast search index on texts embedding vectors and then querying this index. The efficient approach for fast high-dimensional nearest-neighbor search is to build a binary space partitioning tree, namely a Ball Tree or a KD Tree (a k-dimensional binary search tree is a space-partitioning data structure for organizing points in a k-dimensional space) on the text embeddings without explicitly calculating all the distances within a dataset. These algorithms scale with O(N log(N)) complexity for tree construction. My particular choice was a scikit-learn Ball Tree implementation as it has lower query time O(D log(N)) than a KD Tree in high dimensions and we had to optimize query times as far as we intend to perform iterative searches with various search radius for each element in our dataset. For more details on the differences between KDtree and BallTree data structures and the performance benchmarking please refer to this great post by Jake VanderPlasof, scikit-learn contributor.

Initialising a tree structure

The grouping algorithm

Ok, now we’ve finally have got an index of all papers and can group them in threads by distance between different samples.

I did not have enough time to reflect on some more sophisticated approaches and just created a cycle over all papers in each news category (we can take advantage of the pervious algorithm step to partition our dataset by news categories) checking for their neighbors in radius r_start, r_start was selected empirically in such a way that it took a little more papers than the actual thread contained. Then I calculated an empirical functional variative_criterium_norm — the normalized Levenstein distance between the texts within the group — and decreased the search radius r_curr iteratively until this functional became less than the empirically found constraint or the search radius size hit r_min constraint. If there were no news in the given area, I increased the search radius until we found some neighbors and then switched to the radius decreasing branch. The idea behind this iterative search was that similar news have partly similar headlines (some entities like subject, object and sometimes verbs are invariant over the news thread). The other details of the developed algoritms are easier to see from the code snippet.

News grouping logic

There are 7 hyperparameters controlling the algorithm: r_min, r_max — these control the size of the query area, r_start, r_step — controlling the query area dynamics and vc_min, vc_max, delta_max, controlling the value of normalized Levenshtein distance within a group—this defines the variance of news headlines in a group. These hyperparameters should be tuned after you have chosen the vectorization parameters like the final text embedding size (n_components in SVD) and the range of n_grams used (I used 1-3).

It is not really obvious how we can estimate a “proper” news grouping, so I did not spend too much time playing with hyperparameters after I got a reasonable grouping result — my intention was to suggest a working approach to solve the problem with the given constraints. Clearly there is some space for improvement like checking if we can merge some of the groups and filtering out some occasional noise. Actually in order to get a reasonable number and density of groups the grouping hyperparameters should be fine tuned to each dataset, so there could be an outer cycle implementing a kind of Randomized search on them. One of the ways to get an idea of the particular grouping we got and to estimate its quality is to check the groups size distribution histogram.

Groups size distribution histogram obtained on a test dataset

In fact, the described approach could be regarded as a relative of the DBSCAN clustering.

Execution time varies depending on the hyperparameters chosen for the dataset and the structure of data, the typical values are from 8.5 sec / 1000 papers to 25 sec / 1000 papers including the vectorization time defined by the expensive SVD operation.

Sorry for the long listing, here are the full results for the SOCIETY category ranked by groups size

Grouping results — top 3 groups (ranked by size)

Threads ranking

Actually there are lots of features to use when it comes to threads ranking, the problem is that this ranking may be quite subjective depending on the region and the interests of a particular user. We may agree that there are global and local news, some speculations/opinions on ploitical issues and mere facts, but some new information about Trump’s impeachment hearings may outweigh the global but distant tragedies like forest fires in Australia — the importance of news is a perceived, not an objective characteristic.

In the view of the foregoing I shall just describe the approach to handle this task: the most obvious features for threads ranking would be the number of publications in a thread, the ranking of the sources (in the Society category for example Bloomberg and Financial Times are the most respectable sources) and the semantic meaning of the thread like ‘international politics’, ‘global economy’, ‘war’, ‘global accident’, ‘local accident/crime’ etc. This approach presupposes manual introduction and ranking of these semantic categories and manual ranking of the news sources. If we do not want to create features manually, another possible approach would be to manually rank some selection of threads, to calculate their mean semantic vectors (average any vectorization we have used) and then to use these vectors as the predictors to train out ranking model (a kind of regression model) based on the scores we have assigned to threads importance.

In fact, a quite reasonable ranking can be obtained just by sorting the threads by size. Given this fact and the fact that I have not implemented the ranking part during the contest, I shall leave the enthusiastic ones free to try out any of the approaches discussed above.

One important issue is that the contest presupposed a static ranking — we are given all news for the same date, but in reality news are time-dependent and loose their novelty as time passes, this relevance decay could be fairly enough described with the exp(-t) function.

Thank you for reading this long post, I hope you have found some ideas to use in your NLP projects.