News Aggregator in 2 weeks

A simple and fast approach for news categorization and clustering

Ilya Gusev
Towards Data Science

--

Image by Pezibear via Pixabay

Data Clustering Contest was a two-week data science competition held by Telegram in November. The Telegram team wanted participants to build a simple yet effective news aggregator that consolidates thousands of articles from various publishers and websites into a single page, which shows the latest news and top stories, exactly like Google News or Bing News does.

Submission testing screen
A public testing screen for a submission. Image by Author

Despite the competition finished on Dec 2, 2019, the organizers announced the results a week ago. Our team “Mindful Squirrel” stands 3rd in the final leaderboard. In this post, we provide a brief overview of our approach and discuss the main challenges we came across. The full solution is available on GitHub. All training scripts are shared in Colab.

Task overview

The contest included five subtasks which were expected to be run sequentially on the raw data:

  1. Language identification (only English and Russian texts were needed)
  2. News filtering from other texts
  3. Categorization into 7 topics
  4. Clustering articles related to the same event into threads
  5. Ranking threads within each topic
Tasks of the contest. Image by Author, created with draw.io

The language detection step seemed to be very clear, while the other tasks raised a bunch of questions. The strict definition of “news” was not given; topic criteria were not well-defined; clustering granularity and ranking criteria were also missing.

The contest rules required the final submission to run locally on a machine under Debian with 8 cores and 16 GB RAM with no network usage. Moreover, the app had to be able to process any batch of 1000 articles within 60 seconds. The first version of the restrictions also included the penalty for solutions over 200 MB of disk space.

Solution

These restrictions affected the range of applicable instruments and algorithms heavily. The first idea was to adopt SOTA models, such as ELMo or BERT. However, they are too large to meet the requirements. Therefore we extensively relied on a fastText library. It is the library for learning word embeddings and training text classification models created by Facebook’s AI Research in 2016. The whole submission code was written in C++, and the only complex model was trained in Python with Keras.

Language detection

The solution of this task was straightforward. We utilized a pretrained model provided by the fastText team.

Filtering & topic classification

We merged these two tasks into a single multi-class classification problem. Since any supervised classifier requires some amount of training data, we utilized Yandex.Toloka to label a fraction of input data. Toloka is a Russian crowdsourcing platform, just like Amazon Mechanical Turk. We spent 60$ for all the labeling both for Russian and English. We used machine translation to Russian for English texts to make it easier for Russian speaking workers.

We also extended the dataset with publicly available data from the Lenta.ru news dataset for the Russian language and the ones from BBC and HuffPost for English.

Categorization datasets. Image by Author, created with draw.io

With all the labeled data, we trained a simple fastText classifier. The default fastText classification algorithm learns the word embeddings on the provided data and then uses their averaging for prediction. Utilizing rich pretrained word representations usually boosts the classifier performance in the case of a lack of labeled data. We trained such unsupervised vectors on the previous datasets extended by the RIA and the “All the news” corpora.

News clustering

Two essential things for news clustering are rich text embeddings and robust clustering algorithm.

Simple averaging over word vectors did not lead to good results forcing us to apply something more sophisticated. We trained the final text embeddings using a Siamese neural network architecture combined with a dot product, a dense layer, and sigmoid on top. The branch’s architecture consisted of a single dense layer over the concatenation of average-, min- and max-pooling over the word vectors. We split every text into two halves and then trained the network to predict whether these parts belong to the same article or not. We mined negatives from other texts of the same publisher that were close in time to positive examples. As an objective function, we used the log loss function, though we should have used triplet loss.

Siamese network for building text embeddings. Image by Author, created with draw.io

The model was still as simple as possible and had only one trainable matrix. We didn’t use any neural network frameworks in the actual C++ app, only the Eigen library for matrix multiplications.

We utilized the SLINK agglomerative clustering. The algorithm has some drawbacks, and the most important one is transitive linkage. The first point could be linked to the second, the second to the third, and as a result, the first and third fall in the same cluster, although they could be far from each other.

Agglomerative clustering. Image by Author, created with draw.io

The SLINK’s time complexity of O(n²) prevented us from clustering all the documents at once. For overcoming this issue, we assumed that distant in time documents should not belong to the same cluster. This assumption allowed us to split the whole timeline into chunks of 10,000 documents with an overlap of 2,000 items. Then we applied clustering within each chunk and, finally, linked consecutive chunks through the overlapping documents.

Clustering by chunks. Image by Author, created with draw.io

Thread naming & ranking

We based our title selection within a thread on the product of three components: freshness, similarity to other thread documents, and the document’s source weight.

We defined freshness as the scaled sigmoid over the difference between the document date and the date of the freshest document in the thread.

The inner similarity between thread’s documents was calculated using the same text vectors as in clustering. For each document in every thread, we calculated the mean distance to other documents in the thread.

The PageRank algorithm over the links inside the documents allowed us to evaluate the importance of the source.

Freshness, the source weight, and the cluster size formed the basis of thread ranking. The 99th percentile of all publish dates was chosen as the “current moment” for the whole dataset because of possible outliers. The age of the thread was calculated from this “current moment”.

Conclusion

The most significant part of the challenge was to build a system that meets the strict hardware limitations. These constraints, as well as short contest duration, forced us to rely on simple machine learning algorithms (such as fastText) instead of sophisticated SOTA models. We did not have enough time to conduct all the experiments we wanted or to polish the final solution. There is still room for improvement, though we have implemented some modifications after the end of the contest. We also built a CI workflow that matches the output of every version of the code with canonical results and publishes the top to GitHub pages.

Finally, we would like to thank the competition organizers for a challenging task. It was definitely a great problem to solve.

Links:

  1. GitHub repository: Telegram Data Clustering contest solution by Mindful Squirrel
  2. This article in Russian: Новостной агрегатор за две недели
  3. Another article about Data Clustering contest: Building a news aggregator from scratch: news filtering, classification, grouping in threads and ranking

--

--