Researching Content-Based Filtering for News Feeds

Exploring the effects of personalizing news feeds using classification algorithms

Michel Wijkstra
Towards Data Science

--

Figure by Madison Inouye on Pexels (2020)

With news consumption becoming increasingly digital, news platforms have to work hard to transfer and retain their user base. A proven method of maintaining and increasing the user base of any digital platform is to apply personalization techniques. Currently, news platforms tend to prefer manually selecting articles that should be promoted on the front page. This is partly because they have always worked like this, but also because they are simply unaware of the real world effects of applying personalization techniques to their news feeds. As these news feeds are becoming their primary interaction with their customers, their hesitation to try this technology is understandable. With this research we aim to take some of this hesitation away, by providing some valuable insights into the effects of content-based filtering on news feeds.

This blog provides a look into research conducted for my bachelor thesis. It is written in collaboration with Max Knobbout, Lead Artificial Intelligence at Triple.

We will explain the technical concept, the experiment setup and the most important conclusions. The original research was done using a dataset containing articles from a Dutch regional news platform. In the technical analysis we will be using an open source alternative.

Table of Contents

Technical Concept: Recommendation as Classification

In this section we will explain how we can view the problem of recommending relevant news articles as a classification problem. Generally speaking, there are two families of recommendation algorithms:

  • Collaborative Filtering based: In this way of recommending items to users, we use the interactions of all the users with all the items within a given platform to predict what an individual user may like. For example, in determining whether or not you might like a specific item, we can look for users with a similar taste and see what they thought about that item.
  • Content based: In content based recommendation, we only use the characteristics (features) of items in determining whether or not you might like an item. We do not use interactions of the other users.

The advantage of collaborative filtering is that we can use very little data in order to create a fairly accurate profile for a new user. The disadvantage however is that we require user-item interaction data. In other words, collaborative filtering suffers from the so called “cold start” problem. In this blog, we are going to explore how we can pose the problem of recommendation as a classification problem. This approach is content-based, and thus does not suffer from the cold-start problem. More specifically, we are going to create a dataset and train a ML model for each user on a set of pre-defined articles. This model is able to predict the rating for each user on any unseen article.

In this section we are going to explain and perform the following steps to create our recommendation-as-classification framework:

  1. We are going to extract relevant features for the set of news article.
  2. From the set of all news articles, we are going to extract a small subset of training articles. In our experiment, we opted to select 30 unique articles.
  3. These articles are then presented to a user, for which the user selects a ‘like’ or ‘dislike’ (a Tinder-like swipe flow). This is then used as a labeled training set for our supervised classification problem.
  4. We will use these articles to train a model that is able to predict like/dislike for unseen articles. In order to demonstrate what we did, we are going to use a subset of the “All the news” dataset found on Kaggle, which can be downloaded here: https://www.kaggle.com/snapcrack/all-the-news

1. Feature extraction

The first major challenge in our approach is the problem of feature extraction. Within the domain of Natural Language Processing, a popular approach to model a large body of text is still the Bag-of-Words (BoW) representation. In essence, a BoW representation of a text can be acquired by simply counting all the words present in the document. The count of each word is then used as a feature, allowing us to represent each document as a vector in some N-dimensional space (where N is the number of unique words found in all news articles). In other words, we lose some meaning because we do not remember the order of the words, but we gain the ability to represent an article as a numerical vector. There are many text-books and online repositories which explain the BoW model more in-depth.

Once such a BoW vector of a document is acquired, it is often a good idea, especially in the domain of NLP, to weigh them in some relevant way. The intuition behind this is that some words are less relevant than other words to describe the topic of an article. For example, words like ‘the’, ‘a’ and ‘an’ are very common words, but typically do not convey any meaning. On the contrary, a word that is common in some documents, but rare in other documents (for example names like ‘Trump’) typically convey a lot of meaning for a document. This observation is the basis for the TF-IDF weighting scheme. The general idea is the following:

  1. We scale down features which are common over the entire dataset. These are words which typically convey little meaning.
  2. We scale up features which are common in certain documents, but rare over the entirety of the dataset. These are words which typically convey a lot of meaning.

Once the vectors are weighted with our TF-IDF weighting scheme, we are still faced with a major concern/challenge. That is the challenge of having too many features in the dataset. Simply creating a BoW model and weighting them according to a TF-IDF scheme, and removing some of the rare features, still nets us a vector-representation for each news article of over 5000 dimensions. That is to say, if we are going to train our classification model on only 30 news articles, and each news article has over 5000 features, how can any classification model decide which features are relevant, and which ones are not? Whenever the number of features (5000+) is close-to or larger than the number of instances (30), we are almost inevitably going to overfit the resulting model. In order to circumvent this, we are going to need to perform some feature reduction, in Machine Learning commonly called the problem of dimensionality reduction.

Suppose you, without having any prior knowledge in Machine Learning, are tasked with the problem of performing some relevant dimensionality reduction of the input space. That is to say, each news article is now represented as a TF-IDF vector with a length greater than 5000, and we want to reduce this length to some small number (let’s say 16) without losing too much meaning of the contents of these articles. That is to say, if two articles are about similar topics, we want the resulting vectors to be close together, and if the articles are about completely different topics, we want them to be far apart.

How would you tackle such a problem? By looking at the TF-IDF vectors, you would probably come to the conclusion that some terms are often seen together in some documents, while other terms are often seen in different documents. For example, the term ‘Donald’ is probably often seen together with ‘Trump’, but both these terms are probably not often seen when talking about ‘weather’. The idea you might come up with is to model some higher order terms, called “components”, as some linear combination of invidivual terms. For example, one component might assign high weights to terms like “Donald”, “Trump”, “White”, “House” etc., while another component might assign high weights to terms like “Weather”, “Sunny”, “Rainy”.

Given these components with their respective weights, we can then represent each article as some vector of components. This is the algebraic intuition behind ‘singular value decomposition’ (SVD) (there are also explanations of the technique from a more geometrical point of view). Although we have not talked about how these components are determined, we simply do not have to in order to get the intuition behind the technique.

A BoW extraction, together with a TF-IDF weighting scheme, and then selecting the best components using SVD to describe the topics in the articles, is often together described as the ‘Latent Semantic Analysis’ (LSA) algorithm. More about the technique can be read in the Wikipedia article: https://en.wikipedia.org/wiki/Latent_semantic_analysis

Let’s perform this task in code! Let us first download the csv file and load it in pandas.

We are then goint to perform the steps described above using scikit-learn in Python. For the purpose of this demonstration, we are going to create a pipeline which consists of a TfidfVectorizerfollowed by a TruncatedSVD. The TfidfVectorizer combines a BoW representation (in scikit-learn called CountVectorizer), together with a TF-IDF weighting scheme (in scikit-learn called TfidfTransformer).

As a hyperparameter for the TF-IDF step we give a set of stop-words (acquired from the nltk library) which we do want to include (just to be on the safe side), and we tell the vectorizer that we want a minimum document frequence (min_df) of 0.5%. This means that we are only interested in words which appear in at least 0.5% of all the articles, which allows us to get rid of super-rare words or spelling mistakes.

The TruncatedSVD step creates lower-dimensional vectors, where each dimension (called “components”) are these resulting linear combinations of terms which we described earlier. The hyperparameter n_components determines how many of these components we want to keep; the more we keep, the more information we retain of the original data (note: the amount of information that is retained is often calculated by computing “explained variance”; the variance present in the original dataset which is still explained by the lower-dimensional representations). In our code example, we only select 16 components. For the purpose of LSA, this is particularly low, and we are definitely going to lose some information. However, since we only want to build our recommendation system on the basis of 30 articles, and we do not want to overfit this model, we need the number of features to be low.

(50000, 16)

As can be seen in the output above, we have succesfully transformed each article to a 16-dimensional component vector. It is often not a bad idea to verify what these components represent. Remember that each component is computed as a linear combination of terms. By checking which of these terms have high weights (mathematically we refer to these weights as “coefficients”) for a given component, we can get a grasp as to what these components represent. The coefficients of these terms are stored in the variable pipe.named_steps.svd.components_, so by examining this matrix we can extract the terms with highest coefficients with our custom-build function.

['trump', 'clinton', 'hillary', 'donald', 'campaign']

In other words, components 1 (of the 16) is definetly about Donald Trump and Hilary Clinton, and their campaign. Let’s examine another one:

['comey', 'fbi', 'house', 'investigation', 'white']

Component 7 is about the dismissal of James Comey. What this means, is that a large and significant portion of the news articles are about this specific topic. Note that this is all computed automatically for us, we did not have to engineer any features. That is the power of unsupervised Machine Learning!

2. Extracting 30 relevant articles

So we successfully extracted the relevant features of our news articles. Remember that we want to present 30 unique articles to each user to determine their taste. But which articles do we select? In general, we want to select 30 articles which are a good representation of the entire dataset. In other words, we want articles that kind of “cover” the entirety of the dataset. If we have 30 articles which cover the dataset quite closely, we can make a better estimate about the general taste of a user.

In order to do this, we can perform the following two steps:

  1. First, we perform a K-means clustering (where K=30) on the 16-dimensional LSA-vectors. This will compute 30 points in the 16-dimensional space where the news articles are centered (called the “centroids”).
  2. For each centroid, we compute a best-matching article. A best matching article is an article with lowest distance to the centroid.

The resulting 30 articles are then shown to the user in order to create a dataset. Let’s compute it now!

Thus, the result is a 30-length array where at each index i, the article best_matching_articles[i] has the minimum distance to centroid i. The corresponding articles are:

Figure by Michel Wijkstra (2020)

3. Creating a training-set

We now show each user these 30 articles, and the user has to either select like (swipe right) or dislike (swipe left). We will assume that the result is a 30-length list. For the sake of demonstration, a random choice can be generated as follows:

4. Training the model

We are now ready to train a classificier with input data the 16-dimensional LSA vectors of the 30 best-matching articles, and with label either like or dislike. In other words, we now leave the area of unsupervised ML, and enter the area of supervised ML. We have many choices for our model, but we generally want to select one which can handle a low number of instances. For this purpose, we select a Nearest Neighbor classification model. What this model does, is for each unseen instance, see which known articles are closest (called the "nearest neighbors"). Based on these nearest neighbors, the classifier then decides whether to rate the unseen article as a like or as a dislike.

For example, if all neighbors are like, clearly we would want to rate the unseen articles as like too. This is the intuition behind a Nearest Neighbor classification method. Let's see it in code!

There are two interesting hyperparameters which we can select to counteract overfitting. The parameter n_neighbors tells us how many neighbors we want to consider to classify an unseen instance. Clearly, if we make it very small (e.g. 1), we are prone to overfit because we make a decision of only 1 neighbor which may be an "outlier" if we would look at more neighbors. For example, the closes neighbor may say 'like', but if we would look at more neighbors we would see that they would say 'dislike'. On the contrary, if we make the number of neighbors too high, we may underfit the classifier, since we are simply taking averages of all the neighbors.

The distance hyperparameter tells us that the closer a neighbor is, the more weight it has in determining whether we want to rate an unseen article as like/dislike.

Using this classifier, we can predict what the rating will be for the entirety of the dataset of news articles:

array(['dislike', 'like', 'like', ..., 'like', 'like', 'like'],
dtype='<U7')

One final question that remains is: how good is this classifier? A way to validate this would be to use some form of cross-validation. Since the number of instances is very low, a good way to cross validate is to use the “Leave One Out” (LOO) cross validation policy. What this would do in our case is train the model on 29 articles, and see if the last article is classified correctly. This is done 30 times for each article, and this will net us a score (% correct prediction, also called the “accuracy”). Since we do not have the actual ratings of the user, we can not do that here, but is always a good idea to verify the resulting model. In scikit learn, LOO is implemented in sklearn.model_selection.LeaveOneOut.

Real World Experiment

Now that we have the technical implementation in place, we can conduct the experiment. The experiment consisted of two phases; in de first phase we would gather information about the respondent’s interests to train the algorithm, and in the second phase we would use the trained algorithm to recommend articles. We’ve had 126 respondents help us with this.

Phase One:

In order to recommend articles using our algorithm we first need to have a profile for each respondent. This is where we faced an important decision: using lots of data points would improve the quality of the profile, however spending too much time on this phase might cause the respondents to lose interest. We eventually settled on using 30 articles. We felt like this was a good trade-off between model quality and time spent collecting data.

In the original research it was determined that performing the experiment should not take longer than 6 minutes for each respondent. This is the average time spent consuming digital news in the Netherlands.

In order to create the best possible profile using only 30 articles we needed a selection that was as diverse as possible. To achieve this, we utilized K-means to create 30 clusters. From each cluster we selected the article with the least distance to the centroid.

The selected articles were presented to the respondents as swipe cards, a proven concept borrowed from apps like Tinder. For those who are unfamiliar with this concept, the articles are shown like a deck of playing cards. The respondent can swipe the card to the left to dislike the article, or to the right to like it. We’ve chosen this design because of its simplicity and speed. After swiping through all 30 articles the necessary information is gathered and phase one is concluded.

Phase Two

Now that we have a profile for the respondent, we can start gathering behavioural data. During this phase the respondents were exposed to 30 news views, containing various amounts of personalized articles. This amount varied between 10 and 70% personalized content. Each respondent was instructed to simply pick a single article they would like to read. For each served news overview we stored the articles, whether they were personalized or not and the choise of the respondent. We’ve based our conclusions on this stored data.

Left: Phase One Swipe Interface, Right: Phase 2 News Feed | Figure by Michel Wijkstra (2020)

Conclusions

From the collected data we managed to pull some interesting conclusions. First, we should note that we split up the respondents into 3 groups based on the ‘width’ of their interest. This resulted in groups of respondents with narrow, avarage and wide interests. The categorization was done by counting the amount of articles liked during Phase One and splitting the set on each 33rd percentile. The research resulted in 2 interesting conclusions, which will be addressed below.

Increased effectiveness between 20 and 40% personalized content

The first interesting discovery is increased consumption of personalized content between 20 percent and 40% personalized content. The group with a narrow interests showed the highest improvements between 20 and 30 percent personalisation, while the groups with average and wide interest spike at 40%. In the graph we see a peak around 60% for the group with wide interest as well, but we’ve been unable discover why this effect appeared.

Consumption of personalized content | Figure by Michel Wijkstra (2020)

Reduced effectiveness from 50 to 70% personalized content

The second effect we discovered was more unexpected. During the development of the experiment the general expectation was that at some point only personalized content would be consumed, leaving all non-personalized content untouched. The results however disproved this. To our surprise the amount of personalized content being consumed started decreasing from 50 percent on and even dropped below our expected baseline when we hit 70 percent personalized content. This seems to indicate some sort of oversaturation starts to occur when there is too much personalized content being pushed into the news feed. This eventually appears to lead to respondents looking for diversification, dropping the consumption of the personalised content below the baseline. The effects are displayed below.

The way the effectiveness is calculated was part of the original research, but is beyond the scope of this blog post.

Effectiveness for each level of personalization | Figure by Michel Wijkstra (2020)

Final Word

Our conclusions suggest that limited amounts of personalisation will be beneficial for your website. If you are part of a news platform, we hope this post has inspired you to give personalization a chance. If you do decide to implement this, We’d be very interested in hearing your results!

Otherwise we hope you’ve learnt something and enjoyed the read. If you have any questions, feel free to contact Max or myself.

--

--

MSc Student Information Sciences at VU Amsterdam | Learning about data science, AI and visualization with a background in software engineering