The world’s leading publication for data science, AI, and ML professionals.

Content-Based Recommender for NYT Articles

Data Products. Data Science. NLP

Introduction

We will create a content-based recommender for New York Times articles. This recommender is an example of a very simple data product.

We’ll be recommending new articles that a user should read based on the article that they are currently reading. We will do this by recommending similar articles based on the text data of that article.


Check out the code on my Github profile.

DataBeast03/DataBeast


Inspect the Data

The following is an excerpt from the first NYT article in our dataset. We are, obviously, dealing with text data.

'TOKYO -  State-backed Japan Bank for International Cooperation [JBIC.UL] will lend about 4 billion yen ($39 million) to Russia's Sberbank, which is subject to Western sanctions, in the hope of advancing talks on a territorial dispute, the Nikkei business daily said on Saturday, [...]"

So the first question we must answer is how should we go about vectorizing it? How do we go about engineering new features such as Parts-of-Speech or N-grams or sentiment scores or Named Entities?!

Obviously the NLP tunnel is a deep one and we can spend days experimenting with the different options that we have. But all good science starts by trying out the simplest viable solution then iterating towards the more complex.

In this article, we will implement one such simple viable solution.


Split Your Data

First we need to decide which features from the data set are of interest, shuffle them, and then split the data into a train and test set – all standard data preprocessing.

# move articles to an array
articles = df.body.values
# move article section names to an array
sections = df.section_name.values
# move article web_urls to an array
web_url = df.web_url.values
# shuffle these three arrays 
articles, sections, web_ur = shuffle(articles, sections, web_url, random_state=4)
# split the shuffled articles into two arrays
n = 10
# one will have all but the last 10 articles -- think of this as your training set/corpus 
X_train = articles[:-n]
X_train_urls = web_url[:-n]
X_train_sections = sections[:-n]
# the other will have those last 10 articles -- think of this as your test set/corpus 
X_test = articles[-n:]
X_test_urls = web_url[-n:]
X_test_sections = sections[-n:]

Text Vectorizer

We have options between choosing from several different Text Vectorizers such as Bag-of-Words (BoW), Tf-Idf, Word2Vec, and so on.

Here is one reason why we should choose Tf-Idf:

Unlike BoW, Tf-Idf identifies the importance of words not merely by text frequency, but also by the inverse document frequency.

So, for example, if a word like "Obama" only appears a few times in an article (unlike stop words "a" or "the" which don’t convey much information) but appears in several different articles then it would be given a higher weight.

Which makes senses because "Obama" isn’t a stop word nor is it mentioned without good reason (i.e. it’s highly relevant to the article’s topic).


Similarity Metric

We have several different options when selecting a similarity metric, such as Jacard and Cosine to name a couple.

Jacard works by comparing two different sets and selecting the overlapping elements. Jacard similarity doesn’t make sense as an option considering that we’ve chosen to use Tf-Idf as a vectorizer; it might make more sense to use Jacard had we selected BoWs vectorization.

The reason why we should choose Cosine as our similarity metric is because it make sense as an option having chosen Tf-Idf as our vectorizer.

Since Tf-Idf provides weights to each token in each article, we can then take the dot product between the weights from tokens of different articles.

If article A has a high weight for tokens like "Obama" and "White House" and so does article B, then their product will result in a larger similarity score than the case in which article B had low weights for those same tokens (for simplicity assume that all other token weights are held consent).


Building the Recommender

This section is where the magic happens!

Here you will build a function that outputs the top N articles to recommend to your user based on the similarity scores between the article they’re currently reading and all other articles in the corpus (i.e. "train" data).

def get_top_n_rec_articles(X_train_tfidf, X_train, test_article, X_train_sections, X_train_urls, n = 5):
    '''This function calculates similarity scores between a document and a corpus

       INPUT: vectorized document corpus, 2D array
              text document corpus, 1D array
              user article, 1D array
              article section names, 1D array
              article URLs, 1D array
              number of articles to recommend, int

       OUTPUT: top n recommendations, 1D array
               top n corresponding section names, 1D array
               top n corresponding URLs, 1D array
               similarity scores bewteen user article and entire corpus, 1D array
              '''
    # calculate similarity between the corpus (i.e. the "test" data) and the user's article
    similarity_scores = X_train_tfidf.dot(test_article.toarray().T)
    # get sorted similarity score indices  
    sorted_indicies = np.argsort(similarity_scores, axis = 0)[::-1]
    # get sorted similarity scores
    sorted_sim_scores = similarity_scores[sorted_indicies]
    # get top n most similar documents
    top_n_recs = X_train[sorted_indicies[:n]]
    # get top n corresponding document section names
    rec_sections = X_train_sections[sorted_indicies[:n]]
    # get top n corresponding urls
    rec_urls = X_train_urls[sorted_indicies[:n]]

    # return recommendations and corresponding article meta-data
    return top_n_recs, rec_sections, rec_urls, sorted_sim_scores

The above function works in this order

  1. Calculate the similarity between the user’s article and our corpus
  2. Sort scores from highest to lowest similarity
  3. Get the top N most similar articles
  4. Get the corresponding top N article section names and urls
  5. Return top N articles, section names, urls, and scores

Validate the results

Now that we have recommended articles for the user to read (based on what they are currently reading) check to see if the results make sense.

Let’s compare the user’s article and corresponding section name with the recommended articles and corresponding section names.

First let’s take a look at the similarity scores.

# similarity scores
sorted_sim_scores[:5]
# OUTPUT:
# 0.566
# 0.498
# 0.479
# .
# . 

The scores aren’t very high (note that cosine similarity ranges from 0 to 1). How can we improvement them? Well, we can select a different vectorizer like Doc2Vec. We can also explore different similarity metrics. Even so, let’s check out the results.

# user's article's section name
X_test_sections[k]
# OUTPUT:
'U.S'
# corresponding section names for top n recs 
rec_sections
# OUTPUT:
'World'
'U.S'
'World'
'World'
'U.S.'

Ok, so the recommended section names seem very appropriate. That’s good!

# user's article
X_test[k]
'LOS ANGELES -  The White House says President Barack Obama has told the Defense Department that it must ensure service members instructed to repay enlistment bonuses are being treated fairly and expeditiously.nWhite House spokesman Josh Earnest says the president only recently become aware of Pentagon demands that some soldiers repay their enlistment bonuses after audits revealed overpayments by the California National Guard.  If soldiers refuse, they could face interest charges, wage garnishments and tax liens.nEarnest says he did not believe the president was prepared to support a blanket waiver of those repayments, but he said "we're not going to nickel and dime" service members when they get back from serving the country. He says they should not be held responsible for fraud perpetrated by others.'

Ok, the user’s article is about overpayments made to National Guard service members.

Now let’s check some excerpts from the top N recommended articles:

# article one
'WASHINGTON -  House Speaker Paul Ryan on Tuesday called for the Pentagon to immediately suspend efforts to recover enlistment bonuses paid to thousands of soldiers in California, [...]'
# article two
'WASHINGTON -  The Latest on enlistment bonuses the Pentagon is ordering California National Guard soldiers to repay [...]'
# article three
'SACRAMENTO, Calif. -  Nearly 10,000 California National Guard soldiers have been ordered to repay huge enlistment bonuses a decade after signing up to serve in Iraq and Afghanistan [...]'
# article four
'WASHINGTON -  The Pentagon worked Wednesday to stave off a public relations nightmare, suspending efforts to force California National Guard troops who served in Iraq and Afghanistan to repay their enlistment bonuses [...]'
# article five 
'SACRAMENTO, Calif. -  The Latest on enlistment bonuses the Pentagon is ordering California National Guard soldiers to repay [...]'

Well, well! Looks like our recommender worked out very well indeed. All top 5 recommended articles were entirely relevant to what the reader was currently reading. Not bad.


Note on Validation

Our ad-hoc validation process of comparing the recommended text and section names revealed that our recommender works as we intended it.

Manually combing through results is fine for our purposes here, however what we ultimately want to do is to create a completely automated process for validation so that we can move our recommender in production and have it be self-validating.

Moving this recommender into production is outside of the scope of this article. This article is meant to show how to prototype such a recommender on a real-world dataset.


Conclusion

In this article we have shown how to construct a simple content-based recommender for New York Times articles. We saw that we simply need to choose a text vectorizer and similarity metric to build one. However, which vectorizer and similarity metric we choose greatly influences performance.

You might have noticed that there’s isn’t any actual machine learning in our recommender. Really what we have is memorization and ranking of similarity scores.


Additional Resources

Standard’s Infolab doc on Recommendation Systems

Spotify’s Deep Learning Music Recommender


Related Articles