
Introduction
News organizations have increasingly come to rely on media analytics as a way to attract and retain readers. This has become especially true with ad revenue falling drastically this year due to COVID-19 and other pre-pandemic trends.
It’s become vital for media companies to know which news articles resonate with readers, and which articles don’t. With this in mind, I wanted to find out what makes a news article popular or unpopular. I decided to look at online news articles from the New York Times and predict news popularity based on various features including word count and headline & abstract length.
In this post, I’ll focus mainly on exploratory Data Analysis, where I’ll pick out trends and patterns while figuring out what might be the most important predictors for a machine learning model.
The Data
For this project, I accessed the New York Times API and retrieved metadata for articles and comments between January 1 to December 31, 2020. You can follow the steps here for a tutorial on how to do this.
In total, I managed to obtain over 16,000 articles with 11 features and nearly 5,000,000 comments with 23 features. I’ve uploaded the full dataset on Kaggle for anyone who wants to give it a spin.
Let’s take a look at our target variable:

Our comments are pretty unevenly distributed across our articles. In particular, 50% of our articles have less than 90 comments. Here, you could either go forward with this as a regression project or as a classification project.

For my project, I created a binary classification variable, where I chose to classify articles with more than 90 comments as popular, and articles with less than 90 comments as unpopular. As you can see, our data is almost evenly split between our positive and negative classes.
Exploratory Data Analysis
So what are the most important factors when it comes to article popularity? Let’s find out.
Word Count
You might able to intuitively guess that one of the most important predictors for article popularity is word count, which actually has a decent positive correlation with our target class. The only problem here is that our word count variable is pretty skewed.
In general, machine learning algorithms tend to perform better when the distribution of variables is normal — in other words, performance tends to improve for variables that have a standard distribution. Below, I tried various transformation methods to try and achieve a normal distribution.

You’ll notice that a group of articles have a word count of 0. These are interactive articles that don’t contain words. This data isn’t exactly ‘missing at random’ (MAR) so I decided not to go for any kind of imputation here.
To demonstrate the positive correlation between the number of comments and word count, I made a scatterplot that shows compares word count versus the number of comments of articles from the top five newsdesks.

Interestingly, the word count of articles from the OpEd news desk doesn’t have a positive correlation with the number of comments per article.
News Desk
Below are the biggest news desks from within my dataset. These news desks represent more than 50% of my dataset. Unsurprisingly, articles from the OpEd news desk have the highest average popularity, followed by Foreign and Business news desks. In 2017, the NYT implemented a new commenting system that opened up OpEd articles and other selected news articles for 24 hours. This is likely part of the reason why OpEd articles seem to draw a higher frequency of comments.

Headline & Abstract Length
The length of an article’s headline and abstract also has some bearing on popularity. Below, you can see that there’s a bit of a U-shaped distribution. Articles with less wordy headlines and abstracts tend to be more popular than articles with wordier headlines and abstracts.

However, there’s also a bit of a reverse trend where articles with really long headlines and abstracts tend to do well. We see that at the extreme upper and lower ends, article popularity is pretty unpredictable and tends to fluctuate greatly.
Type of Material
We can see that Editorials and OpEd pieces tend to do tremendously well. News analysis and interactive features tend to be more popular, while regular news tends to be a bit less popular. Reviews, obituaries, and news briefings generally don’t do well.

Day of Week
It turns out that articles posted on weekends actually do better – we can infer that is due to the inverse relationship between the number of articles published in a day and average popularity. In other words, articles tend to do better on the weekend as there is less ‘competition’ between articles.

Time of Day
Lastly, time of day also has an impact on popularity. Articles published at night seem to have a greater tendency to be popular, as compared to articles that are published during the day – except at 9 am.

Keywords / Topics
This isn’t going to be surprising to anyone, but certain topics do much better than other topics. Using our article keywords as a proxy for topics within our articles, we can compare the popularity of topics across time.

I’ll start with the (orange) elephant in the room here – Donald Trump has been an extremely hot topic for the New York Times ever since 2016. This was no different in 2020, though we can see that the news cycle hit a climax in October and began to taper off slightly towards December with the end of the Presidential Election debacle.
Judging by the frequency of popular articles per month, we can clearly see that COVID-19 news pretty much hit its peak in April before tapering off towards the end of the year.
Race & ethnicity became a hot topic this year with the George Floyd protests in May and June.

If we dig down a bit further and look at average popularity, there’s a stark contrast in popularity between topics. ~80% of articles mentioning Donald Trump were popular, while only ~30% of articles mentioning real estate were popular.
Conclusion
There are a few key trends here that are worth paying attention to:
- Longer articles tend to do better overall (except for OpEd articles)
- Articles from the OpEd, Politics, Games, and Washington news desks are likely to be popular, while articles from the Sports, Culture, and Podcast news desks are likely to be unpopular.
- Articles with shorter headlines & abstracts (between 50–130 words)are more likely to be popular.
- News analysis and interactive articles tend to more popular than obituaries and regular news reports.
- The date & time of publication can affect an article’s popularity – articles seem to do better when published during the weekend or at night between 11 pm to 2 am.
- Certain topics do much better than other topics throughout the year (e.g. Donald Trump), while other topics only do well during certain points of the year.
Thanks for reading!
That’s all for this post – hope you enjoyed it! I might make another post covering feature engineering & modeling at a later point, but if you’re interested in seeing the source code for this project, you can find it here. You can also find a slightly more technical write-up on my website.
Feel free to connect with me on LinkedIn as well.