Data Science Mini-project on Online News Popularity of Mashable articles

Richard Han
Towards Data Science
4 min readJul 12, 2019

--

We have a dataset of articles published by mashable, and we want to predict the popularity of a given article. This is a classification problem. Some of the features fall into various categories such as quantitative information about the article — such things as number of images, number of videos, etc. — and qualitative information about the article — such as which day it was published and which topic the article falls under.

Comparing the number of images with the number of shares, we get the following bar graph:

As you can see, articles that have 1 image do better overall. Then, articles that have no images are second-best; and, articles that have 2 images come in third place. The number of shares for articles that have more than 2 images is negligible. Given this insight, it would be wise to include either 0 or 1 image in an article.

Comparing the number of videos with the number of shares, we get the following image:

Notice that articles that have no videos tend to do best, with articles that have 1 video doing second-best and articles that have 2 videos doing third-best. The articles that have more than 2 videos are negligible in comparison.

We can also see which days of the week have the highest number of shares. Here is a bar chart for days of the week:

As you can see, the weekdays have the highest number of shares. Wednesday, Monday, Tuesday, and Thursday have the highest number of shares, with Friday having a significant drop.

We can also see which category or topic does the best. Here is a pie chart for the six categories Tech, Entertainment, World, Business, Social Media, and Lifestyle:

The best performing category is Tech, followed by Entertainment, World, and Business. The least popular categories are Social Media and Lifestyle.

Using Tableau, we can create the above visualizations and do some basic data mining. The insights we’ve derived can tell us how many images and videos to include in an article, on which day to publish the article, and which category the article should be about.

Next, using Python, I applied some machine learning to the dataset. First, to formulate a classification problem, I used the threshold of 1400 shares to create two classes: if the number of shares is greater than 1400, then the article is classified as popular; if the number of shares is less than or equal to 1400, then the article is classified as unpopular.

To prepare the csv file, I created a new column — a popularity column — after the shares column using the IF function; if the number of shares is greater than 1400, then the class is 4 (popular), and otherwise the class is 2 (unpopular).

Our goal is to create a machine learning model that will classify articles as popular or unpopular. To do this, I used the gradient boosting classifier.

First, I split the dataset into a training set and a test set — taking 80% of the dataset as the training set and 20% of the dataset as the test set.

We fit the gradient boosting classifier to our training set. Then, we apply the fitted gradient boosting classifier to our test set.

According to the confusion matrix gotten by comparing the predicted classes to the test classes, our model got 5277 correct classifications out of 7929 test classifications. This gives an accuracy of 67%.

Here is the python code:

#Classification for Online News Popularity#Importing the Librariesimport pandas as pd#Importing the datasetdataset = pd.read_csv('OnlineNewsPopularity(classification with nominal).csv')X = dataset.iloc[:,2:-2].valuesY = dataset.iloc[:, -1].values#Splitting the dataset into the Training set and Test setfrom sklearn.model_selection import train_test_splitX_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)#Fitting Gradient Boosting to Training setfrom sklearn.ensemble import GradientBoostingClassifierclassifier = GradientBoostingClassifier()classifier.fit(X_train, Y_train)#Predicting the Test set resultsY_pred = classifier.predict(X_test)#Making the Confusion Matrixfrom sklearn.metrics import confusion_matrixcm = confusion_matrix(Y_test, Y_pred)

The dataset can be found here: http://archive.ics.uci.edu/ml/datasets/Online+News+Popularity

K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 — Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal.

--

--