Analyzing and Predicting Consumer Engagement
We will be using Internet News and Consumer Engagement dataset from Kaggle to analyze consumer data, predict top article and popularity score.
Introduction
In this project, we will be using Internet News and Consumer Engagement dataset from Kaggle to predict top article and popularity score. We will be exploring our data to discover patterns, such as correlation, distribution, mean, and time series analysis. We will use both text regression and text classification models to predict engagement score and top article based on the title.
Text classification is common among the application that we use on daily basis. For example, email providers use text classification to filter out spam emails from your inbox. The other most common use of text classification is in customer care where they use sentimental analysis to differentiate bad reviews from good reviews ADDI AI 2050. We are going to train our model on titles so that it can predict where the article is top or not. Text Regression is similar where we take text vectorized data and predict popularity score which is a decimal value.
Our key focus will be on an article title and how it affects other features.
DataSet
This dataset (source) is under Creative Commons — CC0 1.0 Universal License and for more information check the meta data information here. The dataset is about news articles collected from Sept. 3, 2019 until Oct. 4, 2019. Afterwards, it is enriched by Facebook engagement data, such as number of shares, comments and reactions.
- Sourceid column value indicates publisher unique identifier usually presented as lowercase sourcename with spaces replaced with underscore symbol.
- Source_name column value indicates publisher name.
- Author column value indicates article author. Some publishers do not share information about authors of their news, in this case usually source_name replaces that information.
- Title column value indicates headline of an article.
- Description column value indicates short article description usually visible in popups or recommendation boxes on the publisher’s website. This field is shortened to a few sentences content column.
- Url column value indicates URL (Uniform Resource Locator) for article located on the publisher website.
- Urltoimage column value indicates a URL to the main image associated with the article.
- Published_at column value indicates the exact date and time of publishing the article. Date and time are presented in UTC (+000) time format.
- Content column value indicates the unformatted content of the article. This field is truncated to 260 characters.
- Top_article column value indicates article listed as a top article on publisher website. This field can have only two values, 1 when the article is contained in the popular/top articles group and 0 otherwise.
Installing required packages
We will be installing vaderSentiment for sentiment analysis, wordcloud to display most common words used, lightgbm for machine learning model, and imblearn for unbalanced classification.
Loading Required Libraries
Exploring Data
In this section, we will be exploring our data and visualizing key features to makes sense of consumer engagement.
Loading Data
After initial review of the data set contains 10437 records and 14 columns.
We will be focusing on:
- Tile of article
- Source name
- Author
- Top Articles
- User Engagement features.
Missing Values
Using .isna().sum()
we can check each column with missing values. To make it fancy we have converted our finding into a table with NA percent
and color gradient background. As we can observe the majority of content and autor name columns have missing values.
Top Ten Author and Source Name
Using seaborn bar plot we have displayed the top 10 authors and source names. We can see that some of the author’s names are similar to publication/sources. The top three leader sources are Reuters, BBC News, and Irish Times. The top three authors are The Associated Press, Reuters, and CBS News.
Top Article
Using matplotlib pie chart, we can see the there are on 12 percentage of top articles.
Engagement Boxplots
It was pretty hard to analyze the distribution of engagement data, as they have extreme outliners. We can use sns.kdeplot
and np.log1p
to analyze each engagement column, but there is a better way that is to use boxplot with yscale set to Symlog
.
- Engagement Reaction count has 1 medium but it has multiple outliners with mean values between 0–60.
- Engagement Comments count has 0 medium but it has multiple outliners with mean values between 0–1.
- Engagement Share count has 10 medium but it has multiple outliners with mean values between 0–50.
- Engagement Comments Plugin count has 0 medium but it has multiple outliners with a mean value is 0.
Comment Plugin
Let’s check out the comment plugin in as it has the oddest data with 0 mean and 0 median. As we can see 99 percent of data is ZERO and the rest of them are outliners from 1–15.
Clean title
Let’s clean our title as we will be using it in our machine learning model. There are punctuation marks and capitalized words within our text data that will make our model perform worst.
clean_title
function will clean for brackets, hyperlinks, punctuation, and words containing numbers.
Adding new columns containing clean titles
Creating sentimental polarity
Using Vader Sentiment Intensity Analyzer we are going to extract scores from the clean title and dividing them into 3 categories Positive, Negative, and Neutral.
Applying Compound score
Applying Sentiment
As we can see we have added 2 columns in our dataset with sentiment score and emotions based on score.
Countplot on sentiment categories
The News title has mostly neutral sentiments and negative emotions to see the news.
Word Cloud
We will use the word cloud library to display the most common word used in both Title and Description.
Most common words are say, new, said, will and Trump. We have used English stopwords to remove common words present in every sentence.
Time series
We will be plotting consumer engagements on date from Sept. 3, 2019, until Oct. 3, 2019.
Spliting Date Time
We will be spliting date into day of week, month, and year, then adding them into dataframe.
Number of Engagement over month
Using seaborn line plot to display consumer engagement pattern over the month. There is a spike in consumer engagement on 1st October. Maybe it’s due to a major event. Other than that there are smaller peaks in reaction engagements on 3rd, 7th, and 12th September.
Coorelation Heatmap
There is a high positive correlation between reaction, comment, and share engagement. The consumer who likes the post is most likely to share and comment. There is no other significant correlation between engagements and top articles, this is evident that selection of top article is purely based on quality.
Preprocessing
- Replace missing titles
- Convert titles into vectors
- Replace missing values in top_article
- Over sampling using SMOTE
- Creating popularity pcore
Tfidf Vectorizer
Our machine learning models understand only numerical values so in order to train our model on text data we will convert it to a matrix of TF-IDF features. SKlearn
Oversampling
Our top_article
data is unbalanced as we only have 12 percent of 1's. In order to make our model perform better, we will be using oversampling method SMOTE (Synthetic Minority Over-sampling Technique). I have also tried other oversampling and under sampling methods but SMOTE performed better.
Popularity Score
We will be adding all fours engagement columns and then taking np.log1p(X)
which is similar to np.log(X+1)
. Many zero engagements will result in infinity so adding 1 to all columns will avoid disaster.
We can clearly see the majority of distribution is between 0–3.
We can clearly see the top 2 popular publishing companies are The New York Times and CNN.
Test Classifier Model
Let’s build a model that will take titles and predict whether the article will be become top or not.
- X : title
- y : top_article
Build Model
We have experimented with SGD and Random Forest classifier but by far light gradient boosting model performed better. After hyperparameter tunning our model was is ready to be trained on the dataset.
- Splitting into train and test.
- Stratify our top_article so they are equaly distributed.
- Using LGBMClassifier
- learning_rate=0.5,
- max_depth=20,
- num_leaves=50,
- n_estimators=120,
- max_bin=2000
- Cross Validation
- Training and Testing
- Confusion Matrix
Cross Validation
After cross-validation of our model, we can observe that our f1 score is quite stable. 0.9 is the best we could achieve.
Train/Test Model
After fitting our model on the training dataset we can see both f1
and accuracy_scores
are +0.9 on our test dataset.
save model
Confusion Matrix
We have a few false positives and false negatives. In general, our LGBM model performed better the expected.
Text Regression Model
- X : title
- y : Popularity_score
We are going to use the vectorized title to predict the popularity score. I have experimented with logistic regression and random forest regressor, but Light Gradient Boosting performed well in our case.
- Filling missing values with 1.
- Train and Test split
- LGBMRegressor
- learning_rate=0.01,
- max_depth=20,
- num_leaves=50,
- n_estimators=150
- Cross Validation
- Training and Testing
Train Test Split
Build Model
Validation Score
The model performed well as without hyperparameter tuning and logistic regression the score was 8+ RMSE.
Training and Testing
After fitting the model on the training dataset it seems like our model preform quite well on the test dataset too. Now we will be using both models to build a function for title scoring.
Save model
Title Scoring
title_score
function takes the title and output the popularity score and classify top Article. Total Engagement which includes reaction, share, and comments. We can use these functions to create the best possible titles for our blogs.
Function Tasks:
- Clean the text
- Vectorized the text
- Predict top article and popularity score
- Print top article, the popularity score, and total engagement.
Testing
It’s time to test both our function and our predictive models. We will be first testing the random value within our data set and then using titles from today’s news to determine the score.
As we can clearly see our function and both models work.
Latest News
Testing on random latest News
Testing on Top News
As we can see it got high popularity score.
We have tested again on top article from news and as we can see it is top article according to our model.
Conclusion
We had fun exploring the data and playing around with different machine learning techniques and models. In short, we have explored our data and presented unique information that can help the News agency create better content that gets high traction from the consumers. We have developed machine learning models that will help writers and bloggers to write better titles. We have also discovered that high popularity doesn’t mean that it’s going to be a top article.
For future work, I would like to explore multiple clusters within the data and create a model using images of the article to predict popularity scores. I have experimented with deep learning text generation models but due to memory constraints, I was limited to simple tabular models.
To learn more about Data Analysis, Natural Language Processing, and Machine Learning in general I will suggest you take an amazing DataCamp course and practice the exercise on your own.
Code
Code is available at:
Author
Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models and research on the latest AI technologies. Currently testing AI Products at PEC-PITC, their work later gets approved for human trials, such as the Breast Cancer Classifier.