Analyzing and Predicting Consumer Engagement

We will be using Internet News and Consumer Engagement dataset from Kaggle to analyze consumer data, predict top article and popularity score.

Abid Ali Awan
Towards Data Science

--

Photo by visuals on Unsplash

Introduction

In this project, we will be using Internet News and Consumer Engagement dataset from Kaggle to predict top article and popularity score. We will be exploring our data to discover patterns, such as correlation, distribution, mean, and time series analysis. We will use both text regression and text classification models to predict engagement score and top article based on the title.

Text classification is common among the application that we use on daily basis. For example, email providers use text classification to filter out spam emails from your inbox. The other most common use of text classification is in customer care where they use sentimental analysis to differentiate bad reviews from good reviews ADDI AI 2050. We are going to train our model on titles so that it can predict where the article is top or not. Text Regression is similar where we take text vectorized data and predict popularity score which is a decimal value.

Our key focus will be on an article title and how it affects other features.

DataSet

This dataset (source) is under Creative Commons — CC0 1.0 Universal License and for more information check the meta data information here. The dataset is about news articles collected from Sept. 3, 2019 until Oct. 4, 2019. Afterwards, it is enriched by Facebook engagement data, such as number of shares, comments and reactions.

  • Sourceid column value indicates publisher unique identifier usually presented as lowercase sourcename with spaces replaced with underscore symbol.
  • Source_name column value indicates publisher name.
  • Author column value indicates article author. Some publishers do not share information about authors of their news, in this case usually source_name replaces that information.
  • Title column value indicates headline of an article.
  • Description column value indicates short article description usually visible in popups or recommendation boxes on the publisher’s website. This field is shortened to a few sentences content column.
  • Url column value indicates URL (Uniform Resource Locator) for article located on the publisher website.
  • Urltoimage column value indicates a URL to the main image associated with the article.
  • Published_at column value indicates the exact date and time of publishing the article. Date and time are presented in UTC (+000) time format.
  • Content column value indicates the unformatted content of the article. This field is truncated to 260 characters.
  • Top_article column value indicates article listed as a top article on publisher website. This field can have only two values, 1 when the article is contained in the popular/top articles group and 0 otherwise.

Installing required packages

We will be installing vaderSentiment for sentiment analysis, wordcloud to display most common words used, lightgbm for machine learning model, and imblearn for unbalanced classification.

Loading Required Libraries

Exploring Data

In this section, we will be exploring our data and visualizing key features to makes sense of consumer engagement.

Loading Data

After initial review of the data set contains 10437 records and 14 columns.

We will be focusing on:

  • Tile of article
  • Source name
  • Author
  • Top Articles
  • User Engagement features.

Missing Values

Using .isna().sum() we can check each column with missing values. To make it fancy we have converted our finding into a table with NA percent and color gradient background. As we can observe the majority of content and autor name columns have missing values.

Top Ten Author and Source Name

Using seaborn bar plot we have displayed the top 10 authors and source names. We can see that some of the author’s names are similar to publication/sources. The top three leader sources are Reuters, BBC News, and Irish Times. The top three authors are The Associated Press, Reuters, and CBS News.

Top Article

Using matplotlib pie chart, we can see the there are on 12 percentage of top articles.

Engagement Boxplots

It was pretty hard to analyze the distribution of engagement data, as they have extreme outliners. We can use sns.kdeplot and np.log1p to analyze each engagement column, but there is a better way that is to use boxplot with yscale set to Symlog.

  • Engagement Reaction count has 1 medium but it has multiple outliners with mean values between 0–60.
  • Engagement Comments count has 0 medium but it has multiple outliners with mean values between 0–1.
  • Engagement Share count has 10 medium but it has multiple outliners with mean values between 0–50.
  • Engagement Comments Plugin count has 0 medium but it has multiple outliners with a mean value is 0.

Comment Plugin

Let’s check out the comment plugin in as it has the oddest data with 0 mean and 0 median. As we can see 99 percent of data is ZERO and the rest of them are outliners from 1–15.

Clean title

Let’s clean our title as we will be using it in our machine learning model. There are punctuation marks and capitalized words within our text data that will make our model perform worst.

clean_title function will clean for brackets, hyperlinks, punctuation, and words containing numbers.

Adding new columns containing clean titles

Creating sentimental polarity

Using Vader Sentiment Intensity Analyzer we are going to extract scores from the clean title and dividing them into 3 categories Positive, Negative, and Neutral.

Applying Compound score

Applying Sentiment

As we can see we have added 2 columns in our dataset with sentiment score and emotions based on score.

Countplot on sentiment categories

The News title has mostly neutral sentiments and negative emotions to see the news.

Word Cloud

We will use the word cloud library to display the most common word used in both Title and Description.

Most common words are say, new, said, will and Trump. We have used English stopwords to remove common words present in every sentence.

Time series

We will be plotting consumer engagements on date from Sept. 3, 2019, until Oct. 3, 2019.

Spliting Date Time

We will be spliting date into day of week, month, and year, then adding them into dataframe.

Number of Engagement over month

Using seaborn line plot to display consumer engagement pattern over the month. There is a spike in consumer engagement on 1st October. Maybe it’s due to a major event. Other than that there are smaller peaks in reaction engagements on 3rd, 7th, and 12th September.

Coorelation Heatmap

There is a high positive correlation between reaction, comment, and share engagement. The consumer who likes the post is most likely to share and comment. There is no other significant correlation between engagements and top articles, this is evident that selection of top article is purely based on quality.

Preprocessing

  • Replace missing titles
  • Convert titles into vectors
  • Replace missing values in top_article
  • Over sampling using SMOTE
  • Creating popularity pcore

Tfidf Vectorizer

Our machine learning models understand only numerical values so in order to train our model on text data we will convert it to a matrix of TF-IDF features. SKlearn

Oversampling

Our top_article data is unbalanced as we only have 12 percent of 1's. In order to make our model perform better, we will be using oversampling method SMOTE (Synthetic Minority Over-sampling Technique). I have also tried other oversampling and under sampling methods but SMOTE performed better.

Popularity Score

We will be adding all fours engagement columns and then taking np.log1p(X) which is similar to np.log(X+1). Many zero engagements will result in infinity so adding 1 to all columns will avoid disaster.

We can clearly see the majority of distribution is between 0–3.

We can clearly see the top 2 popular publishing companies are The New York Times and CNN.

Test Classifier Model

Let’s build a model that will take titles and predict whether the article will be become top or not.

  • X : title
  • y : top_article

Build Model

We have experimented with SGD and Random Forest classifier but by far light gradient boosting model performed better. After hyperparameter tunning our model was is ready to be trained on the dataset.

  • Splitting into train and test.
  • Stratify our top_article so they are equaly distributed.
  • Using LGBMClassifier
  • learning_rate=0.5,
  • max_depth=20,
  • num_leaves=50,
  • n_estimators=120,
  • max_bin=2000
  • Cross Validation
  • Training and Testing
  • Confusion Matrix

Cross Validation

After cross-validation of our model, we can observe that our f1 score is quite stable. 0.9 is the best we could achieve.

Train/Test Model

After fitting our model on the training dataset we can see both f1 and accuracy_scores are +0.9 on our test dataset.

save model

Confusion Matrix

We have a few false positives and false negatives. In general, our LGBM model performed better the expected.

Text Regression Model

  • X : title
  • y : Popularity_score

We are going to use the vectorized title to predict the popularity score. I have experimented with logistic regression and random forest regressor, but Light Gradient Boosting performed well in our case.

  • Filling missing values with 1.
  • Train and Test split
  • LGBMRegressor
  • learning_rate=0.01,
  • max_depth=20,
  • num_leaves=50,
  • n_estimators=150
  • Cross Validation
  • Training and Testing

Train Test Split

Build Model

Validation Score

The model performed well as without hyperparameter tuning and logistic regression the score was 8+ RMSE.

Training and Testing

After fitting the model on the training dataset it seems like our model preform quite well on the test dataset too. Now we will be using both models to build a function for title scoring.

Save model

Title Scoring

title_score function takes the title and output the popularity score and classify top Article. Total Engagement which includes reaction, share, and comments. We can use these functions to create the best possible titles for our blogs.

Function Tasks:

  • Clean the text
  • Vectorized the text
  • Predict top article and popularity score
  • Print top article, the popularity score, and total engagement.

Testing

It’s time to test both our function and our predictive models. We will be first testing the random value within our data set and then using titles from today’s news to determine the score.

As we can clearly see our function and both models work.

Latest News

Testing on random latest News

Testing on Top News

As we can see it got high popularity score.

We have tested again on top article from news and as we can see it is top article according to our model.

Conclusion

We had fun exploring the data and playing around with different machine learning techniques and models. In short, we have explored our data and presented unique information that can help the News agency create better content that gets high traction from the consumers. We have developed machine learning models that will help writers and bloggers to write better titles. We have also discovered that high popularity doesn’t mean that it’s going to be a top article.

For future work, I would like to explore multiple clusters within the data and create a model using images of the article to predict popularity scores. I have experimented with deep learning text generation models but due to memory constraints, I was limited to simple tabular models.

To learn more about Data Analysis, Natural Language Processing, and Machine Learning in general I will suggest you take an amazing DataCamp course and practice the exercise on your own.

Code

Code is available at:

Author

Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models and research on the latest AI technologies. Currently testing AI Products at PEC-PITC, their work later gets approved for human trials, such as the Breast Cancer Classifier.

--

--