Imaginary scene: I am a data scientist working at an online news company. The marketing manager wants to spend a budget promoting articles with high potentials. Therefore, she asks me to build a predictive model to predict if an article is going to get a good amount of shares or not.
The articles in this dataset were published by Mashable (www.mashable.com) and their content as the rights to reproduce it belongs to them. Hence, this dataset does not share the original content but some statistics associated with it. The original content is publicly accessed and retrieved using the provided URLs inside the dataset. (Feel free to download the dataset from Kaggle)
It contains around 40k rows and 61 columns (58 predictive attributes, 2 non-predictive, 1 goal field).
After assessing the data, I found this dataset is pretty clean, tidy, and has zero missing value. However, the only issue it has is that every column’s name has space at the beginning, except for "url". (See the column names below.)
Photo by author
Step 2: Data Cleaning
Therefore, I used the following code to address the above problem.
The lambda function is used to run through every column’s name and if there is a " " in it, then use "" to replace it.
Step 3: Exploratory Data Analysis
In the EDA phase, I found that most shares are less than 15000, so I filter out the articles with more than 15000 shares to see the majority. I planned to plot out the empirical cumulative distribution function(ECDF) and distribution of the number of shares to see how can I classify the number of shares into different groups so that I can build a Classification model to predict which group it belongs to.
Photo by authorPhoto by author
After seeing both ECDF and the distribution, it seems that most articles are between 500 and 3000 shares. Therefore, I decided to divide the number of shares into 3 levels. (See code here)
Extremely Bad: The number of shares is lower than 500.
Majority: The number of shares is between 500 and 3000.
Extremely Good: The number of shares is higher than 3000.
Photo by authorThis shows the percentage of each level in the dataset. Photo by author
Secondly, I found that the original dataset contains two features, weekday and data_channel, that are already one-hot encoded.
data_channel is already one-hot encoded. Photo by author.weekday is already one-hot encoded. Photo by author.
Since one-hot encoding categorical variables with high cardinality can cause inefficiency in tree-based ensembles. Continuous variables will be given more importance than the dummy variables by the algorithm which will obscure the order of feature importance resulting in poorer performance(Ravi. "One-Hot Encoding is making your Tree-Based Ensembles worse, here’s why?").
However, in the last step of this project, optimizing the accuracy, I will test whether it is true that without using one-hot encoding really performs better. So, here, for preparing for the test, I reverted all the one-hot encoded variables, weekday-related variables, and data_channel-related variables, back to single columns.
The converted version of weekday and data_channel variables look like this:
Photo by author
Step 4: Random Forest Model Building
We come to the most important part, building the model. In the model building process, we have mini-steps that need to follow in order to correctly perform it.
Convert the categorical columns into numeric ones.
Split the dataset into a training set(70%) and a test set(30%)
Build and test the model
Beautiful! It is actually higher than I expected!
Try different numbers of n_estimators to see how many trees should we use
Photo by author
This graph shows that we will get an accuracy of ~76% after using more than 60 trees(n_estimatpr = 60).
Step 5: Optimizing the accuracy rate
For optimizing the accuracy rate, I have three directions to explore:
Find out which accuracy is higher, using one-hot encoding or not using it?
Find out which variable is less important for the dependant variable? Drop them to see if the accuracy rate increases.
Which model gets better result, Random Forest Classifier, Linear Support Vector Classifier, or RBF Support Vector Classifier?
Photo by author
For the first one, after calculating the accuracies of different n_estimaters for each case, we can see that the one without using one-hot encoding got nearly 1% higher than the other. Therefore, I decided to keep not using one-hot encoding.
This graph shows the 10 least important variables and "n_non_stop_words" has the lowest feature importance score. Photo by author
For direction 2, after getting the feature importance scores, there is only one variable that has an obvious low score, "n_non_stop_words". And the below graph is the comparison after dropping "n_non_stop_words" and before dropping it.
Photo by author
Weirdly, after removing "n_non_stop_words", it didn’t increase the accuracy. Instead, it decreased by nearly 1.5%. Therefore, I decided not to remove it.
For the last one, which model is better, Random Forest Classifier, Linear Support Vector Classifier, or RBF Support Vector Classifier?
Photo by author
After 5 minutes of calculating the accuracies of three of them, Random Forest Classifier won with an accuracy of 76%!
Thank you for reading to the end! If you are interested in the full code of this project, please check out my Github. Besides, I love feedback, if there is any part this is unclear or should be done better, please reach out to me. Here is my LinkedIn or contact me through my email( [email protected])