Machine Learning Classification with Python for Direct Marketing
How to make business more time-efficient, slash costs and drive up sales? The question is timeless but not rhetorical. In the next few minutes of your reading time, I will apply a few classification algorithms to demonstrate how the use of the data analytic approach can contribute to that end. Together we’ll create a predictive model that will help us customise the client databases we hand over to the telemarketing team so that they could concentrate resources on more promising clients first.
On course to that, we’ll perform a number of actions on the dataset. First, we’ll clean, encode, standardize and resample the data. Once the data is ready, we’ll try four different classifiers on the training subset, make predictions and visualise them with a confusion matrix, and compute F1 score to elect the best model. These steps have been put together in the schema:
The dataset we’ll be using here is not new to the town and you have probably come across it before. The data sample of 41,118 records was collected by a Portuguese bank between 2008 and 2013 and contains the results of a telemarketing campaign including customer’s response to the bank’s offer of a deposit contract (the binary target variable ‘y’). That response is exactly what we are going to predict with the model. The dataset is available at Irvine’s Machine Learning Repository of the University of California. So let’s get started!
Data Cleaning, Feature Selection, Feature Transformation
“I like to make a clean sweep of things”, Friedrich Nietzsche said. This is the data cleaning part!
df.isnull().sum() query we make sure there are no missing values in the dataset (if there were any,
df.dropna(subset = ['feature_name'], inplace=True) would drop them from the respected column).
In this example, I used Tableau Prep Builder data cleaning tool to trace and drop outliers, to make sure values in numerical features aren’t strings, to rename some columns and to get rid of a few irrelevant ones, i.e.
'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays', 'previous' and 'poutcome' (these columns describe a telephone call that has happened already, therefore they should not be used in our predictive model).
Next we shall transform the non-numerical labels of the categorical variables to numerical ones and convert them to integers. We do it like this:
It makes sense to run
df.dtypes query just to ensure that the labels have turned to integers. We then apply
sklearn.preprocessing toolbox to standardize the numerical values of the other features that we expect to find use of in the model. The method standardizes features by removing the mean and scaling to unit variance:
Let’s now rank the features of the dataset with recursive feature elimination (RFE) method and Random Forest Classifier algorithm as its estimator:
Num Features: 6
Selected Features: [ True True False True False False True False True True]
Feature Ranking: [1 1 3 1 2 5 1 4 1 1]
At a later stage, when we’ll be building a predictive model, we will make use of this feature ranking. We will play with the model trying to find an optimal combination of the highest-ranked features that would make prediction with satisfactory F1 score. Running ahead, that optimal combination will be:
Class imbalance is the problem that often comes along with such classification cases as fraudulent credit card transactions or the results of online campaigns, for instance. After executing
df['y'].value_counts() query we see that the two classes of the variable ‘y’ are not represented equally in our dataset also. After data cleaning there are 35584 records belonging to the class ‘0’ and only 4517 records of the class ‘1’ in the target variable ‘y’. Prior to splitting the data into the training and testing samples, we should think of oversampling or undersampling the data.
To resample the data, let’s apply
SMOTE method for oversampling from
imblearn.over_sampling toolbox (for this step you may need to install
imblearn package with Pip or Conda first):
It is as simple as that. Now the data is in balance with 35584 entries in each class:
Building a predictive model
Now that the data has been prepared, we are ready to train our model and make predictions. Let’s first split the data into the training and testing sets:
We will try four classification algorithms, i.e. Logistic Regression, Support Vector Machine, Decision Trees, and Random Forest, and then compute their F1 scores using a user-defined
scorer function to choose the classifier with the highest score:
LogisticRegression F1 score = 0.71245481432799659
SVC F1 score = 0.753674137005029
DecisionTreeClassifier F1 score = 0.7013983920155255
RandomForestClassifier F1 score = 0.923286257213907
F1 score is the weighted average of the precision and recall. You can read about how to interpret the precision and recall scores in my post here.
Now, let’s print a full classification report with the precision and recall for the Random Forest algorithm which has demonstrated the highest F1 score:
precision recall f1-score support
0 0.91 0.93 0.92 10594
1 0.93 0.91 0.92 10757
micro avg 0.92 0.92 0.92 21351
macro avg 0.92 0.92 0.92 21351
weighted avg 0.92 0.92 0.92 21351
Finally, we can visualize the result with a confusion matrix:
Great! We’ve cleaned and transformed the data, selected the most relevant features, elected the best model and made a prediction with a decent score. Now we have a model that should help us customise the client databases we hand over to the telemarketing team so that they could center their efforts on those better positioned to react in the affirmative to the campaign first.
Thank you for reading !!