The world’s leading publication for data science, AI, and ML professionals.

Practical Machine Learning with Scikit-Learn

EDA, feature engineering and preprocessing, pipelines

Photo by Joshua Hoehne on Unsplash
Photo by Joshua Hoehne on Unsplash

Customer churn is an important issue for every business. While looking for ways to expand customer portfolio, businesses also focuses on keeping the existing customers. Thus, it is crucial to learn the reasons why existing customers churn (i.e. leaves).

Churn prediction is a common task in predictive analytics. In this article, we will try to predict whether a customer will leave the credit card services of a bank. The dataset is available on Kaggle.

We will first try to understand the dataset and explore the relationships among variables. After that, we will create pipelines to transform features to an appropriate format for model training. In the final step, we will combine the feature transformation pipeline and a Machine Learning model in a new pipeline.

The first step is to read the dataset into a pandas dataframe.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='darkgrid')
churn = pd.read_csv("/content/BankChurners.csv", usecols=list(np.arange(1,21)))
print(churn.shape)
(10127, 20)

I have excluded the redundant columns in the dataset using the usecols parameter.

(image by author)
(image by author)

There are 20 columns. The screenshot above only includes 7 columns for demonstration purposes. We can view the entire list of columns by using the "columns" method.

We should always check the missing values in the dataframe.

churn.isna().sum().sum()
0

The isna function of Pandas returns true if a value is missing. We can apply sum functions to count the number of missing values in each column or entire dataframe. Since there is no missing values, we can move on.


Target variable

The first column (Attrition_Flag) is the target variable which indicates if a customer is attrited (i.e. churned or left the company).

churn['Attrition_Flag'].value_counts(normalize=True)
Existing Customer    0.83934
Attrited Customer    0.16066

Only 16 percent of the customers churned which I think is high. This is the reason why the bank is investigating the issue and trying to understand why these customers churned.

It makes it easier for analysis to have numbers instead of strings for values of the target variable. Thus, we will replace "attrited customer" values with 1 and "existing customer" with 0.

churn['Attrition_Flag'].replace({'Existing Customer':0, 'Attrited Customer':1}, inplace=True)
churn['Attrition_Flag'].value_counts(normalize=True)
0    0.83934
1    0.16066

Categorical Features

The dataset contains both categorical and numerical features. We will first try to explore the relationship between the categorical features and the target variable.

churn.select_dtypes(include='object').columns
Index(['Gender', 'Education_Level', 'Marital_Status', 'Income_Category','Card_Category'],dtype='object')

The select_dtypes function can be used to filter columns based on the data types.

One way to analyze categorical features is the group by function of pandas. It gives us an overview of how a numerical values changes based on the groups in a categorical variable.

churn[['Attrition_Flag','Gender','Marital_Status']].
groupby(['Gender','Marital_Status']).mean().round(2)
(image by author)
(image by author)

The churned customers are indicated by 1 in the attrition flag column. Thus, the higher the average value, the more likely the customer churn is.

It seems like females are slightly more likely to churn than males. The marital status does not seem to be a significant factor in customer churn.

Let’s also check the average churn rate based on education level.

churn[['Attrition_Flag','Education_Level']].
groupby(['Education_Level']).mean().round(2)
(image by author)
(image by author)

Customers with a doctorate degree has the highest average churn rate.

We can check the other categorical variables in a similar way.


Numerical Features

One measure that gives us an idea about the relationships between numerical variables is the correlation coefficient.

The corr function of Pandas creates a dataframe of correlation coefficients between variables. We can check the correlations on the dataframe or visualize them using a heatmap. I prefer the latter because it provides a more structured and informative overview.

corr = churn.corr().round(2)
plt.figure(figsize=(12,8))
sns.heatmap(corr, annot=True, cmap="YlGnBu")
heatmap (image by author)
heatmap (image by author)

The first row of the heatmap is what we are mostly interested in. It shows the correlation coefficients between the target variable (Attrition_Flag) and other variables.

Correlation with target (image by author)
Correlation with target (image by author)

I have sorted the correlation coefficients based on the absolute value because we are interested in all correlations, not only the positive ones.

The correlation coefficient between customer age and churn is very low. We can also check the average age of churned and not-churned customers.

churn[['Attrition_Flag','Customer_Age']]
.groupby('Attrition_Flag').mean()
(image by author)
(image by author)

The difference is so small that we can ignore the customer age variable and not include it in the model.

"Months on book" and "average open to buy" variables also have very small correlation with the target. Scatter plot is a useful plot to compare and analyze numerical variables.

sns.relplot(data=churn, kind='scatter', x='Avg_Open_To_Buy', y='Months_on_book', hue='Attrition_Flag', height=6)
(image by author)
(image by author)

We do not see any pattern that distinguishes churned and not-churned customers based on these two variables. We can also exclude them from the model.

In the heatmap, we also see that total transaction amount and total transaction count are highly correlated. We can visualize these two variables on a scatter plot for further analysis.

sns.relplot(data=churn, kind='scatter', x='Total_Trans_Amt', y='Total_Trans_Ct',hue='Attrition_Flag', height=6)
(image by author)
(image by author)

Both of these variables are important in separating churned and not-churned customers. However, it would be overdoing to include both because they are highly correlating. We can just use one of them.


Feature Preprocessing and Model Training

We will use scikit-learn pipelines to do feature preprocessing steps and model training. Before creating the pipelines, we need to drop the features that will not be included in the model.

churn.drop(['Total_Trans_Ct','Months_on_book','Avg_Open_To_Buy',
'Customer_Age'], axis=1, inplace=True)

The numerical and categorical features require different kinds of preprocessing techniques. Thus, we will create separate pipelines for numerical and categorical features and them combine them in another transformer. The final pipeline will include the combined transformer and a machine learning model.

Let’s first import the dependencies.

#feature engineering and preprocessing
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
#machine learning model and evaluation
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss

Since different transformations will be applied on numerical and categorical features, we will create lists for two different kinds of features.

numeric = churn.iloc[:,1:].select_dtypes(exclude='object').columns
categoric = churn.iloc[:,1:].select_dtypes(include='object').columns

The numerical transformer will scale the feature values using the MinMaxScaler and the categorical transformer will encode the categories using the OneHotEncoder.

num_transformer = Pipeline(steps=[('scaler', MinMaxScaler())])
cat_transformer = Pipeline(steps=[('encode', OneHotEncoder(drop='first'))])

We will use the column transformer module to apply the transformers to relevant features.

preprocess = ColumnTransformer(
    transformers=[
        ('numeric', num_transformer, numeric),
        ('categorical', cat_transformer, categoric)
    ]
)

The next step is to combine the column transformer with a machine learning model in a new pipeline.


Note: It is important to note that we do not actually need these feature transformations for a tree based model (e.g. random forests). However, it is better to create a general pipeline in case we want to use other types of machine learning models.


rf = RandomForestClassifier(n_estimators=100, max_depth=8)
clf = Pipeline([('preprocess', preprocess), ('model', rf)])

The clf is pipeline that transforms features by using the transformer pipelines in the column transformer (preprocess) and then train a random forest classifier with these features.

The next step is to separate training and test sets and train the clf pipeline.

X = churn.drop(['Attrition_Flag'], axis=1)
y = churn['Attrition_Flag']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf.fit(X_train, y_train)

We now have a pipeline that contains a trained random forest classifier. We can make predictions and evaluate the performance.

train_pred = clf.predict_proba(X_train)
test_pred = clf.predict_proba(X_test)
log_loss(y_train, train_pred).round(3)
0.173
log_loss(y_test, test_pred).round(3)
0.196

When it comes to a classification task, log loss is one of the most commonly used metrics. Log loss (i.e. cross-entropy loss) evaluates the performance by comparing the actual class labels and the predicted probabilities. The comparison is quantified using cross-entropy.

The performance is fairly good. However, there is always room for improvement. The model is slightly overfit on the training set since there is a small difference between the losses on training and test sets.

We can apply hyperparameter tuning and cross validation to improve the performance of our model. We can also try different types of machine learning models.


Conclusion

We have covered a typical machine learning model creating process. We first tried to explore the data and understand the relationships among variables. Some features were found to be redundant and dropped.

We have used pipelines to combine feature transformations and model creation. The main advantage of using pipelines is to simplify the preprocessing by combining many different operations in a single pipeline.

We can use pipelines for model selection as well. For instance, the clf pipeline we have created can be used to try different models. We can iterate over a list of models and use each model in the pipeline.

Thank you for reading. Please let me know if you have any feedback.


Related Articles