Predicting and Preventing the Churn of High Value Customers Using Machine Learning

Published in

Towards Data Science

10 min readSep 9, 2020

**Short code snippets and visualizations will be shared throughout this post but all the code for this project can be found in this Github Repo.

Companies put a lot of emphasis on customer acquisition, and rightfully so: companies need customers. However, customer acquisition costs are usually fairly high, and companies sometimes focus too much on acquisition and not enough on retention. A company may have great acquisition rates, but, if you also have a high Churn rate, you are essentially flushing those acquisition costs down the drain. The goal of this project is to demonstrate how machine learning can be used to identify customers who are likely to Churn in the coming months. If customers at high risk of Churn can be identified early, interventions can be done to retain that customer.

What is Churn?

If a customer cancels their service with your company, that is considered a Churn case.

The Data

For this project, I chose to use a dataset from a mobile phone carrier. The dataset contains information on 3,100 randomly selected customers. After cleaning, the dataset looks like this…

Mobile Carrier Dataset Snapshot (image by Author)

Each row represents a customer and each customer has 12 attributes associated with them as well as a 13th attribute (‘Churn’) that describes whether or not that customer canceled their service. Except for Churn, each attribute is an aggregate of 9 months of data (months 1–9). Churn is a record of whether or not that customer canceled service during months 9 through 12.

Why do we need to predict Churn if there is already a Churn column?

We only have a Churn column because this is past data. We will use this past data to build a model, which can then be used on current data to predict Churn before it happens.

Visualizing the Data

How Attributes Relate to Churn (0=Non-Churn, 1=Churn). (image by author)

As we can see, a number of variables differ significantly between the Churn and Non-Churn group, so this dataset likely holds a good deal of useful intelligence.

Training and Testing Data

As with all Machine Learning models, we will split the data set into two parts: training and testing. We will use the training data to build the model and then evaluate the model’s performance on the test data (which the model has never seen before). We will give the model the Churn values during the training process, then, during the testing processes, we will withhold the Churn values and have the model predict them. This is how we simulate how the model would perform if deployed and used on current data to predict Churn.

Imbalanced Classes

The problem we are trying to solve is a Classification Problem: we are trying to take each customer and classify them into either the Non-Churn Class (denoted as a 0) or the Churn Class (denoted as a 1).

Whenever we have a classification problem, it is important to note how frequently we observe each class in the dataset. In our case, we have significantly more Non-Churn cases than we do Churn Cases, as we’d hope would be the case for any business.

For every Churn case we see about 5.5 Non-Churn Cases (image by author)

Noting class imbalances is important for evaluating modeling accuracy results.

For example, if I told you the model I built classifies 84% of cases correctly, that might seem pretty good. But in this case, Non-Churn cases represent 84% of the data, so, in theory, the model could just classify everything as Non-Churn and achieve an 84% accuracy score.

Therefore, when are evaluating model performance with an imbalanced dataset, we want to judge the model based mainly on how it performs on the minority class.

Modeling Results

As always, I will skip right to the results of the final model before diving into the modeling process, so those of you less interested in the whole process can get an understanding of the ultimate outcome and its implications.

By using Synthetic Minority Over-sampling on the training data to oversample the Churn class, I was able to build a Random Forest Classifier with an overall accuracy of 96%. This means that on unseen data (test data), the model accurately predicted whether or not a customer would Churn during the next 3 months with 96% accuracy.

However, as we just discussed, overall accuracy does not tell the whole story!

Because we have imbalanced classes (84%:16%) we want to focus more on how well the model performed on the Churn cases (the minority class).

The below Classification Report gives us this information…

Churn Class Performance Underlined (image by author)

Let’s unpack those results a little bit…

Recall
A Churn class Recall of 0.91 means that the model was able to catch 91% of the actual Churn cases. This is the measure we really care about, because we want to miss as few of the true Churn cases as possible.

Precision
Precision of the Churn class measures how often the model catches an actual Churn case, while also factoring in how often it misclassifies a Non-Churn case as a Churn case. In this case, a Churn Precision of 0.84 is not a problem because there are no significant consequences of identifying a customer as a Churn risk when she isn’t.

F1 Score
The F1 Score is the harmonic mean of Precision and Recall. It helps give us a balanced idea of how the model is performing on the Churn class. In this case a Churn Class F1 Score of 0.87 is pretty good. There is usually a trade off between Precision and Recall. I could have played around with the probability thresholds of the model and gotten Churn Class Recall up to 97%, but Churn Class Precision would have gone down since the model would be classifying a bunch of Actual Non-Churn Cases as Churn Cases. The F1 Score helps keep us honest.

We can visualize this relationship with a Precision Recall Curve of the Churn Class…

Precision Recall Curve of the Churn Class (image by author)

The more the curve bows out to the right, the better the model, so this model is doing well!

Explaining the Final Model

For those of you interested in what a Random Forest Classifier utilizing SMOTE means…

SMOTE (Synthetic Minority Oversampling)

SMOTE is a method for dealing with the class imbalance issue. Because our data contained only 1 Churn case for every 5.5 Churn cases, the model wasn’t seeing enough Churn cases and therefore wasn’t performing well in classifying those cases.

With SMOTE, we can synthesize examples of the minority class so that the classes become more balanced. Now, it is important to note that we only do this to the training data so the model can see more examples of the minority class. We do not manipulate the testing data in any way, which is what we use to evaluate the model performance.

How does SMOTE create new data points out of thin air?

SMOTE plots each example of the minority class in 12 Dimensional Space, since we have twelve attributes for each customer. It picks a random point from that 12D space and draws a line to that point’s nearest neighbor, then plots a new point that lies right in the middle of that line. It does this over and over again until it reaches the class ratios that you asked it for at the start.

In this case, I used SMOTE on the training data to generate enough Churn class samples such that there was 1 Churn case for every 2 Non-Churn cases. This helped improve performance greatly.

Random Forest Classifier

This is the actual model used for the Classification. A Random Forest is a collection of individual Decision Trees, and the way it works is pretty cool!

Most of us are familiar with the concept of a decision tree, even if we don’t know we are. A decision tree searches through the features available and picks the feature which, by splitting the data based on the value of that feature, will produce resulting groups that are as different from each other as possible. A picture will offer a clearer explanation…

Simple Decision Tree Example (image by author)

A Random Forest is a collection of hundreds of different decision trees. Each individual decision tree spits out a classification, and whichever classification gets the most “votes” wins.

Visualization of the voting process (image by author)

If each decision tree is splitting on the most efficient feature, then shouldn’t every decision tree in the forest be identical?
The Random Forest introduces randomness in this way: every time a tree is deciding on which feature to split on, it has to chose from a random subset of features rather than the whole feature set. Thus, each individual tree in the forest is completely unique!

Here is the code for how to implement and evaluate a Random Forest Classifier with SMOTE…

The Model In Action

Accuracy is great but how could this model be used in real life?

With the Random Forest model we can actually generate probabilities for each class prediction. So we can basically have the model give us a probability for each customer of how likely the model thinks that customer is to Churn over the next three months.

For example, we could return a list of all customers who have a greater than 65% chance of Churning. Since we really only care about high value customers, we could make it so that list contains only those customers with a higher than average Customer Value.

In doing so, we are able to generate a list of customers who are of high value to the company and are at high risk of Churning. These are the customers with which the company would want to intervene in some way to get them to stay.

When run on testing data, such a list looks like this (the index is the Customer ID)…

All of our High Risk, High Value Customers (image by author)

Here is the code for generating such a list after fitting the model like we did above…

Understanding How Each Feature Contributes to Churn

Being able to predict those customers who are likely to Churn is great, but, if this were my company I’d also want to know how each feature is contributing to Churn.

Just by visualizing the data like we did in the Visualizing the Data section, we could see how each individual features relates to Churn. Unfortunately, it’s tough to compare the effect of each variable based solely on visualization.

We can actually use another Machine Learning technique, called Logistic Regression, to compare the effects of each feature. Logistic Regression is also a classification model. In our case, it didn’t give us prediction performance that was as good as Random Forest, but it does give us good, interpretable information on how each feature affects Churn.

The Logistic Regression Coefficients are visualized below…

Logistic Regression Coefficients (image by author)

Features with positive coefficients mean that an increase in that feature leads to an increased chance in that observation being a Churn case.

Features with negative coefficients mean that an increase in that feature leads to a decreased chance in that observation being a Churn case.

So the more Call Failures recorded for a customer, the more likely that customer is to be a Churn case. The more SMS messages sent by a customer (‘Frequency of SMS’), the less likely that customer is to be a Churn case.

Here is the code for computing and plotting those coefficients…

Modeling Iterations

As always, I tried multiple different models and over sampling techniques before settling on Random Forest with SMOTE.

Models Tried:
- Logistic Regression
-Random Forest
-Gradient Boosting Classifier
-Ada Boost Classifier

Oversampling Techniques Tried:
-No oversampling (imbalanced classes)
-SMOTE w/ 0.25, 0.5,0.75 and 1 as class ratios
-ADASYN w/ 0.25, 0.5, 0.75 and 1 class ratios

To compare the models, I tried each model with each sampling strategy. I added all results to a dataframe that looked like this…

Summary of Modeling Iterations (image by author)

This dataframe was put together with the following two code blocks (except for Logistic Regression models, which had to be computed separately due to the need to scale the X variables)…

Generating Ensemble Models with Each Sampling Technique

So that is how you can use Machine Learning to predict and prevent the Churn of High Value Customers. To recap, we were able to accomplish two major things using machine learning that we couldn’t have done with “traditional” visualization based analytics. We were able to:
1) Generate a model that can identify 90% of Churn cases before they happen
2) Rank order how each factor contributes to the probability of a customer Churning.