
Customer churn prediction is a very common Data Science use case in Marketing. The idea is to estimate which customers are likely to cancel a subscription or stop using your service.
Within the past years many competitions have been held and new approaches were developed to better predict customer churn. A few months ago I found the great paper "A survey on machine learning methods for churn prediction" written by Geiler et al. (2022). The paper benchmarks and later analyzes common churn prediction approaches. Like for most of the papers, no Github repository or source code can be found.
So I decided to build a custom benchmarking pipeline by myself for two reasons:
- Transferring described methods from papers to code is a crucial skill in Data Science.
- I was wondering if I get similar results when using a subset of their used data sets (the ones with a high class imbalance), since some of their sets have a high churn rate (e.g., 50%).
What you will learn in this article:
- How to create a scikit-learn pipeline to benchmark different churn-prediction approaches on different data sets
- How to implement a feedforward neural network with the KerasClassifier wrapper to be compatible with our pipeline
- How to implement a neural network as a custom classifier in scikit-learn
What you will not find in this article:
- The benchmarking results. They are worth having their own article.
- The article can now be found here.
- An extensive explanation of the used models and pipeline functions.
Please note:
- If you are interested in a more detailed explanation of how to create pipelines with scikit-learn, check out my previous article "Advanced Pipelines with scikit-learn".
- To focus on the important parts of the code, the snippets below do not contain any import statements. You can find a link to the complete code at the end of this article.
Overview
Figure 1 provides a brief overview of each benchmarking step in the pipeline. Data Loading and Pre-Cleaning are summarized in chapter 0 "Pre-Pipeline steps". The Benchmarking itself is covered in chapters 1 to 6. Finally, chapter 7 "Visualization" closes this article.

0. Pre-pipeline steps
Since each data set is different in its structure (e.g., different names or data types of the target variable), it makes sense to bring them in a consistent format first.
In this case, the target column should always be named churn, and its values are bool. Also, columns that do not provide any value (e.g., the user id) should be dropped as well.
The snippet above shows an example of loading and manipulating one data set (ibm hr). Lines 7-12
can be repeated for any other data set you want to use for benchmarking.
After each data set is in the right format, minor pre-cleaning steps (see code below) can be applied.
Lines 4-11
convert the column names to lower case (4
), remove columns with more than 20% missing values (6-7
).
After the helper function _df_precleaning is applied to all data sets, a simple pipeline (18-23
) can be used to remove duplicate columns and columns with constant values. Finally, the pre-cleaned data sets are stored as .csv files (31
).
1. Initial configuration and loading data
After ensuring that the data sets are pre-cleaned and persisted, the actual benchmarking can be constructed. As an initial step, configurations like suppressing warnings (2–6
), enabling logging (9-16
), and ensuring that when our pipeline is visualized when called (23
) have to be set.
After this step, each data set, that end with _cleaned.csv (26–30) is loaded into the _datasets dictionary.
2. Defining sampling approaches
Churn data sets suffer usually from high-class imbalance. This means that the number of churners are in the minority. To deal with this class imbalance the package imbalanced-learn comes with a battery of different sampling approaches. I will focus here on the ones that were also used by Geiler et al. (2022). However, feel free to extend the list below.
To use these sampling methods later in the pipeline, they have to be brought in the right format (tuple) first. Approaches that use a combination of multiple sampling methods (20-31
), have to be wrapped in an imbPipeline object.
3. Define models
There are many machine learning models out there for Churn Prediction. For this benchmarking I decided to stick to the following:
- GEV-NN (gev_nn)
- Feedforward Neural Network (ffnn)
- Logistic Regression (lr)
- Random Forest (rf)
- XGB Classifier (xgb)
- KNeighborsClassifier (knn)
- SVC (svc)
- LGBMClassifier (lgb)
- Gaussian Naive Bayes (gnb)
- Two VotingClassifiers (soft voting) made of lr, xgb, rf, and ffnn
The first two deep learning models are (winning) solutions from past customer churn prediction competitions. Their code can not be directly used in a scikit-learn Pipeline. Therefore, I had to make their solutions "pipeline-compatible" first. The rest of the models are default implementations by scikit-learn or provide a wrapper (lgb) for it.
3.1 GEV-NN
GEV-NN is a deep learning framework for imbalanced classification. The authors Munkhdalai et al. (2020) claim it outperforms state-of-the-art baseline algorithms by around 2% at most. Their GitHub code can be found here.
To implement their architecture in my scikit-learn pipeline, I implemented a custom classifier and called the relevant functions (MLP_AE) from their code.
The code for MLP_AE class is stored in a separate file which is almost the same as in their code in Gev_network.py. The authors respect the size of the given (train) set when determining the batch size. So I ensured to provide the _batchsize as a fit_param (34–36
).
3.2 Feed Forward Neureal Network (FFNN)
Similar to the GEV-NN model, the FFNN model was also developed for a competition (the WSDM – KKBox’s Churn Prediction Challenge). Naomi Fridman’s code can be found here. Since her code follows a simpler approach, I did not have to write a custom classifier. Instead, I could build a KerasClassifier which is a wrapper for using the Scikit-Learn API with Keras models.
The wrapper (36–41
) needs a function that returns a compiled model (5-24
). Based on the original code, I also ensured that all callback functions, except storing the model, are implemented.
3.3 Bringing everything together
Now that these custom solutions are compatible with the sci-kit learn pipeline, the other default scikit-learn models can be defined quickly and stored as tuples in a list (48-58
) to call them later one by one during the benchmarking (see step 6).
4. Initial pipeline
The pipeline’s elements are very dynamic. In each iteration, a combination of a sampling approach and an ML is benchmarked on a given data set. However, some parts of the pipeline remain constant. These parts are defined as the initial pipeline (figure 2, left).

The dynamic part represents the combination of a sampling approach and an ML model that changes in each iteration. An example of this pipeline "extension" can be seen in figure 2 on the right hand side. After the cross-validation scoring of the given approach is done, the pipeline is set to its initial state (part 2 is removed from the pipeline) and a new combination is attached to it.
The code below shows the implementation of this initial pipeline.
The pipeline distinguishes between numerical (4-8
) and categorial features (9-13
). Missing values in numerical features are replaced by the feature’s mean (5
), while for categorial features, the most frequent value is used (10
). After the imputation step, a MinMaxScaler is applied to numerical columns (6
) and an OneHotEncoder to categorial ones (11
).
5. Scores to track and the right batch size
5.1. Scores
The scores that are being tracked during the cross-validation are the following:
- Lift Score (compares model predictions to randomly generated predictions)
- ROC AUC
- F1 Score (for true class and macro)
- F2 Score
- Recall (in churn prediction we usually have higher costs on false negatives)
- Precision
- Average Precision (PR AUC)
The implementation can be found below.
Since the lift score is not a default sci-kit learn score, I used the _makescorer function (3
) to make it compatible.
5.2 Determine the right batch size
Before running the benchmarking, I created a helper function that determines the right batch size based on the (training) data. This function follows Munkhdalai et al. (2020) approach of setting an appropriate batch size value.
6. The benchmark loops
The benchmarking (see code below) consists of three nested loops:
- Level 1: Data sets (
3
) - Level 2: Models (
11
) - Level 3: Sampling approaches (
14
)
Before each loop, a dictionary is created (9,13
) to store the scores of the respective benchmarked combination. The structure of the _bnchmrkresults dictionary defined in the previous step (5.1) should then look like this:
{
'data set ds': {
'model m':{
'sampling approach sa': {
'lift_score':[],
'roc_auc':[],
'f1_macro':[],
'recall':[]
}, ...
}, ...
}, ...
}
In the first loop (data sets), X and y are defined (5-7
) by dropping/assigning the target variable churn.
Within the inner loop (level 3 – starting from 14
), the respective sampling approach is attached to the initial pipeline (21–25
). Since some sampling approaches have multiple steps (e.g., SMOTE + RND), a loop is necessary to attach each single step. After appending the sampling approach(es), finally the model is attached as well (28
).
As mentioned earlier, a helper function was created to determine the right batch size. This function is called when the current model is either the FFNN or the GEV-NN (31–39
). Its output is then provided via the _fitparams parameter in the cross-validation function to the respective deep learning model.
In lines 44-53
the _cross_validate_ function is called with RepeatedStratifiedKFold as its splitting strategy. After the results were written into the _samplingresults dictionary (55
), the (extended) pipeline is set back to its initial state (59
).
Since some benchmarks run for a quite long time, the results for each data set are stored as a pickle file (68–69
).
7. Visualization
A comprehensive way to visualize the performance of the different approaches is to use box plots. For this type of visualization the models names are plot together with their respective (overall) performance on the axes. This means the data set level in our _bnchmrkresults dictionary (see #6) can be skipped.
But first things first, the script below loads all pickle files and adds their content (the benchmark on each data set) to the results dictionary.
As mentioned at the beginning, the data set will not be considered in this visualization. Therefore, some transformations have to be applied first to bring the data in the following shape:
{
'sampling approach sa':{
'model m': {
'lift_score':[],
'roc_auc':[],
'f1':[],
'recall':[]
}, ...
}, ...
}
Unfortunately, I was not able to find in time a better way to bring the data in the right structure than using this code below:
The code creates a new dictionary (12
) to store the reshaped data from the original one. The helper function _metricmerger (5–9
) concatenates the values of each error metric.
Finally, the reshaped data can be visualized. The code below consists of two parts. The first one (lines 1–21
) is a helper function, that creates a single box plot.
The second part (lines24-46
) loops through each sampling approach and then plots the respective box plots.
As mentioned in the introduction part, the benchmark outcomes deserve a separate article. That’s why I limited the visualization below to the IBM HR data set only (figure 3) and will leave the results uncommented.

Conclusion
Creating a more complex pipeline with custom classifiers or wrappers can be challenging. However, it was also very fun and I had a steep learning curve. One of the most time consuming part was definitely to integrate the custom deep learning models (gev_nn, ffnn) in the scikit-learn pipeline.
Also, the upcoming computation time to evaluate each approach on several data sets can be intense. This might also get reinforced by the initial pipeline steps. I used, like in Geiler et al. (2022) paper, one-hot encoding. If a data set with a lot of categorial features gets into the pipeline, a lot of new columns will be created (curse of dimensionality). An alternative would be here to add a dimensionality reduction step.
I hope the code might help you for your next project:
Sources
- Geiler, L., Affeldt, S., Nadif, M., 2022. A survey on machine learning methods for churn prediction. Int J Data Sci Anal. https://doi.org/10.1007/s41060-022-00312-5
- Munkhdalai, L., Munkhdalai, T., Ryu, K.H., 2020. GEV-NN: A deep neural network architecture for class imbalance problem in binary classification. Knowledge-Based Systems. https://doi.org/10.1016/j.knosys.2020.105534
Used data
- IBM HR Analytics Employee Attrition & Performance (Database Contents License (DbCL) v1.0), https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset
GitHubs
- Munkhdalai, L., 2020. GEV-NN-A-deep-neural-network-architecture-for-class-imbalance-problem-in-binary-classification. https://github.com/lhagiimn/GEV-NN-A-deep-neural-network-architecture-for-class-imbalance-problem-in-binary-classification
- Fridman, N., 2019. Neural-Network, Churn-Prediction. https://github.com/naomifridman/Neural-Network-Churn-Prediction