
Building up and keeping a loyal clientele can be challenging for any company, especially when customers are free to choose from a variety of providers within a product category. Furthermore, retaining existing customers is generally more cost-effective than acquiring new ones.
For this reason, evaluating client retention is crucial for business. It is essential to not only measure the level of customer satisfaction but also to measure the number of clients that stop doing business with a company or service.
Customer Churn, also known as customer attrition, is the percentage of customers that stopped using a company’s service during a particular period. Keeping churn rates as low as possible is what every business pursuits, and understanding these metrics can assist companies to identify potential churners in time to prevent them from leaving the client base.
In this article, we will build churn prediction models based on a telecommunication company dataset.
About the data
The dataset was provided by the IBM Developer Platform and is available here. Some information, such as the company name and private client data, was kept anonymous for the sake of confidentiality and will not affect the model’s performance.
We’ll be working with data from 7043 clients and we have 20 features, as below:
Demographic customer information
gender
,SeniorCitizen
,Partner
,Dependents
Services that each customer has signed up for
PhoneService
,MultipleLines
,InternetService
,OnlineSecurity
,OnlineBackup
,DeviceProtection
,TechSupport
,StreamingTV
,StreamingMovies
Customer account information
tenure
,Contract
,PaperlessBilling
,PaymentMethod
,MonthlyCharges
,TotalCharges
Customers who left within the last month (This is the feature our model is going to predict)
Churn
Let’s check the value distribution for our target variable Churn
.

With a churn rate of 26.54%, we are dealing with an unbalanced dataset. The quantity of churns is substantially smaller than that of non-churns.
Data Preparation
An initial observation shows that we are dealing with 10 categorical, 7 binary, and 3 numerical variables. However, this study will consider some of these categorical features as binary. To illustrate, the columns StreamingTV
and TechSupport
have values "No", "Yes", and "No internet service". In these cases, "No internet service" will be considered as "No". The final model will count with 4 categorical, 13 binary, and 3 numerical variables.
Before moving any further, let’s check for the occurrence of outliers in our numerical variables, more specifically, in MonthlyCharges
and TotalCharges
.

At first glance, everything looks ok with the numerical features. Examining the boxplots, there is no evidence of outliers.
As established earlier, adjustments were made on the initial dataset. Some features with 3 unique values were converted into binary, in an attempt to improve our model. With all features settled, let’s view an example of the churn distributions for some of these features.

Looking at the example above, we can interpret that gender probably won’t be a meaningful variable to the model, as the churn rate is quite similar for both male and female customers. On the other hand, clients with dependents are less prone to stop doing business with the company.
As for internet service, customers with fiber optic plans are more likely to quit. Their churn rate is more than double that of DSL and no internet users.
Turning to protection services, clients with device protection and online security plans are more likely to maintain their contracts.
Lastly, the type of contract might be a valuable feature for the model. Notice that the churn rate for month-to-month contracts is considerably higher than that of one year and two-year contracts.
Before setting up the ML algorithms, we need to perform some preprocessing. Considering that most Machine Learning algorithms work better with numerical inputs, we’ll preprocess our data using Scikit Learn’s LabelEncoder
and pandas’ get_dummies
. Here is an example of how the dataset looks like:

Machine Learning Models
The first thing that needs to be done is splitting the data into training and test sets. After that, to manage the unbalance situation, we’ll standardize the features of the training set using StardardScaler
and then apply RandomUnderSampler
, which is a "way to balance the data by randomly selecting a subset of data for the targeted classes", according to the official documentation.
With the data standardized and balanced, the following models will be used and we’ll determine which one shows the better results:
- SVC (Support Vector Classifier)
- Logistic Regression
- XGBoost
To evaluate the effectiveness of these models, we could use Precision
or Recall
. Precision will give us the proportion of positive identifications that were indeed correct, while recall will determine the proportion of real positives that were correctly identified.
Considering the problem we are trying to solve, Recall
will be more suitable for this study, as the objective here is to identify the maximum number of clients that are actually prone to stop doing business with the company, even if some "non-churners" are wrongly identified as "churners". That is to say, in our case, it is better to pursue a number of False Negatives as small as possible.
We also used Cross-Validation to get better results. Instead of simply splitting the data into train and test sets, the cross_validate
method splits our training data into k number of Folds, making better use of the data. In this case, we performed a 5-fold cross-validation.

Notice that all 3 models provided similar results, with a recall rate of about 80%. We’ll now tune some hyperparameters on the models to see if we can achieve higher recall values. The method utilized here is GridSearchCV
, which will search over specified parameter values for each estimator. Each model has a variety of parameters that can be tuned, but we are only adjusting those with more potential to impact the prediction (specified in the param_grid
parameter), while the remainder can be left to their default values.
Please find below the tunings for each model:
Support Vector Classifier
param_grid = {'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
'C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100]}
Best recall rate: 0.96 for {‘C’: 0.01, ‘kernel’: ‘poly’}
Notice how effective hyperparameter tuning can be. We searched over different values for C
and kernel
and we got an increased recall of 96%, for C = 0.01 and kernel type "poly".
Logistic Regression
param_grid = {'solver': ['newton-cg', 'lbfgs', 'liblinear'],
'C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100]}
Best recall rate: 0.88 for {‘C’: 0.0001, ‘solver’: ‘liblinear’}
Turning to the Logistic Regression, we achieved a better recall as well, with 88% for C = 0.0001 and solver = "liblinear".
Finally, let’s make some adjustments to the XGBoost estimator. XGBoost is known for being one of the most effective Machine Learning algorithms, due to its great performance on structured and tabular datasets on classification and regression predictive modeling problems. It is highly customizable and counts with a higher range of parameters to be tuned.
XGBoost
In the first step, we are going to determine the optimal number of trees in the XGBoost model, searching over values for the n_estimators
argument.
param_grid = {'n_estimators': range(0,1000,25)}
_Best recall rate: 0.82 for {‘nestimators’: 25}
We can already detect an improvement in the recall rate. Now that we have determined the better n_estimators
value, we can move on and search over two relevant parameters, max_depth
, and min_child_weight
.
param_grid = {'max_depth': range(1,8,1),
'min_child_weight': np.arange(0.0001, 0.5, 0.001)}
_Best recall rate: 0.86 for {‘max_depth’: 1, ‘min_childweight’: 0.0001}
In the following step, we’ll determine the best value for gamma
, an important parameter used to control the model’s tendency to overfit.
param_grid = {'gama': np.arange(0.0,20.0,0.05)}
Best recall rate: 0.86 for {‘gama’: 0.0}
Finally, we’ll search for the optimal learning_rate
value.
param_grid = {'learning_rate': [0.0001, 0.01, 0.1, 1]}
_Best recall rate: 0.88 for {‘learningrate’: 0.0001}
Evaluating the models on the test set
After tuning parameters for SVC, Logistic Regression, and XGBoost, we noticed improvements in all three models. It’s crucial to run the tuned versions of each model on the test set, to check their performance.
Now, let’s draw a confusion matrix for each of these algorithms to visualize their performance on the test set.
Support Vector Classifier

Logistic Regression

XGBoost

After running the algorithms on the test set, we have a display on how the model’s performance can be improved when we adjust some parameters. The three models had gains in recall rate after tuning, with XGBoost presenting the best recall rate among them.
Conclusion
The purpose of this project was to develop a model that would be able to determine churning clients from a telecom company, as efficiently as possible.
Being able to identify potential churners in advance allows the company to develop strategies to prevent customers from leaving the client base. With this data in hand, companies can offer incentives, like discounts or loyalty programs, or provide additional services in an attempt to reduce the churn rate.
Another point that is worth mentioning is the importance of tuning hyperparameters, adjusting the ML algorithm to achieve better results. All three models improved their recall rate after parameter tuning.
XGBoost has been proving its effectiveness on Data Science projects for a while, and, in this project, it provided the best results among the models. For that reason, XGBoost algorithm would be our choice for the problem presented here.
For the full code, please refer to the notebook.