Customer Churn, also known as customer attrition, occurs when customers stop doing business with a company. The companies are interested in identifying segments of these customers because the price for acquiring a new customer is usually higher than retaining the old one. For example, if Netflix knew a segment of customers who were at risk of churning they could proactively engage them with special offers instead of simply losing them.
In this post, we will create a simple customer churn prediction model using Telco Customer Churn dataset. We chose a decision tree to model churned customers, pandas for data crunching and matplotlib for visualizations. We will do all of that above in Python. The code can be used with another dataset with a few minor adjustments to train the baseline model. We also provide a few references and give ideas for new features and improvements.
You can run this code by downloading this Jupyter notebook.
Data Preprocessing
We use pandas to read the dataset and preprocess it. Telco dataset has one customer per line with many columns (features). There aren’t any rows with all missing values or duplicates (this rarely happens with real-world datasets). There are 11 samples that have TotalCharges set to " ", which seems like a mistake in the data. We remove those samples and set the type to numeric (float).
df = pd.read_csv('data/WA_Fn-UseC_-Telco-Customer-Churn.csv')
df = df.dropna(how="all") # remove samples with all missing values
df = df[~df.duplicated()] # remove duplicates
total_charges_filter = df.TotalCharges == " "
df = df[~total_charges_filter]
df.TotalCharges = pd.to_numeric(df.TotalCharges)

Exploratory Data Analysis
We have 2 types of features in the dataset: categorical (two or more values and without any order) and numerical. Most of the feature names are self-explanatory, except for:
- Partner: whether the customer has a partner or not (Yes, No),
- Dependents: whether the customer has dependents or not (Yes, No),
- OnlineBackup: Whether the customer has an online backup or not (Yes, No, No internet service),
- tenure: number of months the customer has stayed with the company,
- MonthlyCharges: the amount charged to the customer monthly,
- TotalCharges: the total amount charged to the customer.
There are 7032 customers in the dataset and 19 features without customerID (non-informative) and Churn column (target variable). Most of the categorical features have 4 or less unique values.

We combine features into two lists so that we can analyze them jointly.
categorical_features = [
"gender",
"SeniorCitizen",
"Partner",
"Dependents",
"PhoneService",
"MultipleLines",
"InternetService",
"OnlineSecurity",
"OnlineBackup",
"DeviceProtection",
"TechSupport",
"StreamingTV",
"StreamingMovies",
"Contract",
"PaperlessBilling",
"PaymentMethod",
]
numerical_features = ["tenure", "MonthlyCharges", "TotalCharges"]
target = "Churn"
Numerical features distribution
Numeric summarizing techniques (mean, standard deviation, etc.) don’t show us spikes, shapes of distributions and it is hard to observe outliers with it. That is the reason we use histograms.
df[numerical_features].describe()

At first glance, there aren’t any outliers in the data. No data point is disconnected from distribution or too far from the mean value. To confirm that we would need to calculate interquartile range (IQR) and show that values of each numerical feature are within the 1.5 IQR from first and third quartile.
We could convert numerical features to ordinal intervals. For example, tenure is numerical, but often we don’t care about small numeric differences and instead group tenure to customers with short, medium and long term tenure. One reason to convert it would be to reduce the noise, often small fluctuates are just noise.
df[numerical_features].hist(bins=30, figsize=(10, 7))

We look at distributions of numerical features in relation to the target variable. We can observe that the greater TotalCharges and tenure are the less is the probability of churn.
fig, ax = plt.subplots(1, 3, figsize=(14, 4))
df[df.Churn == "No"][numerical_features].hist(bins=30, color="blue", alpha=0.5, ax=ax)
df[df.Churn == "Yes"][numerical_features].hist(bins=30, color="red", alpha=0.5, ax=ax)

Categorical feature distribution
To analyze categorical features, we use bar charts. We observe that Senior citizens and customers without phone service are less represented in the data.
ROWS, COLS = 4, 4
fig, ax = plt.subplots(ROWS, COLS, figsize=(18, 18))
row, col = 0, 0
for i, categorical_feature in enumerate(categorical_features):
if col == COLS - 1:
row += 1
col = i % COLS
df[categorical_feature].value_counts().plot('bar', ax=ax[row, col]).set_title(categorical_feature)

The next step is to look at categorical features in relation to the target variable. We do this only for contract feature. Users who have a month-to-month contract are more likely to churn than users with long term contracts.
feature = 'Contract'
fig, ax = plt.subplots(1, 2, figsize=(14, 4))
df[df.Churn == "No"][feature].value_counts().plot('bar', ax=ax[0]).set_title('not churned')
df[df.Churn == "Yes"][feature].value_counts().plot('bar', ax=ax[1]).set_title('churned')

Target variable distribution
Target variable distribution shows that we are dealing with an imbalanced problem as there are many more non-churned as churned users. The model would achieve high accuracy as it would mostly predict majority class – users who didn’t churn in our example.
Few things we can do to minimize the influence of imbalanced dataset:
- resample data (imbalanced-learn),
- collect more samples,
- use precision and recall as accuracy metrics.
df[target].value_counts().plot('bar').set_title('churned')

Features
Telco dataset is already grouped by customerID so it is difficult to add new features. When working on the churn prediction we usually get a dataset that has one entry per customer session (customer activity in a certain time). Then we could add features like:
- number of sessions before buying something,
- average time per session,
- time difference between sessions (frequent or less frequent customer),
- is a customer only in one country.
Sometimes we even have customer event data, which enables us to find patterns of customer behavior in relation to the outcome (churn).
Encoding features
To prepare the dataset for modeling churn, we need to encode categorical features to numbers. This means encoding "Yes", "No" to 0 and 1 so that algorithm can work with the data. This process is called onehot encoding.

Classifier
We use sklearn, a Machine Learning library in Python, to create a classifier. The sklearn way is to use pipelines that define feature processing and the classifier. In our example, the pipeline takes a dataset in the input, it preprocesses features and trains the classifier. When trained, it takes the same input and returns predictions in the output.
In the pipeline, we separately process categorical and numerical features. We onehot encode categorical features and scale numerical features by removing the mean and scaling them to unit variance. We chose a Decision Tree model because of its interpretability and set max depth to 3 (arbitrarily).

Training the model
We split the dataset to train (75% samples) and test (25% samples). We train (fit) the pipeline and make predictions. With classification_report we calculate precision and recall with actual and predicted values.
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.25, random_state=42)
pipeline.fit(df_train, df_train[target])
pred = pipeline.predict(df_test)
Testing the model
With _classificationreport we calculate precision and recall with actual and predicted values.
For class 1 (churned users) model achieves 0.67 precision and 0.37 recall. Precision tells us how many churned users did our classifier predicted correctly. On the other side, recall tell us how many churned users it missed.
In layman terms, the classifier is not very accurate for churned users.
from sklearn.metrics import classification_report
print(classification_report(df_test[target], pred))

Model interpretability
Decision Tree model uses Contract, MonthlyCharges, InternetService, TotalCharges, and tenure features to make a decision if a customer will churn or not. These features separate churned customers from others well based on the split criteria in the decision tree.
Each customer sample traverses the tree and final node gives the prediction. For example, if _ContractMonth-to-month is:
- equal to 0, continue traversing the tree with True branch,
- equal to 1, continue traversing the tree with False branch,
- not defined, it outputs the class 0.
This is a great approach to see how the model is making a decision or if any features sneaked in our model that shouldn’t be there.

Let’s connect
Talk: Book a call Socials: YouTube 🎥 | LinkedIn | Twitter Code: GitHub
Further reading
- Handling class imbalance in customer churn prediction – how can we better handle class imbalance in churn prediction.
- A Survey on Customer Churn Prediction using Machine Learning Techniques] – This paper reviews the most popular Machine Learning algorithms used by researchers for churn predicting.
- Telco customer churn on Kaggle – Churn analysis on Kaggle.
- WTTE-RNN-Hackless-churn-modeling – Event based churn prediction.