Modeling Telecom Customer Churn with Variational Autoencoder

How to apply deep convolutional neural networks and auto-encoders for building a churn prediction model.

Published in

Towards Data Science

9 min readFeb 19, 2019

An autoencoder is deep learning’s answer to dimensionality reduction. The idea is pretty simple: transform the input through a series of hidden layers but ensure that the final output layer is the same dimension as the input layer. However, the intervening hidden layers have progressively smaller number of nodes (and hence, reduce the dimension of the input matrix). If the output matches or encodes the input closely, then the nodes of the smallest hidden layer can be taken as a valid dimension reduced data set.

A variational autoencoder (VAE) resembles a classical autoencoder and is a neural network consisting of an encoder, a decoder and a loss function. They let us design complex generative models of data, and fit them to large data sets.

After reading an article on using convolutional networks and autoencoders to provide insights into user churn. I decided to implement VAE to a telecom churn data set that can be downloaded from IBM Sample Data Sets. It is a bit of overkill to apply VAE to a relative small data set like this, but for the sake of learning VAE, I am going to do it anyway.

The Data

Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

The data set includes the following information:

Customers who left within the last month — the column is called Churn
Services that each customer has signed up for — phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
Customer account information — how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
Demographic info about customers — gender, age range, and if they have partners and dependents.

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler,MinMaxScaler
import collections
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn.metrics import (confusion_matrix, precision_recall_curve, auc,
                             roc_curve, recall_score, classification_report, f1_score,
                             precision_recall_fscore_support)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from keras.layers import Input, Dense, Lambda
from keras.models import Model
from keras.objectives import binary_crossentropy
from keras.callbacks import LearningRateScheduler
from keras.utils.vis_utils import model_to_dot
from keras.callbacks import EarlyStopping, ModelCheckpoint
import keras.backend as K
from keras.callbacks import Callback
import matplotlib
matplotlib.rcParams['figure.figsize'] = (10.0, 6.0)df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
df.info()

TotalCharges should be converted to numeric.

df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

Most features in the data set are categorical. We are going to visualize them first then create dummy variables.

Visualize and Analyze Categorical Features

Gender

gender_plot = df.groupby(['gender', 'Churn']).size().reset_index().pivot(columns='Churn', index='gender', values=0)
gender_plot.plot(x=gender_plot.index, kind='bar', stacked=True);
print('Gender', collections.Counter(df['gender']))

Gender does not seem to have an effect on the churn.

Partner

partner_plot = df.groupby(['Partner', 'Churn']).size().reset_index().pivot(columns='Churn', index='Partner', values=0)
partner_plot.plot(x=partner_plot.index, kind='bar', stacked=True);
print('Partner', collections.Counter(df['Partner']))

Whether the customer has partner or not does seem to have some effect on the churn.

Dependents

dependents_plot = df.groupby(['Dependents', 'Churn']).size().reset_index().pivot(columns='Churn', index='Dependents', values=0)
dependents_plot.plot(x=dependents_plot.index, kind='bar', stacked=True);
print('Dependents', collections.Counter(df['Dependents']))

Customers who have no dependents are more likely to churn than customers who have dependents.

PhoneService

phoneservice_plot = df.groupby(['PhoneService', 'Churn']).size().reset_index().pivot(columns='Churn', index='PhoneService', values=0)
phoneservice_plot.plot(x=phoneservice_plot.index, kind='bar', stacked=True);
print('PhoneService', collections.Counter(df['PhoneService']))

There are not many customers did not sign up for phone service, whether customer have phone service or not does not seem to have an effect on the churn.

MultipleLines

multiplelines_plot = df.groupby(['MultipleLines', 'Churn']).size().reset_index().pivot(columns='Churn', index='MultipleLines', values=0)
multiplelines_plot.plot(x=multiplelines_plot.index, kind='bar', stacked=True);
print('MultipleLines', collections.Counter(df['MultipleLines']))

Whether customer signed up for MultipleLines or not does not seem to have an effect on the churn.

InternetService

internetservice_plot = df.groupby(['InternetService', 'Churn']).size().reset_index().pivot(columns='Churn', index='InternetService', values=0)
internetservice_plot.plot(x=internetservice_plot.index, kind='bar', stacked=True);
print('InternetService', collections.Counter(df['InternetService']))

It seems customers who signed up for Fiber optic are most likely to churn, almost 50% of them churned.

OnlineSecurity

onlinesecurity_plot = df.groupby(['OnlineSecurity', 'Churn']).size().reset_index().pivot(columns='Churn', index='OnlineSecurity', values=0)
onlinesecurity_plot.plot(x=onlinesecurity_plot.index, kind='bar', stacked=True);
print('OnlineSecurity', collections.Counter(df['OnlineSecurity']))

Customers who did not sign up for OnlineSecurity are most likely to churn.

OnlineBackup

onlinebackup_plot = df.groupby(['OnlineBackup', 'Churn']).size().reset_index().pivot(columns='Churn', index='OnlineBackup', values=0)
onlinebackup_plot.plot(x=onlinebackup_plot.index, kind='bar', stacked=True);
print('OnlineBackup', collections.Counter(df['OnlineBackup']))

Customers who did not sign up for OnlineBackUp are most likely to churn.

DeviceProtection

deviceprotection_plot = df.groupby(['DeviceProtection', 'Churn']).size().reset_index().pivot(columns='Churn', index='DeviceProtection', values=0)
deviceprotection_plot.plot(x=deviceprotection_plot.index, kind='bar', stacked=True);
print('DeviceProtection', collections.Counter(df['DeviceProtection']))

Customers who did not sign up for DeviceProtection are most likely to churn.

TechSupport

techsupport_plot = df.groupby(['TechSupport', 'Churn']).size().reset_index().pivot(columns='Churn', index='TechSupport', values=0)
techsupport_plot.plot(x=techsupport_plot.index, kind='bar', stacked=True);
print('TechSupport', collections.Counter(df['TechSupport']))

Customers who did not sign up for TechSupport are most likely to churn.

StreamingTV

streamingtv_plot = df.groupby(['StreamingTV', 'Churn']).size().reset_index().pivot(columns='Churn', index='StreamingTV', values=0)
streamingtv_plot.plot(x=streamingtv_plot.index, kind='bar', stacked=True);
print('StreamingTV', collections.Counter(df['StreamingTV']))

StreamingMovies

streamingmovies_plot = df.groupby(['StreamingMovies', 'Churn']).size().reset_index().pivot(columns='Churn', index='StreamingMovies', values=0)
streamingmovies_plot.plot(x=streamingmovies_plot.index, kind='bar', stacked=True);
print('StreamingMovies', collections.Counter(df['StreamingMovies']))

From above seven plots, we can see that customers without internet service have a very low churn rate.

Contract

contract_plot = df.groupby(['Contract', 'Churn']).size().reset_index().pivot(columns='Churn', index='Contract', values=0)
contract_plot.plot(x=contract_plot.index, kind='bar', stacked=True);
print('Contract', collections.Counter(df['Contract']))

It is obvious that contract term does have an effect on churn. There were very few churns when customers have a two-year contract. And most churns occurred on customers with a month-to-month contract.

PaperlessBilling

paperlessbilling_plot = df.groupby(['PaperlessBilling', 'Churn']).size().reset_index().pivot(columns='Churn', index='PaperlessBilling', values=0)
paperlessbilling_plot.plot(x=paperlessbilling_plot.index, kind='bar', stacked=True);
print('PaperlessBilling', collections.Counter(df['PaperlessBilling']))

PaymentMethod

paymentmethod_plot = df.groupby(['PaymentMethod', 'Churn']).size().reset_index().pivot(columns='Churn', index='PaymentMethod', values=0)
paymentmethod_plot.plot(x=paymentmethod_plot.index, kind='bar', stacked=True);
print('PaymentMethod', collections.Counter(df['PaymentMethod']))

PaymentMethod does seem to have an effect on churn, in particular, pay by electronic check has the highest percentage churning rate.

SeniorCitizen

seniorcitizen_plot = df.groupby(['SeniorCitizen', 'Churn']).size().reset_index().pivot(columns='Churn', index='SeniorCitizen', values=0)
seniorcitizen_plot.plot(x=seniorcitizen_plot.index, kind='bar', stacked=True);
print('SeniorCitizen', collections.Counter(df['SeniorCitizen']))

We do not have many senior citizens in the data. It seems whether customers are seniors citizens or not does not have an effect on the churning rate.

Explore Numeric Features

Tenure

sns.kdeplot(df['tenure'].loc[df['Churn'] == 'No'], label='not churn', shade=True);
sns.kdeplot(df['tenure'].loc[df['Churn'] == 'Yes'], label='churn', shade=True);

df['tenure'].loc[df['Churn'] == 'No'].describe()

df['tenure'].loc[df['Churn'] == 'Yes'].describe()

Not churned customers have a much longer average tenure (20 months) than the churned customers.

Monthly Charges

sns.kdeplot(df['MonthlyCharges'].loc[df['Churn'] == 'No'], label='not churn', shade=True);
sns.kdeplot(df['MonthlyCharges'].loc[df['Churn'] == 'Yes'], label='churn', shade=True);

df['MonthlyCharges'].loc[df['Churn'] == 'No'].describe()

df['MonthlyCharges'].loc[df['Churn'] == 'Yes'].describe()

Churned customers paid over 20% higher on average monthly fee than not-churned customers.

TotalCharges

sns.kdeplot(df['TotalCharges'].loc[df['Churn'] == 'No'], label='not churn', shade=True);
sns.kdeplot(df['TotalCharges'].loc[df['Churn'] == 'Yes'], label='churn', shade=True);

Data Pre-processing

Encode labels with value between 0 and 1.

le = preprocessing.LabelEncoder()
df['Churn'] = le.fit_transform(df.Churn.values)

Fill nan with the mean of the column.

df = df.fillna(df.mean())

Encode categorical features.

categorical =  ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']
for f in categorical:
    dummies = pd.get_dummies(df[f], prefix = f, prefix_sep = '_')
    df = pd.concat([df, dummies], axis = 1)
# drop original categorical features
df.drop(categorical, axis = 1, inplace = True)

Split the data into train, validation and test sets, and create batch to send through our network.

autoencoder_preprocessing.py

VAE Implementation in Keras

The following code scrips were largely from Agustinus Kristiadi’s blog post: Variational Autoencoder: Intuition and Implementation.

Define input layer.
Define encoder layer.
Encoder model, to encode input into latent variable.
We use the mean as the output as it is the center point, the representative of the Gaussian.
We sample from the output of the 2 dense layers.
Define decoder layer in VAE model.
Define overall VAE model, for reconstruction and training.
Define generator model, generate new data given latent variable z.
Translate our loss into Keras code.
Start training.

VAE.py

The model stopped training after 55 epochs with a batch size of 100 samples.

Evaluation

plt.plot(vae_history.history['loss'])
plt.plot(vae_history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show();

From the above loss plot, we can see that the model has comparable performance on both train and validation data sets, and it seems to converge nicely at the end.

We use reconstruction error to measure how well the decoder is performing. Autoencoders are trained to reduce reconstruction error which we show below:

x_train_encoded = encoder.predict(X_train)pred_train = decoder.predict(x_train_encoded)
mse = np.mean(np.power(X_train - pred_train, 2), axis=1)
error_df = pd.DataFrame({'recon_error': mse,
                        'churn': y_train})plt.figure(figsize=(10,6))
sns.kdeplot(error_df.recon_error[error_df.churn==0], label='not churn', shade=True, clip=(0,10))
sns.kdeplot(error_df.recon_error[error_df.churn==1], label='churn', shade=True, clip=(0,10))
plt.xlabel('reconstruction error');
plt.title('Reconstruction error - Train set');

x_val_encoded = encoder.predict(X_val)pred = decoder.predict(x_val_encoded)
mseV = np.mean(np.power(X_val - pred, 2), axis=1)
error_df = pd.DataFrame({'recon_error': mseV,
                        'churn': y_val})plt.figure(figsize=(10,6))
sns.kdeplot(error_df.recon_error[error_df.churn==0], label='not churn', shade=True, clip=(0,10))
sns.kdeplot(error_df.recon_error[error_df.churn==1], label='churn', shade=True, clip=(0,10))
plt.xlabel('reconstruction error');
plt.title('Reconstruction error - Validation set');

Latent Space Visualization

We can cluster customers in the 2D latent space and visualize churned and not-churned customers, they can be separable at latent space and reveal the formation of distinct clusters.

x_train_encoded = encoder.predict(X_train)plt.scatter(x_train_encoded[:, 0], x_train_encoded[:, 1], 
            c=y_train, alpha=0.6)
plt.title('Train set in latent space')
plt.show();

x_val_encoded = encoder.predict(X_val)plt.scatter(x_val_encoded[:, 0], x_val_encoded[:, 1], 
            c=y_val, alpha=0.6)plt.title('Validation set in latent space')
plt.show();

Prediction on the validation set

x_val_encoded = encoder.predict(X_val)
fpr, tpr, thresholds = roc_curve(y_val, clf.predict(x_val_encoded))
roc_auc = auc(fpr, tpr)plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, label='AUC = %0.4f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.001, 1])
plt.ylim([0, 1.001])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show();

print('Accuracy:')
print(accuracy_score(y_val, clf.predict(x_val_encoded)))
print("Confusion Matrix:")
print(confusion_matrix(y_val,clf.predict(x_val_encoded)))
print("Classification Report:")
print(classification_report(y_val,clf.predict(x_val_encoded)))

Prediction on the test set

x_test_encoded = encoder.predict(X_test)
fpr, tpr, thresholds = roc_curve(y_test, clf.predict(x_test_encoded))
roc_auc = auc(fpr, tpr)
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, label='AUC = %0.4f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.001, 1])
plt.ylim([0, 1.001])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show();

print('Accuracy:')
print(accuracy_score(y_test, clf.predict(x_test_encoded)))
print("Confusion Matrix:")
print(confusion_matrix(y_test,clf.predict(x_test_encoded)))
print("Classification Report:")
print(classification_report(y_test,clf.predict(x_test_encoded)))

That was it! Jupyter notebook can be found on Github. Happy Monday!

References:

Variational autoencoders.

In my introductory post on autoencoders, I discussed various models (undercomplete, sparse, denoising, contractive)…

www.jeremyjordan.me

Tutorial - What is a variational autoencoder? - Jaan Altosaar

Understanding Variational Autoencoders (VAEs) from two perspectives: deep learning and graphical models.

jaan.io

Variational Autoencoders Explained

In my previous post about generative adversarial networks, I went over a simple method to training a network that could…

kvfrans.com

Credit Card Fraud Detection using Autoencoders in Keras — TensorFlow for Hackers (Part VII)

How Anomaly Detection in credit card transactions works?

medium.com

Basic Autoencoder- Anomaly Detection Using Reconstruction Error | Deeplearning4j

Download this notebook Please view the README to learn about installing, setting up dependencies, and importing…

deeplearning4j.org

naomifridman/Deep-VAE-prediction-of-churn-customer

Variational deep autoencoder to predict churn customer - naomifridman/Deep-VAE-prediction-of-churn-customer

github.co