Photo credit: Pixabay

Modeling Telecom Customer Churn with Variational Autoencoder

How to apply deep convolutional neural networks and auto-encoders for building a churn prediction model.

--

An autoencoder is deep learning’s answer to dimensionality reduction. The idea is pretty simple: transform the input through a series of hidden layers but ensure that the final output layer is the same dimension as the input layer. However, the intervening hidden layers have progressively smaller number of nodes (and hence, reduce the dimension of the input matrix). If the output matches or encodes the input closely, then the nodes of the smallest hidden layer can be taken as a valid dimension reduced data set.

A variational autoencoder (VAE) resembles a classical autoencoder and is a neural network consisting of an encoder, a decoder and a loss function. They let us design complex generative models of data, and fit them to large data sets.

After reading an article on using convolutional networks and autoencoders to provide insights into user churn. I decided to implement VAE to a telecom churn data set that can be downloaded from IBM Sample Data Sets. It is a bit of overkill to apply VAE to a relative small data set like this, but for the sake of learning VAE, I am going to do it anyway.

The Data

Each row represents a customer, each column contains customer’s attributes described on the column Metadata.

The data set includes the following information:

  • Customers who left within the last month — the column is called Churn
  • Services that each customer has signed up for — phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
  • Customer account information — how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
  • Demographic info about customers — gender, age range, and if they have partners and dependents.
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler,MinMaxScaler
import collections
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn.metrics import (confusion_matrix, precision_recall_curve, auc,
roc_curve, recall_score, classification_report, f1_score,
precision_recall_fscore_support)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from keras.layers import Input, Dense, Lambda
from keras.models import Model
from keras.objectives import binary_crossentropy
from keras.callbacks import LearningRateScheduler
from keras.utils.vis_utils import model_to_dot
from keras.callbacks import EarlyStopping, ModelCheckpoint
import keras.backend as K
from keras.callbacks import Callback
import matplotlib
matplotlib.rcParams['figure.figsize'] = (10.0, 6.0)
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
df.info()
Figure 1

TotalCharges should be converted to numeric.

df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

Most features in the data set are categorical. We are going to visualize them first then create dummy variables.

Visualize and Analyze Categorical Features

Gender

gender_plot = df.groupby(['gender', 'Churn']).size().reset_index().pivot(columns='Churn', index='gender', values=0)
gender_plot.plot(x=gender_plot.index, kind='bar', stacked=True);
print('Gender', collections.Counter(df['gender']))
Figure 2

Gender does not seem to have an effect on the churn.

Partner

partner_plot = df.groupby(['Partner', 'Churn']).size().reset_index().pivot(columns='Churn', index='Partner', values=0)
partner_plot.plot(x=partner_plot.index, kind='bar', stacked=True);
print('Partner', collections.Counter(df['Partner']))
Figure 3

Whether the customer has partner or not does seem to have some effect on the churn.

Dependents

dependents_plot = df.groupby(['Dependents', 'Churn']).size().reset_index().pivot(columns='Churn', index='Dependents', values=0)
dependents_plot.plot(x=dependents_plot.index, kind='bar', stacked=True);
print('Dependents', collections.Counter(df['Dependents']))
Figure 4

Customers who have no dependents are more likely to churn than customers who have dependents.

PhoneService

phoneservice_plot = df.groupby(['PhoneService', 'Churn']).size().reset_index().pivot(columns='Churn', index='PhoneService', values=0)
phoneservice_plot.plot(x=phoneservice_plot.index, kind='bar', stacked=True);
print('PhoneService', collections.Counter(df['PhoneService']))
Figure 5

There are not many customers did not sign up for phone service, whether customer have phone service or not does not seem to have an effect on the churn.

MultipleLines

multiplelines_plot = df.groupby(['MultipleLines', 'Churn']).size().reset_index().pivot(columns='Churn', index='MultipleLines', values=0)
multiplelines_plot.plot(x=multiplelines_plot.index, kind='bar', stacked=True);
print('MultipleLines', collections.Counter(df['MultipleLines']))
Figure 6

Whether customer signed up for MultipleLines or not does not seem to have an effect on the churn.

InternetService

internetservice_plot = df.groupby(['InternetService', 'Churn']).size().reset_index().pivot(columns='Churn', index='InternetService', values=0)
internetservice_plot.plot(x=internetservice_plot.index, kind='bar', stacked=True);
print('InternetService', collections.Counter(df['InternetService']))
Figure 7

It seems customers who signed up for Fiber optic are most likely to churn, almost 50% of them churned.

OnlineSecurity

onlinesecurity_plot = df.groupby(['OnlineSecurity', 'Churn']).size().reset_index().pivot(columns='Churn', index='OnlineSecurity', values=0)
onlinesecurity_plot.plot(x=onlinesecurity_plot.index, kind='bar', stacked=True);
print('OnlineSecurity', collections.Counter(df['OnlineSecurity']))
Figure 8

Customers who did not sign up for OnlineSecurity are most likely to churn.

OnlineBackup

onlinebackup_plot = df.groupby(['OnlineBackup', 'Churn']).size().reset_index().pivot(columns='Churn', index='OnlineBackup', values=0)
onlinebackup_plot.plot(x=onlinebackup_plot.index, kind='bar', stacked=True);
print('OnlineBackup', collections.Counter(df['OnlineBackup']))
Figure 9

Customers who did not sign up for OnlineBackUp are most likely to churn.

DeviceProtection

deviceprotection_plot = df.groupby(['DeviceProtection', 'Churn']).size().reset_index().pivot(columns='Churn', index='DeviceProtection', values=0)
deviceprotection_plot.plot(x=deviceprotection_plot.index, kind='bar', stacked=True);
print('DeviceProtection', collections.Counter(df['DeviceProtection']))
Figure 10

Customers who did not sign up for DeviceProtection are most likely to churn.

TechSupport

techsupport_plot = df.groupby(['TechSupport', 'Churn']).size().reset_index().pivot(columns='Churn', index='TechSupport', values=0)
techsupport_plot.plot(x=techsupport_plot.index, kind='bar', stacked=True);
print('TechSupport', collections.Counter(df['TechSupport']))
Figure 11

Customers who did not sign up for TechSupport are most likely to churn.

StreamingTV

streamingtv_plot = df.groupby(['StreamingTV', 'Churn']).size().reset_index().pivot(columns='Churn', index='StreamingTV', values=0)
streamingtv_plot.plot(x=streamingtv_plot.index, kind='bar', stacked=True);
print('StreamingTV', collections.Counter(df['StreamingTV']))
Figure 12

StreamingMovies

streamingmovies_plot = df.groupby(['StreamingMovies', 'Churn']).size().reset_index().pivot(columns='Churn', index='StreamingMovies', values=0)
streamingmovies_plot.plot(x=streamingmovies_plot.index, kind='bar', stacked=True);
print('StreamingMovies', collections.Counter(df['StreamingMovies']))
Figure 13

From above seven plots, we can see that customers without internet service have a very low churn rate.

Contract

contract_plot = df.groupby(['Contract', 'Churn']).size().reset_index().pivot(columns='Churn', index='Contract', values=0)
contract_plot.plot(x=contract_plot.index, kind='bar', stacked=True);
print('Contract', collections.Counter(df['Contract']))
Figure 14

It is obvious that contract term does have an effect on churn. There were very few churns when customers have a two-year contract. And most churns occurred on customers with a month-to-month contract.

PaperlessBilling

paperlessbilling_plot = df.groupby(['PaperlessBilling', 'Churn']).size().reset_index().pivot(columns='Churn', index='PaperlessBilling', values=0)
paperlessbilling_plot.plot(x=paperlessbilling_plot.index, kind='bar', stacked=True);
print('PaperlessBilling', collections.Counter(df['PaperlessBilling']))
Figure 15

PaymentMethod

paymentmethod_plot = df.groupby(['PaymentMethod', 'Churn']).size().reset_index().pivot(columns='Churn', index='PaymentMethod', values=0)
paymentmethod_plot.plot(x=paymentmethod_plot.index, kind='bar', stacked=True);
print('PaymentMethod', collections.Counter(df['PaymentMethod']))
Figure 16

PaymentMethod does seem to have an effect on churn, in particular, pay by electronic check has the highest percentage churning rate.

SeniorCitizen

seniorcitizen_plot = df.groupby(['SeniorCitizen', 'Churn']).size().reset_index().pivot(columns='Churn', index='SeniorCitizen', values=0)
seniorcitizen_plot.plot(x=seniorcitizen_plot.index, kind='bar', stacked=True);
print('SeniorCitizen', collections.Counter(df['SeniorCitizen']))
Figure 17

We do not have many senior citizens in the data. It seems whether customers are seniors citizens or not does not have an effect on the churning rate.

Explore Numeric Features

Tenure

sns.kdeplot(df['tenure'].loc[df['Churn'] == 'No'], label='not churn', shade=True);
sns.kdeplot(df['tenure'].loc[df['Churn'] == 'Yes'], label='churn', shade=True);
Figure 18
df['tenure'].loc[df['Churn'] == 'No'].describe()
Figure 19
df['tenure'].loc[df['Churn'] == 'Yes'].describe()
Figure 20

Not churned customers have a much longer average tenure (20 months) than the churned customers.

Monthly Charges

sns.kdeplot(df['MonthlyCharges'].loc[df['Churn'] == 'No'], label='not churn', shade=True);
sns.kdeplot(df['MonthlyCharges'].loc[df['Churn'] == 'Yes'], label='churn', shade=True);
Figure 21
df['MonthlyCharges'].loc[df['Churn'] == 'No'].describe()
Figure 22
df['MonthlyCharges'].loc[df['Churn'] == 'Yes'].describe()
Figure 23

Churned customers paid over 20% higher on average monthly fee than not-churned customers.

TotalCharges

sns.kdeplot(df['TotalCharges'].loc[df['Churn'] == 'No'], label='not churn', shade=True);
sns.kdeplot(df['TotalCharges'].loc[df['Churn'] == 'Yes'], label='churn', shade=True);
Figure 24

Data Pre-processing

Encode labels with value between 0 and 1.

le = preprocessing.LabelEncoder()
df['Churn'] = le.fit_transform(df.Churn.values)

Fill nan with the mean of the column.

df = df.fillna(df.mean())

Encode categorical features.

categorical =  ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']
for f in categorical:
dummies = pd.get_dummies(df[f], prefix = f, prefix_sep = '_')
df = pd.concat([df, dummies], axis = 1)
# drop original categorical features
df.drop(categorical, axis = 1, inplace = True)

Split the data into train, validation and test sets, and create batch to send through our network.

autoencoder_preprocessing.py

VAE Implementation in Keras

The following code scrips were largely from Agustinus Kristiadi’s blog post: Variational Autoencoder: Intuition and Implementation.

  • Define input layer.
  • Define encoder layer.
  • Encoder model, to encode input into latent variable.
  • We use the mean as the output as it is the center point, the representative of the Gaussian.
  • We sample from the output of the 2 dense layers.
  • Define decoder layer in VAE model.
  • Define overall VAE model, for reconstruction and training.
  • Define generator model, generate new data given latent variable z.
  • Translate our loss into Keras code.
  • Start training.
VAE.py

The model stopped training after 55 epochs with a batch size of 100 samples.

Evaluation

plt.plot(vae_history.history['loss'])
plt.plot(vae_history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show();
Figure 25

From the above loss plot, we can see that the model has comparable performance on both train and validation data sets, and it seems to converge nicely at the end.

We use reconstruction error to measure how well the decoder is performing. Autoencoders are trained to reduce reconstruction error which we show below:

x_train_encoded = encoder.predict(X_train)pred_train = decoder.predict(x_train_encoded)
mse = np.mean(np.power(X_train - pred_train, 2), axis=1)
error_df = pd.DataFrame({'recon_error': mse,
'churn': y_train})
plt.figure(figsize=(10,6))
sns.kdeplot(error_df.recon_error[error_df.churn==0], label='not churn', shade=True, clip=(0,10))
sns.kdeplot(error_df.recon_error[error_df.churn==1], label='churn', shade=True, clip=(0,10))
plt.xlabel('reconstruction error');
plt.title('Reconstruction error - Train set');
Figure 26
x_val_encoded = encoder.predict(X_val)pred = decoder.predict(x_val_encoded)
mseV = np.mean(np.power(X_val - pred, 2), axis=1)
error_df = pd.DataFrame({'recon_error': mseV,
'churn': y_val})
plt.figure(figsize=(10,6))
sns.kdeplot(error_df.recon_error[error_df.churn==0], label='not churn', shade=True, clip=(0,10))
sns.kdeplot(error_df.recon_error[error_df.churn==1], label='churn', shade=True, clip=(0,10))
plt.xlabel('reconstruction error');
plt.title('Reconstruction error - Validation set');
Figure 27

Latent Space Visualization

We can cluster customers in the 2D latent space and visualize churned and not-churned customers, they can be separable at latent space and reveal the formation of distinct clusters.

x_train_encoded = encoder.predict(X_train)plt.scatter(x_train_encoded[:, 0], x_train_encoded[:, 1], 
c=y_train, alpha=0.6)
plt.title('Train set in latent space')
plt.show();
Figure 28
x_val_encoded = encoder.predict(X_val)plt.scatter(x_val_encoded[:, 0], x_val_encoded[:, 1], 
c=y_val, alpha=0.6)
plt.title('Validation set in latent space')
plt.show();
Figure 29

Prediction on the validation set

x_val_encoded = encoder.predict(X_val)
fpr, tpr, thresholds = roc_curve(y_val, clf.predict(x_val_encoded))
roc_auc = auc(fpr, tpr)
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, label='AUC = %0.4f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.001, 1])
plt.ylim([0, 1.001])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show();
Figure 29
print('Accuracy:')
print(accuracy_score(y_val, clf.predict(x_val_encoded)))
print("Confusion Matrix:")
print(confusion_matrix(y_val,clf.predict(x_val_encoded)))
print("Classification Report:")
print(classification_report(y_val,clf.predict(x_val_encoded)))
Figure 30

Prediction on the test set

x_test_encoded = encoder.predict(X_test)
fpr, tpr, thresholds = roc_curve(y_test, clf.predict(x_test_encoded))
roc_auc = auc(fpr, tpr)
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, label='AUC = %0.4f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.001, 1])
plt.ylim([0, 1.001])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show();
Figure 31
print('Accuracy:')
print(accuracy_score(y_test, clf.predict(x_test_encoded)))
print("Confusion Matrix:")
print(confusion_matrix(y_test,clf.predict(x_test_encoded)))
print("Classification Report:")
print(classification_report(y_test,clf.predict(x_test_encoded)))
Figure 32

That was it! Jupyter notebook can be found on Github. Happy Monday!

References:

--

--