Modeling Telecom Customer Churn with Variational Autoencoder
How to apply deep convolutional neural networks and auto-encoders for building a churn prediction model.
An autoencoder is deep learning’s answer to dimensionality reduction. The idea is pretty simple: transform the input through a series of hidden layers but ensure that the final output layer is the same dimension as the input layer. However, the intervening hidden layers have progressively smaller number of nodes (and hence, reduce the dimension of the input matrix). If the output matches or encodes the input closely, then the nodes of the smallest hidden layer can be taken as a valid dimension reduced data set.
A variational autoencoder (VAE) resembles a classical autoencoder and is a neural network consisting of an encoder, a decoder and a loss function. They let us design complex generative models of data, and fit them to large data sets.
After reading an article on using convolutional networks and autoencoders to provide insights into user churn. I decided to implement VAE to a telecom churn data set that can be downloaded from IBM Sample Data Sets. It is a bit of overkill to apply VAE to a relative small data set like this, but for the sake of learning VAE, I am going to do it anyway.
The Data
Each row represents a customer, each column contains customer’s attributes described on the column Metadata.
The data set includes the following information:
- Customers who left within the last month — the column is called Churn
- Services that each customer has signed up for — phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
- Customer account information — how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
- Demographic info about customers — gender, age range, and if they have partners and dependents.
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler,MinMaxScaler
import collections
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn.metrics import (confusion_matrix, precision_recall_curve, auc,
roc_curve, recall_score, classification_report, f1_score,
precision_recall_fscore_support)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from keras.layers import Input, Dense, Lambda
from keras.models import Model
from keras.objectives import binary_crossentropy
from keras.callbacks import LearningRateScheduler
from keras.utils.vis_utils import model_to_dot
from keras.callbacks import EarlyStopping, ModelCheckpoint
import keras.backend as K
from keras.callbacks import Callback
import matplotlib
matplotlib.rcParams['figure.figsize'] = (10.0, 6.0)df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
df.info()
TotalCharges should be converted to numeric.
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
Most features in the data set are categorical. We are going to visualize them first then create dummy variables.
Visualize and Analyze Categorical Features
Gender
gender_plot = df.groupby(['gender', 'Churn']).size().reset_index().pivot(columns='Churn', index='gender', values=0)
gender_plot.plot(x=gender_plot.index, kind='bar', stacked=True);
print('Gender', collections.Counter(df['gender']))
Gender does not seem to have an effect on the churn.
Partner
partner_plot = df.groupby(['Partner', 'Churn']).size().reset_index().pivot(columns='Churn', index='Partner', values=0)
partner_plot.plot(x=partner_plot.index, kind='bar', stacked=True);
print('Partner', collections.Counter(df['Partner']))
Whether the customer has partner or not does seem to have some effect on the churn.
Dependents
dependents_plot = df.groupby(['Dependents', 'Churn']).size().reset_index().pivot(columns='Churn', index='Dependents', values=0)
dependents_plot.plot(x=dependents_plot.index, kind='bar', stacked=True);
print('Dependents', collections.Counter(df['Dependents']))
Customers who have no dependents are more likely to churn than customers who have dependents.
PhoneService
phoneservice_plot = df.groupby(['PhoneService', 'Churn']).size().reset_index().pivot(columns='Churn', index='PhoneService', values=0)
phoneservice_plot.plot(x=phoneservice_plot.index, kind='bar', stacked=True);
print('PhoneService', collections.Counter(df['PhoneService']))
There are not many customers did not sign up for phone service, whether customer have phone service or not does not seem to have an effect on the churn.
MultipleLines
multiplelines_plot = df.groupby(['MultipleLines', 'Churn']).size().reset_index().pivot(columns='Churn', index='MultipleLines', values=0)
multiplelines_plot.plot(x=multiplelines_plot.index, kind='bar', stacked=True);
print('MultipleLines', collections.Counter(df['MultipleLines']))
Whether customer signed up for MultipleLines or not does not seem to have an effect on the churn.
InternetService
internetservice_plot = df.groupby(['InternetService', 'Churn']).size().reset_index().pivot(columns='Churn', index='InternetService', values=0)
internetservice_plot.plot(x=internetservice_plot.index, kind='bar', stacked=True);
print('InternetService', collections.Counter(df['InternetService']))
It seems customers who signed up for Fiber optic are most likely to churn, almost 50% of them churned.
OnlineSecurity
onlinesecurity_plot = df.groupby(['OnlineSecurity', 'Churn']).size().reset_index().pivot(columns='Churn', index='OnlineSecurity', values=0)
onlinesecurity_plot.plot(x=onlinesecurity_plot.index, kind='bar', stacked=True);
print('OnlineSecurity', collections.Counter(df['OnlineSecurity']))
Customers who did not sign up for OnlineSecurity are most likely to churn.
OnlineBackup
onlinebackup_plot = df.groupby(['OnlineBackup', 'Churn']).size().reset_index().pivot(columns='Churn', index='OnlineBackup', values=0)
onlinebackup_plot.plot(x=onlinebackup_plot.index, kind='bar', stacked=True);
print('OnlineBackup', collections.Counter(df['OnlineBackup']))
Customers who did not sign up for OnlineBackUp are most likely to churn.
DeviceProtection
deviceprotection_plot = df.groupby(['DeviceProtection', 'Churn']).size().reset_index().pivot(columns='Churn', index='DeviceProtection', values=0)
deviceprotection_plot.plot(x=deviceprotection_plot.index, kind='bar', stacked=True);
print('DeviceProtection', collections.Counter(df['DeviceProtection']))
Customers who did not sign up for DeviceProtection are most likely to churn.
TechSupport
techsupport_plot = df.groupby(['TechSupport', 'Churn']).size().reset_index().pivot(columns='Churn', index='TechSupport', values=0)
techsupport_plot.plot(x=techsupport_plot.index, kind='bar', stacked=True);
print('TechSupport', collections.Counter(df['TechSupport']))
Customers who did not sign up for TechSupport are most likely to churn.
StreamingTV
streamingtv_plot = df.groupby(['StreamingTV', 'Churn']).size().reset_index().pivot(columns='Churn', index='StreamingTV', values=0)
streamingtv_plot.plot(x=streamingtv_plot.index, kind='bar', stacked=True);
print('StreamingTV', collections.Counter(df['StreamingTV']))
StreamingMovies
streamingmovies_plot = df.groupby(['StreamingMovies', 'Churn']).size().reset_index().pivot(columns='Churn', index='StreamingMovies', values=0)
streamingmovies_plot.plot(x=streamingmovies_plot.index, kind='bar', stacked=True);
print('StreamingMovies', collections.Counter(df['StreamingMovies']))
From above seven plots, we can see that customers without internet service have a very low churn rate.
Contract
contract_plot = df.groupby(['Contract', 'Churn']).size().reset_index().pivot(columns='Churn', index='Contract', values=0)
contract_plot.plot(x=contract_plot.index, kind='bar', stacked=True);
print('Contract', collections.Counter(df['Contract']))
It is obvious that contract term does have an effect on churn. There were very few churns when customers have a two-year contract. And most churns occurred on customers with a month-to-month contract.
PaperlessBilling
paperlessbilling_plot = df.groupby(['PaperlessBilling', 'Churn']).size().reset_index().pivot(columns='Churn', index='PaperlessBilling', values=0)
paperlessbilling_plot.plot(x=paperlessbilling_plot.index, kind='bar', stacked=True);
print('PaperlessBilling', collections.Counter(df['PaperlessBilling']))
PaymentMethod
paymentmethod_plot = df.groupby(['PaymentMethod', 'Churn']).size().reset_index().pivot(columns='Churn', index='PaymentMethod', values=0)
paymentmethod_plot.plot(x=paymentmethod_plot.index, kind='bar', stacked=True);
print('PaymentMethod', collections.Counter(df['PaymentMethod']))
PaymentMethod does seem to have an effect on churn, in particular, pay by electronic check has the highest percentage churning rate.
SeniorCitizen
seniorcitizen_plot = df.groupby(['SeniorCitizen', 'Churn']).size().reset_index().pivot(columns='Churn', index='SeniorCitizen', values=0)
seniorcitizen_plot.plot(x=seniorcitizen_plot.index, kind='bar', stacked=True);
print('SeniorCitizen', collections.Counter(df['SeniorCitizen']))
We do not have many senior citizens in the data. It seems whether customers are seniors citizens or not does not have an effect on the churning rate.
Explore Numeric Features
Tenure
sns.kdeplot(df['tenure'].loc[df['Churn'] == 'No'], label='not churn', shade=True);
sns.kdeplot(df['tenure'].loc[df['Churn'] == 'Yes'], label='churn', shade=True);
df['tenure'].loc[df['Churn'] == 'No'].describe()
df['tenure'].loc[df['Churn'] == 'Yes'].describe()
Not churned customers have a much longer average tenure (20 months) than the churned customers.
Monthly Charges
sns.kdeplot(df['MonthlyCharges'].loc[df['Churn'] == 'No'], label='not churn', shade=True);
sns.kdeplot(df['MonthlyCharges'].loc[df['Churn'] == 'Yes'], label='churn', shade=True);
df['MonthlyCharges'].loc[df['Churn'] == 'No'].describe()
df['MonthlyCharges'].loc[df['Churn'] == 'Yes'].describe()
Churned customers paid over 20% higher on average monthly fee than not-churned customers.
TotalCharges
sns.kdeplot(df['TotalCharges'].loc[df['Churn'] == 'No'], label='not churn', shade=True);
sns.kdeplot(df['TotalCharges'].loc[df['Churn'] == 'Yes'], label='churn', shade=True);
Data Pre-processing
Encode labels with value between 0 and 1.
le = preprocessing.LabelEncoder()
df['Churn'] = le.fit_transform(df.Churn.values)
Fill nan with the mean of the column.
df = df.fillna(df.mean())
Encode categorical features.
categorical = ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']
for f in categorical:
dummies = pd.get_dummies(df[f], prefix = f, prefix_sep = '_')
df = pd.concat([df, dummies], axis = 1)
# drop original categorical features
df.drop(categorical, axis = 1, inplace = True)
Split the data into train, validation and test sets, and create batch to send through our network.
VAE Implementation in Keras
The following code scrips were largely from Agustinus Kristiadi’s blog post: Variational Autoencoder: Intuition and Implementation.
- Define input layer.
- Define encoder layer.
- Encoder model, to encode input into latent variable.
- We use the mean as the output as it is the center point, the representative of the Gaussian.
- We sample from the output of the 2 dense layers.
- Define decoder layer in VAE model.
- Define overall VAE model, for reconstruction and training.
- Define generator model, generate new data given latent variable z.
- Translate our loss into Keras code.
- Start training.
The model stopped training after 55 epochs with a batch size of 100 samples.
Evaluation
plt.plot(vae_history.history['loss'])
plt.plot(vae_history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show();
From the above loss plot, we can see that the model has comparable performance on both train and validation data sets, and it seems to converge nicely at the end.
We use reconstruction error to measure how well the decoder is performing. Autoencoders are trained to reduce reconstruction error which we show below:
x_train_encoded = encoder.predict(X_train)pred_train = decoder.predict(x_train_encoded)
mse = np.mean(np.power(X_train - pred_train, 2), axis=1)
error_df = pd.DataFrame({'recon_error': mse,
'churn': y_train})plt.figure(figsize=(10,6))
sns.kdeplot(error_df.recon_error[error_df.churn==0], label='not churn', shade=True, clip=(0,10))
sns.kdeplot(error_df.recon_error[error_df.churn==1], label='churn', shade=True, clip=(0,10))
plt.xlabel('reconstruction error');
plt.title('Reconstruction error - Train set');
x_val_encoded = encoder.predict(X_val)pred = decoder.predict(x_val_encoded)
mseV = np.mean(np.power(X_val - pred, 2), axis=1)
error_df = pd.DataFrame({'recon_error': mseV,
'churn': y_val})plt.figure(figsize=(10,6))
sns.kdeplot(error_df.recon_error[error_df.churn==0], label='not churn', shade=True, clip=(0,10))
sns.kdeplot(error_df.recon_error[error_df.churn==1], label='churn', shade=True, clip=(0,10))
plt.xlabel('reconstruction error');
plt.title('Reconstruction error - Validation set');
Latent Space Visualization
We can cluster customers in the 2D latent space and visualize churned and not-churned customers, they can be separable at latent space and reveal the formation of distinct clusters.
x_train_encoded = encoder.predict(X_train)plt.scatter(x_train_encoded[:, 0], x_train_encoded[:, 1],
c=y_train, alpha=0.6)
plt.title('Train set in latent space')
plt.show();
x_val_encoded = encoder.predict(X_val)plt.scatter(x_val_encoded[:, 0], x_val_encoded[:, 1],
c=y_val, alpha=0.6)plt.title('Validation set in latent space')
plt.show();
Prediction on the validation set
x_val_encoded = encoder.predict(X_val)
fpr, tpr, thresholds = roc_curve(y_val, clf.predict(x_val_encoded))
roc_auc = auc(fpr, tpr)plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, label='AUC = %0.4f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.001, 1])
plt.ylim([0, 1.001])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show();
print('Accuracy:')
print(accuracy_score(y_val, clf.predict(x_val_encoded)))
print("Confusion Matrix:")
print(confusion_matrix(y_val,clf.predict(x_val_encoded)))
print("Classification Report:")
print(classification_report(y_val,clf.predict(x_val_encoded)))
Prediction on the test set
x_test_encoded = encoder.predict(X_test)
fpr, tpr, thresholds = roc_curve(y_test, clf.predict(x_test_encoded))
roc_auc = auc(fpr, tpr)
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, label='AUC = %0.4f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.001, 1])
plt.ylim([0, 1.001])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show();
print('Accuracy:')
print(accuracy_score(y_test, clf.predict(x_test_encoded)))
print("Confusion Matrix:")
print(confusion_matrix(y_test,clf.predict(x_test_encoded)))
print("Classification Report:")
print(classification_report(y_test,clf.predict(x_test_encoded)))
That was it! Jupyter notebook can be found on Github. Happy Monday!
References: