The world’s leading publication for data science, AI, and ML professionals.

How To Predict Customer Churn From Your Website Logs?

Hands-on Tutorials

The Practical Guide for Churn Analysis & Prediction of the Music Streaming Service At Scale (PySpark and Plotly)

Photo by bruce mars on Unsplash
Photo by bruce mars on Unsplash

· Introduction · How to Predict Customer Churn?Step1: Clean The Website Log DataStep2: Transform Your Website Logs To User-Log DatasetStep3: Explore the data for two user groups (Churned Vs. Stayed)Step3: Predict Customer Churn Using ML Algorithms · Conclusion

Introduction

Many online-based companies earn a huge portion of their revenue from the subscription model, so it is very important to track down how many customers stop using their products or services. Customer churn is defined when existing customers cancel the subscription. If we can build a model to predict if a current customer churns(1) or not(0), the companies prevent from losing their subscribers by providing more promotions or improving their services. If you want to learn more about customer churn, please check this blog.

Among many customer-facing businesses, music streaming services like Spotify can use their huge user-interaction data across their website and applications to predict customer churn. Since website logs are "big data" in general, it is essential to analyze them and predict churn at scale using some big data libraries like Spark. In this article, I would like to guide you on the churn prediction project in a high-level view and share some visualization and prediction results. If you want to jump into the practical tutorial, please check my Kaggle post(you can play with the notebook online by copying my notebook) or Github repository (fork or download the code to your local system)

Do you want to see my code first? Please check my Kaggle Notebook or my Github Repository.

A Tutorial of Customer Churn Analysis & Prediction

suhongkim/Churn-Prediciton-At-Scale


How to Predict Customer Churn?

To show how to build the prediction model in practice, we are going to use the fabricated website logging data collected from the virtual music streaming company, called "Sparkify" from Udacity’s Data Science Nanodegree. First, we need to transform these crazy website logs into a clean and aggregated user-log dataset. Then, we’ll analyze features in the dataset related to customer churn using Plotly visualization libraries. Lastly, we’ll extract meaningful features and select proper machine learning algorithms to predict customer churn. Let’s dive in!

Step1: Clean The Website Log Data

Let’s look at the schema of the website-log dataset and check the descriptions to understand what kinds of information are collected from the website. Note that the bold column names are related to customer churn, others are about website logging information.

| Column        | Type   | Description                           
|---------------|--------|---------------------------------------- 
| ts            | long   | the timestamp of user log                
| sessionId     | long   | an identifier for the current session   
| auth          | string | the authentification log                
| itemInSession | long   | the number of items in a single session 
| method        | string | HTTP request method (put/get)                                   
| status        | long   | http status 
| userAgent     | string | the name of HTTP user agent             
| userId        | string | the user Id                             
| gender        | string | the user gender (F/M)                         
| location      | string | the user location in US             
| firstName     | string | the user first name                     
| lastName      | string | the user last                           
| registration  | long   | the timestamp when user registered      
| level         | string | the subscription level (free, paid)                                  
| artist        | string | the name of the artist played by users 
| song          | string | the name of songs played by users
| length        | double | the length of a song in seconds  
| page          | string | the page name visited by users
** the categories of page:  Home, Login, LogOut, Settings, Save Settings, about, NextSong, Thumbs Up, Thumbs Down, Add to Playlist, Add Friend, Roll Advert, Upgrade, Downgrade, help, Submit Downgrade, Cancel, Cancellation Confrimation

I dropped some records where userIdis empty or null, then add a target column(churn) and three additional columns as below.

  • churn (integer): the churn status (1:churned, 0:stayed) defined using Cancellation Confirmation events for both paid and free users
  • ds (date): the date stamp converted from the time stamp data ts
  • dsRestration (date): the date stamp converted from the time stamp data registration
  • locCity (string): the city name from locationdata
Below is the sample of new columns
+------+-----+----------+--------------+--------------+
|userId|churn|        ds|dsRegistration|       locCity|
+------+-----+----------+--------------+--------------+
|100010|    0|2018-10-08|    2018-09-27|    Bridgeport|
|200002|    0|2018-10-01|    2018-09-06|       Chicago|
|   125|    1|2018-10-12|    2018-08-01|Corpus Christi|
+------+-----+----------+--------------+--------------+

Step2: Transform Your Website Logs To User-Log Dataset

In order to predict churn status for users, the website-log data needs to be transformed for each user. First, we need to discard some columns that are not related to customer churn events such as session logs and user names. Then, we can transform data based on userId and there are two types of data: user information and user activities. User information columns in our data are churn, gender, level, and locCity, which must be the same for each user.

+------+-----+------+-----+--------------+
|userId|churn|gender|level|       locCity|
+------+-----+------+-----+--------------+
|100010|    0|     F| free|    Bridgeport|
|200002|    0|     M| free|       Chicago|
|   125|    1|     M| free|Corpus Christi|
+------+-----+------+-----+--------------+

For user activity data, we need to aggregate the logging data to create some meaningful features. I listed the new columns that I added to the user-log dataset below.

  • lifeTime (long): the user lifetime is how long a user has been alive on the website, and the number indicates days from the registration date to the last active log date
  • playTime (double): the song playtime is the average time(sec) of the total songs played by a user while the user visits next songpage
  • numSongs (long): the total number of song names for each user
  • numArtists (long): the total number of artist names for each user
  • numPage_* (long): the total number of page visits for each page and each user. Note that Cancellation and Conform cancellation pages are not considered for the feature group because those are used for generating churn labels. Also, Login and Register have zero counts for all users in our dataset, so they are automatically dropped
+--------------+-----------------+--------+----------+-------------+
|lifeTime(days)|    PlayTime(sec)|numSongs|numArtists|    numPage_*|
+--------------+-----------------+--------+----------+-------------+
|            55|318224.4166666667|     269|       252|            1|
|            70|187044.0476190476|     378|       339|            3|
|            72|           1762.0|       8|         8|            0|
+--------------+-----------------+--------+----------+-------------+

Before jump into the visualization, let’s summarize what we’ve done so far with a few numbers!

The shape of the raw data: (286500, 18)
The shape of the clean data: (278154, 18)
The shape of the website-log data: (278154, 22)
The shape of the User-Log data(df_user): (225, 26)
The number of users (unique userId): 225
The count of churned users (1): 52
The count of Not-Churned users (0): 173
The logging period: 2018-10-01 - 2018-12-03

Step3: Explore the data for two user groups (Churned Vs. Stayed)

It is very important to visualize the features related to our target value as we have better intuitions of the mechanisms behind the customer churn event. Let’s start with revealing how the number of churned users is changed per month (The logging period of this mini-dataset is 2 months long)

The total number of users decresed from 213 in Oct. to 187 in Nov. To specific, the number of stayed users sligjhtly increased by 4 while the number of churned users decreased by more than half(55 in Oct and 22 in Nov), showing that Sparkify service succeed to retain the exisisting customers. If we have more data about Sparkify business and activites, we can analyze what kind of factors affects less customers churns from this observation.

We can observe that Male subscribers tend to churns more.

The users in the free subscription level can be in the churned status when they request Submit Downgrade but don’t reach to cancellation confirmation page – Note that we defined the churn status when a user visits the cancellation confirmation page only. Thus, those past subscribers in this category can be main targets by marketers to retain more users for this website.

The churn rate that you can see from the above scatter plot is the proportion of churned user to population in each city. The churn rate of cities shows many extreme values like 0% or 100% since this dataset is synthesized, which lead to the conclusion that we need to exclude this feature for our prediction model.

So far, we’ve explored the categorical features such as gender, subscription level, and location(The first graph "time analysis" is just for understanding the trend of churned users). As you saw in the step 2 section, we generated new numerical features to describe user activities on a website. Let’s visualize them!

The churned users tend to have both shorter lifetime and shorter song playtime than the stayed users.

The churned group tend to choose slightly less variety of songs and artists (both graphs look similar, it might have a high correlation)

As we made 17 different features from the page column in the raw dataset, I picked up some important features to visualize in the above graph. Among them, I noticed that page SubmitDowngrade seems to have a discrete distribution compared to others so that I decided to change this feature from numerical variable to categorical variable for the prediction model.

Step3: Predict Customer Churn Using ML Algorithms

From the visualization, we can finally select our features for the prediction model by modifying the user-log dataset as below.

  • LocCity will be dropped from the feature set since it has many extreme values
  • numSongs and numArts are highly correlated (0.99) so numSongs will be chosen only for the feature set
  • numPage_SubmitDowngrade will be converted to the categorical feature page_SubmitDowngrade having only two values: ‘visited’ or ‘none’
  • The features related to the page column have many correlations each other, so I selected only 7 features:numPage_About, numPage_Error, numPage_RollAdvert, numPage_SaveSettings, numPage_SubmitDowngrade, numPage_ThumbsDown, numPage_Upgrade
Image By Suhong Kim
Image By Suhong Kim
The schema of df_feat
 |-- userId: string, User Identification Info (not used for feature)
 |-- label: integer, the target(Churn) for the prediction model
 <Categorical Features>
 |-- gender: string ('F' or 'M')
 |-- level: string ('paid' or 'free')
 |-- page_SubmitDowngrade: string ('visited' or 'none')
 <Numerical Features> 
 |-- lifeTime: integer 
 |-- playTime: double 
 |-- numSongs: long 
 |-- numPage_About: long 
 |-- numPage_Error: long 
 |-- numPage_RollAdvert: long 
 |-- numPage_SaveSettings: long 
 |-- numPage_SubmitUpgrade: long 
 |-- numPage_ThumbsDown: long 
 |-- numPage_Upgrade: long 

Let’s start to build the pipeline for the prediction model using ML libraries in Spark. For better cross-validation, I combined all feature transformations and an estimator into one pipeline and feed it into the CrossValidator. There are three main parts of the pipeline.

  1. Feature Transformation: category variables will be transformed to one-hot encoded vectors by StringIndexer and OneHotEncoder. Then, the categorical vector and numerical variables will be assembled into a dense feature vector using VectorAssembler.
  2. Feature Importance Selection: I built a custom FeatureSelector class to extract only important features using a Tree-based estimator. This step is optional so that I didn’t use it for Logistic Regression or LinearSVC models.
  3. Estimator: The final step is using ML algorithms to estimate the churn label for each user.

With this pipeline, I chose five different estimators to select the best algorithm with the default parameters based on the F1 score because our data is imbalanced. As the result below, I chose RandomForestClassifier for our prediction model which showed the highest validation score (0.78).

<The result of the model selection>
--------------------
LogisticRegressionModel: numClasses=2, numFeatures=16
train_f1: 0.8275, test_f1: 0.7112
--------------------
LinearSVCModel: numClasses=2, numFeatures=16
train_f1: 0.8618, test_f1: 0.7472
--------------------
DecisionTreeClassificationModel: depth=5, numNodes=31, numFeatures=7
idx                name     score
0    0            lifeTime  0.439549
1    1            numSongs  0.207649
2    2  numPage_ThumbsDown  0.137043
3    3  numPage_RollAdvert  0.096274
4    4            playTime  0.062585
5    5       numPage_Error  0.045681
6    6       numPage_About  0.011218
train_f1: 0.9373, test_f1: 0.7667
--------------------
RandomForestClassificationModel:numTrees=20, numFeatures=12
    idx                   name     score
0     0               lifeTime  0.315996
1     1               playTime  0.174795
2     2     numPage_ThumbsDown  0.101804
5     5               numSongs  0.089395
3     3     numPage_RollAdvert  0.080125
6     6        numPage_Upgrade  0.053891
4     4          numPage_About  0.053557
7     7          numPage_Error  0.051073
8     8   numPage_SaveSettings  0.033237
9     9  numPage_SubmitUpgrade  0.024314
11   11            genderVec_M  0.012589
10   10          levelVec_paid  0.009225
train_f1: 0.9086, test_f1: 0.7788
--------------------
GBTClassificationModel: numTrees=20, numFeatures=11
    idx                   name     score
0     0     numPage_ThumbsDown  0.276418
1     1               lifeTime  0.191477
2     2               numSongs  0.104416
4     4     numPage_RollAdvert  0.080323
5     6          numPage_About  0.074554
9     5          levelVec_free  0.068573
3     3               playTime  0.067631
6     7        numPage_Upgrade  0.050553
7     8          numPage_Error  0.042485
10    9            genderVec_M  0.029921
8    10  numPage_SubmitUpgrade  0.013649
train_f1: 1.0, test_f1: 0.7615

Finally, I ran the cross-validation to tune the hyper-parameters of the RandomForestClassifier. Since our dataset is very small, you can observe an almost perfect train score, pointing that model is overfitted. Thus, I selected some cross-validation parameter maps to make the model less complex compared to the default model that I used for the above model selection (numTrees=20). The result shows that the model with 10 trees and 16 max bins has slightly better performance but it didn’t overcome the overfitting problem well. I assume that this problem can be solved by adding more data.

RandomForestClassificationModel: numTrees=10, numFeatures=11
Best parameters:[('bootstrap', True), ('cacheNodeIds', False), ('checkpointInterval', 10), ('featureSubsetStrategy', 'auto'), ('impurity', 'gini'), ('maxBins', 16), ('maxDepth', 5), ('numTrees', 10), ('maxMemoryInMB', 256), ('minInfoGain', 0.0), ('minInstancesPerNode', 1), ('minWeightFractionPerNode', 0.0), ('seed', -5400988877677221036), ('subsamplingRate', 1.0)]
    idx                             name     score
0     0                         lifeTime  0.346846
1     1                         playTime  0.160187
2     2                         numSongs  0.104921
5     5               numPage_ThumbsDown  0.102716
6     6                    numPage_About  0.075681
4     4                  numPage_Upgrade  0.075326
3     3               numPage_RollAdvert  0.048552
7     7                    numPage_Error  0.044421
8     8             numPage_SaveSettings  0.028647
9     9                    levelVec_free  0.009174
10   10  page_SubmitDowngradeVec_visited  0.003529
train_f1: 0.9287, valid_f1: 0.7608 test_f1: 0.7255

Conclusion

In this article, we tackled one of the most challenging and common business problems – how to predict customer churns. From the nasty and huge website logs, we extracted several meaningful features for each user, and visualize them based on two user groups (churned vs. stayed) for more analysis. Finally, we built the ML pipeline including feature transformations and estimator, which fed to the cross-validator for model selection and hyper-parameter tuning. The final model shows a pretty high testing score (f1-score: 0.73), but it also has an overfitting problem due to the limitation of the small dataset size (128MB). Since Udacity provides the full dataset (12GB) on AWS cloud, I have a plan to deploy this Spark cluster to handle the overfitting problem soon.

Absolutely, there are many things we can do to improve this model without considering data size. First, most features are just aggregated regardless of the time factor. The logs are collected for 2 months so it would be better to emphasize more recent ones with different methods like a weighted sum. Also, we can apply some strategies to handle the data imbalance (this blog will help you get some ideas). Furthermore, we can model this problem as time series models because churn rates should be reported periodically to the business stakeholders.

I hope that this project gives you a tutorial on how to deal with large data to solve a real-world problem using data science and Machine Learning skills. Plus, it can be good practice to play with the Spark and Plotly libraries. Thanks for reading and love to connect with you anytime via my LinkedIn!


Related Articles