Customer churn refers to when a customer ceases his/her relationship with a company. It has a considerable impact on the business’ health and long-term success, as it could significantly lower revenues and profits. According to the statistics by Forrester research, it costs 5 times more to acquire new customers than it does to keep the existing ones. Therefore, it is always worth investing time and resources for companies to identify certain customers who are likely to churn, and addressing those customers before they make the decision of stop using the companies’ services.
In this article, we will explore the user logs data of a fictitious music streaming services called Sparkify, and build supervised machine learning models to predict whether a customer is potentially going to churn.

Our Datasets
We have two datasets for this project: the full dataset with a size of 12GB, and a small subset of the full dataset with a size of 128MB. We will use the 128MB version of the dataset which is small enough to fit into the memory of a local machine as the example in this article. In other to use the full dataset for model training, we may need to deploy a cluster on cloud services like AWS.
Let’s take a quick look at all the fields in the dataset:
df.printSchema()root
|-- artist: string (nullable = true)
|-- auth: string (nullable = true)
|-- firstName: string (nullable = true)
|-- gender: string (nullable = true)
|-- itemInSession: long (nullable = true)
|-- lastName: string (nullable = true)
|-- length: double (nullable = true)
|-- level: string (nullable = true)
|-- location: string (nullable = true)
|-- method: string (nullable = true)
|-- page: string (nullable = true)
|-- registration: long (nullable = true)
|-- sessionId: long (nullable = true)
|-- song: string (nullable = true)
|-- status: long (nullable = true)
|-- ts: long (nullable = true)
|-- userAgent: string (nullable = true)
|-- userId: string (nullable = true)
The page field gives us more details about user behaviors:
df.select('page').dropDuplicates().sort('page').show()+--------------------+
| page|
+--------------------+
| About|
| Add Friend|
| Add to Playlist|
| Cancel|
|Cancellation Conf...|
| Downgrade|
| Error|
| Help|
| Home|
| Logout|
| NextSong|
| Roll Advert|
| Save Settings|
| Settings|
| Submit Downgrade|
| Submit Upgrade|
| Thumbs Down|
| Thumbs Up|
| Upgrade|
+--------------------+
Since userId is our target variable that helps us uniquely identify users, logs with empty userId are not helpful for us to make predictions about which customer would churn. We will drop those records with missing userId from our dataset.
df = df.where(col('userId').isNotNull())
df.count()278154
Exploratory Data Analysis
Before we can move on to comparing users who churned and those who did not, we first need to think about what we would define as churn. A narrow definition would probably only include people who have deleted their account,which is captured in our data as cases where the page feature takes on the value “Cancellation Confirmation”.
We will use the Cancellation Confirmation event of pages to define churn. We will create a new column “Churned” to label churned users for our model.
Number of users who have churned 52
Number of users who have not churned 173We can see that we have an imbalanced amount of churn and non-churn class in this dataset. In this case, we will use F1 score for evaluating our model later, as it is less sensitive to a class imbalance compared with Accuracy and Precision.
Impact of gender
there were 19 percent female users churned
there were 26 percent male users churnedImpact of level (paid/free)
there were 23 percent free users churned
there were 21 percent paid users churnedImpact of time users spent on listening songs

Impact of number of songs played

Impact of number of days since registration

Impact of pages viewed

Feature Engineering
Now we are familiarized with the data, we build the features we find promising, combined with the page events to train the model on:
user_df.printSchema()root
|-- userId: string (nullable = true)
|-- churn: long (nullable = true)
|-- n_act: long (nullable = false)
|-- n_about: long (nullable = true)
|-- n_addFriend: long (nullable = true)
|-- n_addToPlaylist: long (nullable = true)
|-- n_cancel: long (nullable = true)
|-- n_downgrade: long (nullable = true)
|-- n_error: long (nullable = true)
|-- n_help: long (nullable = true)
|-- n_home: long (nullable = true)
|-- n_logout: long (nullable = true)
|-- n_rollAdvert: long (nullable = true)
|-- n_saveSettings: long (nullable = true)
|-- n_settings: long (nullable = true)
|-- n_submitDowngrade: long (nullable = true)
|-- n_submitUpgrade: long (nullable = true)
|-- n_thumbsDown: long (nullable = true)
|-- n_thumbsUp: long (nullable = true)
|-- n_upgrade: long (nullable = true)
|-- playTime: double (nullable = true)
|-- numSongs: long (nullable = false)
|-- numArtist: long (nullable = false)
|-- active_days: double (nullable = true)
|-- numSession: long (nullable = false)
|-- encoded_level: double (nullable = true)
|-- encoded_gender: double (nullable = true)
Multicollinearity increases the standard errors of the coefficients. Increased standard errors, in turn, means that coefficients for some independent variables may be found not to be significantly different from 0. In other words, by over-inflating the standard errors, multicollinearity makes some variables statistically insignificant when they should be significant. Without multicollinearity (and thus, with lower standard errors), those coefficients might be significant.

We can see that there are some variable pairs with correlation coefficient over 0.8, which means that those variables are highly correlated. in order to deal with the multicollinearity, we will try two approaches here :
- Remove correlated features manually
- PCA
Remove correlated feature manually
We manually remove those variables with high correlation coefficients in the last heat map, and retain the rest of the features:

user_df_m.printSchema()root
|-- userId: string (nullable = true)
|-- churn: long (nullable = true)
|-- n_about: long (nullable = true)
|-- n_error: long (nullable = true)
|-- n_rollAdvert: long (nullable = true)
|-- n_saveSettings: long (nullable = true)
|-- n_settings: long (nullable = true)
|-- n_submitDowngrade: long (nullable = true)
|-- n_submitUpgrade: long (nullable = true)
|-- n_thumbsDown: long (nullable = true)
|-- n_upgrade: long (nullable = true)
|-- active_days: double (nullable = true)
|-- numSession: long (nullable = false)
|-- encoded_level: double (nullable = true)
|-- encoded_gender: double (nullable = true)
Transform features with PCA
Principal component analysis (PCA) is a statistical analysis technique that transforms possibly correlated variables to orthogonal linearly uncorrelated values. We can use it for applications such as data compression, to get dominant physical processes and to get important features for machine learning. Without losing much information, it reduces the number of features in the original data.
When applying PCA, we need to first combine features into a vector, and standardize our data:
# Vector Assembler
user_df_pca = user_df
cols = user_df_pca.drop('userId','churn').columns
assembler = VectorAssembler(inputCols=cols, outputCol='Features')
user_df_pca = assembler.transform(user_df).select('userId', 'churn','Features')
# Standard Scaler
scaler= FT.StandardScaler(inputCol='Features', outputCol='scaled_features_1', withStd=True)
scalerModel = scaler.fit(user_df_pca)
user_df_pca = scalerModel.transform(user_df_pca)
user_df_pca.select(['userId','churn', 'scaled_features_1']).show(5, truncate = False)pca = PCA(k=10, inputCol = scaler.getOutputCol(), outputCol="pcaFeatures")
pca = pca.fit(user_df_pca)
pca_result = pca.transform(user_df_pca).select("userId","churn","pcaFeatures")

Model Selection
In order to select a good model for the final tuning, we will compare three different classifier models in Spark’s ML:
- Logistic Regression
- Decision Tree
- Random Forest
Split Data into Train and Test sets
From the distribution of churn shown in the previous session, we know that this is an imbalanced dataset with only 1/4 of users labeled as churn. To avoid imbalanced results in random split, we first make a train set with sampling by label, then subtract them from the whole dataset to get the test set.
# prepare training and test data, sample by labelratio = 0.7train_m = m_features_df.sampleBy('churn', fractions={0:ratio, 1:ratio}, seed=123)
test_m = m_features_df.subtract(train_m)train_pca = pca_features_df.sampleBy('churn', fractions={0:ratio, 1:ratio}, seed=123)
test_pca = pca_features_df.subtract(train_pca)
Logistic Regression
# initialize classifier
lr = LogisticRegression(maxIter=10)# evaluator
evaluator_1 = MulticlassClassificationEvaluator(metricName='f1')# paramGrid
paramGrid = ParamGridBuilder() \
.build()crossval_lr = CrossValidator(estimator=lr,
evaluator=evaluator_1,
estimatorParamMaps=paramGrid,
numFolds=3)

Evaluate with test dataset:

Decision Tree
# initialize classifier
dtc = DecisionTreeClassifier()# evaluator
evaluator_2 = MulticlassClassificationEvaluator(metricName='f1')# paramGrid
paramGrid = ParamGridBuilder() \
.build()crossval_dtc = CrossValidator(estimator=dtc,
evaluator=evaluator_2,
estimatorParamMaps=paramGrid,
numFolds=3)

Evaluate with test dataset:

Random Forest
# initialize classifier
rfc = RandomForestClassifier()# evaluator
evaluator_3 = MulticlassClassificationEvaluator(metricName='f1')# paramGrid
paramGrid = ParamGridBuilder() \
.build()crossval_rfc = CrossValidator(estimator=rfc,
evaluator=evaluator_3,
estimatorParamMaps=paramGrid,
numFolds=3)

Evaluate with test dataset:

Conclusion
Among all three models, the result of PCA outperforms the features after manual dimension reduction. Although the Logistic Regression model gave us a perfect F1 score and accuracy for test dataset, it performs much less satisfying with the training dataset. As the Logistic Regression model also performs not as satisfying with the full dataset, this perfect F1 score could due to chance or simplicity of the model. Therefore, I would chose Random Forest classifier for future implementation, of which both the test and train F1 Score are around 97%.
Although there are many other models we could try out in the future, such as Naive Bayes and Linear SVM, the Random Forest model performs pretty well in this case (with 97% F1 score with the small dataset, and with 99.992% F1 score with the full dataset).
There are some other improvements that we could work on in the future:
- A more automatic and robust approach to train and evaluate models
- Take account of those users who downgraded their services
- More focus on user behavior features
Thanks for reading!

