Introduction
Sparkify is a popular digital music service similar to Spotify or Pandora created by Udacity. The goal of the project is to predict which users are at risk to churn cancelling their service. If we can identify these users before they leave, we can offer them discounts or incentives.
Different page actions a user can do, are, for example, playing a song, like a song, dislike a song, logging in, logging out, add a friend or at worst case cancel the service. What we know about the user is the id, the first and last name, the gender, if they are logging in or not, if they are a paid user or not, their location, the timestamp of their registration as well as and which browser the user is using. What we know about the songs they are listening to, is the name of the artist and the length of the song.

Part I: Example of a user story
Let’s take a single user to better understand the interactions a user can do. Let’s investigate three actions of user id 2, Natalee Charles from Raleigh, NC.
Row(artist=None, auth='Logged In', firstName='Natalee', gender='F', itemInSession=1, lastName='Charles', length=None, level='paid', location='Raleigh, NC', method='GET', page='Home', sessionId=1928, song=None, status=200, userAgent='"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36"', userId='2', timestamp='2018-11-21 18:04:37', registration_timestamp='2018-09-13 02:49:30'),
Row(artist='Octopus Project', auth='Logged In', firstName='Natalee', gender='F', itemInSession=2, lastName='Charles', length=184.42404, level='paid', location='Raleigh, NC', method='PUT', page='NextSong', sessionId=1928, song='What They Found', status=200, userAgent='"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36"', userId='2', timestamp='2018-11-21 18:04:45', registration_timestamp='2018-09-13 02:49:30'),
Row(artist='Carla Bruni', auth='Logged In', firstName='Natalee', gender='F', itemInSession=3, lastName='Charles', length=123.48036, level='paid', location='Raleigh, NC', method='PUT', page='NextSong', sessionId=1928, song="l'antilope", status=200, userAgent='"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36"', userId='2', timestamp='2018-11-21 18:07:49', registration_timestamp='2018-09-13 02:49:30'),
...
Row(artist=None, auth='Logged In', firstName='Natalee', gender='F', itemInSession=106, lastName='Charles', length=None, level='paid', location='Raleigh, NC', method='PUT', page='Add Friend', sessionId=1928, song=None, status=307, userAgent='"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36"', userId='2', timestamp='2018-11-21 23:30:04', registration_timestamp='2018-09-13 02:49:30')
The first action is the first action within the session: visiting the home page. The second action is listening to a song which allows us to gather information about the song itself: artist, song and length. The third action is listening to a different song (Carla Bruni: l’antilope). The last action of the whole session is adding a friend.
Part II: Identifying features
The definition used for a user who churned is a user who submitted a cancellation confirmation. We need to find features which represent a big difference between the two user types: a user who churned and a user who don’t churn. The difficulty is that the data contains a lot of more churns (233290) than non-churns (44864), that means, that we have to find features who are comparable despite unbalanced data.

The first idea is to compare the item by session because the duration of a session is not influenced by the number of users, only by the number of items in a session. The most important item of the music service platform is the playing of a song. This leads to the decision to compare the number of Next Songs between the two user types. Churns are listening at maximum to 708 songs while non-churns are listening up to 992 songs. The mean number of songs listened to is about 70 for a churn and 105 for a non-churn. This is a good feature because we have seen a lot of differences for both user types.

The second idea is to inspect the proportional difference between the two user types. There is already a user type definition in the data, the level (a free/paid user). We want to investigate if the proportional distribution differs for churns or non-churns. We can see that non-churns have a greater proportion of paid users than churns and churns have a greater proportion of free users. This is a great feature, too, because a free user is more likely to churn than a paid in this case.
Part III: Modeling
We use Logistic Regression to find the best model who predicts churns. Because we have a binary decision, we use the BinaryClassificationEvaluator to evaluate the results. Since the churned users are a fairly small subset, we use f1 score (in combination with accuracy) as the metric to optimize. We use three different input types to test if for example weighting the data before training improves the f1 score or not. Another test is removing the duplicates.

In the first round, the best accuracy and f measure was achieved by all data and weighted data. They both achieved the same result. But the unique data set achieved the best f measure of the label 1.
All data
Accuracy: 0.9023548583351629
F-measure: 0.8560382799176114
F-measure by label:
label 0: 0.9486714367527147
label 1: 0.0
Weighted data
Accuracy: 0.9023548583351629
F-measure: 0.8560382799176114
F-measure by label:
label 0: 0.9486714367527147
label 1: 0.0
Unique data
Accuracy: 0.6666666666666666
F-measure: 0.5859374999999999
F-measure by label:
label 0: 0.7906666666666665
label 1: 0.18229166666666666
In the second round, we used the best threshold based on the maximum of the f score by the threshold. The total accuracy and f measure dropped but the f measure of label 1 raised. The best f 1 measure (label 1) was achieved by the unique data set and the best total accuracy and f measure by the other two data sets.
All data
Accuracy: 0.7611244603498374
F-measure: 0.7955698747928712
F-measure by label:
label 0: 0.8594335689341074
label 1: 0.20539494445808643
Weighted data
Accuracy: 0.7611244603498374
F-measure: 0.7955698747928712
F-measure by label:
label 0: 0.8594335689341074
label 1: 0.20539494445808643
Unique data
Accuracy: 0.5859872611464968
F-measure: 0.5914187270302529
label 0: 0.6044624746450304
label 1: 0.5657015590200445
The last round (parameter tuning) used different parameters to optimize the average metrics. After parameter tuning, the unique data set predicted 597 of 317 expected label 1 values with an accuracy on the test data of 0.5754 while all data set predicted 134437 of 73990 expected label 1 value with an accuracy of 0.7786 on the test data.
All data
Average Metric Max: 0.5703188574707733
Average Metric Min: 0.5
Accuracy on test data: 0.7786
Total count of Label 1 prediction: 134437
Total count of Label 1 expectations: 73990
Unique data
Average Metric Max: 0.6750815894510674
Average Metric Min: 0.5
Accuracy on test data: 0.5754
Total count of Label 1 prediction: 597
Total count of Label 1 expectations: 317
Conclusion

The best model is the one which uses all data of the features we have chosen. The total accuracy and f score are less than of other tested models but it delivers the best prediction of label 1 values which is the main use case of the model.
Further improvements could be made by using other algorithms e.g. Random Forest Classifier or Gradient-Boosted Tree Classifier. Another source of improvement could be more features which improve the prediction of the final model.
To see more about this project, see the link to my Github available here.