The world’s leading publication for data science, AI, and ML professionals.

Machine Learning with R: Churn Prediction

A practical guide for R programming language

Photo by Tim Mossholder on Unsplash
Source: https://unsplash.com/photos/WE_Kv_ZB1l0

R is one of the predominant languages in Data Science ecosystem. It is primarily designed for statistical computing and graphics. R makes it simple to efficiently implement statistical techniques and thus it is excellent choice for machine learning tasks.

In this article, we will create a random forest model to solve a typical machine learning problem: churn prediction.

Note: If you’d like to read this article in Spanish, it is published on Planeta Chatbot.

Customer churn is an important issue for every business. While looking for ways to expand customer portfolio, businesses also focuses on keeping the existing customers. Thus, it is crucial to learn the reasons why existing customers churn (i.e. leaves).

The dataset is available on Kaggle. We will use the randomForest library for R. The first step is to install and import the library.

install.package("randomForest")
library(randomForest)

I use R-studio IDE but there are other alternatives too. The next step is to read the dataset into a table.

> churn <- read.table("/home/soner/Downloads/datasets/BankChurners.csv", sep = ",", header = TRUE)

Some of the columns are redundant or highly correlated with another column. Thus, we will drop 7 columns.

> churn2 <- churn[-c(1,3,10,16,19,22,23)]

The code above filters the table based on the given index list of the columns. We add a minus sign before the list to indicate these columns will be dropped.

I wrote a separate article as exploratory data analysis of this dataset. I suggest to read it if you’d like to know why we drop some of the columns.

Here is a list of the remaining columns:

The attrition flag column is our target variable and indicates whether a customer churned (i.e. left the company). Remaining columns carry information about the customers and their activities with the bank.

The next step is to split the dataset into train and test subsets. We first create a partition and use it to split the data. Before splitting the dataset, we need to factor the target variable (Attrition_Flag) so that the model knows this is a classification task.

> churn2$Attrition_Flag = as.factor(churn2$Attrition_Flag)
> set.seed(42)
> train <- sample(nrow(churn2), 0.8*nrow(churn2), replace = FALSE)
> train_set <- churn2[train,]
> test_set <- churn2[-train,]

We randomly select 80% of the observations (i.e. rows) for training. The remaining ones are stored in the test set.

The next step is to create a random forest model and train it.

> model_rf <- randomForest(Attrition_Flag ~ ., data = train_set, importance = TRUE)

We create random forest model and indicate the target variable. The dot after the tilde (~) operator tells the model that all other columns are used in training as independent variables.

Here is a summary of the random forest model:

The number of trees used in the forest is 500 by default. We can change it using the ntree parameter.

The critical metric for evaluation is the OOB (Out of bag) estimate of the error rate. I would like to briefly explain how random forest algorithm works before going in detail about the OOB estimate of the error.

Random forest uses bootstrap sampling which means randomly selecting samples from training data with replacement. Each bootstrap sample contains a random sample from the entire dataset. The observations that are not in a bootstrap sample are referred to as out of bag data. In order to get an unbiased and more accurate evaluation of the model, out of bag error is used.

The OOB error rate is 5.33% which means the accuracy of the model is approximately 95%.

The confusion matrix is also an important tool to evaluate a classification model. It shows the number of correct and incorrect predictions for each class.

In order to evaluate the model on the test set, we first make predictions.

> predTest <- predict(model_rf, test_set, type = "class")
> mean(predTest == test_set$Attrition_Flag)  

[1] 0.9343534

We compare the predictions and the target variable of the test set (Attrition_Flag) and take the mean. The classification accuracy of the model on the test set is 93.4% which is a little lower than the accuracy on the train set.

We can also generate the confusion matrix of the predictions on the test set.

It seems like the model performs worse at predicting the churned customers (i.e. attrited) than the existing customers. The main reason of this issue is the unbalanced class distribution. The number of churned customers is much less than the number of existing customers. One way to overcome this issue is to use upsampling to increase the number of observations in attrited customer class.

Hyperparameter tuning is an important part of model building, especially for complex model. For instance, changing the number of trees in random forest or the maximum depth of an individual tree are considered as hyperparameter tuning.

Hyperparameter tuning requires to have a comprehensive understanding of the hyperparameters of an algorithm. It would not be efficient just to try out random values.


Conclusion

We have covered the randomForest library in R programming language. The syntax is fairly simple and the model evaluation tools are highly intuitive.

Python and R are the two most commonly used programming language in data science and Machine Learning. Both of them have a rich selection of libraries and frameworks that makes the life easier for data scientists and machine learning engineers.

I don’t think one is superior to other with regards to data science related tasks. I suggest to learn both in order to benefit some special features of both.

Thank you for reading. Please let me know if you have any feedback.


Related Articles