Modelling Customer Churn When Churns Are Not Explicitly Observed, with R

Published in

Towards Data Science

6 min readApr 22, 2018

Customer churn can be characterized as either contractual or non-contractual. It can also be characterized as voluntary or non-voluntary depending on the cancellation mechanism.

We will only discuss non-contractual and voluntary churn today.

Non-Contractual

Customers are free to buy or not at anytime
Churn event is not explicitly observed

Voluntary

Customers make the choice to leave the service

In general, customer churn is a classification problem. However, at non-contractual business including Amazon (non-prime member), every purchase could be that customer’s last, or one of a long sequence of purchases. Thus, churn modelling in non-contractual business is not a classification problem, it is an anomaly detection problem. In order to determine when customers are churning or likely to churn, we need to know when they are displaying anomalously large between purchase times.

Anomaly Detection

Anomaly detection is a technique used to identify unusual patterns that do not conform to expected behaviour, called outliers. It has many applications in businesses, from credit card fraud detection(based on “amount spent”) to system health monitoring.

And here we are, using anomaly detection to model customer churn for non-contractual business. We want to be able to make claims like “9 times out of 10, Customer X will make his next purchase within Y days”. If Customer X does not make another purchase within Y days, we know that there is only a 1 in 10 chance of this happening, and that this behaviour is anomalous.

To do this, we will need each customer’s between purchase time distribution. This may be difficult to estimate, especially if the distribution is multimodal or irregular. To avoid this difficulty, we will take a non-parametric approach and use the Empirical Cumulative Distribution Function (ECDF) to approximate the quantiles of each customer’s between purchase time distribution. Once we have the ECDF, we can approximate the 90th percentile, and obtain estimates of the nature we have described above. Let’s get started!

The Data

We will use online retail data set from UCI Machine Learning repository.

Load the libraries and read the data:

library(tidyverse)
library(lubridate)
library(XLConnect)
library(dplyr)
library(ggplot2)
theme_set(theme_minimal())raw.data <- readWorksheet(loadWorkbook("Online_Retail.xlsx"), sheet=1)
data <- raw.data

Create a “Total” column to show how much each customer spent on each purchase. Then we can create a new data frame for each customer’s total spend per day.

data$Total <- data$Quantity * data$UnitPricetxns <- data %>% 
  mutate(CustomerID = as.factor(CustomerID),
         InvoiceDate = InvoiceDate) %>%
  group_by(CustomerID, InvoiceNo, InvoiceDate) %>% 
  summarise(Spend = sum(Total)) %>%
  ungroup() %>% 
  filter(Spend>0)

Next, we can calculate time between purchases for each customer.

time_between <- txns %>% 
  arrange(CustomerID, InvoiceDate) %>% 
  group_by(CustomerID) %>% 
  mutate(dt = as.numeric(InvoiceDate - lag(InvoiceDate), unit=  'days')) %>% 
  ungroup() %>% 
  na.omit()

At this time, we are only interested in customers who have made at least 20 purchases in the data.

Ntrans = txns %>% 
  group_by(CustomerID) %>% 
  summarise(N = n()) %>%
  filter(N>20)

Create a little function for randomly sampling customers.

sample_n_groups = function(tbl, size, replace = FALSE, weight = NULL) {
  grps = tbl %>% groups %>% lapply(as.character) %>% unlist
  keep = tbl %>% summarise() %>% ungroup() %>% sample_n(size, replace, weight)
  tbl %>% right_join(keep, by=grps) %>% group_by_(.dots = grps)
}

After a series of data wrangling, we can now visualize the between purchase days distribution for a randomly selected 20 customers.

ecdf_df <- time_between %>% group_by(CustomerID) %>% arrange(dt) %>% mutate(e_cdf = 1:length(dt)/length(dt))
sample_users <- ecdf_df %>% inner_join(Ntrans) %>% sample_n_groups(20)ggplot(data = time_between %>% inner_join(Ntrans) %>% filter(CustomerID %in% sample_users$CustomerID), aes(dt)) + 
  geom_histogram(aes(y = ..count../sum(..count..)), bins = 15) + 
  facet_wrap(~CustomerID) +
  labs(x = 'Time Since Last Purchase (Days)',y = 'Frequency')

Interpretation:

Most of CustomerID 12748’s between purchase days were less than 5 days, occasionally, his (or her) between purchase days exceeded 5 or even 10 days.
CustomerID 13102 is an infrequent customer and most of his (or her) between purchase days range from 5 to 30 days.

After calculating ECDF for every customer, we are visualizing the above customers’ ECDF. The red line represents approximate 90 percentile. So if the ECDF crosses the red line at 20 days, this means 9 times out of 10 that customer will make another purchase within 20 days.

ggplot(data = ecdf_df %>% inner_join(Ntrans) %>% filter(CustomerID %in% sample_users$CustomerID), aes(dt,e_cdf) ) + 
  geom_point(size =0.5) +
  geom_line() + 
  geom_hline(yintercept = 0.9, color = 'red') + 
  facet_wrap(~CustomerID) +
  labs(x = 'Time Since Last Purchase (Days)')

Create a function to calculate 90th percentile.

getq <- function(x,a = 0.9){
  if(a>1|a<0){
    print('Check your quantile')
  }
  X <- sort(x)
  e_cdf <- 1:length(X) / length(X)
  aprx = approx(e_cdf, X, xout = c(0.9))
  return(aprx$y)
}percentiles = time_between %>% 
  inner_join(Ntrans) %>% 
  filter(N>5) %>% 
  group_by(CustomerID) %>% 
  summarise(percentile.90= getq(dt)) %>% 
  arrange(percentile.90)

Looking at CustomerID 12748:

percentiles[ which(percentiles$CustomerID==12748), ]

Figure 3

The model tells us: 9 times out of 10, CustomerID 12748 will make another purchase within 4.74 days, If CustomerID 12748 does not make another purchase within 4.74 days, we know that there is only a 1 in 10 chance of this happening, and that this behaviour is anomalous. At this point, we know that CustomerID 12748 begins to act “anomalously”.

Let’s have a quick snapshot of CustomerID 12748’s purchase history to see whether our model makes sense:

txns[ which(txns$CustomerID==12748), ]

Most of CustomerID 12748’s purchases happened in 1 to 4 days. It makes sense that we should be concerned if he (or she) does not make another purchase in 4.74 days.

Looking at CustomerID 13102:

percentiles[ which(percentiles$CustomerID==13102), ]

Figure 5

The model tells us: 9 times out of 10, CustomerID 13102 will make another purchase within 31.6 days, If CustomerID 13102 does not make another purchase within 31.6 days, we know that there is only a 1 in 10 chance of this happening, and that this behaviour is anomalous. At this point, we know CustomerID 13102 begins to act “anomalously”.

Again, we have a quick snapshot of CustomerID 13102 ’s purchase history to see whether our model makes sense:

txns[ which(txns$CustomerID==13102), ]

By looking at CustomerID 13102’s purchase history, we agree with the model!

Here you have it! We now know the point when each customer will begin to act “anomalously”.

Churn is very different for non-contractual businesses. The challenge lies in defining a clear churn event which means taking a different approach to modelling churn. When a customer has churned, his (or her) time between purchases is anomalously large, so we should have an idea of what “anomalously” means for each customer. Using the ECDF, we have estimated the 90 percentile of each customers between purchase time distribution in a non-parametric way. By examining the last time a customer has made purchase, and if the time between then and now is near the 90th percentile then we can call them “at risk for churn” and take appropriate action to prevent them from churning. Best of all, with more data our approach will become better and better since the ECDF would converge on the underlying Cumulative Distribution Function(CDF) of the population.

In addition, when we implement the above model, we might want to take seasonality into consideration.

Source code can be found on Github. I look forward to hear any questions.

References:

Exploratory Data Analysis: Conceptual Foundations of Empirical Cumulative Distribution Functions

Modelling Customer Churn When Churns Are Not Explicitly Observed, with R

The Data

Written by Susan Li