Notes from Industry

1. The Insights We Are Looking for
Today’s article will set its focus on non-contractual business settings. We want to predict the transaction frequency of customers and their churn risks.
A customer who used to purchase once every 20 days on average, but has remained inactive for 50 days, must be rated an increased churn risk. She may be on the brink of turning away from our business, towards a competitor’s product range. The marketing team, when our app points customers with elevated churn risks out to them, can consider promotions, discounts, and other outreach actions.
What we want is a model that flags churn-risk customers for us. We also want it to predict every customer’s purchase volume. And while we’re at it, it should calculate the lifetime values of all our customers.
Data science methods that meet these requirements are called Buy ‘Til You Die Models: BTYD (Wikipedia): from the "birth" of a customer (when she places her first order with our business) to the day she "dies" (when she turns to a competitor and thus is dead to us, the enterprise she’s spurned).
The Customer Churn models we most frequently encounter in data science are applicable to contractual settings. Customers sign term-limited contracts with internet providers, mobile phone networks, insurance companies, and subscription-based service providers. Enterprises in contractual settings can develop a much clearer view on customers that have a propensity to churn. Their raw data will show an elevated risk whenever a contract term nears its end. Sales and marketing teams tend to reach out to a customer just before she begins to think about renewing or cancelling her subscription. The IBM Telco dataset, for instance, has become the ubiquitous example for contractual churn problems, to which we can apply a battery of well-known models such as logistic regression, random forests, gradient boost approaches, and other classifiers.
In a non-contractual business environment, though, circumstances are more murky. The sales and marketing teams cannot foresee a customer’s elevated risk in their calendars. Most classification models we typically apply to customer churn problems in contractual business settings are not applicable to non-contractual settings. Churn dates are not transparent.
Instead, we need to analyze the purchase behavior of customers by making distributional assumptions. A probability model will also enable us to predict the purchase volume.

The model we are going to build will enable marketing teams to set the focus of their efforts on those customers for whom the model identifies a turning point.
In today’s article, we will implement 2 ½ mouthfuls:
- the Beta-Geometric/Negative Binomial (BG/NBD) model,
- combined with the Gamma-Gamma model for estimating the Customer Lifetime Value.
This sounds more scary than it will turn out to be in practice. The math is complex – four distributions and Bayesian inference are involved. Most of the math will remain safely tucked away, though. The Lifetimes package will carry out the number crunching for us. While I’ll skip the equations, we should understand the principles and assumptions the two models apply when they derive the predictions. That’s why I’ll add some explanations when we set up and run the models.
The screenshot below shows an example of the output we want to obtain from our model. A dataframe that lists all our customers, for whom we will derive the following performance indicators.

- CLV denotes the customer lifetime value: in this case, the revenues over a chosen period of time, for instance 12 months.
- The four predict columns forecast the number of purchase transactions we can expect over the next 10, 30, 60, and 90 days from each customer.
- prob_alive estimates the customer’s probability of being alive. Its complement (1 – p) is equivalent to the customer’s churn risk: of turning away from our business and favoring a competitor.
2. Dependencies
In addition to our core libraries at the top, we install and import the Lifetimes library.
- Quickstart – lifetimes 0.11.2 documentation
- Lifetimes :: Anaconda.org
- Lifetimes · PyPI
- conda install -c conda-forge lifetimes
-
pip install lifetimes
3. Data Wrangling
The data source we are going to use is available for download as an Excel file from UC Irvine’s Machine Learning Repository: "Online Retail. (2015). UCI Machine Learning Repository. Creative Commons Attributions."
It contains 542,000 records for the years 2010 and 2011 from a British online retailer.

First, we look for records without an identifiable customer. We find as many as 135,000 of them, spread out over 3,710 invoice numbers.

Hypothetically, we could consider to use the invoice numbers as dummy variables that each represent an otherwise unknown customer. But this would distort our model. We are interested in repeat customers and their demand patterns. By interpreting each of these 3,700 invoice numbers as a dummy customer we would multiply the number of one-time buyers. They would not provide us with a recognizable trend and would instead dilute the patterns we can infer for the identifiable customers, of which we find 4,300. Thus, we need to remove these 135,000 records that lack a customer ID.
Before we begin to modify the source data, we copy the dataframe to leave the original source data unchanged.
Let’s review the numerical variables.

We observe negative quantities. These must have been product returns or cancellations. We need to remove their records.

After restricting the dataset to positive quantities, we use the opportunity to adjust some datatypes.
- The dt.date method removes the time component from the invoice dates.
-
Pandas interprets the customer ID as a numerical variable. Let’s redefine it as an object/string type.
image by author
The dataset comprises 4,339 unique customer IDs and 305 invoice dates within the time horizon.
The invoice numbers, stock codes, and descriptions do not provide meaningful information for our analysis, therefore we delete them.
The "Country" column could be useful for customer segmentation. But while the dataset contains a relatively high number of records, 400,000, it does not cover a long period of time: just 2 years. If we tried to split it between 20 different countries – some of them with few customers and transactions – we would dilute the raw material for our analysis and risk to get less reliable results. Therefore, we also remove the "Country" column and will analyze the global customer behavior.

To determine the monetary values of transactions – and later also the lifetime values of all customers – we insert a new column "Revenues" by multiplying the quantities with their corresponding unit prices.



4. The RFM Model in Lifetimes: Recency, Frequency, and Monetary Value
The predictions made by the Beta Geometric / Negative Binomial Distribution model rely on the metrics of the RFM concept: recency, frequency, monetary value, and the so-called customer age or longevity.

The RFM concept rests on the axiom that customers who have placed an order more recently, have purchased more frequently, and tend to spend more on their transactions are likely to be returning customers as well as the most valuable customers. While this sounds like a self-evident classification, the RFM concept provides the building blocks from which the BG/NBD model can derive its predictions. BG/NBD adds distributional assumptions to the RFM metrics.
4.1 Definitions
- Recency: the number of time periods between a customer’s first and last (latest) transactions
- Frequency: the number of time periods, after the initial purchase, when the customer buys again
- Monetary Value: the average value of the customer’s transactions (revenues or profit per transaction)
Additionally,
- Customer age T: the number of time periods since the customer’s first purchase
The Lifetimes package will compute these values for us. To understand their definitions, let’s demonstrate how they can be determined for a chosen customer: ID = 14527. We create a small dataframe that will hold the customer’s transactions.
- The customer’s "birth date", 2010–12–05, was the invoice date of her first purchase.
- She made her latest purchase on 2011–12–07.
- The full dataset, comprising all customers, extends to 2011–12–09. We take this as the end date of our analytical horizon.
- Her recency is the number of days between her first and her latest transaction: 367 days. A bit counter-intuitively, it does not measure the days since her latest purchase.
- Her age or longevity T is the number of days between her "birth" – the first invoice – and the end of our analytical horizon on 2011–12–09: 369 days.
- Her frequency is defined as the number of time periods in which one or more repeat purchases were recorded. That happened on 53 days. Line 18 groups all her transactions in the dataset by their invoice dates. Then it subtracts 1 because the RFT and BG/NBD models focus on repeat purchases. It omits the initial transaction.
-
Note that this definition almost certainly differs from what you understand as frequency in other fields. Elsewhere, frequency denotes the number of occurrences per unit of time. Our dataset consists of daily invoice records. But the model will not define frequency as the average number of invoices per day, nor per month or year. Instead, it just counts the days that have seen a transaction since the second invoice. In a way, the model defines the "unit of time" as the aggregate number of all active days (whenever an invoice was recorded) and sets this artificial time unit to 1.0. It’s a customer-specific unit of time that omits all the calendar days when the customer was inactive.
image by author

The Lifetimes package calculates the same values for customer 14527 as we just did.
To implement a more comprehensive, traditional RFM analysis in our application, we could group each of the recency, frequency, and monetary values in five bins, the so-called quintiles. A customer may have a recency value in bin #2, a frequency value in bin #3, and a monetary value in bin #4. Then the customer’s RFM score is equal to 234. The maximum score is 555. This approach can serve as a customer segmentation heuristic for the marketing team, similar to an ABC inventory analysis in supply chain management.
The RFM concept was developed 4 decades ago, before data science and computational resources contributed more advanced techniques for customer segmentation. We will use the RFM metrics for BG/NBD, but refrain from trying out the RFM scores.

4.2 Split of Training and Test Dataset
Lifetimes uses the terms calibration and holdout periods for what we would label the training and test datasets in most other applications.
Its function _calibration_and_holdoutdata() takes as its arguments the chosen end of the calibration period; and the end of the observation period. The notebook cell below reserves the final 240 days in the dataset for the holdout period. It calculates the end date of the calibration period, _max_caldate, in line 7. We want to use the full dataset, which does not cover many time periods in any case. Therefore we set the _observation_periodend to the latest date in the dataset.
Note that the split between calibration and holdout period can be finicky. This particular dataset comprises 400,000 records, but they are spread over 4,000 customers and cover just two years. Many smaller customers submit their repeat orders a couple of months apart. If we set the holdout period to a small number of time periods, it will contain only a few repeat transactions for small customers. We risk that it does not fully reflect their demand patterns. I found that a holdout period shorter than 100 days does not work well in this case. Ideally it should be close to 200 days. Yet if we reserve too many days for the holdout period, the same problem can afflict the training phase: a training period as short as 365 days may not catch enough customer-specific transactions to pin down their demand patterns.
It would be preferable to create the holdout dataset by randomly drawing customers from the entire dataset and then check whether the purchase behavior of the customers in the training dataset differs from the behavior of the holdout customers. But many of Lifetimes’ evaluation functions take a time period t as their parameter and would not work if calibration and holdout transactions overlap in the same period of time, even if the customers are kept separate.

The _calibration_and_holdoutdata() function splits the dataset between training and test dataset. In parallel, it calculates the recency, frequency, monetary value, and age T values for each customer. It creates a new dataframe, which we assign to variable df_ch ("dataframe calibration and holdout"). The original source data, df1, were organized by customer ID and invoice number; whereas df_ch uses the unique customer IDs as its index and summarizes the invoice records by customer.

The list below shows the average frequencies (number of active invoice days minus 1) per customer. Most customers make a single purchase. Remember that a frequency of 0 marks a customer with zero repeat transactions – he just completed his single initial purchase but has not returned since. A tiny fraction of customers has come back for 6 or more purchases. This pattern underlines the difficulty I mentioned earlier: to split the dataset between training and test periods that equally well reflect the demand patterns of customers who, in their majority, experience only a modest urge to buy anything from the business, once every couple of months.

To visualize the patterns of recency, frequency, and age T, we can use Seaborn’s distplots.

We observe again that only a few customers, within the less than 1.5 years covered by the training dataset, return for a second or third purchase.

Recency measures the number of days between a customer’s first and latest transactions. Since most of them are one-time buyers, we see the peak at 0 days.

The customer age T represents the number of days since the buyer’s "birth", their first day in their role as a new customer of the business: beginning with their first purchase and extending to the end of the dataset. Most customers came on board 120 or more days in the past. Over its short horizon of 4 months, the charts shows a fairly even level of new customers that made their initial purchases 20, 40, 60, or 80 days ago. The chart suggests that the business does not find itself in a downward trend. It has attracted new customers at a stable rate over the horizon of 120 days.
5. The Beta-Geometric / Negative Binomial Distribution (BG/NBD) Model
5.1 BG/NBD Concept
In non-contractual business settings, the purchase behavior of customers does not follow a deterministic trend. Both demand levels and churn rates are random variables. A distributional model like BG/NBD describes the random processes that influence the customer behavior, individually and in aggregate.
The method relies on four distributional assumptions to model the uncertainties:
- The number of orders a customer will place in a time period follows a Poisson distribution with transaction rate lambda. This Poisson count distribution is equivalent to the assumption that the time between transactions follows an exponential distribution with the same transaction rate lambda.
- The demand varies independently between customers: heterogeneity in lambda. The variation in lambda is a random variable that follows a Gamma distribution with shape parameter r and scale alpha.
- After any purchase, a customer may become inactive with probability p and turns away from the business. The churn risk follows a Geometric distribution.
- The churn risk varies independently between customers: heterogeneity in p. The variation of the churn or dropout probability p is a random variable that follows a Beta distribution.
C:JFMKSC -2MKSC0098.DVI (brucehardie.com)
Where comes "NBD", the Negative Binomial distribution, into the picture? Is a fifth distribution involved? No, the four bullet points above have not left a gap. The first two assumptions combine to form a Poisson-Gamma mixture distribution. If the lambda rate is a Gamma random variable, then the mixture distribution is equal to a Negative binomial distribution – Wikipedia.
5.2 Fitting the BG/NBD Model to the Observations
The Lifetimes package fits the observational data to the distributions to derive their parameters.



The _plot_periodtransactions method draws a chart that compares the predictions with the actual observations in the training dataset. The prediction accuracy looks quite good.
Lifetimes’ method _plot_calibration_purchases_vs_holdoutpurchases compares the predicted frequency with the actual purchases in the testing period.

We see that the orange line of model predictions closely hugs the blue actual line, up to 3 transactions per customer. For higher transaction levels, the model tends to underestimate the frequency, owing to the rarity of customers who made more than 3 repeat purchases within the short training time window.
6. BG/NBD Model for the Full Dataset
The dataset in our example covers a short time period with few transactions by most customers. After reviewing the training and test results in chapter 5, I prefer to re-run the model on the full dataset.
6.1 RFM: Recency, Frequency, Monetary Value, and Longevity T
We use Lifetimes’ function _summary_data_from_transactiondata() to obtain the RFM profile of every customer in the full dataset. The function collects them in a new, customer-centric dataframe df_rft.


6.2 Fitting the BG/NBD Model
We fit the Beta-Geometric/Negative Binomial Distribution model to the full dataset in df_rft.

To review the prediction accuracy, we call Lifetimes’ _plot_periodtransactions method. The chart suggests that the model’s predictions for the number of purchases are closely aligned with the actual observations.

6.3 Predicted Purchases
The frequency/recency matrix visualizes how these two attributes in a customer’s profile predict the number of purchases she will make in a chosen period of time.
We will draw two matrices, for the forecast periods of 10 and 90 days. The list comprehension in line 15 feeds the two forecast horizons to Lifetimes’ method _plot_frequency_recencymatrix in line 6.


The frequency/recency matrices both show the same shape of the probability distribution, for 10 as well as 90 days, with a reddish hot zone in the lower right corner. The shape of the distribution remains constant. They only differ in their scales on the right-hand side: the number of purchases over a longer or shorter number of days.
Each matrix demonstrates that a customer with a high frequency (80 or more transactions ), combined with a long recency (more than 300 days between first and latest transactions), will exhibit the highest propensity for future purchases: orange to red, implying 15 to 25 transactions over the next 90 days.
This, of course, is of limited usefulness even though it looks appealing. We need to translate the visual clues to concrete, quantified prediction values for individual customers.
Let’s randomly pick a customer from the dataset, for instance ID = 12748, and predict her purchases over the next 30 days.
Lifetimes’ predict function evaluates the customer’s recency, frequency, and age; and concludes that, based on this profile, we should expect her to make 7.7 transactions over the next 30 days.

The next notebook cell expands this single-buyer forecast to all customers in the database. The list comprehension in row 12 asks for the expected purchase volume over 10, 30, 60, and 90 days. The helper function it calls, _predictpurch(), will insert its customer-specific predictions as four new columns into the dataframe.

Next, we can sort and filter the dataframe based on this forecast and, for instance, study the top 10 customers, ranked by their predicted number of purchases over the next 30 days.

Seaborn’s distplot shows, of course, a heavily right-skewed distribution. Only a few big spenders have the inclination or the means to place an order more than once a month.

The Lifetimes method _probability_of_n_purchases_up_totime computes the average number of transactions per customer within a period of time.

6.4 Customer Churn Probability
In the previous chapter, we have predicted each customer’s purchase volume. In the next step, we will estimate a customer’s churn risk – or conversely, her probability of being alive (in her role as a customer) at time t.
Lifetimes’ method _plot_probability_alivematrix visualizes the survival probability based on the customer’s frequency and recency profile. A customer whose transactions cover a long stretch of time (high recency), with a high number of transactions, exhibits a solid, red probability of being alive.

We translate the visual clues into metrics to obtain actionable information.
The function _conditional_probabilityalive computes the survival probability for every customer. We insert the array as a new column into the dataframe.
Its descriptive statistics show that, given the current recency/frequency profiles, the survival probability of the least solid customer is still as high as 75%.

Only a few customers are ranked with a survival chance of less than 99%. If we limit the analysis to the not perfectly robust candidates, those with a survival rating of less than 90%, we find four customers.

Let’s examine customer ID 15107, with a rating of 75%.
Lifetimes’ method _plot_historyalive shows us how the customer’s transactions and her probability of being alive have developed over the 2 years of observations. She completed five purchases in the early months, but has been inactive between March and December 2011. The model still rates her 75% likely to be alive as a customer because our dataset contains many customers who are only moderately more lively over the course of 10 months. A different dataset with more active customers, who tend to order something every month or every week, would motivate the model to deprecate the probability for a customer who has been inactive for multiple months.


7. Customer Lifetime Value

7.1 The Gamma-Gamma Model
The BG/NBD model enables us to predict both the number of transactions and the churn risk (conversely, the probability of being alive) for each customer.
Now we want to estimate the monetary values that correspond to the count data. We will link the BG/NBD results up into the Gamma-Gamma model.
The GG model rests on the assumptions that
- customers differ in their average transaction values: for instance, __ some big spenders may have a propensity to spend twice as much per transaction as others
- the average monetary transaction value of an individual customer is time-invariant within the analytical horizon; the values of the customer’s transactions vary around the mean while the mean itself remains stable over time
The time-invariant average transaction value is the random variable Z we want to pin down.
In 1994, Schmittlein and Peterson proposed that the transaction values follow a normal distribution. But the normal distribution extends to negative infinity and is symmetric around its mean – whereas real-world frequency curves of spendings are non-negative and right-skewed, often distinctively so. Fader/Hardie suggested the Gamma distribution as a more appropriate alternative to mirror a customer’s average transaction value. They define its scale parameter as a Gamma random variable of its own, therefore the label Gamma-Gamma model. (gamma_gamma.DVI (brucehardie.com))
Using Bayes’ theorem, they derive the conditional expectation of a customer’s mean transaction value. The more transactions have been observed, the less weight the model puts on the prior distribution and the more on the customer’s observed mean.
The GG model only works for positive monetary values. Thus, we exclude the few transactions with a unit price of zero.

The GG model also requires that the frequency of purchases be independent from their monetary value. The correlation matrix confirms that the Pearson’s correlation coefficient amounts to a modest 0.016 in our example, low enough that we will proceed.

7.2 Fitting the Gamma-Gamma Model to the RFT Data
We instantiate Lifetimes’ GammaGammaFitter, which takes as its argument the frequency and monetary values of the BG/NBD model.

7.3 GG Predictions
After fitting the GG model, we can compute every customer’s expected monetary value using Lifetimes’ function _conditional_expected_averageprofit. Our source dataset does not offer profit margins, therefore the function, despite its name, returns the expected average revenues.
We insert the results as a new column "exp_avg_rev" into our dataframe, together with the variance between predicted average revenues and the actual average monetary value.

7.4 Customer Lifetime Value CLV
In our final code cell, we will again link up the BG/NBD results to the fitted Gamma-Gamma model.
- A customer’s longevity represents the time period between her first purchase and the end date of the observational data.
- Similarly, a customer’s "lifetime" denotes the time from her first transaction to her unknown future churn date, when she will turn away from our business. For practical purposes, the "lifetime" horizon should be limited to a user-defined time period, for instance the next 12 months I set as a hyperparameter in line 3, below.
Lifetimes’ _customer_lifetimevalue function will add up the predicted revenues over the chosen horizon of 12 months.
The customer lifetime value is the outcome of a discounted cash flow (DCF) calculation. It rests on the time-value-of-money concept. A sum of money is worth more in the near future than in the far future. Money we earn now, we can invest immediately and begin to accrue interest or dividends. A customer whose proceeds flow to the business now is more valuable than a customer who will contribute revenues in the more remote future. The time value of money is expressed by the discount rate which the _customer_lifetimevalue function requests from us. The discount should reflect our company’s cost of capital: a weighted average of
- the interest rate the company owes on its loans and
- the return on equity (dividend rate) the company’s shareholders expect to receive.
The two rates should be weighted by the capital amounts which the shareholders and the lenders have contributed to the balance sheet of the company.
In this example, I set a low annual rate of 6% in line 2. Note that Lifetimes actually expects a monthly discount rate. Line 5 translates the annual rate to a monthly rate.
The _customer_lifetimevalue function takes all four metrics we had obtained from the BG/NBD function _summary_data_from_transactiondata: recency, frequency, longevity T, and monetary value.
It returns an array of customer lifetime values, one per customer, which we insert as a new column "CLV" in our dataframe.

The mean CLV across all customers amounts to £ 2,847.
When we sort the dataframe by customer value, we can see our top 5 customers, with lifetime values above £ 100,000 – obviously wholesalers; or extremely dedicated private consumers with a concerning addiction to souvenirs and gifts.

8. Conclusions
Most classification tools we typically use for customer churn problems do not cope well with non-contractual settings. The churn dates in a non-contractual business environment are not transparent.
The BG/NBD model applies distributional assumptions to the observational data. It enables us to estimate churn risks and future purchases, based on a customer’s profile defined by her recency, frequency, and longevity. In a second step, the count data of the BG/NBD model can be linked up to a Gamma-Gamma model that estimates monetary values.
The Jupyter notebook is available for download on GitHub:h3ik0th/clv: customer lifetime value BG/NBD model (github.com)
Dataset: Online Retail. (2015). UCI Machine Learning Repository. UCI Machine Learning Repository. This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. This allows for the sharing and adaptation of the datasets for any purpose, provided that the appropriate credit is given.