Customer Lifetime Value Estimation via Probabilistic Modeling

A deep-dive into BG-NBD, an influential hierarchical model that facilitates a understanding of customers’ purchase behavior

Meraldo Antonio

Published in

Towards Data Science

16 min readFeb 23, 2022

Importance of customer lifetime value

Customer lifetime value (CLV) is the total worth of a customer to a company over the length of his relationship. In practice, this “worth” can be defined as revenue, profit, or other metrics of an analyst’s choosing.

CLV is an important metric to track for two reasons. First, the totality of a company’s CLV over its entire customer base gives a rough idea of its market value. Thus, a company with a high total CLV will appear attractive to investors. Second, a CLV analysis can guide the formulation of customer acquisition and retention strategies. For example, special attention could be given to high-value customers to ensure that they stay loyal to the company.

Many CLV models have been developed with different levels of sophistication and accuracy, ranging from rough heuristics to the use of complex probabilistic frameworks. In this article series, we delve into one of them: the Beta Geometric Negative Binomial Distribution (BG-NBD) model. This model, developed by Fader, Hardie, and Lee in 2005¹, has been one of the most influential models in the domain, thanks to its interpretability and accuracy.

This series consists of three articles that build on top of one another. Our game plan is as follows:

In Part 1, (this one), we’ll achieve an ELI-5 understanding of the BG-NBD model and its assumptions.
In Part 2, we’ll look into the Python library lifetimes that allows us to conveniently fit a BG-NBD model to a dataset in a scikit-learn-like fashion and almost immediately receive the maximum likelihood estimate of the model’s parameters. We’ll also explore the various downstream analyses that lifetimes enables.
In Part 3, we’ll look at an alternative way to implement the BG-NBD model, this time from a Bayesian perspective. We’ll see how the Bayesian hierarchical BG-NBD model allows us to inject our prior intuition of customer behavior into the model. To this end, we will be using the Python library PyMC3.

The scope of BG-NBD

Before digging deeper into the mathematics of BG-NBD, we need to understand what it can and cannot do. There are two major limitations to keep in mind:

The model is only applicable to non-contractual, continuous purchases.
The model only tackles one component of CLV calculation, which is the prediction of the number of purchases.

Let’s learn about these limitations in greater detail.

BG-NBD applies to non-contractual, continuous purchases

Depending on the relationship between the sellers and the buyers, a business can either be a contractual business or a non-contractual business.

A contractual business, as its name suggests, is one where the buyer-seller relationship is governed by contracts. When either party no longer wants to continue this relationship, the contract is terminated. Thanks to the contract, there is no ambiguity as to whether someone is a customer of the business at a given point.
In a non-contractual business, on the other hand, purchases are made on a per-need basis without any contract.

We can further distinguish between continuous and discrete settings:

In a continuous setting, purchases can occur at any given moment. The majority of purchase situations (e.g. grocery purchases) fall under this category.
In a discrete setting, purchases usually occur periodically with some degree of regularity. An example of this is weekly magazine purchase.

The BG-NBD model tackles the non-contractual, continuous situation, which is the most common yet most challenging of the four to analyze. Under such a setting, customer attrition is not explicitly observable and can happen anytime. This makes it harder to differentiate between customers who have indefinitely churned and those who will return in the future. As we will see later, the BG-NBD model is capable of assigning probabilities to each of these two options.

BG-NBD focuses on predicting the number of transactions

A customer’s CLV for a given period can be calculated by multiplying two numbers:

The customer's predicted number of transactions within this period.
The predicted value of each purchase.

Usually these two components are tackled and modeled separately. The BG-NBD model addresses the first — predicting the number of transactions, which in many regards is the more difficult of the two.

The second component, the expected value of the purchases, can be found either by using simple heuristics, such as taking the average of all past purchases, or by a more sophisticated probabilistic model, such as the Gamma-Gamma model² (which was also created by the authors of BG-NBD).

Intuition

Before going into the mathematics of the model, let’s try to understand how the model works on a conceptual level.

Let’s imagine the following scenario. The date is 31 December 2021 and you’re the manager of a cake shop. You’ve carefully kept track of all transactions that happened this year and you want to predict how many transactions you can expect your customers to make in 2022.

You also happen to be a skilled data scientist and you plan to achieve this prediction by fitting a model to your data. This model should be able to describe your customers’ purchasing behavior in an interpretable way.

There are some assumptions you could consider when developing the model.

Each customer has different purchasing rate

You’ve noted that some people buy cakes every day and some every weekend. Others only buy them on special occasions that take place every six months on average. Your model will need a way to assign a different purchasing rate to each customer.

Each customer can stop being your customer at any time

In the fiercely competitive cake business, loyalty isn’t guaranteed. At any point, your customer can leave your business for another one. Let’s refer to this departure as the “deactivation” of a previously active customer.

To conveniently model a deactivation, we can assume that it can only happen after a successful purchase. That is to say, after every purchase in your store, a customer will decide whether he’ll continue buying at your shop or to abandon it. Deactivation happens when the customer decides for the latter.

We’ll assume that a deactivation is both permanent and latent. Permanent, because once a customer decides to churn, he will never return. Latent, because he won’t explicitly let you know that he will no longer be your customer.

Illustration

With these assumptions in place, let’s consider the following scenario, where we have two customers, A and B. Each of them has made some transactions in 2021, and each transaction is indicated by a red dot.

Can we tell which customers have deactivated and which ones will still frequent your store and contribute to your future revenue?

The answer is yes — to a certain extent. For example, looking at A’s pattern above, we see that he used to shop pretty frequently, but it’s been a while since we’ve seen him. Because his inter-transaction time is so much shorter than the time that’s passed since his last transaction, it is quite likely that A has deactivated.

On the other hand, B is an infrequent shopper and her most recent purchase is not that long ago compared to her average between-purchase period. It is pretty likely that she’ll come back.

Let’s now develop these assumptions and intuitions into a more sophisticated model!

Mathematical model

Probabilistic modeling: an introduction

Traditionally, CLV was calculated using a simple function of the past data. For example, we can estimate the value of future transactions by taking a fixed fraction of the value of past transactions. Such a calculation, unsurprisingly, is simplistic, unreliable, and uninterpretable.

The BG-NBD model, on the other hand, is a probabilistic model. In a probabilistic model, we assume that our observations (i.e. the transactions) are generated by a physical process that we can model using probability distributions. Our task is to estimate the parameters that best explain our existing observations. One commonly used option is to find maximum likelihood estimators of these parameters. We can then use these estimated parameters to perform future predictions. Compared to the first one, this probabilistic framework is usually more robust, accurate and interpretable.

With that introduction, let’s now convert the assumptions qualitatively described above into a solid probabilistic framework.

Poisson process to model transactions and exponential distribution to model time between purchases

First, let’s focus on the repeat purchasing behaviors of active customers. We can assume that as long as a customer is still active, their transactions follow a Poisson process with a constant purchase rate 𝜆. With this assumption, we can model the time-to-next-purchase Δt as an exponential distribution parameterized by 𝜆. The PDF of this distribution is as follows:

Each active customer will have his own exponential distribution that we can use to predict the probability of the time of the next purchase.

The graph above shows the PDF of two exponential distributions associated with two customers. The first customer (the blue curve) generally buys a cake every day (his purchasing rate 𝜆 is 1 cake/day). The probability that his next purchase takes place within one day of his current purchase can be found by taking the area under the blue curve between 0 and 1 and calculate to 0.63

The second customer only buys a cake every week (his 𝜆 is 1/7 cake/day). After performing the same integration, we see that it is much less probable (P = 0.13) that his next purchase will happen sometime before tomorrow.

Gamma distribution to describe the variation in buying behavior across population

It is useful to think that all these customers, with their differing 𝜆’s, contribute to a store-wide 𝜆 distribution. Our task now is to model this 𝜆 distribution. In doing so, we’ll need to comply with the following requirements:

The distribution should preferably be one that is well-studied.
Since 𝜆 can only take values in positive real numbers, the chosen distribution must only have positive values.
The distribution needs to be flexible enough to model different customer bases with different purchasing behaviors.

The Gamma distribution ticks all those boxes and is the one used in BG-NBD to model 𝜆. It is parameterized by the shape parameter r and the scale parameter α; different combinations of these two parameters result in the gamma distribution taking distinct shapes. Here is the PDF of the distribution:

It is important to note that this Gamma distribution is not just some theoretical mumbo-jumbo. In fact, a particular Gamma distribution quantitatively describes the collective purchasing behavior of a specific customer base and carries important business implications.

For example, the blue line in the graph below shows a downward sloping, left-leaning Gamma distribution that results from setting both r and α to equal 1. If this distribution corresponded to my customer base, I wouldn’t be too happy — the heavy left skew means that the bulk of my customers have purchasing rates 𝜆 that are close to zero. That is, they barely purchase any cake!

Another Gamma distribution is shown in orange. This is a healthier distribution in which the 𝜆 peaks around 2, which means that a considerable chunk of this customer base buys two cakes per day. Not too shabby!

Now, a little nerdy note — the combination of Poisson/Gamma distributions, which we’ve been using to model our customers’ purchasing behavior, is also known as the Negative Binomial Distribution (NBD). Yep, this is where the name of our model comes from.

Deactivation of a customer is modeled as a geometric process

Now let’s deal with the deactivation process. As mentioned earlier, after each purchase, a customer will make a decision on whether or not to deactivate. We can assign a probability p for this deactivation. Consequently, the transaction after which a customer deactivates is distributed according to the shifted geometric distribution. The PMF of this discrete distribution is shown below:

This PMF is very intuitive — it comes from noting that (1) if a customer deactivates after the xᵗʰ transaction, he must have survived the preceding x-1 transactions and (2) each of this survival carries the probability (1-p). Do note that by definition, a customer must have performed at least one transaction before deactivating (otherwise he wouldn't have become our customer in the first place!).

The graph below compares two customers with p = 0.01 and p = 0.1.

We can see that the higher the p, the more likely it is that the deactivation happens earlier. The customer with p = 0.01 (blue) has a much lower probability of deactivating early compared to the customer with p = 0.1 (orange).

Beta distribution to describe the variation in deactivation probability

Similar to 𝜆, it is useful to consider that a population of customers is associated with a distribution p. This time, however, we can’t use the Gamma distribution, which has no upper bound. We’ll need another distribution that is equally flexible but whose value range from 0 to 1 (because p can only be between 0 to 1).

This time, the Beta distribution fits our needs. Here is the PDF of the Beta distribution:

We can see that the distribution is parameterized by two positive shape parameters, a and b. Here are some examples of Beta distribution:

Similar to the Gamma distribution, this Beta distribution also carries business implications. You’d want to see a left-skewed Beta distribution that puts most of its weight near 0, which suggests that most of your customers have low p and aren’t likely to deactivate early. The orange line in the graph above is an example.

Another quick note — it is this combination of Beta/Geometric distributions that gives rise to the “BG” in the BG-NBD model. Now you know!

Tying everything together: mathematical model of likelihood on an individual level

We’ve looked at all the distributions with which we quantitatively describe the behavior of our customers. How do we then obtain the best parameters for these distributions?

One way is to get the maximum likelihood estimators (MLE), which are parameter estimators that maximize the likelihood that the model produced the data that were actually observed.

Let’s make it more concrete. Suppose that we currently are at time T and we’re looking back at the historical transactions of a particular customer that has a purchase rate 𝜆. He made his first transaction at t₁ and his last at tₓ . These points, drawn on a timeline, look like this:

We can derive the individual-level likelihood function of this person by following the below steps:

The likelihood of the first transaction occurring at t₁ is described using the exponential distribution we elaborated earlier:

The likelihood of the second transaction occurring at t₂ is the probability of the customer remaining active after t₁ — (1-p) — multiplied by the standard exponential likelihood component:

Such a likelihood pattern is repeated for each subsequent transaction, that is, the likelihood of the xᵗʰ transaction happening at tₓ is:

Now, let’s analyze what happens after the last transaction at tₓ. We didn’t observe any transaction between tₓ and T; this absence can be due to either of the following two scenarios:

The customer deactivated after his last transaction at tₓ. As we know, the probability of this happening is p.
He remained active yet didn’t make any transaction at this interval. The probability of this happening is:

The likelihood of observing the transaction pattern that we observed is simply the multiplication of all the likelihood for earlier transactions times the sum of the likelihood of the two scenarios:

The above defined likelihood formula is applicable to customers who made some purchase in the observation period. Since we assume that all customers were active to begin with, the likelihood that a customer will not make any purchase between the time [0, T] is the standard exponential function:

Lastly, combining the above two likelihood formulas, we obtain a generalized formula for all customers notwithstanding the number of transactions they made (or lack thereof):

We can then programmatically try out different values of p and 𝜆 and choose a (p, 𝜆) combination that maximizes this likelihood. These parameter values, referred to as Maximum Likelihood Estimators (MLE), represent the “best” parameters that describe the purchasing behavior and the deactivation probability of this individual.

Do note that this individual-level likelihood function only involves three unknowns variables that need to be supplied by the data:

x: the number of repeat transactions. This is also called the (repeat) frequency.
tₓ: the age of the customer at his last transaction time, that is, the time that had passed between his first and his last transaction. This is also called recency.
T: the age of the customer at the point of analysis, that is, the time that had passed between his first transaction and the time of analysis.

Interestingly enough, we can see that the time of the earlier transactions is not part of the formula.

A dataset whose rows correspond to different customer IDs and whose columns indicate the x, tₓ, and T of each customer is referred to as being in the “RFM format”. The "R" and "F" here stand for recency and (repeat) frequency, respectively. Meanwhile, "M" stands for monetary value; this is a column we won't use in our analysis since we're not concerned about the transaction values. The RFM format is the canonical format used frequently in CLV analysis.

Zooming out: mathematical model of likelihood on a population level

As a company with (hopefully) many customers, oftentimes, we’re not too interested in looking at individual customers. Rather, we’d like to analyze our customer base as a whole. Specifically, we’re interested in obtaining the best Gamma and Beta distributions that describe the performance of our entire business.

Just like how we can use MLE to get the best p and 𝜆 for an individual, we can also use MLE to get the best r, α, a, and b for the population. I won’t be deriving the population-level likelihood equation in this article; it’s long enough as it is. However, if you’ve understood the math above, you should be in good shape to dive into the derivation that is clearly explained in the BG-NBD paper.

Outro

Other applications of BG-NBD

We’ve so far framed our discussion around CLV calculation, which is what BG-NBD was initially intended for. However, BG-NBD is more versatile than that. It can in fact be used to model any phenomenon that involves different “users” making repeated “transactions” and predict (1) how many future "transactions" will be made by those "users" provided that they are still “active” and (2) the probability that they are still "active" during the time of analysis. For example:

Predicting the future usage frequency of a mobile app by exploring users’ usage history.
Calculating the probability that your distant relative who used to call you periodically is still alive, literally, by analyzing her call pattern.
Checking if your Tinder dates have become disinterested in you by looking at their texting frequency.

Going forward

Alright, I know that we’ve gone through a lot of math, which can be challenging. You might be thinking: is there a way to skip all these equations and use a ready-made implementation of this model to start deriving business values out of it?

I hear you! In Part 2 of the series, we’ll look at the Python library lifetimes that allows us to use a couple lines of code to get the MLE of r, α, a, and b from a given record of past transactions. This library also contains other useful analytical and plotting functions that will allow us to derive business insights from the BG-NBD model and other related models.

Afterwards, in Part 3, we'll check out an alternative implementation of BG-NBD, which approaches the parameter estimation from the Bayesian perspective. This Bayesian framework will allow us to "inject" our domain knowledge and/or beliefs into the modeling process.

I hope to see you there!

References

[1] “Counting Your Customers” the Easy Way: An Alternative to the Pareto/NBD Model (Bruce Hardie et. al, 2005)

[2] The Gamma-Gamma Model of Monetary Value (Bruce Hardie et. al, 2013)

Note: All images, diagrams, tables and equations belong to me unless indicated otherwise.

If you have any comments about the article or would like to reach out to me, feel free to send me a connection through LinkedIn. Also, I’d be very grateful if you could support me by becoming a Medium member through my referral link. As a member, you’ll be able to read all my writings on data science and personal development and have full access to all stories on Medium.