The Paper: “A Deep Probabilistic Model for Customer Lifetime Value Prediction”

A run through the paper’s neural network architecture and loss function

Maja Pavlovic
Towards Data Science

--

About

Predicting a customer’s lifetime value (LTV) can be quite a challenging task. Wang, Liu and Miao propose using a neural network with a mixture loss to handle the intricacies of churn and lifetime value modelling of new customers.

In this blogpost we’ll take a look at their proposed solution and go through the reasoning behind their ideas. The first section briefly summarises the importance and challenges of lifetime value modelling. It also looks at how Wang, Liu and Miao tackle these problems and quickly runs through the results of their paper. The second section then takes a deeper look at their chosen architecture, specifically the output layer. Lastly, the final section goes through the proposed loss function in more detail.

This post is not intended to be an in-depth mathematical explanation; but rather, is meant to provide a high-level, more approachable explanation of the paper.

Paper Overview

Why are lifetime value and churn important? Having accurate insights into customers’ future purchase behaviour can support a variety of operational and strategic business activities. For example, it can help companies with customer retention by segmenting customers according to their LTV, and targeting specific customer segments with different offers and loyalty schemes, resulting in more efficient resource spending.

Figure 1 shows a typical skewed LTV distribution. The large count at LTV=0 shows the huge proportion of customers who have churned (they bought a product once and then never returned). To visualise the returning customers a bit better the x-axis is displayed as the logarithm of (LTV+1). This shows that the range between returning customers’ lifetime spend varies considerably. A small proportion of high spenders can sometimes account for a large amount of business revenue. This distribution with a large proportion of one-time spenders and a few rather huge spenders poses a challenge to the traditional mean-squared-error approach. You will see this in more detail in the section that dives into their loss function below.

Figure 1 | image by author, inspired by figure 1 in the paper

Commonly, LTV and churn modelling is done separately (figure 2, left). Wang, Liu and Miao’s approach, however, allows them to tackle Churn and LTV prediction simultaneously (figure 2, right).

Figure 2 | image by author

Instead of looking at the LTV and Churn distributions separately, they propose using “a mixture of zero-point mass and a lognormal distribution”, calling it the zero-inflated lognormal distribution (ZILN). They then derive a mixture loss by taking the negative log-likelihood of that ZILN distribution and then use it to train their neural network on both tasks simultaneously. Too much information? Don’t worry about it. We will look at what loss functions are and how the ZILN works in more detail below. What you should take away for now is that you can use one model to predict the two tasks: Churn and LTV.

Wang, Liu and Miao measure the performance of their approach on the two subtasks. While the Churn prediction achieves a comparable performance with a classical binary classification loss, their LTV prediction task outperforms the traditional mean-squared-error (MSE) approach on three different evaluation metrics. We won’t dissect their evaluation metrics in this post, but you can take a look at their paper if you’d like to understand their results more.

Deep Dive: Architecture — Output Layer

As mentioned above Wang, Liu and Miao use a neural network for their model. They opt for a simple feed-forward neural network with 64 units on the first hidden layer and 32 on the second. See the figure 3 below. These hidden units use the ReLu activation function. Numerical inputs are fed in directly, while categorical inputs are encoded as embeddings. We will focus on the output layer as these units are interesting to the loss function that we will discuss below.

Figure 3 | image by author

The output layer has one unit p to predict the probability of someone having churned or not. The activation function used to represent that probability is a sigmoid function. This function works well for outputting probabilities as its range of output values is between 0 and 1. A typical threshold is 0.5. A prediction that is below 0.5 will be a customer who’s still alive, while those with a prediction above 0.5 will be predicted as churned. See figure 4 below.

Figure 4 | image by author

LTV on the other hand needs two units. One unit to predict a location parameter μ and another unit to predict the scale parameter σ. We then use these two parameters to define a full prediction distribution, which in turn provides us with an uncertainty measure connected to our prediction (the uncertainty can be estimated by using the quantiles of our predicted distribution). These two outputs μ and σ should not be mistaken with the more familiar mean and standard deviation from a normal distribution.

The μ unit has no constraints in terms of range or sign. Thus, Wang, Liu and Miao opted to use an identity function for μ. This means it simply outputs the predicted value of the neural networks μ unit without running it through an activation function.

The scale parameter σ should only be able to return positive outputs. A common function like the exponential can solve this positivity constraint, but due to its steep growth can easily result in exploding gradients. Wang, Liu and Miao opted for a softplus activation function which doesn’t grow as steeply as the exponential function. However, the softplus alone didn’t seem enough to avoid this instability when training their model so they also applied gradient clipping.

Figure 5 | image by author

The output layer then ends up looking like figure 6 below. Churn just needs one probability value p, whereas LTV has both μ and σ. By learning to represent multiple tasks (both Churn and LTV) using one model, the middle layers of this network learn to generalise better on each subtask. If you want to understand more about this, have a read on multi-task learning here.

Figure 6 | image by author

Deep Dive: ZILN Loss

To train this network we minimise the error between the neural network’s predictions and the actual target value from the data. In our example this means we compare our predicted churn probability to the actual churn value. This performance measure is called the loss.

Depending on what you are trying to achieve, you use a different loss function that you optimise your neural network against. Given that the architecture above combines two tasks we will need a loss function that is tailored to each subtask.

Two common loss functions are Binary Cross Entropy (BCE) loss and Mean Squared Error (MSE) loss. BCE is used for binary classification tasks where you are trying to predict if something is true or false; meaning there are only two states. In our case that would be whether someone has churned or not. Then there is MSE which is a commonly used loss for regression problems where you are trying to predict a continuous value such as a salary, stocks or in our case LTV.

Binary Cross Entropy (BCE) Loss — Churn

For the churn task of the neural network Wang, Liu and Miao use the BCE loss:

yᵢ is the true target value (1 for churn and 0 for no churn), pᵢ is the predicted probability of yᵢ =churn while (1-pᵢ) is the exact opposite (no churn)

The aim in our case is to correctly classify churn (class 1) and no churn (class 0). We need a way to handle both classes in one function and put a higher penalty on wrong predictions. This is achieved by activating only one part of the loss function, while keeping the other part deactivated. Giving wrong predictions a much higher error is achieved by using the logarithm. Let’s look at an example to clarify this.

To keep it simple, we will just look at one observation where we have the true observation belonging to class 1 (churn, y₁=1), this removes the summation, and we are left with:

Now let’s say our model predicts this observation to be churn with a probability of 0.95 (p₁=0.95). That means that we get the following error for our observation:

This shows that only the first part of the equation is activated, the other part (multiplying by 0) cancels out. So, we are left with the log() of our probability which gives us tiny loss values when pᵢ is close to 1 and gets larger and larger the closer our probability gets to the value of 0 (as in us predicting the wrong class), see figure 8 (left) below. In our example we have a probability of 0.95 and therefore end up with a small error value of 0.051 for our observation. If instead, our true value is now 0 instead of 1, then the other part of the loss function activates, and we end up having a large error of 2.966 if we are still predicting that it is churn (=1) with a probability of 0.95, see figure 9 (right) below.

Figure 7 | image by author

This example is summarised in the first 2 rows of table 1 below. The table, additionally, shows how the loss function behaves in the opposite scenario where the predicted value is 0 with pᵢ=0.01.

Table 1 | image by author

This example should give you a general idea why BCE works well for tasks where you are predicting either one class or the other. Having such a large error on wrong predictions will force the model to learn a better representation which should minimise the overall error in your model’s prediction performance.

Lognormal Loss — LTV

As mentioned above, rather than predicting a single point estimate for the LTV task of the network it is preferable to have a full prediction distribution. The distribution can give us an idea about the spread of observation points around our predicted LTV value. Wang, Liu and Miao propose using the lognormal distribution. The two parameters μ and σ from above describe the distribution’s probability density function (PDF):

See figure 8 below. This lognormal distribution resembles typical LTV data quite well: it is right-skewed, starting at zero and going into positive infinity. Hence, it makes sense for the neural network to learn the parameters of the lognormal distribution.

Figure 8 | image by author

We can obtain the lognormal loss by taking the negative log-likelihood of the lognormal distribution, see the loss function below. We won’t go through the steps of deriving the loss. The main idea is to simply compare the lognormal loss to the classic MSE loss and see how it behaves differently to the MSE.

xᵢ is our actual true value, μ & σ are the estimated parameters for the lognormal probability distribution, N is the number of samples

Versus the MSE loss:

xᵢ is our actual true value and xᵢ is the predicted value, N is the number of samples

Let’s look at a few examples. The first example will show us how these losses differ on small and large prediction values. The second example will show us how they differ when predictions are over- or under-estimated.

Lognormal takes more of a relative look

We have a case where our true median value of the lognormal distribution is 20 and another case where this true value is 20,000. The table below compares the MSE loss to the Lognormal loss (with two different σ² values). We can see that the MSE penalises high spenders (case 2, loss: 16,000,000) more harshly compared to lower spenders (case 1, loss: 16). The loss in case 2 is 1,000,000 times larger than for case 1, despite the relative deviation from the true value being the same.

Table 2 | image by author

Given that a high-value customer typically spends orders of magnitude more than a normal customer, the MSE is less suited for this problem. It would over-penalise the prediction errors for our high value customers. The Lognormal on the other hand will treat small differences from the average spending customer almost the same as large differences from high-value customer predictions.

Lognormal penalises underestimates more than overestimates

Here both cases have a true value of 20, but in case 1 we under-predict by 6, whereas in case 2 we over-predict by 6. The table below compares the different losses on both cases again. We can see that the MSE penalises underestimates and overestimates equally, while the lognormal loss penalises underestimates more than overestimates.

Table 3 | image by author

This asymmetric behaviour is visualised in the figure 9 below¹. While the MSE is symmetrical around the minimum, it shows that the lognormal loss gets more asymmetric as σ² increases.

Figure 9 | image by author, inspired by figure 2 in the paper

So far, we’ve looked at the loss function in terms of its median. The median is defined as exp(μ), meaning we used μ=log(x) in the loss function. Wang, Liu and Miao, however, suggest predicting LTV as the mean of the lognormal distribution, which is:

This means we substitute μ with log(x)-σ²/2 in the loss function. The loss still behaves in the same way as above. The only difference this creates is a bias in our predictions towards a higher number. The higher σ² the higher our predicted LTV will be. If for instance our true value is 20, and we perfectly predict the μ-unit but have a σ² of 0.3, then our predicted LTV will be about 23.2. The same value with a σ² of only 0.01 would give us an LTV prediction of about 20.1 (much closer to our true value). We can see the shift in minimum prediction in the figure 10 below².

Figure 10 | image by author, inspired by figure 2 in the paper

Both the mean and median are viable approaches when using the lognormal. But when do we use the lognormal instead of the MSE? The lognormal loss should be considered more favourable to the MSE when the range of our true values (LTV) is large as we don’t want large values to be penalised more.

Zero-Inflated-Lognormal (ZILN) Loss — Churn & LTV together

Now that we’ve looked at the different tasks individually the final step to getting the ZILN mixture loss is straight-forward. It is simply a linear combination of the two losses from above. See figure 11 below.

Figure 11 | image by author

Wang, Liu and Miao assume an equal weighting for each subtask, so the loss function ends up looking as follows:

This means we end up training the neural network by simply minimising the sum of the two loss functions (BCE and Lognormal loss).

Summary

In this blog post you had a look at the paper: A Deep Probabilistic Model for Customer Lifetime Value Prediction, which proposes a multi-task approach to LTV modelling. You explored their proposed output units and the associated BCE and lognormal loss functions that when combined, result in their proposed ZILN loss. The benefits of this approach are:

  • Having an LTV prediction distribution instead of a single point prediction (which gives you insights into the uncertainty of your predictions)
  • Better generalisation on both churn and LTV modelling, due to the multitask learning approach
  • Reduction of the engineering effort around, building, training and maintaining two separate models

What this post didn’t cover is how the results compare to other models or how to apply this model to a use case. Given that their code is available online you should be able to use it as is for modelling new customers’ expectations or adapt it to model the lifetime over time using a sequential neural network architecture such as an RNN or LSTM.

References

[1] X Wang, T Liu, J Miao, A deep probabilistic model for customer lifetime value prediction(2019), Google Research Publications

Footnotes

¹For all losses to start at the same minimum y-value the minimum of each loss itself was subtracted, which is why the y-axis values differ to the direct loss calculations from the table above.

²The figure in the paper looks like it has been scaled on each of the loss functions to show the effect of shift in a clearer way

--

--

Google DeepMind PhD Scholar, simplifying Data Science and Deep Learning concepts || London (UK) ||