Statistics

Often in real world regression problems the target variable may be a probability or a proportion. You can think of modelling read ratios of your articles, the win-ratio of your favorite soccer team or customer churn rates of your enterprise. All of these problems have in common that the target variable is bound between 0 and 1.
In this post we will explore what kind of difficulties may arise in such a setting, why a standard linear regression may not be applicable and how to solve the problem.
The Problem
Let’s consider you want to model the click-through rate of an advertisement on you website depending on several variables like size and position. Here, we generate some mock-up data:

After generating the regression problem the target variable y is transformed by the expit function such that it lies in the interval between 0 and 1 (this will be explained below). Please note that we are generating the regression data not within the bounded interval, but on the real numbers and with gaussian errors for convenience. This is not optimal and serves just for illustrative purposes.
This is what the data looks like plotting one of the observables vs y.

We can see that all y-values are in the interval [0,1] and some correlation with the observable is visible.
Perfect, let’s build a linear regression using ordinary least squares to model the data!
Everything looks fine. So we will have a look at the predictions. Here we plot the X4 observable vs. the predicted y values:
![Prediction of the linear regression model. The data lies outside the [0,1] interval. Plot by author.](https://towardsdatascience.com/wp-content/uploads/2021/02/1NuissMXpH1EI8gtweuRrXQ.png)
As you can see, the model predicts negative values as well as values above one. As we want to interpret our target variable as click-through rates this does not make sense.
Additionally, random variables that represent probabilities or rates often are beta distributed. The ordinary least squares regression, however, assumes data that follows a gaussian distribution.
Is there a way to build a regression model that can guarantee that the predicted values lie within the interval [0,1] and at the same time account for beta-distributed data? Indeed there is. In the following we will investigate the beta regression model which is described by Ferrari and Cribari-Neto [1].
Note that we will discuss data only on the interval [0,1], as it is most common. Nevertheless, the model also works for Bounded Data on an arbitrary interval [a,b]. For doing so, just transform the target variable y → (y-a)/(a-b).
Prerequisites
Before we dive right into our problem, we will discuss a few functions that we need on the way.
A crucial role in performing a Regression on bounded data plays the so called link function g that maps from the bounded interval [0,1] to the real numbers: g: [0,1] → R.
A prominent function that does this job is the logit function. We can do the opposite using the inverse of the logit, which is called the expit or sigmoid function:
This is what both functions look like:

The logit function is steep at the edges of the [0,1] interval and thus maps values from close to 1 to large real numbers (e.g., logit(0.99) = 4.60). The expit function, however, is flat for very high or very low values, slowly approaching 0 and 1 (e.g, expit(5) = 0.993).
These two functions are our toolbox for mapping from the [0,1] interval to the real numbers and back.
Besides the logit and expit functions we will work with the beta distribution. The beta distribution is especially useful in modelling probabilities as it works on the interval [0,1]. The beta distribution takes two shape parameters α and β. It may look very different for different parameters:

This is the definition of the beta distribution:

However, we will use it with a slightly different parameterization substituting µ=α/(α+β) and ϕ=α+β. In this case µ can be interpreted as the mean of the distribution and ϕ as it’s precision. With this parametrization the beta distribution looks as follows:

Beta Regression
Now we are equipped with the tools necessary to tackle our problem.
The general idea of the beta regression is that we use a link function g (e.g., the logit) to map from our bounded space [0,1] to the real numbers. There we will perform a regression assuming our data is beta distributed by maximizing the corresponding likelihood. In order to interpret the coefficients in our bounded space [0,1], we need to map the data back using the expit function. Because we perform the regression in the space of the real numbers and only later map back to our bounded space, we can guarantee that any predictions from our model are also bounded to the interval [0,1].
In the following we will go through the different steps including the corresponding python implementation.
First we set up the regression model. For this we assume that for each data point our target variable y (e.g. click-through rates of our advertisement) is represented by the mean of a beta distribution µ. Further we assume that we can model µ as:

where _β_ᵢ are the regression parameters and the index t indicates our data points. g is our link function, as discussed above. This means that we transform µ (the mean of the beta distribution describing y) by the logit function from the bounded interval to the real numbers. There, we model g(µ) by a standard linear regression model.
But how do we obtain the parameters _β_ᵢ? We maximize the likelihood of our data under this model. The likelihood is defined as:

This means for each data point we multiply the probability of our data given our model. In this case f is the beta distribution defined above. We want to find the parameters _β_ᵢ __ such that the likelihood is maximal. As usual, it is easier in practice to minimize the negative of the logarithm of the likelihood (the log-likelihood). But this is only a technical detail. You can look up the analytical form of the log-likelihood in [1].
We can think of our analysis to consist of two different worlds. We have a probability world bounded to the interval [0,1] and the world of the real numbers. The logit and expit functions are the link back and forth between these two worlds. Here is an overview in which world the different parameters live in.

Our problem is defined on the bounded interval between [0,1]. This is where the target variable y is drawn from. We assume that y can be modelled by a beta distribution with mean µ and precision ϕ. Further we define our regression model in the space of the real numbers as we assume that g(µ) can be modelled by _Xβᵢ._ We than maximize the likelihood of our data y with respect to the beta distributions with mean µ and ϕ, finding optimal parameters _βᵢ._ Remember that µ depends on _β_ᵢ via µ = g⁻¹(Xβ). In order to interprete the predictions made by our model we need to map them back to the bounded interval [0,1] with g⁻¹(Xβ).
Below is the python code implementing the log-likelihood:
Now, we can run an optimizer to minimize the log-likelihood. We want to find optimal parameters µ = g⁻¹(Xβ) and ϕ, such that the likelihood of the data y is maximal with respect to our model. Note that ϕ is a regression parameter as well and is included in the optimization. This is done to find the best fitting shape of the beta distributions.
The bounds for the minimizer are necessary in order to constrain ϕ > 0, which is important as the beta function is not defined for ϕ <0. The output of the minimizer states that the convergence was successful and returns the regression parameters _β_ᵢ __ as well as ϕ:

We can check if the regression can recover the regression parameters that were used in the generation of the data. For this we need to multiply them with 50, as we used this factor to map the target variable to the [0,1] interval (see first code snipped). If we do so we get (truth values in parentheses):
X0 = 10.8 (10.7), X1 = 0.1 (0.0), X2 = -0.1 (0.0), X3 = 66.3 (66.5), X4 = 41.5 (41.2).
As you can see we are able to reproduce the regression parameters with high accuracy.
Finally let’s have a look at the predictions made by our model. Remember we need to map the data back from the space of the real numbers to our bounded interval using the expit function:

We can see that all values are in the interval [0,1] as they should be.
Conclusion
We saw that performing a regression on bounded data as rates, probabilities or proportions has several pitfalls. First, a regression model may predict values that do not lie within the interval, thus making unrealistic predictions. Second, data like rates or probabilities are often not gaussian distributed, which makes the ordinary least squares approach unsuitable. The beta regression is taking care of both points. The regression model is performed on a transformed space and the results are then transformed back to the bounded interval. Furthermore, the model assumes that the data is beta distributed. This post includes the code necessary to perform a beta regression in python.
If you want to dive deeper into the world of probabilities and bayesian analyses, check out this post:
Ressources
[1] Ferrari, S., & Cribari-Neto, F. (2004). Beta regression for modelling rates and proportions. Journal of applied Statistics, 31(7), 799–815.