Binary Classification with Logistic Regression

Estimating the CTR in Online Advertising

Published in

Towards Data Science

7 min readNov 24, 2019

In performance marketing, an important Key Performance Indicator (KPI) is given by the Click Through Rate (CTR). The CTR is the ratio of users who click on a specific link to the number of total users who view a page, email, or advertisement (ad).

Estimating the CTR is a binary classification problem. When a user views an ad he either clicks (y=1)or does not click (y=0). Having solely two possible results let us use logistic regression as our model. Logistic regression is applied to estimate any number of discrete classes in contrary to linear regression, which is used to infer continuous variables. I have given a simple visualization, which gives the right model to three of the major Data Science problems:

In this story, I want to guide you first through the technical details of logistic regression before applying everything learned to a “Click-Through Rate Prediction” challenge of Kaggle¹.

Binary Logistic Regression: The Theory

Logistic regression is characterized by a logistic function to model the conditional probability of the label Y variables X

The conditional probability.

In our case Y takes the state clicked or not clicked and X will be an observable of features we want to select (e.g. device type).

We will work with m observations, each containing n features. For each of them, we will have m row vectors xᵢ of dimension n+1. Our labels Y can only be zero or one. The parameters will be given in a column vector Θ of dimension n+1.

Definitions of Y, X and Θ.

The conditional probability of a user who clicks given an observation X can then be modeled as the sigmoid function.

The conditional probability modeled with the sigmoid logistic function.

The core of logistic regression is the sigmoid function. The sigmoid function maps a continuous variable to a closed set [0, 1], which then can be interpreted as a probability. Every data point on the right-hand side gets interpreted as y=1 and every data point on the left-hand side gets inferred as y=0.

A plot of the sigmoid function with labeled sample data.

Derivation (Optional)

The sigmoid function appears naturally when deriving the conditional probability. We can express P(Y|X) with Bayes’ theorem

Bayes’ theorem

From a Bayesian interpretation we have

P(Y|X) as the posterior,
P(Y) as the prior,
and P(X) as a normalization factor.

We will fit the posterior and the prior to our data and have to get rid of the unknown probability P(X). This can be done by using the complement conditional probability.

Complement conditional probability.

When dividing the posterior by the complement conditional probability and taking the logarithm we get the log-odds (logit)

The logit can be modeled as a linear function of X.

Here we assumed that the logit is a linear function in X! Now we just have to undo the logarithm and solve for the posterior to derive the sigmoid function

Maximum Likelihood Estimation

So far we have modeled the posterior with a set of parameters Θ. How do we determine the best choice of Θ? The conditional probability of a user who clicked is equal to the sigmoid function. The sum over the probability of all cases has to add up to one. As we do have solely two cases we can find an elegant way to express both probabilities within a single expression:

The probability mass function of the Bernoulli distribution.

The right-hand side is referred to as the probability mass function (PMF) of the Bernoulli distribution. The Bernoulli distribution describes a random variable that can take one of two outcomes like our labels clicked or not clicked. Now to determine our parameters Θ we need to maximize the probability of reproducing the distribution of our population, while only be given a sample. This method is called maximum likelihood estimation (MLE). We principally join all the probabilities of every single event in our sample. This joint probability is called the likelihood, which has much in common with a probability but is focused on the parameters

We could maximize the above function, but for convenience (to obtain prettier derivatives) we apply the logarithm to the likelihood. We can do so as the logarithm is a monotonically increasing and thus conserving the position of the maximum. By applying the logarithm the product turns into a sum

To maximize the log-likelihood we can use calculus. The derivative of an extreme point has to be equal to zero

The first derivative of the log-likelihood.

The derivative of the sigmoid function (Optional)

In the last result, we have used the derivative with respect to Θ of the sigmoid function. The derivation is as follows

Derivation of the derivative of the sigmoid function.

Newton-Raphson

To perform MLE we have to find the root of the first derivative of the log-likelihood. We can use the Newton-Raphson³ root-finding algorithm for this task. Newton-Raphson is the standard method to maximize the log-likelihood. It needs the second derivative to be calculated. In our case, we can analytically determine it. In other cases, where the second derivative is computationally expensive, we could use gradient descent (ascent) for the optimization. The second derivative is given by

The second derivative of the log-likelihood.

and the Newton-Raphson method then tells us how to update the parameters for each iteration.

Estimating the CTR

In the second part of this story, we want to code our own logistic regression implementation. The Jupyter notebook I build has been published as Gist. We will work with data from the “Click-Through Rate Prediction” Kaggle competition¹. After downloading the data we unpack it and prepare a sample of 10000 rows before training on the complete set.

unzip avazu-ctr-prediction.zip
gunzip train.gz
head -n10000 train > train_sample.csv

We then load the CSV into a panda data frame and split it into a training and a test set

df = pd.read_csv('train_sample.csv')
msk = np.random.rand(len(df)) < 0.8
train = df[msk]
test = df[~msk]

Now you should focus on feature exploration, but to keep things simple I selected the columns device_type, C1, C15 and C16 as feature columns. I then can prepare my feature matrix X and use the click column as label

m = len(train)
X_train = np.ones((m, 5))
X_train[:,1] = train.device_type.to_numpy()
X_train[:,2] = train.C1.to_numpy()
X_train[:,3] = train.C15.to_numpy()
X_train[:,4] = train.C16.to_numpy()y_train = train.click.to_numpy()

For our algorithm to work we need the previously derived first and second derivative of the log-likelihood, which can be coded as follows

def DLogLikelihood(X, y, theta):
    res = np.zeros(theta.shape[0])
    for i in range(0, X.shape[0]):
        x_i = X[i]
        y_i = y[i]
        res += x_i * (y_i - sigmoid(np.dot(theta, x_i)) )
    return resdef DDLogLikelihood(X, theta):
    res = np.zeros((theta.shape[0], theta.shape[0]))
    for i in range(0, X.shape[0]):
        x_i = X[i]
        sigma = sigmoid(np.dot(theta, x_i))
        res += np.outer(x_i, x_i) * sigma * ( 1 - sigma ) 
    return -res

The iterative Netwon-Raphons steps and our logistic regression algorithm are then

def NewtonRaphsonTheta(X, y, theta):
    return theta - np.dot(
        np.linalg.inv(DDLogLikelihood(X, theta)),
        DLogLikelihood(X, y, theta))def logisticRegression(X, y, epochs=100):
    theta = np.zeros(X.shape[1])
    for i in range(epochs):
        theta = NewtonRaphsonTheta(X, y, theta)
    return theta

By calling logisticRegression(X, y) we will iteratively calculate the parameters Θ, which then can be used to make a prediction of the click probability of a user

def predict(X, theta):
    res = np.zeros(X.shape[0])
    for i in range(len(res)):
        x = X[i]
        res[i] = sigmoid(np.dot(theta, x))
    return res

For a test run, we get the following probabilities

theta = logisticRegression(X_train, y_train, epochs=100)
y_pred = predict(X_test, theta)print(y_pred)
[0.18827126 0.16229901 … 0.16229901 0.16229901 0.16229901]

To evaluate the model I compared predictions from the test set to their actual value, which showed that the model is rather poor. To improve we could spend more time on the feature selection and train on more data, while constantly measure the model performance with evaluation metrics like the logarithmic loss or the ROC curve.

Summary

Logistic regression is used in multi-classification problems
Binary logistic regression is used if we have only two classes
P(Y|X) is modeled by the sigmoid function, which maps from (-∞, ∞) to (0, 1)
We assumed that the logit can be modeled as a linear function
To estimate the parameters Θ we maximize the log-likelihood
The Bernoulli distribution is a discrete distribution having two possible outcomes, which is used in binary classification
We use Newton-Raphson as a root finder because we can easily compute the second derivative of the log-likelihood

[1]: Kaggle, Click-Through Rate Prediction https://www.kaggle.com/c/avazu-ctr-prediction

[2]: The Elements of Statistical Learning, T. Hastie, R. Tibshirani, J. Friedman https://web.stanford.edu/~hastie/ElemStatLearn/

[3]: Newton-Raphson Method https://en.wikipedia.org/wiki/Newton%27s_method