Binary Logistic Regression — Understanding Explainable AI

Grant Holtes
Towards Data Science
5 min readAug 21, 2018

--

“Explainable AI or XAI is a sub-category of AI where the decisions made by the model can be interpreted by humans, as opposed to “black box” models. As AI moves from correcting our spelling and targeting ads to driving our cars and diagnosing patients, the need to verify and justify the conclusions being reached is beginning to be prioritised.”

After the apparent popularity of my last story on explainable AI (XAI) it’s time to tackle a similarly transparent but useful algorithm, logistic regression. Logistic regression belongs to the happy family of “generalised linear models”, which add a layer of complexity to the otherwise straight lines of linear regression.

A Plain / Vanilla Linear Model

Fitting a straight line through data points works well in many situations where both x and y can take any value within a reasonable range, such a predicting height from weight or income from education. However, there are cases where the variable of interest is binary, taking only two values. For example, it may be useful to predict if somebody will have claim on their insurance, or if a customer will purchase your product if advertised to. In this case, y takes the value of 0 (event will not occur) or 1 (event does occur).

When we use linear regression on a problem of this type we predict values for y that are outside its domain of [0,1], rendering an interpretation of the results confusing if not meaningless.

A better model would be able adjust for the distribution of y and only provide predictions that make sense. This is the purpose of generalized linear models.

The function f() is aptly named the ‘link’ function, linking the linear specification of X’B to the non-linear variable y. The obvious limitation here is that it is assumed that a linear combination of the X variables is sufficient to explain y.

For the binary outcome case we will use the Logit function, the results of which are below:

Logistic regression with the same data as above

Unlike simple linear models, there is no closed solution for the parameters B in logistic regression. Instead we must fit them much like we fit the weights in a deep learning scenario — through iterative improvement. This requires two things; a measure of how “good” B is and a way to improve B.

How Good is B?

We take a “likelihood” approach to evaluating B, estimating the likelihood or probability of our model being the “true model” which generated the data. Mathematically, we find the probability of observing the data we have, given the parameters. For a single observation of y and the logit function S(), this probability is given by:

When the parameters B give a more accurate prediction for y, the probability will be higher. The intuition of this equation is given below:

dodgy truth table

By assuming each observation is independent, we can calculate the probability of the entire dataset as the product of all the observation level probabilities, which we call the likelihood, L().

Obviously finding the derivative of L for optimisation purposes would be impractical and as a result it’s common practice to take the log of the likelihood function to yield l().

The nicely simplified version

Taking the log is valid as we predominantly care about the relative levels of likelihood for different B specification, rather than the value itself. As log(x) is always increasing in x, if one specification’s likelihood is larger than another, its loglikelihood will also be larger. This form allows us to define the first and second derivatives of l, l’ and l’’ respectively while still giving us a was to quantify how “good” a set of parameters is at fitting the data.

How to make B better

There are a number of algorithms to find the optimal B value, most of which aim to locate the B value which maximises l, the loglikelihood function. These algorithms are “hill climbing” as they find the maximum value of a function, as opposed to deep learning techniques which try to find the minimum a loss function.

One such technique is the Newton-Raphson method, which finds the roots of a function F (when the function is equal to zero). Each iteration t we get slightly closer to the root of F by applying the following rule:

Using some high school calculus, we know that our loglikelihood l function will have its maximum where its derivative is equal to zero. Substituting in l’ for F and we are left with our hill climbing algorithm:

Thankfully all of these formulas including the derivatives of l can be expressed in matrix form, allowing for faster computation. (The code and included fun linear algebra for this project can be accessed here.)

Results

The results from logistic models are easy to interpret, with the predicted value of y = S(X’B) being the probability of y taking a value of 1 and as such the probability of the event occurring.

--

--