Logistic Regression Explained from Scratch (Visually, Mathematically and Programmatically)

Hands-on Vanilla Modelling Part III

Abhibhav Sharma
Towards Data Science

--

Image By Author

A plethora of results appear on a small google search “Logistic Regression”. Sometimes it gets very confusing for beginners in data science, to get around the main idea behind logistic regression. And why wouldn't they be confused!!? Every different tutorial, article, or forum has a different narration on Logistic Regression (not including the legit verbose of textbooks because that would kill the entire purpose of these “quick sources” of mastery). Some sources claim it a “Classification algorithm” and some more sophisticated ones call it a “Regressor”, however, the idea and utility remain unrevealed. Remember that Logistic regression is the basic building block of artificial neural networks and no/fallacious understanding of it could make it really difficult to understand the advanced formalisms of data science.

Here, I will try to shed some light on and inside the Logistic Regression model and its formalisms in a very basic manner in order to give a sense of understanding to the readers (hopefully without confusing them). Now the simplicity offered here is at a cost of capering the in-depth details of some crucial aspects, and to get into the nitty-gritty of each aspect of Logistic regression would be like diving into the fractal (there will be no end to the discussion). However, for each such concept, I will provide eminent readings/sources that one should refer to.

For there are two major branches in the study of Logistic regression (i) Modelling and (ii) Post Modelling analysis (using the logistic regression results). While the latter is the measure of effect from the fitted coefficients, I believe that the black-box aspect of logistic regression has always been in its Modelling.

My aim here is to:

  1. To elaborate Logistic regression in the most layman way.
  2. To discuss the underlying mathematics of two popular optimizers that are employed in Logistic Regression (Gradient Descent and Newton Method).
  3. To create a logistic-regression module from scratch in R for each type of optimizer.

One last thing before we proceed, this entire article is designed by keeping the binary classification problem in mind in order to avoid complexity.

1. The Logistic Regression is NOT A CLASSIFIER

Yes, it is not. It is rather a regression model in the core of its heart. I will depict what and why logistic regression while preserving its resonance with a linear regression model. Assuming that my readers are somewhat aware of the basics of linear regression, it is easy to say that the linear regression predicts a “value” of the targeted variable through a linear combination of the given features, while on the other hand, a Logistic regression predicts “probability value” through a linear combination of the given features plugged inside a logistic function (aka inverse-logit) given as eq(1):

Equation 1
Logistic Function (Image by author)

Hence the name logistic regression. This logistic function is a simple strategy to map the linear combination “z”, lying in the (-inf,inf) range to the probability interval of [0,1] (in the context of logistic regression, this z will be called the log(odd) or logit or log(p/1-p)) (see the above plot). Consequently, Logistic regression is a type of regression where the range of mapping is confined to [0,1], unlike simple linear regression models where the domain and range could take any real value.

A small sample of the data (Image by author)

Consider simple data with one variable and its corresponding binary class either 0 or 1. The scatter plot of this data looks something like (Fig A left). We see that the data points are in the two extreme clusters. Good, now for our prediction modeling, a naive regression line in this scenario will give a nonsense fit (red line in Fig A right) and what we actually require to fit is something like a squiggly line (or a curvy “S” shaped blue rule in Fig A right) to explain (or to correctly separate) a maximum number of data points.

(Image by author) Fig A

Logistic regression is a scheme to search this most optimum blue squiggly line. Now first let's understand what each point on this squiggly line represents. given any variable value projected on this line, this squiggly line tells the probability of falling in Class 1 (say “p”) for that projected variable value. So accordingly the line tells that all the bottom points that lie on this blue line have zero chances (p=0) of being in class 1 and the top points that lie on it have the probability of 1(p=1) for the same. Now, remember that I have mentioned that the logistic (aka inverse-logit) is a strategy to map infinitely stretching space (-inf, inf) to a probability space of [0,1], a logit function could transform the probability space of [0,1] to a space stretching to (-inf, inf) eq(2)&(Fig B).

Equation 2
Fig B. The logit function is given by log(p/1-p) that maps each probability value to the point on the number line {} stretching from -infinity to infinity (Image by author)

Keeping this in mind, here comes the mantra of logistic regression modeling:

Logistic Regression starts with first transforming the space of class probability[0,1] vs variable{ℝ} (as in fig A right) to the space of Logit{} vs variable{} where a “regression like” fitting is performed by adjusting the coefficient and slope in order to maximize the Likelihood (a very fancy stuff that I will elaborated this part in coming section). Once tweaking and tuning are done, the Logit{} vs variable{} space is remapped to class probability[0,1] vs variable{ℝ} using inverse-Logit (aka Logistic function). Performing this cycle iteratively (Ⓐ →Ⓑ →Ⓐ ) would eventually result in the most optimum squiggly line or the most discriminating rule.

WOW!!!

Well, you may (should) ask (i) Why and how to do this transformation ??, (ii) what the heck is Likelihood?? and (iii) How this scheme would lead to the most optimum squiggle?!!.

So for (i), the idea to the transformation from a confined probability space [0,1] to an infinitely stretching real space (-inf, inf) is because it will make the fitting problem very close to solving a linear regression, for which we have a lot of optimizers and techniques to fit the most optimum line. The latter questions will be answered eventually.

Now coming back to our search for the best classifying blue squiggly line, the idea is to plot an initial linear regression line with the arbitrary coefficient on ⚠️logit vs variable space⚠️ coordinates first and then adjust the coefficients of this fit to maximize the likelihood (relax!! I will explain the “likelihood” when it is needed).

In our one variable case, we can write equation 3:

logit(p) = log(p/1-p) = β₀+ β₁*v ……………………………….(eq 3)

(Image by author) Fig C

In Fig C (I), the red line is our arbitrary chosen regression line fitted for the data points, mapped in a different coordinate system with β₀ (intercept) as -20 and β₁(slope) as 3.75.

⚠️ Note that the coordinate space is not class{0,1} vs variable{ℝ} but its the Logit{} vs variable{}. Also, notice that the transformation from Fig A(right) to Fig C(I) has no effects on the positional preferences of the points i.e. the extremes as in equation 2 above, the logit(0)=-infinity, and logit(1)=+infinity.

At this point, let me reiterate our objective: We want to fit the straight line for the data points in logit vs variable plot in such a way that it explains (correctly separates) the maximum number of data points when it gets converted to the blue squiggly line through inverse-logit (aka logistic function) eq(1). So to achieve the best regression, a similar strategy of simple linear regression comes into play but despite minimizing the squared residual, the idea is to maximize the likelihood (relax!!). Since the points scattered on the infinity make it difficult to proceed in an orthodox linear regression method, the trick is to project these points on the logit (the initial chosen/fitted line with the arbitrary coefficient) Fig C(II). In this way, each data point projected on the logit corresponds to a logit value. When these logit values are plugged into the logistic function eq(1), we get their probability of falling in class 1 Fig C(III).

Note: This also can be proven mathematically as well that: logit(p)=logistic⁻¹(p)

This probability can be represented mathematically as equation 4, which is very close to a Bernoulli distribution, isn't it?.

P(Y = y|X = x) = σ(βᵀ x)ʸ · [1 − σ(βᵀx)]⁽¹⁻ʸ⁾ ; where y is either 0 or 1..eq(4)

The equation reads that for a given data instance x, the probability of the label Y being y(where y is either 0 or 1) is equal to the logistic of logit when y=1 and is equal to (1-logistic of logit) if y=0. These new probability values are illustrated in our class{0,1} vs variable{ℝ} space as blue dots in Fig C(III). This new probability value for a datapoint is what we call the LIKELIHOOD of that data point. So in simple terms, likelihood is the probability value of the datapoint where the probability value indicates that how LIKELY the point is to be falling in the class 1 category. And the likelihood of the training label for the fitted weight vector β is nothing but the product of each of these newfound probability values equation 5&6.

L(β) = ⁿ∏ᵢ₌₁ P(Y = y⁽ⁱ⁾ | X = x⁽ⁱ⁾ )………………….……….eq(5)

Substituting equation 4 in equation 5 we get,

L(β) = ⁿ∏ᵢ₌₁ σ(βᵀ x⁽ⁱ⁾)ʸ⁽ⁱ⁾ · [1 − σ(βᵀx⁽ⁱ⁾)]⁽¹⁻ʸ⁽ⁱ⁾⁾ ………………eq(6)

The idea is to estimate the parameters (β) such that it maximizes the L(β). However, due to the mathematical convenience, we maximize the log of L(β) and call its log-likelihood equation 7.

LL(β) = ⁿ∑ᵢ₌₁ y⁽ⁱ⁾log σ(βᵀ x⁽ⁱ⁾) + (1− y⁽ⁱ⁾) log[1 − σ(βᵀx⁽ⁱ⁾)]……..eq(7)

So at this point, I hope that our earlier stated objective is much understandable i.e. to find the best fitting parameters β in logit vs variable space such that LL(β) in probability vs variable space is maximum. For this, there is no close form and so in the next section, I will touch upon two optimization methods (1) Gradient descent and (2) Newton’s method to find the optimum parameters.

2. Optimizers

Gradient Ascent

Our optimization first requires the partial derivative of the log-likelihood function. So let me shamelessly share the snap from a very eminent lecture note that beautifully elucidate the steps to derive the partial derivative of LL(β). (Note: the calculations shown here use θ in place of β to represent the parameters.)

To update the parameter, the steps toward the global maximum is:

Where η is the learning rate

so the algorithm is:

Initialize β and set likelihood=0

While likelihood≤max(likelihood){

Calculate logit(p) = xβᵀ

Calculate P=logistic(Xβᵀ)= 1/(1+exp(-Xβᵀ))

Calculate Likelihood L(β) = ⁿ∏ᵢ₌₁ifelse( y(i)=1, p(i), (1-p(i)))

Calculate first_derivative LL(β) = X(Y-P)

Update: β(new) = β (old) + ηLL(β)

}

Newton Method

Newton’s Method is another strong candidate among the all available optimizers. We have learned Newton’s Method as an algorithm to stepwise find the maximum/minimum point on the concave/convex functions in our early lessons:

xₙ₊₁=xₙ + ∇f(x). ∇∇f⁻¹(x)

In the context of our log-likelihood function, the f(x) will be replaced by the gradient of LL(β) (i.e ∇LL(β)) and the ∇∇f⁻¹(x) would be the Hessian H i.e. the second-order partial derivative of LL(β)). Well, I will caper the details here, but your curious brain should refer to this. So, the ultimate expression to update the parameter, in this case, is given by:

βₙ₊₁=βₙ+ H⁻¹. ∇LL(β)

Here in the case of logistic regression, the calculation of H is super easy because:

H= ∇∇LL(β) = ∇ⁿ∑ᵢ₌₁[y − σ(βᵀx⁽ⁱ⁾)].x⁽ⁱ⁾

= ∇ⁿ∑ᵢ₌₁[ypᵢ].x⁽ⁱ⁾

= −ⁿ∑ᵢ₌₁ x⁽ⁱ⁾(∇pᵢ) = −ⁿ∑ᵢ₌₁ x⁽ⁱ⁾ pᵢ(1-pᵢ) (x⁽ⁱ⁾)

thus, H= -XWXᵀ, where W=(P*(1-P))I

So the algorithms are:

Initialize β and set likelihood=0

While likelihood≤max(likelihood){

Calculate logit(p) = xβᵀ

Calculate P=logistic(Xβᵀ)= 1/(1+exp(-Xβᵀ))

Calculate Likelihood L(β) = ⁿ∏ᵢ₌₁ifelse( y(i)=1, p(i), (1-p(i)))

Calculate first_derivative LL(β) = X(Y-P)

Calculate Second_derivative Hessian H = -Xᵀ (P*(1-P))ᵀI X

β(new) = β (old) + H⁻¹. ∇LL(β)

}

3. The Scratch Code

The data

The dataset that I am going to use for training and testing my binary classification model can be downloaded from here. Originally this dataset is an Algerian Forest Fires Dataset. You can check out the details of the dataset here.

The Code

For the readers who hopped the entire article above to play around with code, I would recommend having a quick eyeballing through the second section as I have given a spet-wise algorithm for both the optimizer and my code will strictly follow that order.

The training Function

setwd("C:/Users/Dell/Desktop")
set.seed(111) #to generate the same results as mine
#-------------Training Function---------------------------------#logistic.train<- function(train_data, method, lr, verbose){

b0<-rep(1, nrow(train_data))
x<-as.matrix(cbind(b0, train_data[,1:(ncol(train_data)-1)]))
y<- train_data[, ncol(train_data)]

beta<- as.matrix(rep(0.5,ncol(x))); likelihood<-0; epoch<-0 #initiate

beta_all<-NULL
beta_at<-c(1,10,50,100,110,150,180,200,300,500,600,800,1000,
1500,2000,4000,5000,6000,10000) #checkpoints (the epochs at which I will record the betas)

#-----------------------------Gradient Descent---------------------#
if(method=="Gradient"){
while( (likelihood < 0.95) & (epoch<=35000)){

logit<-x%*%beta #Calculate logit(p) = xβᵀ

p <- 1/( 1+ exp(-(logit))) #Calculate P=logistic(Xβᵀ)= 1/(1+exp(-Xβᵀ))

# Likelihood: L(x|beta) = P(Y=1|x,beta)*P(Y=0|x,beta)
likelihood<-1
for(i in 1:length(p)){
likelihood <- likelihood*(ifelse( y[i]==1, p[i], (1-p[i]))) #product of all the probability
}

first_d<- t(x) %*% (y-p)#first derivative of the likelihood function

beta <- beta + lr*first_d #updating the parameters for a step toward maximization

#to see inside the steps of learning (irrelevant to the main working algo)
if(verbose==T){
ifelse(epoch%%200==0,
print(paste0(epoch, "th Epoch",
"---------Likelihood=", round(likelihood,4),
"---------log-likelihood=", round(log(likelihood),4),
collapse = "")), NA)}

if(epoch %in% beta_at){beta_all<-cbind(beta_all, beta)}

epoch<- epoch+1
}
}

#--------------Newton second order diff method-------------#

else if(method=="Newton"){
while((likelihood < 0.95) & (epoch<=35000)){

logit<-x%*%beta #Calculate logit(p) = xβᵀ
p <- 1/( 1+ exp(-(logit))) #Calculate P=logistic(Xβᵀ)= 1/(1+exp(-Xβᵀ))

# Likelihood: L(x|beta) = P(Y=1|x,beta)*P(Y=0|x,beta)
likelihood<-1
for(i in 1:length(p)){
likelihood <- likelihood*(ifelse( y[i]==1, p[i], (1-p[i])))
}

first_d<- t(x) %*% (y-p)#first derivative of the likelihood function

w<-matrix(0, ncol= nrow(x), nrow = nrow(x)) #initializing p(1-p) diagonal matrix
diag(w)<-p*(1-p)
hessian<- -t(x) %*% w %*% x #hessian matrix

hessian<- diag(ncol(x))-hessian #Levenberg-Marquardt method: Add a scaled identity matrix to avoid singularity issues

k<- solve(hessian) %*% (t(x) %*% (y-p)) #the gradient for newton method

beta <- beta + k #updating the parameters for a step toward maximization
if(verbose==T){
ifelse(epoch%%200==0,
print(paste0(epoch, "th Epoch",
"---------Likelihood=", round(likelihood,4),
"---------log-likelihood=", round(log(likelihood),4),
collapse = "")), NA)}

if(epoch %in% beta_at){beta_all<-cbind(beta_all, beta)} #just to inside the learning
epoch<- epoch+1
}
}

else(break)

beta_all<-cbind(beta_all, beta)
colnames(beta_all)<-c(beta_at[1:(ncol(beta_all)-1)], epoch-1)

mylist<-list(as.matrix(beta), likelihood, beta_all)
names(mylist)<- c("Beta", "likelihood", "Beta_all")
return(mylist)
} # Fitting of logistic model

The Prediction Function

logistic.pred<-function(model, test_data){

test_new<- cbind( rep(1, nrow(test_data)), test_data[,-ncol(test_data)]) #adding 1 to fit the intercept

beta<-as.matrix(model$Beta) #extract the best suiting beta (the beta at final epoch)
beta_all<-model$Beta_all #extract all the betas at different checkpoints
ll<- model$likelihood #extract the highest likelihood obtained

log_odd<-cbind(as.matrix(test_new)) %*% beta #logit(p)

probability<- 1/(1+ exp(-log_odd)) # p=logistic(logit(p))
predicted_label<- ifelse(probability >= 0.5, 1, 0) #discrimination rule

k<-cbind(test_data[,ncol(test_data)], predicted_label) # actual label vs predicted label
colnames(k)<- c("Actual", "Predicted")
k<- as.data.frame(k)

tp<-length(which(k$Actual==1 & k$Predicted==1)) #true positive
tn<-length(which(k$Actual==0 & k$Predicted==0)) #true negative
fp<-length(which(k$Actual==0 & k$Predicted==1)) #false positive
fn<-length(which(k$Actual==1 & k$Predicted==0)) #false negative

cf<-matrix(c(tp, fn, fp, tn), 2, 2, byrow = F) #confusion matrix
rownames(cf)<- c("1", "0")
colnames(cf)<- c("1", "0")

p_list<-list(k, cf, beta, ll, beta_all)
names(p_list)<- c("predticted", "confusion matrix","beta", "liklihood", "Beta_all")
return(p_list)

} # to make prediction from the trained model

Data Parsing

#importing data
data<-read.csv("fire.csv", header = T) #import
data$Classes<-as.numeric(ifelse(1:nrow(data)%in%grep("not",data$Classes), 0, 1)) # one hot encoding ; numeric conversion from label to 1 or 0
data<-rbind(data[which(data$Classes==0),],
data[sample(size=length(which(data$Classes==0)),which(data$Classes==1)),]) #balancing the classes
data<-data[sample(1:nrow(data)),] #shuffling
data<-as.data.frame(data) #data to data frame
data<-lapply(data, as.numeric)
data<-as.data.frame(data)
#missing data handling
if(!is.null(is.na(data))){
data<-data[-unique(which(is.na(data), arr.ind = T)[,1]),]
}
#test train partition
partition<-sample(c(0,1), size=nrow(data), prob = c(0.8,0.2), replace = T)
train<-data[which(partition==0),]
test<-data[which(partition==1),]

Training → Testing → Results

#-------------------------TRAINING---------------------------------#mymodel_newton<- logistic.train(train, "Newton", 0.01, verbose=T) # Fitting the model using Newton method
mymodel_gradient<- logistic.train(train, "Gradient", 0.01, verbose=T) # Fitting the model using Gradient method
#------------------------TESTING-------------------------------------#
myresult1<-logistic.pred( mymodel_newton, test_data = test) #prediction using Newton trained model
myresult2<-logistic.pred( mymodel_gradient, test_data = test) #prediction using Gradient trained model
#------------------------Results----------------------------------#
myresult1$`confusion matrix`
myresult2$`confusion matrix`

The Results

(Image by author)

The confusion matrix obtained by both methods is the same. The accuracies obtained by both methods on the independent test set are 95.2% (quite good!!).

(Image by author)

However, the best fitting coefficients β obtained by both methods are very different in terms of values. Newton’s method took 3,566 epochs to obtain a likelihood of 1, while Gradient descent took 3,539 to read the maximum likelihood of 0.999.

In terms of time taken, Newton’s method took more time to reach the optimum in comparison to the gradient method because, in Newton's method, the solving inverse of Hessian makes it a little computationally extensive and time-consuming algorithm.

As you have noticed that I have captured the betas at various checkpoints during the training. This will allow us to peep into the training process that is carried out while maximizing the likelihood (see ⇩⇩⇩⇩)

(Image by author)

We can also peep into the fitting on the test set by the fully trained model

(Image by author)

So at this point, I think I can reiterate to the reader that the fundamental nature of Logistic Regression is not of classification, rather it is of regression. Through substantiating a regression in its core functioning, The Logistic regression gives output as probability attached to a given instance. It is when a rule of >or≤ 0.5 or something is employed, the assignment of an instance to a particular discrete class is carried out.

gif

Conclusion

If you are here then go get yourself a fine treat, you are a real MVP. I hope my very casual elaboration on logistic regression gave you slightly better insights into the logistic regression. This article encompasses the concept, the underlying mathematics, and the programming of logistic regression. While the ideas here depict the actual scheme, there are some out-of-scope aspects of the optimizers discussed here, in which the optimizing algorithm might fail to achieve an optimum, more details can be found here.

Feel free to download the entire code (Model and plots) from my git.

--

--