At the very beginning of my journey to learn fundamentals of machine learning, I remember spending a lot of time in clearly understanding the basics of logistic regression. Hopefully this meditation will leave you with more answers and correct concepts than confusions related with logistic regression.
In this post I will attempt to cover –
- Odds and Odds ratio
- Understanding logistic regression, starting from linear regression.
- Logistic function as a classifier; Connecting Logit with Bernoulli Distribution.
- Example on cancer data set and setting up probability threshold to classify malignant and benign.
Odds and Odds ratio
Before we dig deep into logistic regression, we need to clear up some of the fundamentals of probability. For simplicity, we will consider a data-set that tells us about depending on the gender, whether a customer will purchase a product or not. We import and check the data-set
import pandas as pd
gender_df = pd.read_csv('gender_purchase.csv')
print gender_df.head(3)
>>> Gender Purchase
0 Female Yes
1 Female Yes
2 Female No
We will create a table of frequency of ‘yes’ and ‘no’ depending on the gender, using crosstab
feature of pandas
. The table will be of great use to understand odds and odds ratio later on.
table = pd.crosstab(gender_df['Gender'], gender_df['Purchase'])
print table
>>> Purchase No Yes
Gender
Female 106 159
Male 125 121
We’re now ready to define Odds, which describes the ratio of success to ratio of failure. Considering females group, we see that probability that a female will purchase (success) the product is = 159/265 (yes/total number of females). Probability of failure (no purchase) for female is 106/265. In this case the odds is defined as (159/265)/(106/265) = 1.5. Higher the odds, better is the chance for success. Range of odds can be any number between [0 , ∞]. What happens to the range if we take a natural logarithm of such numbers ? log(x) is defined for x≥0 but the range varies from [-∞, ∞]. You can check with a snippet of code
random=[]
xlist = []
for i in range(100):
x = uniform(0,10)# choose numbers between 0 and 10
xlist.append(x)
random.append(math.log(x))
plt.scatter(xlist, random, c='purple',alpha=0.3,label=r'$log x$')
plt.ylabel(r'$log , x$', fontsize=17)
plt.xlabel(r'$x$',fontsize=17)
plt.legend(fontsize=16)
plt.show()

So far we have understood odds. Let’s describe Odds ratio, which as the name suggests, is the ratio of odds. Considering the example above, Odds ratio, represents which group (male/female) has better odds of success, and it’s given by calculating the ratio of odds for each group. So odds ratio for females= odds of successful purchase by female / odds of successful purchase by male = (159/106)/(121/125). Odds ratio for males will be the reciprocal of the above number.
We can appreciate clearly that while odds ratio can vary between 0 to positive infinity, log (odds ratio) will vary between [-∞, ∞]. Specifically when odds ratio lies between [0,1], log (odds ratio) is negative.
Linear to Logistic Regression
Since confusingly the ‘regression’ term is present in logistic regression, we may spare few seconds to review regression. Regression usually refers to continuity i.e. predicting continuous variables (medicine price, taxi fare etc.) depending upon features. However, logistic regression is about predicting binary variables i.e when the target variable is categorical. Logistic regression is probably the first thing a budding data scientist should try to get a hang on classification problems. We will start from linear regression model to achieve the logistic model in step by step understanding.
In linear regression where feature variables can take any values, the output (label) can thus be continuous from negative to positive infinity.

Since logistic regression is about classification, i.e Y is a categorical variable. It’s clearly not possible to achieve such output with linear regression model (eq. 1.1), since the range on both sides do not match. Our aim is to transform the LHS in such a way that it matches the range of RHS, which is governed by the range of feature variables, [-∞, ∞].
We will follow some intuitive steps to search how it’s possible to achieve such outcome.

- For linear regression, both X and Y ranges from minus infinity to positive infinity. Y in logistic is categorical, or for the problem above it takes either of the two distinct values 0,1. First, we try to predict probability using the regression model. Instead of two distinct values now the LHS can take any values from 0 to 1 but still the ranges differ from the RHS.

- I discussed above that odds and odds ratio ratio varies from [0, ∞]. This is better than probability (which is limited between 0 and 1) and one step closer to match the range of RHS.
- Many of you have already understood that if we now consider a natural logarithm on LHS of (eq. 1.3) then the ranges on both side matches.

With this, we have achieved a regression model, where the output is natural logarithm of the odds , also known as logit. The base of the logarithm is not important but taking logarithm of odds is.
We can retrieve the probability of success from eq. 1.4 as below.

If we know the coefficients of independent variables _X_s and __ the intercept a, we can predict the probability. We will use software ( sklearn
) for that optimization. Depending on the problem, from the probability value we can choose whether the output falls in class A or class B. This will be more clear when we will go through an example.
Logistic Function
If you see the RHS of equation 1.5., which is also known as logistic function, is very similar to the sigmoid function, . We can check the behaviour of such function with a snippet of python code.
random1=[]
random2=[]
random3=[]
xlist = []
theta=[10, 1,0.1]
for i in range(100):
x = uniform(-5,5)
xlist.append(x)
logreg1 = 1/(1+math.exp(-(theta[0]*x)))
logreg2 = 1/(1+math.exp(-(theta[1]*x)))
logreg3 = 1/(1+math.exp(-(theta[2]*x)))
random1.append(logreg1)
random2.append(logreg2)
random3.append(logreg3)
plt.scatter(xlist, random1, marker='*',s=40, c='orange',alpha=0.5,label=r'$theta = %3.1f$'%(theta[0]))
plt.scatter(xlist, random2, c='magenta',alpha=0.3,label=r'$theta = %3.1f$'%(theta[1]))
plt.scatter(xlist, random3, c='navy',marker='d', alpha=0.3,label=r'$theta = %3.1f$'%(theta[2]))
plt.axhline(y=0.5, label='P=0.5')
plt.ylabel(r'$P=frac{1}{1+e^{-theta , x}}$', fontsize=19)
plt.xlabel(r'$x$',fontsize=18)
plt.legend(fontsize=16)
plt.show()

From the plot above, notice that higher the value of the coefficient (orange stars) of the independent variable (here X), better it can represent two distinct probabilities 0 and 1. For lower value of the coefficient it’s essentially a straight line, resembling a simple linear regression function. Comparing with equation (1.5), in figure 2, the fixed term a is taken as 0. The effect of the fixed term on the logistic function can also be understood using the plot below

Just like in linear regression where the constant term denotes the intercept on the Y axis (hence a shift along Y axis), here for logistic function, the constant term shifts the s curve along the X axis. The figures above (Fig 2, 3) should convince you that it’s indeed possible to optimize a model using logistic regression that can classify data i.e. predict 0 or 1.
Bernoulli and Logit
The aim of logistic regression is to predict some unknown probability P for a successful event, for any given linear combination of independent variables (features). So as the heading suggests, how logit and Bernoulli functions are connected ? Recall that binomial distribution which is the probability distribution of having n success out of N trials, given that

each trial is true with probability P and false with probability Q=1-P. Bernoulli distribution on the other hand is a discrete distribution with two possible outcomes labelled by n=0 and n=1, in which n=1 (successful event) occurs with probability P and failure i.e. n=0 occurs with a probability 1-P. So, Bernoulli distribution can be written as

It’s understandable that Bernoulli distribution is a special case of binomial distribution for a single trial (N=1 in equation 1.6). Most importantly we see that the dependent variable in logistic regression follows Bernoulli distribution having an unknown probability P. Therefore, the logit i.e. log of odds, links the independent variables (Xs) to the Bernoulli distribution. In logit case, P is unknown, but in Bernoulli distribution (eq. 1.6) we know it. Let’s plot the logit function.

We see that the domain of the function lies between 0 and 1 and the function ranges from minus to positive infinity. We want the probability P on the y axis for logistic regression, and that can be done by taking an inverse of logit function. If you have noticed the sigmoid function curves before (Figure 2 and 3), you can already find the link. Indeed, sigmoid function is the inverse of logit (check eq. 1.5).
Example with Cancer Data-set and and Probability Threshold
Without further delay let’s see an application of logistic regression on cancer data-set. Here we will concentrate on how we can set the probability threshold to classify our model. I will use the all the features of the data-set just for simplicity but you can read in detail about selecting the best features using RFE
method that I have described in a separate post.
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
cancer_df=pd.DataFrame(cancer.data,columns=cancer.feature_names)
X_trainc, X_testc, y_trainc, y_testc = train_test_split(cancer.data, cancer.target, test_size=0.3, stratify=cancer.target, random_state=30)
cancerclf = LogisticRegression()
cancerclf.fit(X_trainc, y_trainc)
#print "Logreg score on cancer data set", cancerclf.score(X_testc, y_testc) # you can check the score if you want, which is not the main purpose.
We will use predict_proba
method for logistic regression which to quote scikit-learn "returns probability estimates for all classes which are ordered by the label of the classes". We call this method on the test data set.
probac = cancerclf.predict_proba(X_testc)
print probac[1:10]
>>> [[5.86216203e-02 9.41378380e-01]
[7.25210884e-03 9.92747891e-01]
[9.99938102e-01 6.18983128e-05]
[4.75502091e-02 9.52449791e-01]
[9.66861480e-01 3.31385203e-02]
[3.09660805e-01 6.90339195e-01]
[9.99687981e-01 3.12018784e-04]
[6.80759215e-04 9.99319241e-01]
[9.99998223e-01 1.77682663e-06]]
As our target is either 0 or 1, then printing predict_proba
will give us probability matrix of dimension (N,2), N is the number of instances. The first index refers to the probability that the data belong to class 0, and the second refers to the probability that the data belong to class 1. By default, if this probability is more than 0.5 then the prediction is categorized as a positive outcome. For each row, adding up the two columns should be equal to 1, as probability of success (P) and failure (1-P) should be equal to 1.
We can now turn into predict
method, which predicts class labels and in default case for binary classification, it categorizes probabilities less than 0.5 as 0 and vice versa.
predict = cancerclf.predict(X_testc)
print predict
>>> [1 1 1 0 1 0 1 0 1 0 1 1 1 1 .....]# didn't show the complete list
Now we consider the first column of the probac=cancerclf.predict_proba(X_testc)
array, which consists of probabilities for class 0 (in cancer data-set this is malignant class). We make a mini data-frame with this array.
probability = probac[:,0]
prob_df = pd.DataFrame(probability)
print prob_df.head(10) # this should match the probac 1st column
>>> 0
0 0.005366
1 0.058622
2 0.007252
3 0.999938
4 0.047550
5 0.966861
6 0.309661
7 0.999688
8 0.000681
9 0.999998
We modify this data-frame a bit more to understand the effect of changing the threshold.
prob_df['predict'] = np.where(prob_df[0]>=0.90, 1, 0)# create a new column
print prob_df.head(10)
>>> 0 predict
0 0.005366 0
1 0.058622 0
2 0.007252 0
3 0.999938 1
4 0.047550 0
5 0.966861 1
6 0.309661 0
7 0.999688 1
8 0.000681 0
9 0.999998 1
We set ≥ 90% as threshold for malignant class selection. In the example print out we see a value 0f 0.96, so changing the threshold to 97% should exclude that sample from malignant class.
prob_df['predict'] = np.where(prob_df[0]>=0.97, 1, 0)
print prob_df.head(10)
>>> 0 predict
0 0.005366 0
1 0.058622 0
2 0.007252 0
3 0.999938 1
4 0.047550 0
5 0.966861 0 # here is the change
6 0.309661 0
7 0.999688 1
8 0.000681 0
9 0.999998 1
One can also check the effect on total number of test samples
prob_df['predict'] = np.where(prob_df[0]>=0.50 1, 0)
print len(prob_df[prob_df['predict']==1])
>>> 56
prob_df['predict'] = np.where(prob_df[0]>=0.97 1, 0)
print len(prob_df[prob_df['predict']==1])
>>> 45
We have seen how one can change the probability threshold to select or reject a sample from a particular class.
Logistic regression uses L2 regularization by default and the result of changing the regularization parameter can be checked and compared with linear regression. I have discussed this before with ridge regression so interested ones can check. Choosing best parameters with RFE
is an important part of logistic regression, as it’s better to have little to no multicollinearity and that is one of the ways to ensure to select few relevant parameters which can describe the model.
So to wrap up, we have learned some of the fundamental ideas about developing a regression model that can be used for classification.
I recommend you to check Andrew Ng’s lecture notes or available lectures in the YouTube. The basic idea of this post is influenced from the book "Learning Predictive Analysis with Python" by Kumar, A., which clearly describes the connection of linear and logistic regression. Relating the connection between Bernoulli and logit function is motivated from the presentation slides by B. Larget (UoW, Madison) which is publicly available.
Stay strong and cheers !
If you’re interested in further fundamental Machine Learning concepts and more, you can consider joining Medium using My Link. You won’t pay anything extra but I’ll get a tiny commission. Appreciate you all!!