
Introduction
In simple logistic regression, we try to fit the probability of the response variable’s success against the predictor variable. This predictor variable can be either categorical or continuous. We need to quantify how good the model is. There are several goodness-of-fit measurements that indicate the goodness-of-fit.
A discrete random variable can often take only two values: 1 for success and 0 for failure. The distribution of this type of random variable is generally defined as Bernoulli distribution. The probability function or the probability math function for Bernoulli distribution can be expressed as.

Here, y can be either 1 or 0.
Using this function, we can calculate the probability for each observation. Sometimes discrete random variable can take more than two values. Using this same distribution function, we can calculate the probability at each level of the random variable. For example, discrete random variable may possess values other than 0 and 1. The Joint probability function can be obtained by multiplying the probability values at each discrete level for all the observations.
The above definition of joint probability leaves us to a new entity called likelihood function. The formulation is same as the joint probability but the interpretation is a bit different for likelihood function. In case of probability function, the parameter is known and we are interested in determining that known parameter’s probability at different level. On the other hand when we consider the likelihood function, we need to estimate the unknown parameter.
Examples
Let us assume that the probability of occurring an event or the probability of success is 0.7 and the probability of not offering the same event all the probability of failure is 0.3. This leads to the following Bernoulli distribution.
With the formula in hand, we would like to determine the probability of occurring the event. We plug in the values in the equation above.

For y = 1, we get P = (0.7)¹(1–0.7)⁰ = 0.7
Bingo. We obtained the probability of success using Bernoulli distribution which involves only one trial.
Let’s move to the concept of likelihood. Let’s also assume the following outcomes from 10 trials.
[1,1,1,0,1,1,1,0,1,0]
The likelihood function becomes:

The likelihood function is often transformed to log-scale. Once this is done, the likelihood function becomes:

Here, p is unknown is we need to estimate that. If we want to measure the maximum likelihood for p, we can set the derivative of log-likelihood function to 0.

Therefore, the maximum likelihood for having success is 7/10 which makes the above trial outcomes the most probable.
Statistics for Goodness-of-Fit
Here, we will discuss the following four statistics for goodness-of-fit.
- Deviance
- Log-likelihood ratio
- Pseudo R²
- AIC and BIC statistics
Let’s go through the details.
Deviance
Using deviance, we can compare the current model with saturated model. A saturated model is that model which can provide the perfect fit for the data. Deviance is defined as
deviance = -2*(log-likelihood for the current model – log-likelihood for the saturated model)
In a saturated model, the number of parameters equals the sample size since it contains one parameter for each observation. The likelihood of a saturated model is 1 which mean a saturated model can provide perfect prediction from the predictor variables. Therefore, the log-likelihood for a saturated model is 0 and as a result deviance becomes -2*(log-likelihood for the current model).
In linear regression, the best model tries to reduce the error across all the data points. Here, in Logistic Regression, the comparable statistics to reduce the error variance is deviance. In other words, in logistic regression, the best model tries to reduce the deviance (i.e. the log-likelihood between the model under study and the saturated model).
Log-likelihood ratio
Ideally, deviance is a comparison between the current model and the saturated model but we can also compare the current model with a model which has more parameters. The later model can be called full model and the current model can be named as reduced model. Calculating deviance between the reduced model and the full model can provide some insight if adding more parameters is actually increasing the model’s performance or not.
Log-likelihood ratio
= Deviance of the reduced model – Deviance of the full model
= – 2(log-likelihood for the reduced model – log-likelihood for the saturated model) – – 2(log-likelihood for the full model – log-likelihood for the saturated model)
= – 2(log-likelihood for the reduced model – log-likelihood for the saturated model) + 2(log-likelihood for the full model – log-likelihood for the saturated model)
= – 2*(log-likelihood for the reduced model – log-likelihood for the full model)
= – 2*ln(likelihood for the reduced model / likelihood for the full model)
To determine the significance of the log-likelihood ratio, a chi-square test is performed and if the null hypothesis is rejected, we conclude that the full model fits the data better than the reduced model. This in turn means that the added variables do contribute to the model. The log-likelihood ratio tests the level of improvement in the model over the intercept-only model.
Pseudo R²
More or less, we are all familiar with the R² interpretation in linear regression but in logistic regression, it’s interpretation is different. The McFadden’s Pseudo R² is defined as
McFadden’s R²
= 1 – (deviance of the fitted model / deviance of the null model)
= 1 + (log-likelihood for the fitted model / log-likelihood for the null model)
When a model’s likelihood value is small, the log-likelihood value becomes larger. If the null model is less likely and the fitted model is more likely, the second part of the above equation becomes smaller. In perfect case, this second portion becomes 0 and Pseudo R² value becomes 1. A value of 1 for Pseudo R² indicates that we can predict the probability of success or failure perfectly.
AIC and BIC statistics
The likelihood ratio test and Pseudo R² are used to compare models which are nested. That means one model has less number of parameters than the other. In cases where the two models have different set of parameters, we cannot use likelihood ratio test and Pseudo R² to compare the models. That is when AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) Statistics come into the picture. AIC and BIC are defined as:
AIC = -2*(Log-likelihood of the current model – k)
BIC = -2(Log-likelihood of the current model) + ln(n)k
here, k = the total number of parameters in the model including intercept and n = sample size
Likelihoods are between 0 and 1, so their log is less than or equal to zero. If a model is more likely, it’s log-likelihood becomes smaller on negative side and "-2*log-likelihood" value becomes positive but small in value. AIC and BIC can be used to compare both nested and non-nested models. The model with lower AIC/BIC value provides the best fit.
Implementation in R
I have previously gone through the basics of simple logistic regression in the article below. I have discussed how the syntax and interpretation are done.
Here, I will use the same dataset and compare the statistics described above.
Dataset
For demonstration, I will use the General Social Survey (GSS) data collected in 2016. The data were downloaded from the Association of Religion Data Archives and were collected by Tom W. Smith. This dataset has responses collected from nearly 3,000 respondents and it has data related to several socio-economic features. For example, it has data related to marital status, education background, working hours, employment status, and many more. Let’s dive into this dataset to understand it a bit more.
The DEGREE column provides the education level values for each individual and MADEG provide the education for each individual mother. Our goal is to find out if mother’s bachelor education level is a good predictor for the children’s bachelor education level or not. The categorical data in the dataset are encoded ordinally.
![DEGREE data [Image by Author]](https://towardsdatascience.com/wp-content/uploads/2022/10/0fiJeh93XY1uDA7Gm.png)
![MADEG data [Image by Author]](https://towardsdatascience.com/wp-content/uploads/2022/10/0mgBBYDgc1WrsZj60.png)
The deviance of the current model and the null model are shown in the summary statistics. Smaller deviance of the current model indicates that it is a better fit compared to the null intercept-only model.
![Output window in R [Image by Author]](https://towardsdatascience.com/wp-content/uploads/2022/10/1Epbb6XMVQfNxAvgd_tGtYw.png)
To achieve the likelihood ratio, we need to come out with another nested model and compare the likelihoods.
To obtain Pseudo R², we need to install "rcompanion" package and use "nagelkerke" command. For AIC/BIC, the command is simple AIC (model name) and BIC (model name).


The nagelkerke output provides three different Pseudo R² but we can use McFadden’s Pseudo R² since many statisticians recommend it the most. A model with greater likelihood would have a higher McFadden’s R² when compared with another model.
Conclusion
In this article we have gone through the basic statistics for goodness-of-fit in logistic regression. We have discussed deviance, log-likelihood ratio, Pseudo R² and AIC/BIC statistics and implemented in R. For better understanding, reader should find any statistical textbook or reference book more informative.
Thanks for reading.