The world’s leading publication for data science, AI, and ML professionals.

Simple Logistic Regression for Dichotomous Variables in R

Statistics in R Series

Image from Unsplash
Image from Unsplash

Introduction

Logistic Regression is one of the fundamental statistical concept by which one can perform regression analysis between categorical variables. Often times we have variables which have ordinal values which don’t necessarily represent any numbers but instead could present a category. We have previously discussed about simple linear regression and multiple linear regression. In this article, I’m going to cover the implementation of logistic regression in R and interpret the results.

Sometimes we have variable which can only take binary type of values for example gender, employment status and other yes/no type responses. Also, we can have more than two categories. In simple logistic regression, we have a dependent variable which is binary and one independent variable which can either be continuous or categorical. We can utilize linear regression to predict a binary dependent variable but there are several limitations. We’re going to discuss about those assumptions here.

Assumption Violations

The first assumption for linear regression is the normality of data. In simple linear regression we assume that the dependent variable is normally distributed where the mean overlaps with the median value. In case of logistic regression, the dependent variable has dichotomous output. That means it is nowhere near normal distribution. In fact it follows Bernoulli distribution. That means we cannot utilize the nearest creation to predict a binary variable.

Another limitation of deploying linear regression to predict a binary variable is the violation of the assumption of homoscedasticity. Under this assumption, the variance of error across only values of the predictor variable is considered uniform. But in reality that is not the case. The variance of the error in this logistic regression depends on the value of predator variable which is called heteroscedasticity.

Lastly I would like to point out that by doing logistic regression in this way, the linearity assumption is also violated. We cannot obtain a linear relationship between dichotomous variable and linear continuous variable. In fact the relationship follows a S-shaped curve which is the trademark of logistic regression as shown below. This curve shows that the response variable can only take values at two levels. If linear regression is applied here, sometimes the outcome maybe less than 0 or greater than 1 which violates the fundamental assumption of probability theory.

Solution

To avoid these violations stated above, we need to use logistic regression instead of linear regression when the response variable is binary. We need to have logistic transformation of the probability of success of the outcome variable. The simplest form of simple logistic equation is

where p is the probability of the outcome variable equaling to 1. Logit(p) represents the logistic transformation of the probability of success. On the right side the formation is very much similar to linear regression. The basic difference between this logistic transformation equation and a simple linear regression is here instead of directly calculating the response variable, we are interested to measure the probability of success of that response variable. Since this also makes the same vibe as the odds of a success, the left side of the equation can be rewritten as follows.

therefore, "logit" is "natural logarithm of odds for success".

The theory behind logistic regression is discussed briefly above. Next I’m going to implement an example of logistic regression in r and interpret all the outputs to get insight.

Dataset

For demonstration, I will use the General Social Survey (GSS) data collected in 2016. The data were downloaded from the Association of Religion Data Archives and were collected by Tom W. Smith. This dataset has responses collected from nearly 3,000 respondents and it has data related to several socio-economic features. For example, it has data related to marital status, education background, working hours, employment status, and many more. Let’s dive into this dataset to understand it a bit more.

GSS 2016 data [Image by Author]
GSS 2016 data [Image by Author]

The DEGREE column provides the education level values for each individual and MADEG provide the education for each individual mother. Our goal is to find out if mother’s bachelor education level is a good predictor for the children’s bachelor education level or not. The categorical data in the dataset are encoded ordinally.

DEGREE data [Image by Author]
DEGREE data [Image by Author]
MADEG data [Image by Author]
MADEG data [Image by Author]

Let’s import the data in R and utilize glm() command to answer our question. We need to modify our dataset a little. For all participants having less than bachelor’s degree are labeled as 0 and others as 1. Same feature engineering is done on mother’s education level. The new columns are renamed as DEGREE1 and MADEG1.

Output window in R [Image by Author]
Output window in R [Image by Author]

Here, the above output window is pretty much similar to linear regression discussed in the following article where we used lm() function.

Total Interpretation of Regression and ANOVA Commands in R

Interpretation of result

As mentioned before, to implement logistic regression, we need to convert the probability of the success of output into logarithmic measures and then the coefficient of the predictor variable and intercept can be determined. The interpretations are below.

  1. From the output window, we observe that there are residuals similar to linear regression scenario. The estimate for intercept is 0.257 and coefficient estimate for MADEG is 0.316 which essentially tells us for every one unit increase in the predictor variable which is mother’s education level, the logit probability of child’s education level to have value of 1 increases by 0.31598 (This is still a positive slope indicating increase in response variable with the increase in predictor variable). In other words, mothers bachelor degree increases the probability of the child’s bachelor degree.
  2. The associated p-value is less than 0.05 which also tells us to reject the null hypothesis. The null hypothesis here is "the predictor variable has coefficient of 0 and essentially does not impact the response variable". Therefore, we can conclude that mother’s bachelor education significantly impacts the child’s bachelor degree. At least the data tells us so.
  3. The t-values are calculated by dividing the estimates by standard errors. Lastly the null deviance value shows the deviance for the null model where we have only the intercept. The residual deviance is the deviance is defined as

residual deviance = -2(log likelihood of current model – log likelihood of saturated model)

The difference between the null deviance and the residual deviance is used to determine the significance of the current model.

Conclusion

We have discussed about simple logistic regression and its implementation in R. We have also walked though the R outputs and interpret the results from General Society Survey. The positive coefficient for the predictor variable indicates that with the increase of mother’s bachelor degree’s value from 0 to 1, the probability of the child’s bachelor degree becoming 1 increases by 0.31598 or in other words it can be concluded that mother’s education significantly impacts child’s education in our dataset.

Thanks for reading.

Website: Learning from Data

Join Medium with my referral link – Md Sohel Mahmood

Get an email whenever Md Sohel Mahmood publishes.


Related Articles