The world’s leading publication for data science, AI, and ML professionals.

Prediction in Various Logistic Regression Models (Part 1)

Statistics in R Series

Photo by Jen Theodore on Unsplash
Photo by Jen Theodore on Unsplash

Introduction

We have covered various types of Logistic Regression in the past several articles. The goal of all these models is to predict future data points as well as intermediate data points as accurately as possible. In this article, we will go through how this prediction analysis can be done in R for simple and multiple logistic regression using both binary and ordinal data.

Dataset

The Adult Data Set that is available in the UCI Machine Learning Repository will be used as a case study as part of our research. The data collected in this dataset includes the demographic data of more than 30000 individuals. The data includes each individual’s race, education, job, gender, salary, hours worked per week, number of jobs held, as well as the amount of income they earn.

Adult Data Set from UCI Machine Learning Repository
Adult Data Set from UCI Machine Learning Repository

A refresher on the dataset:

  • Bachelors: 1 means the person has a bachelor’s degree and 0 means the person doesn’t have a bachelor’s degree
  • _Income_greater_than_50kcode: 1 means the total family income is greater than $50k and 0 means the total family income is less than $50k
  • _Marital_statuscode: 1 means the person is married and 0 means the person is not married or divorced.
  • _Racecode: 1 refers to non-white and 2 refers to white people.

Prediction in simple logistic regression for binary data

We will go through the dataset mentioned above to identify two variables that can be used to predict the binary outcome of income which can be either greater than $59K or less than $50K utilizing the education level and marital status variables. This study asks the following question:

What is the impact of education level on income?

To perform prediction analysis, we need the ggpredict library installed first. The first command will provide the predicted probabilities for the binary "Bachelors" variable. We know that the Bachelors variable can have two values: 0 and 1. R will provide the probabilities for family income (also a binary variable) to be greater than $50k for both cases.

The first output provides tabular data for predicted income probabilities. Here, if Bachelors = 1, the probability that the family income is greater than $50k becomes 0.47 whereas if the person doesn’t possess a bachelor’s degree, the probability drops to 0.16. It tells us that education has a vital effect on family income.

Prediction in multiple logistic regression for binary data

Using the dataset mentioned above, we will take two predictor variables: education level and marital status to predict the binary outcome of income which can be either greater than $50K or less than $50k. The study question here is:

What is the combined impact of education level and marital status on income?

The implementation of multiple logistic regression using binary data is very similar to the simple logistic regression as below.

Here we want to include marital status as another predictor to predict the income of the family. Using similar ggpredict command, we obtain the following result.

For the second model which has 2 predictors, the data for probabilities are adjusted for the mean value of the other variable. Here we have the second predictor as _Marital_statuscode which has a mean value of 0.47. It tells us that 47% of all the individuals in our dataset are married and 53% are either never married or divorced. Keeping that value constant, the probability of having a family income greater than $50k is 0.44 if the person has a bachelor’s degree. If not, the probability drops to 0.17.

Prediction in simple logistic regression for ordinal data

Sometimes we can have more than 2 response levels for the outcome variable which is ordinal. The family income variable that we have in our dataset has only two levels of outcomes but if the response variable has more than 2 outcomes, the same approach can be followed.

The purpose of the regression model is to provide a quantitative explanation of the following question from the dataset:

What is the individual impact of education level, gender and race on income?

To define a logistic regression model where the response variable is ordinal, we can use the clm() command from the ordinal package. First, we need to convert the predictors and response variables to factors. Here the response variable has more than two categories and in general, this model is named as Proportional Odd (PO) model.

We will use the same dataset here for the PO model. For prediction, we are going to use the ggpredict() command from ggeffects library as well.

Let’s first predict the family income to be greater than $50k for _Educationcode 5,10 and 13 which represent 9th grade, high school grad and doctorate respectively. Here response level 1 indicates the group that has a family income of less than $50k and response level 2 indicates the group that has a family income greater than $50k. If the individual has a 9th-grade education, the probability that the family income will be less than $50k is 0.98 and the probability that the family income will be greater than $50k is 0.02 consequently. If the individual has a doctorate degree, the probability that the family income will be less than $50k is 0.35 and the probability that the family income will be greater than $50k is 0.65 consequently. Therefore, the higher the education level, the higher the family income in general.

If we apply the same prediction using "Bachelors" as a predictor, we observe that the probability that the family income will be less than $50k for an individual who doesn’t have a bachelor’s degree is 0.84 and the probability that the family income will be greater than $50k is 0.16 consequently.

Taking the "Education_yrs" as the predictor, we can conclude that the higher the education years, the probability of having higher income increases. If an individual has 16 years of education, the probability of his/her family income being greater than $50k, is 0.69. One thing to observe is the sum of the probabilities across all the response levels equals 1.

Prediction in multiple logistic regression for ordinal data

In ordinal logistic regression, the predictor variables can be ordinal or binary or continuous and the response variable is ordinal.

Consider the example of predicting income with the ordinal levels of education which only has two levels of responses, for example. To carry out the regression analysis, we can have education levels from 1st grade up to all the way up to the doctorate degree and assign ordered numbers for each level. It is also possible to predict income by using binary variables. Using this model, we could, for instance, assign 1 as the number of people with bachelor’s degrees and 0 to those who do not have bachelor’s degrees. It’s kind of like an ordinal variable with two levels. Last but not least, we can also predict income levels with continuous variables like education years. Here we are trying to answer the following question quantitatively.

What is the combined impact of education level, gender and race on income?

Here, in the first model, we have taken all three predictor variables to predict income and the result is shown below.

There are two income levels (income over $50000 and income less than $50000), so there are also two response levels (red above). In the second column, you will find the predicted probabilities for each educational level. Having an education level of 3 (5th-6th grade), the probability of earning < $50000 is 0.99 whereas having an education level of 13 (doctoral), the probability of earning < $50000 is 0.36. Based on the second response level prediction result, the same conclusion can be drawn. Therefore, it is evident that family income has a positive correlation with education levels. A higher level of education is associated with a higher income for the family. As usual, these results are adjusted using the mean value of the other two variables namely _Gendercode and _Racecode.

When gender is included in the second ggpredict command, we obtain the following result.

Now that we have 2 levels of values for gender as well as income level, we now have 4 tables. If a person has a gender code of 1 (female) and they have a doctorate (education code 13), then they can expect the probability of having an income over $50000 to be 0.42, while if the person has a gender code of 2 (male), and he has a doctorate, then the predicted probability of having an income over $50000 will be 0.74. In other words, this indicates that women are not compensated equally for the same level of education as men. As usual, the model has been adjusted using the mean value of _Racecode only.

Conclusion

In this article, we have done a prediction analysis of both binary and ordinal logistic regression models using single and multiple predictor variables. We have covered the use of the associated ggpredict() command in all these four models and the results are discussed quantitatively. As a reminder, the overall performance of the model will depend on how much data cleaning is done. The presence of unnecessary data, repeated data or wrong data will have misleading outcomes from the models.

Acknowledgment for Dataset

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Thanks for reading.

Read every story from Md Sohel Mahmood

Join medium


Related Articles