The world’s leading publication for data science, AI, and ML professionals.

Multinomial Logistic Regression in R

Statistics in R Series

Photo by Edge2Edge Media on Unsplash
Photo by Edge2Edge Media on Unsplash

Introduction

Well, if you have covered my previous articles about "Statistics in R Series", you may have a good understanding of Logistic Regression implementation in R and also a fundamental understanding of the different types of logistic regression models. We have come a long way in the regression world and have covered binary, proportional odd (PO), generalized as well as partial proportional odd (PPO) models. In this article, I am going to discuss multinomial logistic regression which is application specific and may be vital in understanding logistic regression models having multiple unordered response variables. An example of the type of regression that could occur is the affiliation of people to political parties, for instance, republicans, democrats, independents, etc.

Brief refresher

Let’s first do a recap on what we have covered so far. At first, we covered binary logistic regression, which doesn’t have ordered responses. Here, we have only two categories e.g. healthy or not healthy. Then we deep-dived into the ordinal logistic regression model having multiple ordered responses. Here, we made more ordered categories of health status and fed that into the ordinal model. One fundamental assumption of this model is the coefficient is independent of the categories and they don’t vary across the responses. For example, if we had four categories of health status and ordered as 1,2,3,4. We assumed that the coefficients of each and every independent variable remain the same across all these four categories. This is called the proportional odd (PO) assumption.

Later we entered into the generalized logistic model and allowed all the coefficients to vary across the categories. For example, the coefficient of a specific independent variable will be different when we consider health status change from 1 to 2 from the coefficient value when we consider health status change from 2 to 3.

There is another type of model that was discussed previously which is the partial proportional odd (PPO) model. In this type of model, only those variables which violate the proportional odd assumption are allowed to vary across the categories. For example, if we have health status as the response variable and education years, marital status, and family income as predictors and if we find out that only education status violates the PO assumption, we can propose a PPO model for that instead of a generalized ordinal regression model. I know this may sound a little complicated but if you go through my previous articles on "Statistics in R Series", I hope things will become a lot more understandable. For now, the following table shows whatever we have covered in terms of model definition and execution.

What is multinomial regression?

In simple terms, a Multinomial regression model estimates the likelihood of an individual falling into a specific category in relation to a baseline category using a logit or log odds approach. It works like an extension to the binomial distribution when the nominal response variable has more than 2 outcomes. In multinomial regression, we need to define a reference category and the model will determine several binomial distribution parameters with respect to the reference category.

In our example, we will set a specific health status as the baseline category and perform multinomial logistic regression with reference to that. In brief, multinomial regression is like the execution of several binary regression models in one shot.

Dataset

The adult dataset from the UCI repository will again be used for the implementation of multinomial logistic regression. A quick look at the dataset is below.

Adult dataset from UCI repository
Adult dataset from UCI repository
  • Education: numeric and continuous
  • Marital status: binary (0 for unmarried and 1 for married)
  • Gender: binary (0 for female and 1 for male)
  • Family income: binary (0 for average or less than average and 1 for more than average)
  • Health status: ordinal (1 for poor, 2 for average, 3 for good and 4 for excellent)

Here, although the health status is ordinal, we really don’t need to have that since we will be performing multiple binary logistic regression.

Implementation in R

To implement multinomial logistic regression, I will use vglm() command from VGAM package. The code snippet is below.

As mentioned earlier, we need to define a reference category. here, I have defined health status at level = 1 as the reference category.

Interpretation of result

Model summary
Model summary

Alright. We have the summary of the model that we defined above. It is one predictor model with multiple outcome categories. The predictor is education in years (continuous) and the outcome is categorial which is actually ordinal in nature but we don’t need similar ordinal responses. We can simply assign numeric values to the categories.

The summary is similar to binary logistic regression but we have multiple binary regression models all of which are compared with the reference category. Let’s refer to the coefficient block above.

The first model with intercept1 and educ1 compares category 2 with the reference category 1 (defined is the model definition). The second model with intercept2 and educ2 compares category 3 with the reference category 1. The third model with intercept3 and educ3 compares category 4 with the reference category 1. Since there are four categories, only three binary comparisons are possible.

model1:

model2:

model3:

The same binary logistic interpretation can be applied here. For example in model1, we can say that for every unit increase in the educ variable, the logit or log odds of being in category 2 compared to category 1 increases by 0.08969 and the same goes for the remaining two models. The slope is positive in all cases because the log odds of having a better health status increase with the increase in education. This is pretty obvious but we have quantified this fact with the given dataset.

R also ran a hypothesis test on each predictor. The null hypothesis states that the predictor is not a significant variable in predicting log odds. The associated p-value for educ1 is 0.00853 which is < 0.01. Therefore, we can reject the null hypothesis and come to the conclusion that educ1 plays a significant role in determining the log odds of category 2 with respect to category 1 which is the reference category.

We can also determine the odd ratios in R as below.

Odd ratio
Odd ratio

This is basically the exponent value for the coefficients. This tells us that for every unit increase in educ, the odds of being in category 2 increase by a factor of 1.09350 in comparison with category 1. To put it another way, the odds of being in category 2 increase by 9.35% with every one unit increase in educ.

The percentages become significant for other models. The odds of being in category 3 increase by 25.23% with every one unit increase in educ and the odds of being in category 4 increase by 35.82% with every one unit increase in educ with respect to the reference category of 1.

Conclusion

In this article, I have discussed the need for a multinomial logistic regression model and executed it in R. This type of regression is similar to binary regression except we have multiple binary comparisons done. When we have an unordered category like political party affiliation, we can perform multinomial logistic regression.

Acknowledgement for Dataset

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science (CC BY 4.0)

Join Medium with my referral link – Md Sohel Mahmood

Get an email whenever Md Sohel Mahmood publishes.


Related Articles