The world’s leading publication for data science, AI, and ML professionals.

Generalized Ordinal Regression Model in R

Statistics in R Series

Photo by Antoine Dautry on Unsplash
Photo by Antoine Dautry on Unsplash

Introduction

We have gone through several logistic regression models which are capable of performing both simple and multiple regression analysis for both binary and ordinal response variable. The outcomes of these logistic regression models are typically the coefficients of all the predictor variables. One of the fundamental assumptions that we made is the proportional odd assumption. Under this assumption, we assume that the coefficients of the predictor variables don’t vary across the levels of outcomes. For example if the response variable has three different levels, we assume that the coefficient is valid for transition from first level to the second level and also transition from the second level to the third level. The coefficient maybe different across different levels and that is the scenario for many real world Data. We therefore need to come out with different models which do not have the proportional odd assumption.

If we want the coefficients to vary across each level off the response variable, the model is typically called generalized ordinal regression model. On the other hand, if we allow the coefficients to vary only across some of the coefficients, the model is called partial proportional odd model. In this article we will go through the basics and implementation of generalized ordinal regression model in R.

Simplification of the Idea

The assumption of proportional odd model can be clarified with the help of an example. Let’s assume that we have collected data on hundreds of people. The data contains their education, age, marital status, health status, gender, family income and full time working status. We want to derive a regression model for health status and take education, gender, marital status and family income as predictor variables. The predictor variables are all binary meaning that they have either 0 or 1 values except education. Education variable is continuous and represent the number of years an individual completed for having education.

  • Education: numeric and continuous
  • Marital status: binary (0 for unmarried and 1 for married)
  • Gender: binary (0 for female and 1 for male)
  • Family income: binary (0 for average or less than average and 1 for more than average)
  • Health status: ordinal (1 for poor, 2 for average, 3 for good and 4 for excellent)

If we perform an ordinal logistics regression and hold the proportional odd assumption, we will end up with a single coefficient value for each predictor variable. Let’s say the coefficient for family income is ‘x’ which mean that for every one unit increase in family income (in this case from 0 to 1 since the variable is binary), the logit probability or log odds of being above one category of health status is ‘x’. Therefore, we can conclude the statements below for this model.

  1. The log odds of being at average health from poor health is ‘x’ if family income increases to above average status.
  2. The log odds of being at good heath from average health is ‘x’ if family income increases to above average status.
  3. The log odds of being at excellent heath from good health is ‘x’ if family income increases to above average status.

This is the essence of proportional odd models where the log odds remain same across all levels of outcomes. This assumption is often violated in real world data and we cannot therefore, proceed with the proportional odd model assumption.

Solution

There are two possible way to handle variables which violate proportional odd assumption. One approach is to follow the Generalized ordinal logit model where the effect of all the predictor variables are allowed to vary. In contrast, there can be situations where some of the explanatory variables are allowed to vary and this model is attributed as partial proportional odd model.

  • Generalized ordinal regression model -> the effect of all level of all predictors can vary
  • Partial proportional odd model -> the effect of some level of all/some predictors are allowed to vary

In partial proportional odd model, only those variables which violate the proportional odd assumption, are allowed to vary. Others are kept constant.

Dataset

The data source for this case study will be the UCI Machine Learning Repository’s Adult Data Set. In this dataset, it is estimated that approximately 30000 individuals have been identified based on their demographic information, which includes, but is not limited to, their race, education, occupation, gender, salary, working hours per week, employment status, as well as their income.

Adult Data Set from UCI Machine Learning Repository
Adult Data Set from UCI Machine Learning Repository

To implement in R, we will perform some modifications on the raw data. In fact, the modification is exactly same as the description in the "Simplification of the Idea" section above.

  • Education: numeric and continuous. Education can have great impact on health status.
  • Marital status: binary (0 for unmarried and 1 for married). I think this variable’s impact will be less but nevertheless it is included.
  • Gender: binary (0 for female and 1 for male). It’s impact may be also less but will be interesting to find out.
  • Family income: binary (0 for average or less than average and 1 for more than average). This may have a potential impact on health condition.
  • Health status: ordinal (1 for poor, 2 for average, 3 for good and 4 for excellent)
Modified data to be used
Modified data to be used

Implementation in R

First of all, we need to import the required libraries and load the data. Here, I have used vglm() command from VGAM package to implement both of proportional odd model and generalized ordinal model. The only different in the command is

  • For PO model (proportional odd): parallel = FALSE
  • For generalized ordinal model: parallel = True

The use of vglm() for ordinal model with command "parallel = True" is equivalent to the use of clm() command that we used before.

Interpretation of Result

Generalized Ordinal Model Summary
Generalized Ordinal Model Summary

From the result summary, we observe that we have three coefficient estimates for each predictor. The coefficients for all the predictors are negative. This model estimates the logit probability or log odds of being at or below a specific category. The categories are shown as well in the result above (the last line). Since the health status has four possible outcomes (1 for poor, 2 for average, 3 for good and 4 for excellent), there are three underlying binary models working behind the scene.

  • logitlink(P[Y≤1]) is the logit or log odds of being at or below the first category
  • logitlink(P[Y≤2]) is the logit or log odds of being at or below the second category
  • logitlink(P[Y≤3]) is the logit or log odds of being at or below the third category

For example, if we consider marital status,

  • The difference between logit or log odds of being at category 1 of health status and being at categories above 1, changes by -0.572 for every unit change in marital status (in this case from 0 to 1 or from unmarried to married).
  • The difference between logit or log odds of being at or below category 2 of health status and being at categories above 2, changes by -0.404 for every unit change in marital status (in this case from 0 to 1 or from unmarried to married).
  • The difference between logit or log odds of being at or below category 3 of health status and being at categories above 3, changes by -0.12 for every unit change in marital status (in this case from 0 to 1 or from unmarried to married).

We can obtain the odd ratios by executing the code below

exp(coef(model1, matrix = TRUE))

Odd ratios
Odd ratios

The result above shows that the three odd ratios for marital status are 0.564, 0.667 and 0.886 respectively. From here, we can conclude that,

  • The odds of being at category 1 (poorer health) health status versus being above category 1 (better health) for married persons are 0.564 times the odds for the unmarried persons.
  • The odds of being at or below category 2 (poorer health) health status versus being above category 2 (better health) for married persons are 0.667 times the odds for the unmarried persons.
  • The odds of being at or below category 3 (poorer health) health status versus being above category 3 (better health) for married persons are 0.886 times the odds for the unmarried persons.

We can say that being married decreases the odds of being at or below a health status category. Alternatively, being married is associated with a higher chance of being in a better health status category.

We can also make conclusion for ‘educ’ predictor as below:

  • The odds of being at category 1 (poorer health) health status decreases by a factor of 0.839 for every unit increase of education years.
  • The odds of being at or below category 2 (poorer health) health status decreases by a factor of 0.857 for every unit increase of education years.
  • The odds of being at or below category 3 (poorer health) health status decreases by a factor of 0.878 for every unit increase of education years.

Therefore, education is also positively correlated with the health status.

Conclusion

We have gone through the fundamental idea behind generalized ordinal Logistic Regression model and partial proportional odd model. In this article, the implementation of generalized ordinal logistic regression model in R is demonstrated and the results are interpreted. In future, we will implement partial proportional odd model where only those variable which violate the PO assumption, will be varied.

Acknowledgement for Dataset

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science (CC BY 4.0)

Thanks for reading.

Join Medium with my referral link – Md Sohel Mahmood

Get an email whenever Md Sohel Mahmood publishes.


Related Articles