The world’s leading publication for data science, AI, and ML professionals.

How a data scientist goes on a blind date

Ordinal regression in a Bayesian framework

Hands-on Tutorials

Photo by Polina Tankilevitch on Pexels
Photo by Polina Tankilevitch on Pexels

When a data scientist prepares for a blind date he/she cannot help pondering about love, existential loneliness, and mutual trust. Well… not all data scientists probably think this way… but some might. Anyway, I would like to use the last item, about mutual trust, to introduce the topic of ordinal regression in a Bayesian framework to you.

Since the upcoming date will be a blind date, we cannot be sure about the opinions and believes of the other person. However, in psychology a lot of research has been done on charting personalities. Standardized questionnaires and tests are applied to assess psychological traits. Frequently such questionnaires consist of rating scales. The respondents need to answer by choosing from a discrete set of ordered answer options, like ‘fully disagree’, ‘disagree’, ‘no opinion’, ‘agree’, and ‘fully agree’. This type of rating scale is called Likert scale. In this blog we will make use of an example of such a questionnaire to assess the answer on a particular question, and how the answer depends on characteristics such as gender and age.

Machiavelli

The data in this blog comes form the MACH-IV questionnaire by Christie and Geis, which provides a measure of Machiavellianism. Named after Niccolò Machiavelli, Machiavellianism is used in modern psychology to describe a lack of empathy and morality, and a strong focus on personal gain. The MACH-IV questionnaire consists of twenty statements taken from Machiavelli’s work and respondents rate their agreement on a five point scale from ‘strongly disagree’ to ‘strongly agree’. The data set can be downloaded here and is really large; for the purpose of this report I will use only the data that had been uploaded from a Dutch network location. Additionally I focus on one of the twenty questions only: "All in all, it is better to be humble and honest than to be important and dishonest". The figure below is based on Heiberger and Robins and shows how the outcome is associated with age and gender.

Diverging stacked bar graph showing the answers on the chosen question for different genders and age categories. The total row counts are plotted on the right hand axis. Find more details on this type of plots in my previous blog.
Diverging stacked bar graph showing the answers on the chosen question for different genders and age categories. The total row counts are plotted on the right hand axis. Find more details on this type of plots in my previous blog.

The figure shows that older people and females tend to agree more often to this statement. When trying to model the outcomes of this Likert scale on the independent variables age and gender, it is best to use ordinal regression. Therefor I would like to explore this type of regression, and how to perform this in the Bayesian framework in this report.

Model

Since the outcome we want to predict is on an ordinal scale, we need to create an ordinal regression model in the Bayesian framework. That can be done as follows. The answers yᵢ={1, . . . , k} fall into K ordered answer categories for i = 1 . . . n samples. The model assumes that yᵢ is the observed realization of a latent – unobserved – continuous quantity yᵢ*. Additionally the model defines cut-off points αₖ such that:

The cut points αₖ are estimated from the dependent variables as αₖ = β∗xᵢ. Note that we assume that the β does not depend on k in this model and is therefore assumed to be the same for each cut point. Furthermore we define α₀ = −∞ and αₖ = ∞, which translates into the following description:

where Cat(pᵢₖ) is the categorical distribution for which pᵢₖ0 and ∑ pᵢₖ= 1. The variable pᵢₖ indicates the probability that sample i falls into category k. The cut points αₖ and regression variable β need to be estimated from the data. In a Bayesian framework we need to set normal priors for the cut points αₖ for k = 1, . . . , K − 1 and for the regression variable β. In our example we have K = 5 answer options, n = 1062 and two regression variables (age and gender). Let’s use fairly uninformative priors with σ² = 1000.

Numeric implementation of this model can happen in many different ways. In this blog I describe an implementation in R using the rjags package. For proper numeric estimation of the variables it is important to sort the cut-offs α. To help the model to get started we also need to supply ordered initial values for α. The model description in JAGS looks as follows:

```{r model, message=FALSE, cache=TRUE, results='hide'}
jags_model <- " model {
  for(i in 1:length(answer)) {
    answer[i]  ~ dcat(p[i, 1:5])

    logit(Q[i, 1]) <- alpha[1] - mu[i]
    p[i, 1] <- Q[i, 1]
    for (j in 2:4) {
      logit(Q[i, j]) <- alpha[j] - mu[i]
      p[i, j] <- Q[i, j] - Q[i, j-1]
    }
    p[i, 5] <- 1 - Q[i, 4] 

    mu[i] <- beta[1] * age[i] + beta[2] * gender[i] # no intercept
  }

  ## priors over thresholds
  for(j in 1:4) {
    alpha0[j] ~ dnorm(0, 1.0 / 1.0e3)
  }
  alpha <- sort(alpha0)

  # Priors for regression coefficients
  for(j in 1:2) {
    beta[j] ~ dnorm(0, 1.0 / 1.0e3)
  }
} "

Results

Space doesn’t allow to show all diagnostic figures here. Instead the table below shows convergence, auto correlation, effective sizes and point estimates for the variables. The Gelman statistics are close to one, which shows appropriate convergence. The auto correlation coefficients are rather high, even though a thinning factor of 10 has been applied. This high correlation is also reflected in the low effective sample size values. The point estimates show an ordered α, and positive associations with age and gender: older persons and females tend to agree more with the statement.

For ordinal regression it is rather difficult to define and interpret residuals. Following Liu and Zhang we calculate surrogate residuals based on the latent variable. The plots below show little patterns in the residuals as function from the mean response. The qq-plot shows deviations from the expected straight line at both ends, suggesting non-linear behaviour. There is no auto correlation in the residuals, which is good.

Frequently effect plots are created to visualize the results of ordinal regression. Using the outcomes of the JAGS simulation we can recalculate the pᵢₖ for each entry in the simulation. These values can be plotted against age and gender to create the plot below. The plot shows the predictions from the simulation. As can be seen from the plot, older people tend to agree more often to the statement, and female also more often agree as compared to men.

Predictions from the data can be made by calculating the pᵢₖ for each entry in the data set. Subsequently a draw from the categorical distribution Cat(pᵢₖ) can be made. Since this involves a random draw, we repeat this process 100 times. The result can be used to create a contingency table. The table below shows that the performance is not so good, with an accuracy of only 23%.

Applying the model

Suppose a 30 year old boy meets a 28 year old girl. What is the probability that he answered at least as positive as her on the question? We can calculate the probabilities pᵢₖ for each entry in the simulation using the characteristics of the boy and girl separately. Subsequently a draw from the categorical distribution Cat(pᵢₖ) can be made. The plot below shows the distributions of the latent variable, which are substantially different. There is only a 0.5% probability that the latent variable is higher for the boy. When looking at the answer categories, simulated by the draws from the categorical distribution, we find a 55% probability that the boy answered at least as positive as the girl.

Closing words

Ordinal regression using the Bayesian framework requires some effort. The resulting model suffers quite a bit from high auto correlation, and is only slightly able to predict the correct response. Quite likely other factors not considered in this model are also of influence. The Bayesian framework does allow to pretty easily estimate probabilities of answers and compare them.


Related Articles