
Introduction
Logistic regression is one of the most popular forms of the generalized linear model. It comes in handy if you want to predict a Binary outcome from a set of continuous and/or categorical predictor variables. In this article, I will discuss an overview on how to use Logistic Regression in R with an example dataset.
We will use infidelity data as our example dataset, known as Fair’s Affairs, which is based on a cross-sectional survey conducted by Psychology Today in 1969 and is described in Greene (2003) and Fair (1978). This data contains 9 variables collected on 601 respondents which hold information such as how often they have affairs during the past years, as well as their age, gender, education, years married, have children (yes/no), how religious they are (on a 5-point scale from 1=anti to 5=very), occupation (7-point classification), and a self-rating on happiness toward their marriage (from 1=very unhappy to 5=very happy). The figure below shows a few observations to give you an overview of the data.

Applying Logistic Regression, we can find which factors contributed the most to infidelity. Then, you can use the model to check which one between you and your partner that more likely to have an affair or not 😜
But, before that, we will run through some descriptive statistics with the code below to get a better understanding of our data.
# Created by Michaelino Mervisiano
> install.packages("AER")
> library("AER")
> data(Affairs, package="AER")
> View(Affairs)
> summary(Affairs)
affairs gender age yearsmarried children
Min. : 0.0 female:315 Min. :17.50 Min. : 0.125 no :171
1st Qu: 0.0 male :286 1st Qu:27.00 1st Qu: 4.000 yes:430
Median : 0.0 Median :32.00 Median : 7.000
Mean : 1.456 Mean :32.49 Mean : 8.178
3rd Qu.: 0.0 3rd Qu.:37.00 3rd Qu.:15.000
Max. :12.0 Max. :57.00 Max. :15.000
religiousness education occupation rating
Min. :1 Min. : 9.0 Min. :1.0 Min. :1.000
1st Qu:2 1st Qu:14.0 1st Qu:3.0 1st Qu:3.000
Median :3 Median :16.0 Median :5.0 Median :4.000
Mean :3.116 Mean :16.17 Mean :4.195 Mean :3.932
3rd Qu.:4 3rd Qu.:18.0 3rd Qu.:6.0 3rd Qu.:5.000
Max. :5 Max. :20.0 Max. :7.0 Max. :5.000
>table(Affairs$affairs)
0 1 2 3 7 12
451 34 17 19 42 38
From the summary above, we can see that there are 286 male respondents (representing 48% of the overall respondents), 430 respondents had children (representing 72% of the overall respondents), and average age for our respondents was 32.5 years old. In addition, we find that 451 respondents claimed not engaging in an affair in the past year. It means 25% of our respondents has an affair with the largest number reported was 12. In conclusion, we can say that 6% of respondents has 1 affair per month 😏 .
As we are interested in the binary outcome for our response variable (had an affair/didn’t have an affair). We can transform affairs into a binary variable called ynaffair with the following code.
> Affairs$ynaffair[Affairs$affairs > 0] <- 1
> Affairs$ynaffair[Affairs$affairs == 0] <- 0
> Affairs$ynaffair <- factor(Affairs$ynaffair,levels=c(0,1), labels=c("No","Yes"))
> table(Affairs$ynaffair)
No Yes
451 150
Fit the model with Logistic Regression
Now, we can execute the logistic regression to measure the relationship between response variable (affair) and explanatory variables (age, gender, education, occupation, children, self-rating, etc) in R.
> fit.full <- glm(ynaffair ~ gender + age + yearsmarried + children
+ religiousness + education + occupation +rating,
data=Affairs, family=binomial())
> summary(fit.full)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.571 -0.750 -0.569 -0.254 2.519
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.3773 0.8878 1.55 0.12081
gendermale 0.2803 0.2391 1.17 0.24108
age -0.0443 0.0182 -2.43 0.01530 *
yearsmarried 0.0948 0.0322 2.94 0.00326 **
childrenyes 0.3977 0.2915 1.36 0.17251
religiousness -0.3247 0.0898 -3.62 0.00030 ***
education 0.0211 0.0505 0.42 0.67685
occupation 0.0309 0.0718 0.43 0.66663
rating -0.4685 0.0909 -5.15 2.6e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 675.38 on 600 degrees of freedom
Residual deviance: 609.51 on 592 degrees of freedom
AIC: 627.5
Number of Fisher Scoring iterations: 4
If we observe the Pr(>|z|) or p-values for the regression coefficients, then we find that gender, presence of children, education, and occupation do not have a significant contribution to our response variable. Therefore, we can try to fit a second model by including only significant variables such as age, years married, religiousness, and rating to fit the data instead.
> fit.reduced <- glm(ynaffair ~ age + yearsmarried + religiousness
+ rating, data=Affairs, family=binomial())
> summary(fit.reduced)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.628 -0.755 -0.570 -0.262 2.400
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.9308 0.6103 3.16 0.00156 **
age -0.0353 0.0174 -2.03 0.04213 *
yearsmarried 0.1006 0.0292 3.44 0.00057 ***
religiousness -0.3290 0.0895 -3.68 0.00023 ***
rating -0.4614 0.0888 -5.19 2.1e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 675.38 on 600 degrees of freedom
Residual deviance: 615.36 on 596 degrees of freedom
AIC: 625.4
Number of Fisher Scoring iterations: 4
For the second model, we can see that p-values for each regression coefficient is statistically significant. Then, we may run chi-square test with anova function in R to compared between first and second model. We will see which model that explain our response variable better.
> anova(fit.reduced, fit.full, test="Chisq")
Analysis of Deviance Table
Model 1: ynaffair ~ age + yearsmarried + religiousness + rating Model 2: ynaffair ~ gender + age + yearsmarried + children +
religiousness + education + occupation + rating
Resid.Df Resid.Dev Df Deviance P(>|Chi|)
1 596 615
2 592 610 4 5.85 0.21
The output above displays nonsignificant chi-square value with p-values= 0.21. It means that the second model with only four predictors fits as well as the full model with nine predictors. It supports our initial belief that gender, children, education, and occupation don’t add any contribution to predict infidelity (our response variable). Thus, we will continue the analysis with the second model as it is easier to do our interpretations on the simpler model.
Interpret the model parameters
Based on the regression coefficients from the second model, we are seeing that the odds of having affairs increase with years married and decrease with age, religiousness, and happiness self-rating. We can observe it based on the positive or negative sign from each regression coefficient. In conclusion, we might say the longer you are married, then the more likely you will have an affair. On the other hand, the happier you are in the marriage, then the less likely you will have an affair.
> coef(fit.reduced)
(Intercept) age yearsmarried religiousness rating
1.931 -0.035 0.101 -0.329 -0.461
Next, we want to know the impact value of each of these variables towards affair. First, we need to remember that logistic regression modeled the response variable to log(odds) that Y = 1. It implies the regression coefficients allow the change in log(odds) in the return for a unit change in the predictor variable, holding all other predictor variables constant.
Since log(odds) are hard to interpret, we will transform it by exponentiating the outcome as follow
> exp(coef(fit.reduced))
(Intercept) age yearsmarried religiousness rating
6.895 0.965 1.106 0.720 0.630
We observe that the odds of having affair are increased by a factor of 1.106 for a one-year increase in years married (holding age, religiousness, and happiness rating constant). On the contrary, the odds of having affair are multiplied by a factor of 0.965 for every year increase in age. It means the chance of having an affair drop by -3.5% every time someone gets older.
Furthermore, the change in the odds of the higher value on the response variable for an n unit change in a predictor variable is exp(βj)^n. Then, a 15-year increase would increase the odds by a factor of 1.106¹⁵≈4.5, holding the other predictor variables constant.
Predict the outcome using new data
In this section, we are using the model that we built to predict the outcome for the new data. The first step, we will make a new data containing the values of predictor variables we’re interested in. The second step, we will apply the predict() function in R to estimate the probabilities of the outcome event following the values from the new data.
Consider new data below where we have 5 new respondents with different self-rating, holding other variables set to the average of overall data. Then, we apply the prediction function to get the probabilities of having affair for these new respondents.
> newdata1 <- data.frame(rating=c(1,2,3,4,5),age=mean(Affairs$age),
yearsmarried=mean(Affairs$yearsmarried),
religiousness=mean(Affairs$religiousness))
> newdata1
rating age yearsmarried religiousness
1 1 32.5 8.18 3.12
2 2 32.5 8.18 3.12
3 3 32.5 8.18 3.12
4 4 32.5 8.18 3.12
5 5 32.5 8.18 3.12
> newdata1$prob <- predict(fit.reduced, newdata=newdata1,
type="response")
> newdata1
rating age yearsmarried religiousness prob
1 1 32.5 8.18 3.12 0.530
2 2 32.5 8.18 3.12 0.416
3 3 32.5 8.18 3.12 0.310
4 4 32.5 8.18 3.12 0.220
5 5 32.5 8.18 3.12 0.151
Clearly, we notice that chance of having affair declining from 0.53 when marriage is rated 1="very unhappy" to 0.15 when the marriage is rated 5="very happy" (holding other predictor variables constant). It indicates the unhappy couple are three times more likely to have an affair compared to the happy one.
Let’s create another new data to observe the impact of age toward infidelity
> newdata2 <- data.frame(rating=mean(Affairs$rating),
age=(17,27,37,47, 57),
yearsmarried=mean(Affairs$yearsmarried),
religiousness=mean(Affairs$religiousness))
> newdata2
rating age yearsmarried religiousness
1 3.93 17 8.18 3.12
2 3.93 27 8.18 3.12
3 3.93 37 8.18 3.12
4 3.93 47 8.18 3.12
5 3.93 57 8.18 3.12
> newdata2$prob <- predict(fit.reduced, newdata=newdata2,
type="response")
> newdata2
rating age yearsmarried religiousness prob
1 3.93 17 8.18 3.12 0.335
2 3.93 27 8.18 3.12 0.262
3 3.93 37 8.18 3.12 0.199
4 3.93 47 8.18 3.12 0.149
5 3.93 57 8.18 3.12 0.109
Here, we see that as age increases from 17 to 57, the probability of having affair declining from 0.34 to 0.11, holding the other variables constant. If you are interested to explore the impact of other predictor variables or to predict other new data, then you can use this approach to analyze it further.
Evaluate overdispersion
In logistic regression, we need to check the expected variance for data drawn from a binomial distribution _σ2 = nπ(1 − π)_, where n is the number of observations and π is the probability of belonging to the Y = 1 group.
Overdispersion occurs when data admit more variability than expected under the assumed distribution. If overdispersion is present in a dataset, the estimated standard errors and test statistics the overall goodness-of-fit will be distorted and adjustments must be made. One of the solutions, we need to use the quasibinomial distribution rather than the binomial distribution for glm() function in R.
There are two ways to verify if we have an overdispersion issue or not:
The first method, we can check overdispersion by dividing the residual deviance with the residual degrees of freedom of our binomial model.

If the ratio considerably larger than 1, then it indicates that we have an overdispersion issue. Calculating this ratio using our data example, we find that the ratio is close to 1. It means no overdispersion problem on our model.
> deviance(fit.reduced)/df.residual(fit.reduced)
[1] 1.032
The second method, we are using two models fit to check overdispersion. Basically, we will fit the logistic regression using two different models using different distributions. Then, we check if there’s a statistical evidence that the expected variance of the two models is significantly different.
> fit <- glm(ynaffair ~ age + yearsmarried + religiousness +
rating, family = binomial(), data = Affairs)
> fit.od <- glm(ynaffair ~ age + yearsmarried + religiousness +
rating, family = quasibinomial(), data = Affairs)
> pchisq(summary(fit.od)$dispersion * fit$df.residual,
fit$df.residual, lower = F)
[1] 0.34
We find that p-value =0.34 is clearly not significant (p > 0.05), strengthening our belief that overdispersion isn’t a problem on our model
I hope you find this article is useful and kindly share it with others
Cheers,
Michaelino Mervisiano