The world’s leading publication for data science, AI, and ML professionals.

Comparison of the Logistic Regression, Decision Tree, and Random Forest Models to Predict Red Wine…

Comparison of supervised machine learning models to predict red wine quality in R

Photo by Kym Ellis on Unsplash
Photo by Kym Ellis on Unsplash

In the following project, I applied three different Machine Learning algorithms to predict the quality of a wine. The dataset I used for the project is called Wine Quality Data Set (specifically the "winequality-red.csv" file), taken from the UCI Machine Learning Repository.

The dataset contains 1,599 observations and 12 attributes related to the red variants of the Portuguese "Vinho Verde" wine. Each row describes the physicochemical properties of one bottle of wine. The first 11 independent variables display numeric information about these characteristics, and the last dependent variable revels the quality of the wine on a scale from 0 (bad quality wine) to 10 (good quality wine) based on sensory data.

Since the outcome variable is ordinal, I chose logistic regression, decision trees, and random forest classification algorithms to answer the following questions:

  1. Which machine learning algorithm will enable the most accurate prediction of wine quality from its physicochemical properties?
  2. What physicochemical properties of red wine have the highest impact on its quality?

For the following project, I used the R programming language to explore, prepare, and model the data.

Importing the dataset

Once the working directory is set and the dataset is download into our computer, I imported the data.

#Importing the dataset
data <- read.csv('winequality-red.csv', sep = ';')
str(data)
Results from the srt() function
Results from the srt() function

With the str() function, we could see that all the variable types are numerical, which is the correct format except the outcome variable. I proceed to transform the dependent variable into a binary categorical response.

#Format outcome variable
data$quality <- ifelse(data$quality >= 7, 1, 0)
data$quality <- factor(data$quality, levels = c(0, 1))

The arbitrary criteria I selected to modify the levels of the outcome variable is as follows:

  1. Values above or equal to seven will be changed to 1, meaning a good quality wine.
  2. On the other hand, amounts less than seven will be converted to 0 and will indicate bad or mediocre quality.

Furthermore, I modified the type of the variable "quality" to factor, indicating that the variable is categorical.

Exploratory data analysis (EDA)

Now, I proceed to develop an EDA on the data to find essential insights and to determine specific relationships between the variables.

First, I developed a descriptive analysis where I collected the five-number summary statistics of the data by using the summary() function.

#Descriptive statistics
summary(data)
Five-number summary values of each of the variables in the data
Five-number summary values of each of the variables in the data

The image shows the five-number summary values of each of the variables in the data. In other words, with the function, I obtained the minimum and maximum, the 1st and 3rd quartile, the mean, and the median values of the numerical variables. Additionally, the summary shows the frequency of the level of the dependent variable.

Next, I developed a univariate analysis, which consists of examining each of the variables separately. First, I analyzed the dependent variable.

To analyze the outcome variable, I developed a bar plot to visualize the frequency count of the categorical levels. Also, I generated a table of frequency count to know the exact amount and percentage of value that are in the different levels in each category.

#Univariate analysis
  #Dependent variable
    #Frequency plot
par(mfrow=c(1,1))
barplot(table(data[[12]]), 
        main = sprintf('Frequency plot of the variable: %s', 
                       colnames(data[12])),
        xlab = colnames(data[12]),
        ylab = 'Frequency')
#Check class BIAS
table(data$quality)
round(prop.table((table(data$quality))),2)
Frequency plot to analyze the dependent variable
Frequency plot to analyze the dependent variable

Analyzing the plot, I stated that the dataset has a considerably higher amount of 0 values, indicating that the data has more rows that represent a bad quality of the wine. In other words, the data is biased.

Further, by analyzing the tables, I declared that the data has 1,382 rows that were qualified as a bad quality wine and 217 as a good quality wine. Likewise, the dataset contains approximately 86% of 0 outcome values and 14% of 1 outcome values.

In that sense, it is necessary to take into consideration that the dataset is biased. That is why it is essential to follow a stratified sampling method when splitting the data into the train and test set.

Now, I proceed to analyze the independent variables. To develop the analysis, I chose to create boxplots and histogram plots for each variable. These visualizations will help us identify the location of the five-number summary values, the outliers it possesses, and the distribution that the variable follows.

#Independent variable
    #Boxplots
par(mfrow=c(3,4))
for (i in 1:(length(data)-1)){
  boxplot(x = data[i], 
          horizontal = TRUE, 
          main = sprintf('Boxplot of the variable: %s', 
                         colnames(data[i])),
          xlab = colnames(data[i]))
}
#Histograms
par(mfrow=c(3,4))
for (i in 1:(length(data)-1)){
  hist(x = data[[i]], 
       main = sprintf('Histogram of the variable: %s',
                    colnames(data[i])), 
       xlab = colnames(data[i]))
}
Boxplots to analyzed the numeric independent variables
Boxplots to analyzed the numeric independent variables

As we can see, the boxplots show where are the mean, median, and quartile measurements located for each variable, as well as the range of values each variable has.

By analyzing the boxplots, I concluded that all the variables have outliers. Furthermore, the variables "residual sugar" and "chlorides" are the variables that have the most amount of outliers. As we can see, there is a concentration of values near the mean and median, which is reflected by a very slim interquartile range (IQR).

This information will come in handy at the data preparation step when I proceed to assess the outlier values.

Histogram plots to analyzed the numeric independent variables
Histogram plots to analyzed the numeric independent variables

Visualizing the histogram plots, I identified the pattern of each of the variables. As we can see, there is a right skewness in most of the distributions. However, the variables "density" and "pH" show that they follow a normal distribution. Also, I can mention that the variables "residual sugar" and "chlorides" have a wide range of values, with most of the observations grouped to the left side of the graph. This phenomenon indicates that the variables have a large number of outlier values.

Finally, I developed a bivariate analysis to understand the relationship that the variables have with each other.

#Bivariate analysis
  #Correlation matrix
library(ggcorrplot)
ggcorrplot(round(cor(data[-12]), 2), 
           type = "lower", 
           lab = TRUE, 
           title = 
             'Correlation matrix of the red wine quality dataset')
Correlation matrix to analyze the relationship between the numeric variables
Correlation matrix to analyze the relationship between the numeric variables

In the image, we can visualize the positive and negative relationships between the independent variables. As the matrix shows, there is a positive correlation of 0.67 between the "fixed acidity" variable and the variables "citric acid" and "density". In other words, as the "fixed acidity" variable increases the "citric acid" will also increase. Likewise, the same concept will apply to the relationship between "free sulfur dioxide" and "total sulfur dioxide" variables.

Moreover, I can state that the variables "fixed acidity" and "pH" have a negative linear correlation of -0.68. This relationship indicates that when the fixed acidity of the wine increases, the pH value of the wine decreases. This assumption is correct because we know as a fact that when the value of the pH of a component decreases means that the element is gaining acidity.

Data preparation

Once I finished the EDA, I proceed to prepare the data to develop the prediction models. In this step of the project, I focused on finding missing data and assessed the outlier values.

#Missing values
sum(is.na(data))
Result of the number of missing values in the data
Result of the number of missing values in the data

Now that I have identified that the dataset does not contain any missing values, I will proceed to work with the outliers.

First, I identified the number of outliers each variable has. To complete this step, I created and applied a specific function that identifies outliers. Then, I generated a data frame to store the information. Further, I used a for-loop to gather and store the information.

#Outliers
  #Identifing outliers
is_outlier <- function(x) {
  return(x < quantile(x, 0.25) - 1.5 * IQR(x) | 
           x > quantile(x, 0.75) + 1.5 * IQR(x))
}
outlier <- data.frame(variable = character(), 
                      sum_outliers = integer(),
                      stringsAsFactors=FALSE)
for (j in 1:(length(data)-1)){
  variable <- colnames(data[j])
  for (i in data[j]){
    sum_outliers <- sum(is_outlier(i))
  }
  row <- data.frame(variable,sum_outliers)
  outlier <- rbind(outlier, row)
}
Outlier data frame
Outlier data frame

As we can visualize, all of the variables in the data have outliers. To assess these values, I followed a criterion which dictates that I will accept variables that have less than 5% of outlier values throughout all the observations of the dataset.

It is essential to mention that I do not proceed to drop the outlier values because they represent and carry necessary information about the dataset. Deleting the outliers can bias the result of our model in a significant way.

#Identifying the percentage of outliers
for (i in 1:nrow(outlier)){
  if (outlier[i,2]/nrow(data) * 100 >= 5){
    print(paste(outlier[i,1], 
                '=', 
                round(outlier[i,2]/nrow(data) * 100, digits = 2),
                '%'))
  }
}
Variables with a percentage of outlier equal to or greater than 5%
Variables with a percentage of outlier equal to or greater than 5%

With the code display above, I was able to identify that the variables "residual sugar" and "chlorides" have approximately 10% and 7% of outlier values, respectively.

Further, I proceed to input the outlier values of these variables. I chose to change the outlier values with the mean value of the variables because, as we can see in the histogram plot, both variables have a large concentration of value neer the mean. For that reason, by inputting the values with the mean number, will not affect the essence of the data in a significant matter.

#Inputting outlier values
for (i in 4:5){
  for (j in 1:nrow(data)){
    if (data[[j, i]] > as.numeric(quantile(data[[i]], 0.75) + 
                                  1.5 * IQR(data[[i]]))){
      if (i == 4){
        data[[j, i]] <- round(mean(data[[i]]), digits = 2)
      } else{
        data[[j, i]] <- round(mean(data[[i]]), digits = 3)
      }
    }
  }
}

Modeling

Now that I correctly arranged the dataset, I proceed to develop the machine learning models that will predict the red wine quality. The first step is to split the data into train and test. Since the data is unbalanced, I proceed to develop a stratified sampling. I used 80% of the observations that represent a good quality wine (1 outcome of the "quality" variable) to balance the train set. In other words, the dependent variable will have the same number of observations of 0 and 1 in the train set.

#Splitting the dataset into the Training set and Test set
  #Stratified sample
data_ones <- data[which(data$quality == 1), ]
data_zeros <- data[which(data$quality == 0), ]
#Train data
set.seed(123)
train_ones_rows <- sample(1:nrow(data_ones), 0.8*nrow(data_ones))
train_zeros_rows <- sample(1:nrow(data_zeros), 0.8*nrow(data_ones))
train_ones <- data_ones[train_ones_rows, ]  
train_zeros <- data_zeros[train_zeros_rows, ]
train_set <- rbind(train_ones, train_zeros)
table(train_set$quality)
#Test Data
test_ones <- data_ones[-train_ones_rows, ]
test_zeros <- data_zeros[-train_zeros_rows, ]
test_set <- rbind(test_ones, test_zeros)
table(test_set$quality)
Table of the dependent variables of the train and test set
Table of the dependent variables of the train and test set

As we can see in the image, the train set will contain fewer observations than the test set. However, the train set will be balanced to train the models efficiently.

Now that I have completed this step, I proceed to develop the models and determine which model can accurately predict the quality of red wine.

Logistic regression

#Logistic Regression
lr = glm(formula = quality ~.,
         data = training_set,
         family = binomial)
#Predictions
prob_pred = predict(lr, 
                    type = 'response', 
                    newdata = test_set[-12])
library(InformationValue)
optCutOff <- optimalCutoff(test_set$quality, prob_pred)[1]
y_pred = ifelse(prob_pred > optCutOff, 1, 0)

Once the model is created, with the training set, I proceed to predict the values with the test set data.

Since the logistic regression will deliver probability values, I proceed to calculate the optimal cut-off point, which will categorize the outcome values into 1 or 0.

Then, with the predicted values obtained, I proceed to develop a confusion matrix where we can visualize the test set values with the predicted values for the logistic regression model.

#Making the confusion matrix
cm_lr = table(test_set[, 12], y_pred)
cm_lr
#Accuracy
accuracy_lr = (cm_lr[1,1] + cm_lr[1,1])/
  (cm_lr[1,1] + cm_lr[1,1] + cm_lr[2,1] + cm_lr[1,2])
accuracy_lr
Confusion matrix of the Logistic Regression
Confusion matrix of the Logistic Regression

Visualizing the table, I stated that the model has accurately predicted 1,208 values, meaning that the model misclassified 45 of the observations. Additionally, I concluded that the model has an accuracy of 96.41%.

#ROC curve
library(ROSE)
par(mfrow = c(1, 1))
roc.curve(test_set$quality, y_pred)
ROC Curve of the Logistic Regression
ROC Curve of the Logistic Regression

Further, I proceed to develop a ROC curve to know the capability of the model to distinguish the outcome classes. Finally, I founded that the area under the curve (AUC) is 51.1%.

Decision tree

Now I followed the same step as before. Once the model is created, with the training set, I proceed to predict the values with the test set data.

#Decision Tree
library(rpart)
dt = rpart(formula = quality ~ .,
           data = training_set,
           method = 'class')
#Predictions
y_pred = predict(dt, 
                 type = 'class', 
                 newdata = test_set[-12])

Further, I proceed to generate a confusion matrix where we can see the test set values with the predicted values for the decision tree model.

#Making the confusion matrix
cm_dt = table(test_set[, 12], y_pred)
cm_dt
#Accuracy
accuracy_dt = (cm_dt[1,1] + cm_dt[1,1])/
  (cm_dt[1,1] + cm_dt[1,1] + cm_dt[2,1] + cm_dt[1,2])
accuracy_dt
Confusion matrix for the Decision Tree
Confusion matrix for the Decision Tree

Visualizing the table, I declared that the model has accurately predicted 873 observations, indicating that the model misclassified 380 of the values. Also, I founded that the model has an accuracy of 69.67%.

#ROC curve
library(ROSE)
roc.curve(test_set$quality, y_pred)
ROC Curve for the Decision Tree
ROC Curve for the Decision Tree

Then, with the ROC curve, I have obtained that the area under the curve (AUC), which has a value of 81%.

Random forest

Finally, I continued to create the random forest model with the training set and also predict the values with the test set data.

#Random forest
library(randomForest)
rf = randomForest(x = training_set[-12],
                  y = training_set$quality,
                  ntree = 10)
#Predictions
y_pred = predict(rf, 
                 type = 'class', 
                 newdata = test_set[-12])

Now, I proceed to visualize the test set values with the predicted values for the random forest model by creating a confusion matrix.

#Making the confusion matrix
cm_rf = table(test_set[, 12], y_pred)
cm_rf
#Accuracy
accuracy_rf = (cm_rf[1,1] + cm_rf[1,1])/
  (cm_rf[1,1] + cm_rf[1,1] + cm_rf[2,1] + cm_rf[1,2])
accuracy_rf
Confusion matrix of the Random Forest
Confusion matrix of the Random Forest

Evaluating the table, I demonstrated that the model has accurately predicted 991 values, which means that the model misclassified 262 observations. Moreover, I obtained that the model’s accuracy is 79.09%.

#ROC curve
library(ROSE)
roc.curve(test_set$quality, y_pred)
ROC Curve of the Random Forest
ROC Curve of the Random Forest

Finally, with the ROC curve, I obtained a value of the AUC of 83.7%.

Variable importance

Moreover, I proceed to answer the second question of the project by calculating the variable importance of the model with the highest accuracy. In other words, I calculated the variable importance of the logistic regression model.

#Variable importance
library(caret)
varImp(lr)
Variable importance of the Logistic Regression
Variable importance of the Logistic Regression

By analyzing the results, I declared that the most significant variable for this model is "alcohol", followed by the variables "sulphates" and "fixed acidity".

Further, I effectuated an investigation to know the performance and impact of these components on the red wine quality. I founded that sulphate is the component of the wine that is responsible for the freshness of the drink. In that sense, wines that do not contain sulphates or contain a low amount of this element, generally are wines that have a shorter shelf life. In other words, sulphate gives more control over the life of the wine since it helps to ensure the wine will be fresh and clean when opened. On the other hand, the alcohol ingredient also plays a meaningful part in the wine’s quality. The alcohol will help balance the firmer and acid taste of the wine, making an interrelationship of the hard and soft characters of the wine.

As we analyzed, the logistic regression model explains the actual theory facts. The investigation about the essential components for a good wine quality involved the variables obtained as necessary in the model. For this reason, the variables "alcohol" and "sulphates" are very significant to the model because these elements will be an essential component in indicating if a wine has a good or bad quality.


Conclusion

After obtaining the results of the different machine learning algorithms, I stated that the Logistic Regression model displayed a higher accuracy in predicting the quality of red wine. With an accuracy of 96.41%, this model was able to predict correctly 1,209 values, meaning that the misclassification error of the model was 3.59%.

On the other hand, by analyzing the ROC curve, I declared that the model performance is not as good as expected. By evaluating the area under the curve (AUC = 51.1%), I labeled the ROC curve as a fail curve. In other words, the model is not capable of identifying the different classes, which indicates that the model has a low performance. For this reason, I concluded that even though the model has good accuracy in predicting the test set values, it has a pitful rate of quickly identifying true positive values.

Moreover, by analyzing the other ROC curves, I revealed that the random forest did have the best performance by obtaining an area under the curve of 83.7%. Meaning that even though the random forest model did not display the highest accuracy between the three models, it has the best performance by detecting the different classes of the dependent variable better than the logistic regression.

Further, with the logistic regression model, I proceed to identify which of the physicochemical properties has the highest impact on the quality of red wine. I distinguished that the "alcohol", "sulphates", and "fixed acidity" are the variables that have the most crucial influence in the model. For that matter, if one of these variables changes, the results of the model will be affected strongly.


Related Articles