Classifying Images with Feature Transformations

Exploring the MNIST1 and FASHION MNIST2 dataset with Logistic Regression and Random Forest

XuanKhanh Nguyen
Towards Data Science

--

By Gavin Smith and XuanKhanh Nguyen

This semester, I took a Machine Learning class at Tufts University. This was one of my favorite Data Science courses I have taken thus far. It taught me how to tell if machine learning is solving a problem. And most importantly, it made me a better Data Science person.

We were given three projects throughout the semester. Each project has a structure problem and an open-ended problem. The open-ended specification you could imagine. In the spirit of not stifling any brilliant sparks of creativity, we were told to build a machine learning model of our choice that reach the top of the leaderboard.

For the first project, we were given two datasets. The popular MNIST1 dataset corresponds to handwritten digits of 8 and 9 for Part 1 and images of sandals and shoes from the FASHION MNIST2 dataset for Part 2. For part 1, our tasks are to explore the use of logistic regression models and determine the effect of various hyperparameter optimizations on the model’s accuracy on the digit image data. For part 2, our task is to find the best possible model on the Fashion MNIST dataset in determining the difference between shoes and sandals. Methodologies are explained in both sections, along with respective figures and images.

Part 1: MNIST Handwritten Digit Dataset

1.1. Dataset Exploration

The dataset used for this part of the project was a subsection of the MNIST dataset, which only includes the digits 8 and 9. As a result (seen in table 1), the training set was composed of 11,800 feature sets, with each feature composed of 784 pixels (corresponding to a 28x28 grid). The validation set was composed of 1983 feature sets. A logistic regression model was fitted to these datasets using ​sklearn​, and various parameter tuning was performed.

1.2. Assess Loss and Error vs. Training Iterations

For this section, logistic regression models were fit to the training data with ​sklearn​, using the lbfgs​ solver. All other values were left to default, and the primary exploration was the effect of iteration for the models using the max_iter parameter.

Figure 1: Access Loss and Error vs. Training Iterations

The left plot shows log loss (y-axis) vs. iteration (x-axis), with two lines (blue for training, red for validation). From i=0 to i = 5, the logistic loss decreases at a linear rate for the training set until it flat-lines from i=5 to i=40. The logistic loss for the validation set decreases at a linear rate for the first seven iterations and starts increasing. For the first seven iterations, the error rate decreases on both the training and validation set. We see overfitting after 10 iterations when the training loss decreases to zero and validation loss increases. This logistic loss function penalizes wrong predictions at a high cost, and it increases as the predicted probability deviates from the actual true value for the output class. A perfect model should have a logistic loss of zero.

The right plot shows error rate (y-axis) vs. iteration (x-axis), with two lines (blue for training, red for validation). We used sklearn.metrics.zero_one_loss​ ​to get the error rate. In ​sklearn​, the zero-one loss considers the entire set of labels for a given sample incorrect if it does not entirely match the true set of labels. For the first seven iterations, the error rate decreases on both the training and validation set. After that, the error rate starts increasing on the validation set and decreases the training set. This is the problem with over-fitting. One reason to explain this overfitting is that the model is training for too long.

1.3. Hyperparameter selection

In this section, we use the same ​logistic regression​ model with the ​lbfgs​ ​solver as described previously, the best C is 0.01

Figure 2: The error rate as the function of C

1.4. Analysis of Mistakes

In this section, mistakes made by the best model, as defined by the value of C, and the corresponding logistic loss and accuracy were analyzed by finding false positives and false negatives within our model output set and plotting them as images using ​matplotlib​.

Figure 3: False Positive on the Validation set

Figure 3 shows that the model predicted these were all representative of the digit 9.

Figure 4: False negatives on the Validation set.

Figure 4 shows that the model predicted these were all representative of the digit 8.

The classifier misidentifies when the image has some key factors for the other subject. For example, if 9 is written in a very leaned lower part like “/” but not a straight “|” or has a horizontal line at the bottom, it classifies 9 as an 8. This is a linear weighted model, so once those key features are outweighed, the model makes the wrong prediction. This can be explained in terms of the threshold being used to classify between an 8 or a 9. Using 0.5 thresholds solidifies the possibility that a classifier will make mistakes even in determining false positives and false negatives. One other possible outcome for this result is that the input set’s training data may have classified some of these images as 8s when they should have been 9s. Such a theory may only be validated by converting all of the feature sets into image representations and classifying their output label by hand.

1.5. Interpretation of learned weights

Figure 5: Weight Coefficients of Logistic Regression Model

We noticed that pixels that correspond to an 8 (have negative weights) are red in hue. Pixels that correspond to a 9 (have positive weights) are more blue in hue. Red and blue pixels outside regions of high red/blue pixel density correspond to added noise in the preprocessing of the MNIST dataset. Pixels that are neither red nor blue but correspond to a yellow hue are empty spaces within images. To check our observation, we used the values for ​vmin and ​vmax​, along with the ​RdYlBu​ colormap. Since ​vmin=-0.5​, negative weights that are closer to -0.5 are more red in hue, and when vmax=0.5, weights that are more positive and closer to 0.5 are more blue in hue. Weights closer to zero would have yellowish hues, corresponding to white space within an image.

Part 2: Sneaker vs. Sandal Image Classification

Section 1: Methods for Sneaker-Sandal

2.1. Design

We decided to divide up the data using fixed validation rather than cross-validation. To decide this, we analyzed the error rate on our baseline classifier using many different numbers of folds for cross-validation and many different splits of the data for fixed validation. The results for this can be seen in Figures 6 and 7, where for cross-validation, the log loss on the validation set decreased as the number of folds increased. For fixed validation, the log loss on the validation set was minimized, with a split of 30% of the data being test data. Both methods yielded a log loss of about 0.1. We chose to use fixed validation over cross-validation with these results because fixed validation has a shorter runtime, especially when using many folds in cross-validation. We chose fixed validation because with the large dataset we have been provided, the concern of a random validation set, creating a large variance in our error is minimal.

Figure 6: Error Rate on Baseline Models as the Number of Cross-Validation Splits Increases

Figure 6 shows that as the number of splits is increased with cross-validation, the validation log loss and error decreases.

Figure 7: Error Rate on Baseline Model as the Size of the Test Set in Fixed-Validation Increases

Figures 7 shows that as the test set’s size increases with fixed-validation, the validation log loss and error slightly decrease, but the training error and log loss increases.

2.2 Base classifier, model fitting, and the hyperparameter selection process.

For the base classifier, we looked at 3 parameters: C, max iterations, and penalty. First, we looked at is the C parameter, which measures the inverse of the regularization, meaning that the lower the value of C is, the more regularized the model will be, and the more penalize overfitting. We picked a C value using a grid search, where the candidate values ranged from small decimals to ten thousand. We chose these candidate values because C values generally work best when they are not too large or too small. If C is too large, then there will be very little regularization and the model could become very complex and overfitted. If C is too small, then there will be a heavy amount of regularization and the data could become underfitted. Using this grid search we found that a C value of 1 was optimal for minimizing both the log loss and the zero-one error on the validation set, as shown in section 2.2, figure 13.

The other hyperparameter that we looked at was the maximum number of iterations that the solver has to converge. From the data (seen in Figure 8) we saw that once the model had over a few hundred iterations it was always able to converge, so selecting a value for this was arbitrary.

Figure 8: Error Rate on Baseline Model as the Maximum Iterations for Convergence Increases.

Figure 8 shows that the error rate stays constant as the maximum iterations increase on the Logistic Regression classifier.

Another parameter that we looked at whether L1 or L2 penalty increased the performance of the model. Lasso regression (L1) gave us a lower error rate on the test data versus L2. This is probably because L1 regression can put some weights to 0 so those features can be ignored, and since much of the pixel images are always off (black), the weights for those features can be ignored.

2.3 Logistic Regression with Feature Transform, model fitting, and the hyperparameter selection process.

The feature transforms that we used for our second model are the Box-Cox transform, a type of Power Transform. Box-Cox transformation will change the predictor variable or the response variable and then fit a linear model to the data to study the effect that our predictor variable has on the transformed response. This is useful because we can assume that our linear model has normally distributed error terms, therefore reducing type I and type II errors. Also, Box-Cox transformation will maintain the linear relationship between response variable Y and predictor X, include less skewness. For those reasons, it is easier to make predictions on normalized data due to getting rid of outliers. This works on our data set because much of the data is 0 (black) at any time. We also found that using this data transform improved the baseline classifier’s performance on the validation set.

Since we are still using the Logistic Regression classifier, we must again tune C. Just like with the baseline classifier we varied C over a wide range of values, and again found that a value of 1 for C was optimal (seen in figure 9). Having a value of C like this ensures that we are not overfitting or underfitting the data due to regularization. We also left the same value of max iterations so that the model would be able to converge. Lastly, we still left L1 penalization enabled because of the performance boost it provided. When finding C’s right value, we optimized the log loss again because it would minimize the maximum value of error.

Figure 9: Error Rate on Logistic Regression with Box-Cox Transform as the Strength of Regularization Changes.

Figure 9 shows that C’s value that decreases error and log loss is around 1 because it applies an appropriate amount of regularization and does not under or over-compensate for overfitting.

2.4: Random Forest with Feature Transform, model fitting, and the hyperparameter selection process.

For this new classifier, we again used the Box-Cox power transform because it provides a more normal distribution of our data so that our model will be more stable and accurate.

The classifier that we chose to use for our last model is the random forest classifier. We thought the random forest classifier would work well for this data set because the random forest is a decision tree classifier where the data is broken up into many different trees. Each classifier (tree) is constructed by combining different independent base classifiers. The independence is applied by training each base classifier on a training set sampled with replacement from the original training set. It then identifies the best split feature from a random subset of available features. The whole classifier then aggregates the individual predictions to combine into a final prediction, based on a majority voting on the individual predictions. This process ensures that the result will be more robust and accurate and less prone to overfitting. Another reason that the random forest classifier was preferable is that it is more easily able to handle data with many dimensions and large data size. Runtime and memory were major roadblocks for us when using different preprocessing methods and classifiers, so easily running this classifier on our data is a large benefit.

One hyperparameter that we can vary with this classifier is the number of trees in the forest. Increasing the number of trees will give a more accurate prediction using cross-validation and decrease the chances of overfitting our data. We varied the number of trees over a range from small numbers like 10 to 1000 trees (seen in figure 10). We did this because, at larger values, the number of trees had little impact on the accuracy of the prediction on the validation set. Using a grid search, we found that the optimal number of trees is 600, though above around 100 trees the performance improvement was very small.

Figure 10: Error Rate on Random Forest Classifier as the Number of Trees Increases.

Figure 10 shows that increasing the number of estimators (trees) slightly decreases the validation log loss and error rate in the random forest. This is because with more trees there is more cross-validation and more accounting for overfitting.

The other hyperparameters we considered were the minimum number of samples required to split an internal node and the minimum number of samples required to be a leaf node. We only tried these values on a range from 1 to 10 because their default values were 2 and 1, respectively (seen in Figures 11 and 12). We found through grid search that these values were optimal at their default values, so they were left alone.

Figure 11: Error Rate on Random Forest Classifier as the Minimum Number of Samples to Split a Node Increases.

Figure 11shows that increasing the number of samples needed to split an internal node increases, the training and validation log loss slightly increase.

Figure 12: Error Rate on Random Forest Classifier as the Minimum Number of Samples to be a Leaf Node Increases.

Figure 12 shows that as the number of samples needed to be a leaf node increases, the training and validation log loss increase.

We attempted to minimize the binary cross-entropy when searching for optimal hyperparameters because binary cross-entropy is the upper bound on the error rate. Minimizing this value ensures that we are getting the lowest possible maximum value on the test set’s error. The zero-one error rate often coincided with the binary cross-entropy so minimizing either value would have achieved similar results.

Section 2: Results for Sneaker-Sandal dataset

Figure 13: Hyperparameter selection on the baseline model.

To look for evidence of overfitting we varied the c hyperparameter with the linear regression classifier. Since c is a measure of the inverse of regularization, the larger the value is, the less regularized the model will be, and the smaller it is the more regularized it will be. This means that large c values correspond to a model that is overfitting, and small c values correspond to an underfitting model. This model shows that a c value of about 1 is optimal because it penalizes weight values enough to not overfit the model while also penalizing them enough that the model is not underfitted.

Figure 14: ROC Curve for Training and Heldout Set Performance on All Models.

In this figure, all three models perform well with classifying the data in the heldout set. All three models have a ROC curve that immediately goes to the upper left-hand corner, meaning that they can have almost all true positives for a given threshold while having almost no false positives. This can also be seen because our models have very high AUROC values, which means that the area under the ROC curve is high. The Random Forest classifier extends further towards the upper left-hand corner on both plots, meaning that it can give a more accurate prediction for a given threshold. The other two models have similar performance, so the baseline classifier obscures the Box-Cox transformation classifier.

Figure 15: False Positives and False Negatives on Heldout Set for Random Forest and Baseline Models.

This figure shows some false positives and false negatives that the Random Forest classifier and the baseline classifier made. On the top are the Random Forest mistakes, and on the bottom are the baseline mistakes. From these figures, we can see that the classifiers had trouble identifying sneakers with contrasting colors in the middle of the shoe. Usually, sandals are marked by having empty space on the top of the shoe, so a sneaker with a black logo on it, while the rest of the sneaker is white, might be mistakenly identified as a sandal. On the other hand, the Random Forest classifier had a harder time predicting sandals, which are mostly closed toes or solid in form, which could be because, without obvious openings in the sandal, they appear as sneakers. The baseline classifier had similar issues identifying this, but some examples have more obvious openings.

References:

  1. https://towardsdatascience.com/optimization-loss-function-under-the-hood-part-ii-d20a239cde11
  2. https://en.wikipedia.org/wiki/Power_transform

--

--

Interests: Data Science, Machine Learning, AI, Stats, Python | Minimalist | A fan of odd things.