Using Stochastic Gradient Descent to Train Linear Classifiers

You can tame challenges with data sets that have large numbers of training examples or features

Lindo St. Angel
Towards Data Science

--

Photo by Clark Van Der Beken on Unsplash

Introduction

You are probably aware that Stochastic Gradient Descent (SGD) is one of the key algorithms used in training deep neural networks. However, you may not be as familiar with its application as an optimizer for training linear classifiers such as Support Vector Machines and Logistic Regression or when and how to apply it.

In this article, you will learn that the approach can be effectively used on large-scale data sets (> 10⁵ samples) or with a large number (> 10⁵) of features where other methods may lead to extremely long fit times or be infeasible to use beyond a few thousand samples or features. Additionally, SGD allows for online learning, making the algorithm quickly fit new data on an existing classifier.

You will learn about how to use Python APIs from scikit-learn, an example of a data set that is well-suited for this method captured from radar samples, test results from a classifier fitted with that data using SGD and some drawbacks of using SGD including the need for rather extensive hyperparameter tuning.

You can use other optimizers to train linear classifiers and, depending on the size of your data set and feature space, the SGD method and linear classifiers in general may not be the best solution. For an overall guide to using SVMs, see A Practical Guide to Support Vector Classification (Hsu, et al., 2016) Some techniques described in this excellent paper were used here.

Data Set

In order to help you understand the techniques and code used in this article, a short walk through of the data set is provided in this section. The data set was gathered from radar samples as part of the radar-ml project and found here. This project employs autonomous supervised learning whereby standard camera-based object detection techniques are used to automatically label radar scans of people and objects.

The data set is a Python dict of the form:

{‘samples’: samples, ‘labels’: labels}

samples is a list of N radar projection numpy.array tuple samples in the form:

[(xz_0, yz_0, xy_0), (xz_1, yz_1, xy_1),…,(xz_N, yz_N, xy_N)]

Where a radar projection is the maximum return signal strength of a scanned target object in 3-D space projected to the x, y and z axis. These 2-D representations are typically sparse since a projection occupies a small part of scanned volume.

labels is a list of N numpy.array class labels corresponding to each radar projection sample of the form:

[class_label_0, class_label_1,…,class_label_N]

Projections from a typical single sample are shown in the heat map visualization below. Red indicates where the return signal is strongest.

Visualization of a data set sample of a dog (my pet Polly)

This data was captured in my house in various locations designed to maximize the variation in detected objects (currently only people, dogs and cats), distance and angle from the radar sensor.

The data set contains known labeling errors, mostly stemming from the object detector mistaking my cat for my dog which happens (subjectively), at about a 10% error rate. A future effort will attempt to fine-tune the object detector to reduce the error. This error will get propagated to the radar classifier trained from this data set.

Training

You can use the steps below to train the model on the radar data. The complete Python code that implements these steps can be found in the train.py module of the radar-ml project.

  1. Scale data set sample features to the [0, 1] range.
  2. Encode data set labels as integers.
  3. Split samples and labels up into train, validation and test sets.
  4. Generate feature vectors from the radar projections in each set above by concatenating all or selected projections. The result is a large but sparse feature space which is a function of the radar scan volume. In this example, the feature vector has length 10,010.
  5. Augment the training set. This increases accuracy at the expense of training time. Fortunately, using SGD as a optimizer for linear classifiers scales extremely well on large data sets.
  6. Balance the training set.
  7. Use the training set and Stratified K-Folds cross-validation to fit a linear classifier using SGD as an optimization technique and a grid search to find the best hyperparameters.
  8. Calibrate the best classifier using the validation set to obtain an accurate probability estimate of the predictions. This step may not be needed for probabilistic classifiers.

The Python snippet below from radar-ml’s train.py shows the actual fitting function. This uses the sklearn linear_model.SGDClassifier API with ‘log’ loss which gives Logistic Regression. You can see the online training aspect which is used to do partial fits on the optimum classifier using augmented data as well as novel data sets — a very computationally efficient process. You can also see that the grid search tries fits with a number of hyperparameters and getting these values right is key to an accurate classifier.

SGD Fitting Function

Note: the sklearn.svm.LinearSVC API can optimize the same cost function as the SGDClassifier by adjusting the penalty and loss parameters. However, LinearSVC does not allow for online learning. LinearSVC uses the LIBLINEAR library (Fan et al.,2008).

The train.py module also will fit a model using LIBSVM (Chang and Lin, 2011) via the sklearn svm.SVC API, you can see that used in the Python snippet below from radar-ml’s train.py. LIBSVM implements the Sequential minimal optimization algorithm for kernelized Support Vector Machines which is a very powerful method but does not scale well for large data sets or feature vectors from a fit time perspective.

SVC Fitting Function

Evaluation

Using the test set that was split from the data set in the step above, evaluate the performance of the final classifier. The test set was not used for either model training or calibration validation so these samples are completely new to the classifier. The evaluation function is shown in the Python snippet below which is part of radar-ml’s train.py.

Model Evaluation Function

The evaluation results from using SGDClassifier are shown below.

SGD Training Result Summary

The evaluation results from using SVC are shown below.

SVC Training Result Summary

You can see that the SGD method gives better overall accuracy (89% vs. 84%) on the test set and moreover completes the training in about seven minutes (including four epochs of augmentation) vs. about 75 minutes as compared to SVC. These results are not apples-to-apples since the SGD classifier accuracy benefits from the data augmentation. Using augmentation with SVC is basically infeasible on my i5 3.4 GHz machine since the training times are a non-linear function of the training set and would take many days. Note that some of the inaccuracies are likely due to the labeling errors highlighted above.

The resulting SGD-trained linear classifier takes about 250 KB of disk space whereas SVC results in a classifier (RBF kernel) more than two orders of magnitude larger, around 40 MB. This could be an advantage if you use the SDG classifier in a resource limited embedded system.

You can find the fitted classifiers and training results here.

Prediction

Using the classifier to make predictions on new data is straightforward as you can see from the Python snippet below. This is taken from radar-ml’s predict.py.

Prediction Function

Conclusion

You should consider using Stochastic Gradient Descent as an optimizer to efficiently train linear classifiers if you have a large number (many thousands) of training examples or features. Also consider using it for online learning, for example in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data. Different optimization methods or classifiers may be better in other cases.

SGD classifiers are sensitive to feature scaling and require fine tuning of a number of hyperparameters including the regularization parameter and the number of iteration for good performance. You should always use feature normalization and a technique like grid search to find the most optimal hyperparameters when using this method. If you intend to use the classifier to predict both a class and a confidence level, you should calibrate it first on a data set disjoint from the training set. Always evaluate the final classifier on a test set disjoint from both the training and validation set.

You will find prediction using the SGD classifier straightforward via the sklearn APIs and its compact size favors resource limited embedded systems.

--

--

High technology professional at Amazon creating amazing products and services customers love.