How to make SGD Classifier perform as well as Logistic Regression using parfit

Vinay Patlolla
Towards Data Science
5 min readNov 29, 2017

--

For large datasets, using hyper-parameters optimised by parfit, we can get equivalent performance from SGDClassifier in third of the time taken by LogisticRegression.

What is SGD Classifier?

SGD Classifier implements regularised linear models with Stochastic Gradient Descent.

So, what is stochastic gradient descent?

Stochastic gradient descent considers only 1 random point while changing weights unlike gradient descent which considers the whole training data. As such stochastic gradient descent is much faster than gradient descent when dealing with large data sets. Here is a nice answer on Quora which explains in detail the difference between gradient descent and stochastic gradient descent.

Why do we even care about SGD Classifier when we already have Logistic Regression?

Logistic Regression by default uses Gradient Descent and as such it would be better to use SGD Classifier on larger data sets. One another reason you might want to use SGD Classifier is, logistic regression, in its vanilla sklearn form, won’t work if you can’t hold the dataset in RAM but SGD will still work.

How do we make SGD Classifier perform as well as Logistic Regression?

By default, the SGD Classifier does not perform as well as the Logistic Regression. It requires some hyper parameter tuning to be done.

Hyper-Parameter optimisation using parfit:

Parfit is a new package, written by my colleague and fellow MSAN student at the University of San Francisco, Jason Carpenter. This package (using parallel processing) allows the user to perform an exhaustive grid search on a model, which gives the user the flexibility of specifying the validation set, scoring metric, and optionally plotting the scores over the grid of hyper-parameters entered. You can read more about this package in this post on medium.

A key reason you might want to use parfit is because it gives you the flexibility of evaluating your metric on a separate validation set unlike GridSearchCV which uses cross validation instead. Cross validation is not a good idea in all cases. One reason you might not want to use cross validation is when there is a time component in the data. For example, in this case you might want to create a validation set using the most recent 20% of the observations. Rachel Thomas from fast.ai has written a really good blog post on ‘How (and why) to create a good validation set’.

In this article, I will be considering the performance on validation set as an indicator of ‘how well a model performs?’. The metric here is ‘sklearn.metrics.roc_auc_score’.

Parfit on Logistic Regression:

We will use Logistic Regression with ‘l2’ penalty as our benchmark here. For Logistic Regression, we will be tuning 1 hyper-parameter, C.

C = 1/λ, where λ is the regularisation parameter. Smaller values of C specify stronger regularisation. Since parfit fits the model in parallel, we can give a wide range of parameters for C without worrying too much about overhead to find the best model.

How to use parfit:

bestFit takes in the following parameters:

  1. Model: In our case input model is Logistic Regression. Notice that the function only takes the class as input and not its object.
  2. paramGrid: ParmeterGrid object of hyper parameters to run your model on
  3. X_train, y_train, X_val, y_val : Training and validation sets
  4. metric: metric to evaluate your model.
  5. bestScore: Returns the highest score when ‘max’ is passed.

It not only returns bestModel along with its bestScore but also returns allModels along with their allScores.

Hyper Parameter Optimisation for Logistic Regression using parfit

Output:

LogisticRegression took around 26 minutes to find the best model. This long duration is one of the primary reasons why it’s a good idea to use SGDClassifier instead of LogisticRegression. The best roc_auc_score we get is 0.712 for C = 0.0001.

Let’s look at roc_curve for our best model:

Code to plot ROC curve
AUC curve for Logistic Regression’s best model

Parfit on SGD Classifier:

Same as Logistic Regression, we will use ‘l2’ penalty for SGD Classifier. One important hyper-parameter to note here is n_iter. ‘n_iter’ in sklearn documentation is defined as

‘The number of passes over the training data (aka epochs).’

n_iter in sklearn is None by default. We are setting it here to a sufficiently large amount(1000). An alternative parameter to n_iter, which has been recently added, is max_iter. The same advice should apply for max_iter.

The alpha hyper-parameter serves a dual purpose. It is both a regularisation parameter and the initial learning rate under the default schedule. This means that, in addition to regularising the Logistic Regression coefficients, the output of the model is dependent on an interaction between alpha and the number of epochs (n_iter) that the fitting routine performs. Specifically, as alpha becomes very small, n_iter must be increased to compensate for the slow learning rate. This is why it is safer (but slower) to specify n_iter sufficiently large, e.g. 1000, when searching over a wide range of alphas.

Hyper Parameter Optimisation for SGD Classifier using parfit

Output:

Notice that SGD Classifier only took 8 minutes to find the best model whereas Logistic Regression took 26 minutes to find the best model. Also, we are running our SGD Classifier at n_iter = 1000. SGD Classifier gives the best model at α = 0.1. The roc_auc_score on the best model is 0.712 which is similar to what we got from Logistic Regression up to 3rd decimal.

Now, let’s take a look at AUC curve on the best model.

AUC curve for SGD Classifier’s best model

We can see that the AUC curve is similar to what we have observed for Logistic Regression.

Summary

And just like that by using parfit for Hyper-parameter optimisation, we were able to find an SGDClassifier which performs as well as Logistic Regression but only takes one third the time to find the best model.

For sufficiently large datasets, it is best to implement SGD Classifier instead of Logistic Classifier to produce similar results in much less time.

Bio: I am currently pursuing my Masters in Analytics(Data Science) at University of San Francisco and doing my internship at Manifold.ai. Previously, I have worked as Software Storage Engineer at Hewlett Packard Enterprise in Cloud Division.

Linkedin: https://www.linkedin.com/in/vpatlolla/

--

--