What is the difference between Optimization and Deep Learning and why should you care

Dima Shulga
Towards Data Science
4 min readAug 24, 2019

--

The most common way to train a neural network today is by using gradient descent or one of its variants like Adam. Gradient descent is an iterative optimization algorithm for finding the minimum of a function. Simply put, in optimization problems, we are interested in some metric P and we want to find a function (or parameters of a function) that maximizes (or minimizes) this metric on some data (or distribution) D. This sounds just like Machine (or Deep) Learning. We have some metric, like accuracy, or even better precision/recall or F1 score, we have a model with learnable parameters (our network) and we have our data (the training and test sets). Using gradient descent we are “searching’ or “optimizing” our model’s parameters in a way that will eventually maximize our metric (accuracy) on our data, both on the training and the test sets.

From The general inefficiency of batch training for gradient descent learning

There are (at least) two major differences between Optimization and Deep Learning and those differences are important to achieve better results in Deep Learning.

The first difference is the metric function. In optimization, we have a single well-defined metric that we want to minimize (or maximize). Unfortunately, in Deep Learning we often use metrics that are impossible or very hard to optimize on. For example, in classification problems, we may be interested in the “Accuracy” or “F1-Score” of our model. The problem with accuracy and f1-score is that these are not differentiable functions, and we cant use gradient descent because we can't calculate the gradient. For that reason, we are using proxy metrics like negative log-likelihood (or cross-entropy) in a hope that minimizing the proxy function will maximize our original metric. Those proxy metrics are not always bad and may have some advantages, but we need to remember the real value we care about and not the proxy metric.

One of the ways to make sure we care about the original metric is by using Early Stopping. Every epoch, we evaluate our model using the original metic (the accuracy or f1-score) on some validation set and stop the training once we start to overfit. It is also a good practice to print the accuracy (or any other metric) every epoch to better understand the performance of our model.

The second important difference is the data. In optimization, we care only about the data in hand. We know that finding the maximum value will be the best solution to our problem. In Deep Learning, we mostly care about generalization i.e the data we don’t have. It means that even if we find the maximum (or minimum) value for the data we do have (the training set) we may still get poor results on the data we don’t have. It is very important to split our data into different parts and treat the test set as the “data we don’t have”. We can’t make any decision based on the test set. To make decisions about hyperparameters, or model architecture, or early stopping criteria we can use the validation set, but never the test set.

It doesn’t end there. We’re training our model using gradient descent by pushing the parameters to the “right” direction. But what is “right”? is it right for all the data or only our training set? It is relevant when we’re choosing the batch size for example. Some may claim that by using the whole training data (what’s called Batch Gradient Descent) we will get the “true” gradient. But this is true only for the data we have. To push our model in the “right” direction we need to approximate the gradient of the data we don’t have. This may be achieved by using a much smaller batch size (what’s called Mini-batch or Stochastic Gradient Descent). This paper shows that best results may be achieved using a batch size of only 1 (what’s sometimes called On-line Training). By using a smaller batch size, we introduce noise to our gradients and may improve generalization and reduce overfitting. The table below shows the performance of “batch” vs “on-line” training on more than 20 datasets. We can see that “on-line” is better on average.

From The general inefficiency of batch training for gradient descent learning

Machine Learning problems are sometimes referred to as optimization problems. It is important to know the differences and address them.

Reference: “Deep Learning Book” by Ian Goodfellow and Yoshua Bengio

--

--