Lecture Notes: Regularization for Deep Learning

Daniel Wu
7 min readJan 17, 2020

Learn from my notes for the Deep Learning course on DeepLearning.ai

The following are Week 1 lecture notes covering the topic regularization. These are notes for the second of five courses in the Deep Learning Specialization on DeepLearning.ai. If you want more background into neural networks and deep learning, read this introduction.

Week 1 — Practical aspects of Deep Learning

Applied machine learning is iterative and experimental. The iterations involves changing the number of layers, hidden units, learning rates, activation functions used, and so on.

Train/Dev/Test Sets

When a dataset has up to 10,000 examples, using a train/test split is suitable. Common practice to split data is a 70/30 train/test or a 60/20/20 train/dev/test. The dev set is the hold out or cross-validation used to see which model works best, where the final model is used on the test set so not to have a biased dataset.

In a significantly larger dataset, “big data” with millions of examples, dev/test set can be as small as 1% of the dataset e.g., 1% of 1 million training examples is 10,000. For example, 98/1/1% can be split where both dev/test each have 10,000 examples.

In practicality, the dataset can be mismatched where the training is on one dataset (e.g., cat pictures from webpages) and the dev/test set on another (e.g., cat pictures from users using the app).

Recommend to keep the dev and test sets from same distribution of data.

If unbiased data set is not needed, it may be suitable to only have train/dev set without a test set.

Bias/Variance

In dealing with model optimization, the relationship of the error rates between the train and dev set tell us about the bias and variance levels. The train set error determines the bias level while the dev set error determines the variance level.

Assuming that human error is 0% in determining whether a picture of a cat is true (y=1) vs. not (y=0), four scenarios of bias and variance are illustrated below. Based on the scenarios, we can prescribe ways to make improvements.

  • High Variance: Dev set error >> Train set error. For example, a 11% dev set error versus a 1% train set error says that there’s high variance, meaning the training data did not generalize well to the dev set.
  • High Bias: Train set error >> Human error with train and dev set error comparable. Given that human error is about 0%, then a 15% and 16% train dev error says there’s high bias. Neither train nor dev set fit well to the data.
  • High Bias and High Variance: Train set error >> Human error with Dev set error >> Train set error. The training data didn’t fit or generalize well.
  • Low Bias and Low Variance: Train, dev, and human error comparably low. The training data achieved good fit and generalization.
Bias and Variance (Source: DeepLearning.ai / Andrew Ng)

Visually, let’s imagine a dataset of x’s and o’s distributed along x1 and x2 axes. If we try to fit this data with this solid blue diagonal line, high bias shows a bent towards specific characteristics of error. Certain x’s are to the left. If we have a better fit with this dotted blue circular line around the o’s, this describes low bias because fewer o’s and x’s are being miscategorized.

With the solid blue line, let’s say we alter this with the squiggly part in the middle so it captured the x on the left and the o on the right. This part diagonal, part squiggly line represents high variance, meaning this would not generalize well or work very well on other datasets since it’s overfitting.

High Bias and High Variance Example (Source: DeepLearning.ai / Andrew Ng)

Regularizing your neural network

In order to compensate for the variance issues when the neural network is overfitting, different approaches to reducing overfitting can be applied.

  • Logistic Regression: L2 Regularization uses a lambda parameter with ‘w’ in the cost function to reduce overfitting. The cost function includes the “(lambda / 2m) * the norm of w²” where norm of w² is also written as ||w||². (Note: Parameter ‘b’ is not regularized since ‘b’ is a real number whereas ‘w’ is high dimensional. Another approach called L1 regularization results in sparse ‘w’ matrices and is not very helpful. In python, lambda is a reserved term so lambd is used instead).

J(w,b) = [(1/m) * sum( Loss(y-hat,y) )] + [(lambda / 2m) * norm of w²]

L2 regularization in python = norm w² = w.T * w where T is transpose of ‘w’

Logistic regression using L2 regularization (Source: DeepLearning.ai / Andrew Ng)
  • Neural net regularization: Cost equation will include lambda / 2m * sum of squared weights w for each layer. Same as L2 regularization (aka weight decay) but traditionally called Frosenius norm. In gradient descent, dW of layer ‘l’ includes the lambda.

J(w,b) = [(1/m) * sum( Loss(y-hat,y) )] + [sum for each layer (lambda / 2m) * norm of w²]

dw[l]= (from back propagation) + (lambda / m * w[l])
W[l] -= Learning rate alpha * dW[l]

The reason L2 is also called weight decay regularization is that W is multiplied by a W with slightly smaller value with ( 1 — Learningrate * lambda / m ).

Neural Network using L2 regularization (Source: DeepLearning.ai / Andrew Ng)

Why regularization reduces overfitting? In the cost equation, when lambda is large, the effects of weight w is small. As lambda in the regularization part of equation becomes more powerful, the effects of variance is reduced. Below, the right graph with high variance gets reduced leftward towards more bias, resulting in the middle graph. Note that the larger the lambda, the neural network acts more like logistic regression, ignoring multiple nodes in the network.

How does regularization prevent overfitting (Source: DeepLearning.ai / Andrew Ng)

Similarly, when using tanh(z) as opposed to sigmoid(z), W and Z are small, resulting in a more linear slope. A more linear is like the left higher bias graph and means it’s less prone to overfit.

  • Dropout regularization: Drop out regularization literally drops nodes based on probability with results like L2 regularization. For each example, drop out uses probabilities to decide whether a node is kept or dropped, resulting in a much simpler network. For each training example, the probabilities for each node would be run and a different set of nodes would be dropped out.
Droupout regularization (Source: DeepLearning.ai / Andrew Ng)

In implementing dropout using python, it’s a three step process. First, create a vector d3 which is a matrix of random values (with the shape of a3) based upon the keep_prob value. Keep_prob is the probability used to determine whether a node is kept. For example, if keepprob = 0.8, then there’s a 20% chance the node will be dropped. d3 will be a vector of 0s and 1s. Second, a3 is multiplied by d3, which results in 20% of the matrix being 0s. Lastly, a3 is divided by keep_prob to ensure that the total weight of Z is still the same. In test, scaling will be easier. During test, we do not apply drop out.

Implemented dropout (Source: DeepLearning.ai / Andrew Ng)

With drop out, the weights on a unit get spread out so the effects are not reliant on any specific unit. The increase in the squared norm shrinks the weights which regularizes. Also, keep prob can vary by layer. In example, layer 2 is large so keep_prob can be 0.5 whereas later layers are smaller and may want higher keep_prob. Disadvantage of drop out is the cost function J is less well defined and won’t see the cost curve reduce. One can turn on/off drop off to see that the gradient descent is working.

Understanding dropout (Source: DeepLearning.ai / Andrew Ng)
  • Data augmentation: One can include flipped, translated, or subtly distorted images to add to the training data.
  • Early stopping: As gradient descent is run, plot the training error / cost and dev error. Based on the progress of the iterations, neural network may perform better at a specific point and stop early. This works because w is smaller and gets larger as it completes. This has the similar effect of a larger lambda in regularization. However, downside is that it prevents the independent treatment of optimization and variance (not overfitting) problems, which is also called orthogonalization. Instead, recommend L2 regularization even though it requires trying multiple values of lambda which becomes computationally intensive.

Next, we will continue with setting up optimization problems and algorithms. If you missed my last post on a neural network and deep learning introduction, read here.

If you have any comments/suggestions, please share your feedback below. Hope this has been helpful to your learning. Thanks for reading and learning along with me.

--

--

Daniel Wu

Digital Health, Product Management, Data Science, Analytics, and Innovation