The world’s leading publication for data science, AI, and ML professionals.

Deep learning concepts – PART 2

PART1: https://medium.com/towards-data-science/deep-learning-concepts-part-1-ea0b14b234c8

PART1: https://medium.com/towards-data-science/deep-learning-concepts-part-1-ea0b14b234c8

Loss function choice: MSE versus Cross-Entropy:

Cross entropy is preferred loss function forvarious models like classification, segmentation, generative models,etc Here we explain the differences in behavior:

Suppose you have just three training data items. Your neural network uses softmax activation for the output neurons so that there are three output values that can be interpreted as probabilities. For example suppose the neural network’s computed outputs, and the target (aka desired) values are as follows:

computed       | targets              | correct?
-----------------------------------------------
0.3  0.3  0.4  | 0  0  1 (democrat)   | yes
0.3  0.4  0.3  | 0  1  0 (republican) | yes
0.1  0.2  0.7  | 1  0  0 (other)      | no

This neural network has classification error of 1/3 = 0.33, or equivalently a classification accuracy of 2/3 = 0.67. Notice that the NN just barely gets the first two training items correct and is way off on the third training item. But now consider the following neural network:

computed       | targets              | correct?
-----------------------------------------------
0.1  0.2  0.7  | 0  0  1 (democrat)   | yes
0.1  0.7  0.2  | 0  1  0 (republican) | yes
0.3  0.4  0.3  | 1  0  0 (other)      | no

This NN also has a classification error of 1/3 = 0.33. But this second NN is better than the first because it nails the first two training items and just barely misses the third training item. To summarize, classification error is a very crude measure of error.

Now consider cross-entropy error. The cross-entropy error for the first training item in the first neural network above is:

-( (ln(0.3)*0) + (ln(0.3)*0) + (ln(0.4)*1) ) = -ln(0.4)

Notice that in the case of neural network classification, the computation is a bit odd because all terms but one will go away. (There are several good explanations of how to compute cross-entropy on the Internet.) So, the average cross-entropy error (ACE) for the first neural network is computed as:

-(ln(0.4) + ln(0.4) + ln(0.1)) / 3 = 1.38

The average cross-entropy error for the second neural network is:

-(ln(0.7) + ln(0.7) + ln(0.3)) / 3 = 0.64

Notice that the average cross-entropy error for the second, superior neural network is smaller than the ACE error for the first neural network. The ln() function in cross-entropy takes into account the closeness of a prediction and is a more granular way to compute error.

The advantage of cross-entropy error over mean squared error. Briefly, during back-propagation training, you want to drive output node values to either 1.0 or 0.0 depending on the target values. If you use MSE, the weight adjustment factor (the gradient) contains a term of (output) (1 – output). As the computed output gets closer and closer to either 0.0 or 1.0 the value of (output) (1 – output) gets smaller and smaller. For example, if output = 0.6 then (output) (1 – output) = 0.24 but if output is 0.95 then (output) (1 – output) = 0.0475. As the adjustment factor gets smaller and smaller, the change in weights gets smaller and smaller and training can stall out, so to speak.

But if you use cross-entropy error, the (output) * (1 – output) term goes away (the math is very cool). So, the weight changes don’t get smaller and smaller and so training isn’t s likely to stall out. Note that this argument assumes you’re doing neural network classification with softmax output node activation.

Debugging Deep learning models:

Always a good idea to look at top few incorrect predictions (ones with largest loss) for each class of labels in validation set. It usually gives great insights on how your model behaves and also on how good or clean your labeled data is.

Pseudo-Labeling:

http://deeplearning.net/wp-content/uploads/2013/03/pseudo_label_final.pdf

A simple pseudo-labeling implementation in keras

Basic idea: Idea is to train a model using labeled data. use the model to predict labels for unlabeled data and then include the pseudo-labeled data as part of data to train the model along with labeled data assuming the pseudo-labels are true labels.

From blog above: an efficient approach to do pseudo-labeling is, as mentioned here by the winner of 2015 National Data Science Bowl, to blend original data and pseudo-labeled data in a mini-batch by ratio of 67:33. This is also mentioned in fast.ai’s lesson video.

Another implementation of pseudo-labeling in fast.ai lesson can be found in lesson 7 notebook here.

preds = model.predict([conv_test_feat, test_sizes], 
                      batch_size=batch_size*2)
gen = image.ImageDataGenerator()
test_batches = gen.flow(conv_test_feat, preds, batch_size=16)
val_batches = gen.flow(conv_val_feat, val_labels, batch_size=4)
batches = gen.flow(conv_feat, trn_labels, batch_size=44)
mi = MixIterator([batches, test_batches, val_batches)
bn_model.fit_generator(mi, mi.N, nb_epoch=8, 
                       validation_data=(conv_val_feat, val_labels))

In the code above, Jeremy Howard, the lecturer of fast.ai course, wrote a custom iterator for mixing training data and pseudo-labeled data.

Log Loss

measures the performance of a classification model where the prediction input is a probability value between 0 and 1. The goal of Machine Learning models is to minimize this value. A perfect model would have a log loss of 0. Log loss increases as the predicted probability diverges from the actual label.

Log Loss vs Accuracy

  • Accuracy is the count of predictions where your predicted value equals the actual value. Accuracy is not always a good indicator because of its yes or no nature.
  • Log Loss takes into account the uncertainty of your prediction based on how much it varies from the actual label. This gives us a more nuanced view into the performance of our model.

The graph below shows the range of possible log loss values given a true observation (isDog = 1). As the predicted probability approaches 1, log loss slowly decreases. As the predicted probability decreases, however, the log loss increases rapidly. Log loss penalizes both types of errors, but especially those predications that are confident and wrong!

def logloss(true_label, predicted, eps=1e-15):
  p = np.clip(predicted, eps, 1 - eps)
  if true_label == 1:
    return -log(p)
  else:
    return -log(1 - p)

In binary classification (M=2), the formula equals:

Multi-class Classification

In multi-class classification (M>2), we take the sum of log loss values for each class prediction in the observation.

the sum of all log loss values across classes

Why the Negative Sign?

Log Loss uses negative log to provide an easy metric for comparison. It takes this approach because the positive log of numbers < 1 returns negative values, which is confusing to work with when comparing the performance of two models.

Log Loss vs Cross-Entropy

Log loss and cross-entropy are slightly different depending on the context, but in machine learning when calculating error rates between 0 and 1 they resolve to the same thing. As a demonstration, where p and q are the sets p∈{y, 1−y} and q∈{ŷ, 1−ŷ} we can rewrite cross-entropy as:

  • p = set of true labels
  • q = set of prediction
  • y = true label
  • ŷ = predicted prob

Which is exactly the same as log loss!

Data leakage:

Here are good blog posts explaining this and apply to general machine learning/data mining:

http://machinelearningmastery.com/data-leakage-machine-learning/

https://medium.com/machine-intelligence-report/new-to-machine-learning-avoid-these-three-mistakes-73258b3848a4


Related Articles