Reasons why surrogate loss functions are pivotal for classification in machine learning

Jaime Dantas
Towards Data Science
8 min readNov 12, 2020

--

Image by Author — Canoa Quebrada, Brazil

This post will dive deep into the concepts and theory behind hinge loss, logistic loss and binary loss for classification in machine learning. We’ll implement the perceptron algorithm on MATLAB, and see how we can select the best classifier based on the surrogate loss functions.

By the end of this post, you’ll know how to perform classification using the perceptron and the benefits and downsides of each loss function.

Dataset

We’ll use a 3-dimensional dataset with N = 200 data points. This dataset was originally proposed by Dr. Ruth Urner on one of her assignments for a machine learning course. In the repository below, you’ll find two TXT files: fg_inputs.txt and fg_outputs.txt.

These files contain the input and output vectors. Using MATLAB, we can import them in Home > Import Data. Now, let’s plot this dataset.

Dataset FG

Perceptron

The perceptron algorithm is one of the oldest machine learning algorithms still widely used today. It is used in linear classification for different types of datasets. Perceptron is one of the many algorithms used for binary classification, and it very simple to understand. I do recommend reading this short article so you can know in detail how it works. If you really wanna learn the math behind it, these lectures may help you.

Long story short, the way perceptron works is by analyzing the signal of the dot product between x and w times its label t.

Every time h(x) is less or equal to 0, we update w with a new value.

Enough said, let’s implement it on MATLAB! I created a function on MATLAB that implements the perceptron algorithm. It receives the matrix of inputs x and the vector of labels t. The first thing I did was to add a column of ones in x [4].

Then, I concatenated x with t. The reason for that is because it will be easier to shuffle the data in this way. I used the function randperm() of MATLAB to shuffle the dataset. After that, I initialized the vector w with zeros. Since we are adding the term b, i.e. w0, I will initialize with b = 0 [2].

I also keep track of the number of passes the algorithm runs over the dataset in the variable updates. Finally, at the end of the algorithm, I performed the euclidean normalization of the vector w [3]. The algorithm is shown below.

In order to run our function, we have to execute the command [w, updates] = perceptron(input, output) in the command window of MATLAB.

The perceptron function will return the normalized vector w and the number of updates performed.

Surrogate loss function analysis

To compute the losses, I execute the perceptron function 20 times and store the vectors wi each time. I also add a column of ones in the matrix x.

Let’s first compute the empirical binary loss for our predictors and average them in the vector binary. This loss function is given by the equation below [4].

The same process was done for the hinge loss. This time, however, the loss is given by:

Finally, I calculated the logistic loss.

In order to analyze the training losses for this dataset, I created the compare_losses.m script to plot the three losses for the 20 executions.

The left chart of the figure below shows the losses of the 20 executions. From there, We can notice that the binary loss is the smallest loss of all three. Also, the surrogate losses are a much more clear indicator of the “goodness” of our classifier now. We can see the reason for that when we take a look at the execution i = 17 and i = 18. While they have roughly the same binary loss, the latter has a significantly smaller hinge and logistic loss than the former. This is because the surrogate losses are an indicator of the distance between the classifier and its closest data points in both classes. So, data points that are deeply on the wrong side get a very large hinge and logistic loss. If the data point is correctly classified but it is very close to the line, the surrogate losses are not zero like the binary one. So, the classifier of i = 18 is better than the i = 17.

Binary loss, hinge loss and logistic loss for 20 executions of the perceptron algorithm on the left, and the binary loss, hinge loss and logistic loss for one single execution (w1) of the perceptron algorithm over the 200 data points. Plot from the compare_losses.m script.

Another good comparison can be made when we look at the points i = 4 and i = 5. While the former has a smaller binary loss, the latter has a smaller hinge and logistic loss. So, from this analysis, the classifier in i = 5 may be better than the i = 4. This leads us to the conclusion that the logistic and hinge losses are independent of the binary loss. We can also say these two surrogate losses are independent of each other. This can be proven when we look at the opposite directions they took in the points i = 12 and i = 13.

Moving on to the right chart of the figure above, we can see that both the logistic and the hinge losses do give us large losses when we misclassifier a data point, and a loss close to zero for logistic and zero for the hinge (when t⟨w, x⟩ ≥ 1) whenever we classify it correctly. Additionally, we can conclude that the hinge is not zero when the t⟨w, x⟩ is between 0 and 1. This is because the data point is too close to the line that classifies both classes [5].

In addition, when the data point is correctly classified but it is too close to the line, the hinge loss gives us a small value. Similarly, when the data point is wrongly classified and it is located too far away from the line, the hinge and the logistic loss give us a very large value. In fact, for large negative values of t⟨w, x⟩, both losses become parallel lines[5].

The logistic loss follows the trend of the hinge loss for wrongly classified data points. However, it is never zero for correctly classified data points as demonstrated in the right chart of the figure above. Similarly, the logistic loss assigns an error close to 1 for data points that are too close to the line.

As a final point, the hinge loss is an upper bound of the binary loss, and it evidences the real loss much more than the binary loss. Below we can see the minimum loss value for each loss type.

Minimum losses

The results showed that the binary loss was indeed the smallest loss, and it was found in execution i = 1. The minimum hinge and logistic losses, however, were found in the execution 18. This highlights the importance of using the surrogate functions for binary classification. Therefore, the classifier that gives us the smallest binary loss is not always the ideal classifier.

Using MATLAB

Just out of curiosity, I decided to compare our results with one ready-to-use implementation on MATLAB. I run this dataset on MATLAB and fit a linear classifier with 20 interactions using the Classification Linear Module. The figure below shows the classification error for the learner.

Minimum classification error of linear classification done by MATLAB in 20 interactions in the dataset FG.

As we can see, MATLAB fit a classifier with an error of 0.135, which was coincidentally the same as running our perceptron algorithm. It is important to keep in mind that the learning algorithm used in this comparison on MATLAB was not the perceptron. I just wanted to illustrate that our perceptron implementation estimated a pretty good classifier for this dataset.

Conclusion

We saw that unlike the hinge and logistic loss, the binary loss is always 0 when we classified correctly all the data points. The logistic and hinge loss indicates “how correct” or “how mistaken” the predictor actually is. This is why the distance between the choosiest data point and the line of the classifier does influence the overall loss.

Therefore, analyzing the surrogate functions proved to be of the utmost importance when selecting the best classifier, especially when the binary loss is not 0.

Finally, we saw how to implement the perceptron algorithm for D+ 1 dimensional space on MATLAB.

About Me

I’m an M.A.Sc. student at York University, and a Software Engineer by heart. During the past decade, I’ve been working in several industries in areas such as software development, cloud computing and systems engineering. Currently, I’m developing research on cloud computing and distributes systems.

You can check my work on my website if you want to.

Thanks for reading it!

References

[1] Yossi Keshet. Multiclass Classification. 2014. URL: https://u.cs.biu.ac.il/~jkeshet/teaching/aml2016/multiclass.pdf

[2] Dorian Lazar. Perceptron: Explanation, Implementation and a Visual Example. 2020. URL: https: //towardsdatascience.com/perceptron-explanation-implementation-and-a-visual-example- 3c8e76b4e2d1

[3] Marina Santini. Machine Learning for Language Technology Lecture 9: Perceptron. 2014. URL: http://santini.se/teaching/ml/2014/Lecture09_Perceptron.pdf

[4] Shai Shalev-Shwartz and Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. DOI: 10.1017/CBO9781107298019. URL: https://www.cs.huji.ac.il /~shais/UnderstandingMachineLearning/understanding — machine — learning — theory — algorithms.pdf.

[5] Kilian Weinberger. Empirical Risk Minimization — Cornell University. 2020. URL: https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote10.html

--

--