The world’s leading publication for data science, AI, and ML professionals.

Framework for prototyping of machine learning algorithms

A general framework for prototyping of classical machine learning algorithms by exploiting loss functions using scipy in Python

(Python)

Introduction

Reading scientific articles and trying out new promising algorithms is the focus of many data scientists. Unfortunately, these algorithms are located in small repositories and often contain several bugs, therefore they are hard to use and you will spend a considerable amount of time debugging (or resolving dependency conflicts). Luckily these algorithms can often be simplified and represented using a loss function, in this article I will use principal component analysis (PCA) and support vector machines (SVM) as leading examples but it applies to many more techniques.

This approach has multiple advantages:

  • Learn the underlying behavior of machine learning algorithms.
  • Avoid mistakes due to certain default settings in available implementations.
  • Easy to customize for certain peculiarities of the data (in case common assumptions are violated).
  • Don’t lose time on finding a good implementation of the algorithm for your problem.
  • Easy to build in robustness for outliers by adjusting the loss function (e.g. Huber or Hampel).
  • Most algorithms assume normally distributed error terms, however, this is not always valid. By adjusting the loss function, these error terms can be normalized.

Possible disadvantages:

  • Performance/speed
  • Numerical instability

Loss function representation of PCA

Principal component analysis (PCA) can be represented as finding the projections that maximize the variance, the first principal component can be obtained as:

where w is the loading vector of the first principal component (p x 1), p is the dimensionality of the data, x_k is sample k (p x 1) and there are N samples in total. More information can be found in "Support Vector Machines: Methods and Applications" (J. Suykens).

In the current form, this loss functions is maximal if w becomes infinite, therefore the norm is often constrained to 1 (normalization):

This can be translated to Python code as follows:

This results in a constrained optimization problem, which we can easily solve using scipy.optimize, it supports many different solvers depending on your needs and already chooses a suitable solver based on your inputs.

Some random Gaussian data will be generated (dimensionality p=2) and it will be benchmarked with the PCA implementation from sklearn.decomposition.

PC1 using sklearn:[-0.42388046 -0.90571814] PC1 using loss function:[-0.42388046 -0.90571814]

The loading of the first principal component is exactly the same for both implementations, this shows that the framework is powerful to represent algorithms using only a few lines of code.

Finally, we will benchmark the speed performance of both implementations for obtaining the first principal component on a i7–8750H CPU (2.2GHz, 12 cores). The scikit-learn implementation solves the singular value decomposition using the LAPACK library(Fortran). We will compare it for a multiple numbers of samples in the X dataset. The timings shown in the bar chart includes creating the instances and defining the methods.

The loss function implementation is in this case also significantly faster than the singular value decomposition for calculating the first principal component. The scikit-learn implementation catches up for large sample sizes, but this is only because it starts to randomly downsample the data, therefore the loss function implementation is faster and more accurate with only a few lines of code in this benchmark.


Loss function representation of SVM

Support Vector Machines can be solved in both the primal and dual space (if the used kernel can be represented in the primal space). For simplicity the SVM implemented here will be solved in the primal space. For more background information on the primal and dual space, see "Support Vector Machines: Methods and Applications" (J. Suykens).

The loss function of a linear SVM in primal space is defined as:

with the following constraints:

where w is the weight vector that we want to obtain (p x 1), gamma is a hyperparameter that specifies the desired trade-off between second order regularization and the number of misclassifications (if gamma is large, there will be more overfitting). x_k is sample k (p x 1) and there are N samples in total. zeta_k is the slack variable for sample k, which is larger the more sample k is misclassified (further away from decision boundary). y_k is the label of sample k, which can either be positive (+1) or negative (-1). b is the offset of the decision boundary.

Translating this to Python code yields (squared hinge loss) following class:

This class based on the Loss Function will now be compared with the sklearn implementation (like we did for PCA). This results in the following python code:

Weights sklearn:[ 0.349575 -0.17398369] Weights loss:[ 0.34916208 -0.17402776]

There is a small difference between the weights, but if we plot the decision boundary we can see that is almost the same.

Now the performance will be benchmarked similar to what we did for PCA, both implementations are allowed to run until convergence (no cut-off on the number of iterations). This results in:

For 1000 samples the custom implementation is about 10 times slower, for 50000 samples this decreases to around 3 times slower. All things considered this is not bad for a prototype of a couple lines long. Note that the constraints were defined in a nonlinear matter, whereas they can actually be written as linear constraints. Improving this would also improve the timings for the custom application.


Conclusion

Prototyping Machine Learning algorithms by writing a small class that optimizes the loss function is fast, accurate and easy. Doing this will increase your knowledge of the problem, therefore allowing you to improve the algorithm. If the prototype satisfies your needs but the numerical stability or speed is not sufficient for your application, then it’s advised to search in papers for fast implementations or look at some git repositories.


References

JAK Suykens, T Van Gestel, J De Brabanter, B De Moor, J Vandewalle. (2002). "Least Squares Support Vector Machines: Methods and Applications". World Scientific

Halko, N., Martinsson, P. G., and Tropp, J. A. (2011). "Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions". SIAM review, 53(2), 217–288.

Martinsson, P. G., Rokhlin, V., and Tygert, M. (2011). "A randomized algorithm for the decomposition of matrices". Applied and Computational Harmonic Analysis, 30(1), 47–68.


Related Articles