Thoughts and Theory

Explicit feature maps for non-linear kernel functions

Konstantin Kutzkov
Towards Data Science
10 min readSep 2, 2021

--

It is all about finding the right space

At a high level, many machine learning algorithms can be thought of as consisting of the following steps.

  1. We are given an input dataset where examples are described by n-dimensional feature vectors. In supervised problems we are also given a target label.
  2. Map the feature vectors to new m-dimensional space using some function. (Usually, m is larger than n.)

3. Train a basic machine learning model in the new feature space.

For example, you can think of many deep learning models in the above framework. You convert your original input by passing it through several layers that create a new feature representation. Then you train a basic machine learning model in the new feature space. A multilayer perceptron used for regression tasks can be seen as a linear regression model applied to the features generated by the hidden layers. Similarly, by adding a sigmoid as an activation function for the output layer, we train a logistic regression classifier in the new feature space.

In fact, the huge success of neural networks can be attributed to the fact that the mapping function is complex and learnable and thus able to represent problem specific patterns.

But in many other instances we work with a fixed function that maps the original data features to a new domain. A standard example is to create new features by considering the product of pairs of the existing features. Look at the example below. By simply adding the product of the two dimensions as a third dimension, we obtain a linearly separable dataset.

The dataset on the left is not linearly separable. By adding x*y as a third feature the data clearly becomes linearly separable.

Kernel functions and Support-Vector Machines

At a high-level, a kernel function measures the similarity between objects represented by numerical vectors. Kernel functions are the basis of support-vector machines where the objective is to learn a separating hyperplane between objects. Kernel functions should satisfy certain mathematical requirements which, somewhat simplified, can be stated as follows.

The above says that we must be able to rewrite the kernel function between two vectors as the inner product between vectors in a new (high-dimensional) space.

Let us consider a few examples that will make the above formal definition clear:

  • Linear kernel. Here m = n and f is the identity map.

A variation of the linear kernel is the cosine kernel where the vectors are normalized to have unit norm.

  • Polynomial kernel.

Observe that for p=2 we can rewrite the vector x as an n² dimensional vector. For example, let x = (x1, x2, x3) and y=(y1, y2, y3) be two 3-dimensional vectors and p=2. Then for the polynomial kernel of degree 2 it holds

i.e., we have a 9-dimensional vector for the explicit feature map.

More generally, the explicit feature map for the polynomial kernel is n^p-dimensional.

  • RBF kernel. The Radial Basis Function is used to define the RBF kernel:

The use of the exponential function makes the dimensionality of the explicit feature maps infinite.

The kernel trick

The obvious problem with evaluating the explicit feature map is the high-dimensionality of the mapping for most kernels. This has led to the discovery of the infamous kernel trick. We refer the reader to this excellent explanation of the kernel trick but at a high level it achieves the following. By rewriting the objective function that we want to optimize, we can learn an implicitly defined separating hyperplane in a high-dimensional space without generating high-dimensional vectors. The separating hyperplane is defined by a set of vectors, the so-called support vectors, not by exact coordinates. When we need to decide on which side of the hyperplane a point x lies, we evaluate the kernel function k(x, y) against all support vectors y. This is much more efficient. Consider 100-dimensional input vectors and a polynomial kernel of degree p=4. Explicitly computing the feature representation would require us to work with vectors with 10⁸, or 100 million, entries. But evaluating the product with a support vector requires the inner product of two 100-dimensional vectors and then raising the result to power 4.

To get a better idea of the computational savings we can achieve, assume we have detected that the hyperplane is implicitly defined by 1,000 support vectors. Evaluating the kernel for all support vectors will require a total of less than 1,000*100*4 multiplications (the last term 4 comes from the number of multiplications for evaluating the power of the inner product). This is a fraction of 0.004 of the number of multiplications required for the evaluation of a single example using. the explicit feature map.

Big data has changed the picture

However, in the era of Big data it turns out that SVMs are not really scalable. First, in order to compute a set of support vectors we need to compute the Gram matrix consisting of the kernel values k(x,y) for all pairs (x,y) of input instances. Thus, for t examples in our training dataset we need a matrix with entries. Even if we decide to store a sparsified version of the Gram matrix by keeping only the largest entries, for massive datasets simply computing the kernel function for all pairs can be infeasible.

Another problem is that for massive complex datasets the number of support vectors grows linearly with the dataset size [1]. So, even if we can afford to train an SVM model, the prediction time will be very slow as we would need to evaluate the kernel function many times for a single prediction.

Explicit but compact

The above limitations of support vector machines have motivated researchers to consider the problem of obtaining a compact representation of explicit feature maps. This means that the original n-dimensional vectors x, y are replaced by new m-dimensional vectors f(x), f(y), for a reasonably large m, such that

Here m is a user defined parameter. The larger m the better the approximation. Thus, m is a trade-off between accuracy and scalability.

Please feel free to skip the below description of the underlying mechanism of constructing explicit feature maps. It is not crucial for understanding the next sections.

How does it work?

Next we consider two approaches to constructing explicit feature maps for the two most widely used kernels, the RBF and the polynomial kernels, and provide intuition how they work:

RBF sampler.

Using some advanced mathematical concepts from probability theory we can obtain the following representation for the kernel

where p(w) is a probability distribution [2].

The above formulation allows us to design a sampling procedure that samples vectors w according to p(w). Observe that the expression in brackets can be written as in inner product with two coordinates. A large number of samples will then result in an approximate explicit feature map. For a collection of vectors, we sample values w for each coordinate in the approximate explicit feature map using the same random seed. In this way we guarantee that the samples are coordinated and vectors that are similar to each other, i.e., have a small Euclidean distance, are likely to end up with similar samples in the respective coordinates.

Polynomial CountSketch.

The polynomial CountSketch, or TensorSketch [3], builds upon the CountSketch algorithm detecting heavy hitters in massive data streams. We will give an intuitive explanation of how the approach works. Imagine, we are given a vector x and construct the n²-dimensional explicit feature map, call it z. Then, each index i is placed at random by a hash function in one of m buckets, denoted by B and initialized with zeros. Also, the value z[i] is multiplied by -1 or 1 selected at random with equal probability. Precisely, we update a bucket as

Assume that some of the z[i] values are much larger than the rest, the so called heavy hitters. Let z[j] be such a heavy hitter. The bucket that contains it, B[h(j)] also contains other entries z[k]. The content of the bucket is

However, the signs are random and if the summation contains many small entries, they are likely to cancel out.

Now assume we have the explicit feature maps z and w of two vectors such that z[j] and w[j] are both large. (Ultimately, this is what we are looking for in a kernel, to detect subsets of features that are significant in both input vectors.) We replace the multiplication of z[j]*w[j] by

where Rz and Rw are the above sums of other entries hashed to the same bucket. Because the signs in front of z[j] and w[j] are the same, and Rz and Rw are small, we obtain a good approximation of z[i]*z[j].

The above assumes that we have direct access to the entries z[i] in the explicit feature map. In [3] the authors design an efficient algorithm that can perform the above by avoiding the generation of the explicit feature maps. It runs in nearly linear time in the length of the input vectors.

How to use explicit feature maps

The good news is that explicit feature maps are already implemented in scikit-learn and are easy to use. (The code for the following example can be found at https://github.com/konstantinkutzkov/explicit_feature_maps)

First, import the necessary packages:

from sklearn import svm, datasets
from sklearn.kernel_approximation import RBFSampler
from sklearn.kernel_approximation import PolynomialCountSketch

Then we can simply convert the original vectors to a new feature representation.

poly_cs = PolynomialCountSketch(
degree=3,
gamma=1.0,
random_state=1,
n_components=1000
)
X_train, y_train, X_test, y_test =
get_train_test_split(X, y, train_index)
X_train_poly, X_test_poly =
poly_cs.fit_transform(X_train), poly_ts.fit_transform(X_test)

At the end, use a Linear SVM to train the models:

model_cs = 
svm.LinearSVC(C=1.0, max_iter=10000).fit(X_train_poly, y_train)

We trained the above models on the digits dataset consisting of hand-written digits described by 64-dimensional vectors. The explicit feature map of a polynomial kernel of degree 3 would thus result in vectors of dimensionality 262,144. Instead we work with feature maps with only 1,000 entries, 260 times less, and obtain essentially identical results. The below results are averaged over 10-fold cross validation:

Linear SVM: 0.985
RBF SVM: 0.817
Polynomial SVM: 0.998
Explicit RBF: 0.818
Explicit Polynomial: 0.997

In the above, RBF SVM and Polynomial SVM were trained by the dual SVM models that use the kernel trick. The polynomial kernel outperforms the other methods and the RBF kernel does not appear to be appropriate for this problem.

Practical hints

  • Always make sure you use the same random seed if you want to apply the same mapping to different sets, be it for model deployment or reproducibility of results.
  • Scale the features. If the features are not scaled, then in PolynomialCountSketch the feature maps will be biased towards the heaviest entries (see the discussion on how the approach works).
  • Consider that a smaller dimensionality of the approximation of the explicit maps serves as a regularization parameter.

Applications

The obvious application is to train a linear support vector machine using the explicit feature maps. But given that they are oblivious to the problem at hand, explicit feature maps can be used in combination with any other machine learning model, be it supervised or unsupervised. Just think about it as training your model in a new subspace formed by a combination of the coordinates of the original space.

Discussion

Advantages

  • Accessibility. Easy to use, publicly available implementation on scikit-learn.
  • Scalability. Generating explicit feature maps is a highly efficient mathematical manipulation and does not require training a model.
  • Well understood. We are approximating mathematical functions with rigorously understood properties. These kernel functions have been used by the ML community for decades.

Drawbacks

There are two important drawbacks that should be kept in mind if you decide to include explicit feature maps in your project.

  • We have yet another hyperparameter that needs to be carefully tuned. Compared to the standard version of the kernel, we have an additional hyperparameter which is the dimensionality of the explicit feature maps. As discussed, this parameter is a trade-off between accuracy and scalability.
  • The explicit feature maps are not anymore interpretable. The individual dimensions in the original input correspond to features that in many cases are accessible to a human expert. This makes it possible to train explainable ML models such as linear or logistic regression, decision trees, etc. In particular, SVMs using the kernel trick are more or less interpretable as we can look at the kernel function scores between two vectors. However, explicit feature maps are the result of complex mathematical modifications of the original vectors and individual coordinates are not anymore meaningful to a human user.

Code

https://github.com/konstantinkutzkov/explicit_feature_maps

[1] Thorsten Joachims: Training Linear SVMs in Linear Time. KDD 2006. https://www.cs.cornell.edu/people/tj/publications/joachims_06a.pdf

[2] Ali Rahimi, Benjamin Recht: Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning. NIPS 2008 https://people.eecs.berkeley.edu/~brecht/papers/08.rah.rec.nips.pdf

[3] Ninh Pham, Rasmus Pagh. Fast and scalable polynomial kernels via explicit feature maps. KDD 2013 https://dl.acm.org/doi/10.1145/2487575.2487591

--

--

Data scientist with a PhD in Computer science from IT University of Copenhagen. Interested in machine learning and algorithms.