The world’s leading publication for data science, AI, and ML professionals.

Kernel Machine From Scratch

Learn the concept of kernel and reproducing kernel Hilbert space schematically

Photo by Benton Sherman on Unsplash
Photo by Benton Sherman on Unsplash

Training a neural network model may be hard, knowing what it has learned is even harder. Back to 1995, Radford M. Neal showed that a single layer neural network with random parameters would converge to a Gaussian process as the width goes to infinity. In 2018, Lee et al. further generalized the result to infinite width network of arbitrary depth. Upon this connection, we are allowed to build a model using Gaussian process that mimics an infinite width network – casting the problem into an understandable regime. More recently, Arthur Jacot et al. proved that the evolution of an infinite width neural network during gradient descent based training can also be described in terms of kernel (neural tangent kernel, to name it). Along the paved way, more and more works are studying the intriguing behaviors of neural networks with theoretical tools from kernel methods. Before diving into these great works, it’s worth reviewing the theorems relating to the kernel in depth.

1. Function Space in Machine Learning

To fully appreciate the theory behind the kernel method, it’s essential to clarify the concept of function space and the role it plays in machine learning in the first place. As we all know in supervised learning, we collect a set of tuples (x₁,y₁),…,(xₙ, yₙ) and would like to find a function that fits it most. More specifically, the target is to find a function f(.) within a set of functions s.t. Σ l(f(xᵢ), yᵢ) is minimized w.r.t. a loss function l(.,.). Such a set of functions, known as hypothesis set in the machine learning field, is exactly a function space. Since we are finding the optimum within , the larger it is, the more admissible solution we may get. So, how large could it be?

Thinking of function as a vector over field ℝ with x∈𝒳 as indices. For an input space 𝒳of cardinality |𝒳|, the largest function space would be ℝ^|𝒳|, w_h_ich means that once the input space has infinite cardinality (ℝ for example), the corresponding function space will have infinite dimension thereby – If we take all possibilities into account, it ends up with an infinite space that we never have the chance to exhaustively search through.

To make life easier, we often limit ourselves to a smaller hypothesis set. Characterizing the functional form is one way of doing so. By setting up the functional form by parameters w, we reduce the search space to ℝ^|w| and indeed make the problem be in a solvable regime. Nevertheless, no rose without a thorn. In real world applications, it’s not always easy to come up with a functional form for complex data. Even if we do come up one, the capacity with which learnt function equipped could be too limited to be useful. Now, we may wonder, is there a way we can have the problem solvable without sacrificing the capacity much in the meantime? This is where the kernel method comes to the remedy.

Function space of f: ℝ → ℝ with linear / polynomial functional form. Image by author.
Function space of f: ℝ → ℝ with linear / polynomial functional form. Image by author.

2. Beyond Theories – a Schematic View of Kernel-induced Function Space

Kernel, informally speaking, is a generalized inner product between instances in input space. Like what the inner product does, a kernel function K: 𝒳 ×𝒳 → _ℝ ch_aracterizes geometric properties such as norm, distance and angle of the input space, with

In addition to the ability to characterize the input space, what’s more important is the ability to characterize the function space, a.k.a. reproducing kernel Hilbert space (RKHS in abbreviation).

2.1. Kernel-induced Function Space

Instead of going through the definition to understand RKHS, let’s try to construct it from scratch. Consider a kernel function K: 𝒳 × 𝒳 → _ℝ s_atisfying inner product properties. For every x∈𝒳, we further define Kₓ(.) ≡ K(._,_x), i.e., K(.,.) with later part fixed at x. Now again think of function as a vector. With taking this set of functions as basis, it spans a function space ℌ __ s.t. every function in is a linear combination of the Kₓ‘s. Building upon Kₓ‘s, if the Kₓ‘s form a linearly independent set, the resulting would thus have dim()=|𝒳| – reaching the possibly largest space aforementioned.

2.2. The Rose Without The Thorn – RKHS And Representer Theorem

Upon the aforementioned construction, we’ve built a function space reaching the largest dimension. The next question is, how can we find a function that minimizes the loss within an infinite dimensional space? All we need is simply equip with an inner product that reproduces the kernel K – for every u, v∈𝒳, ⟨ Kᵤ, Kᵥ ⟩=K_(u, v)_. With this specialized inner product, comes with the most powerful theorem – representer theorem:

Given A set of training sample _(x₁,y₁),…,(xₙ,yₙ)∈_𝒳 × ℝ, every minimizer of the empirical loss

for arbitrary loss function L: ( 𝒳 × ℝ² )ⁿ → ℝ and any strictly increasing function g: [0,∞) → ℝ measuring the complexity of a function basing on its norm admits a representation of the form:

In a word, it tells us that in order to find the loss minimizer, we don’t need to search the whole infinite dimensional space, but only the subspace spanned by the representers (a.k.a. Kₓ‘s) of training data. What’s the magic behind this special inner product? How can we not consider the complement space to find the optimizer? The very most important property originated from the inner product is that the evaluation of f at x for all possible f and x∈𝒳 _i_s equivalent to taking inner product between _f a_nd _Kₓ, i.e., f(x)_=⟨ Kₓ, f ⟩. Upon this, we can further show that for any function f __ in , the orthogonal complement part always contribute zero when we evaluate f at any x within training data, and thus contribute nothing to the loss.

2.3. Hands-on Crafting A RKHS From Scratch

Having known the rationale behind the kernel machine, let me go through a simple example to help you get a better understanding. Suppose now we are to estimate a function measuring the preference for three different fruits, and we are given a kernel matrix K measuring the relation in between. To build up a function space from K, we can span a function space with

as basis, as shown in the figure below.

Now, we observe two points: (apple, 7) and (guava, 4). Basing on representer theorem, the optimal solution must admit to the following form:

Combining with square loss, the problem becomes:

yielding

and thus

Image by author.
Image by author.

3. Occam’s Razor In Kernel’s Blood

So far, we’ve demonstrated how kernel can be used to build up the hypothesis set and how we can find the minimizer to the empirical loss. As a matter of fact, the solution corresponds to the one with minimum norm among all functions yielding the minimal loss (passing through all the observed data, in some cases). In terms of the principle of Occam’s razor, it is exactly the one to favor.

As we all know, the power of a neural network lies in not the ability to fit the training data perfectly, but the ability to generalize to those unseen as well. In connecting neural network (trained with gradient descent) with kernel machine, the razor in kernel’s blood provides a possible explanation for its success.

4. Final Words

Long developed since the 1960s, theories relating to kernel are well-matured, and algorithms utilizing Kernel Trick are often one of the first choices in coping with new data. Most practitioners nowadays may have well-established knowledge and experiences about how to construct a powerful kernel w.r.t. different data, and take advantage of kernel trick to build up a high capacity learner, yet, not knowing much about the RKHS and theorems related. In this article, I tried to elaborate on the idea of RKHS in a schematic way without much mathematics involved, hoping it could bring some enlightenment to every one needed.


Related Articles