We explain the motivation of the Kernel method in machine learning theory and how one might go about overcoming the computational challenges.
Published
28 October 2023
The concept of a kernel plays a crucial role when dealing with massive datasets, as it provides a computationally feasible method to encapsulate non-linearity.
There are several popular types of kernels, including the Gaussian Kernel (RBF), the polynomial kernel, and the Sigmoid Kernel, each with their unique advantages.
Essentially, kernels serve as a valuable tool to navigate computational limitations while still capturing the intricate non-linearity inherent in data. However, it’s important to note that approximations may be necessary in order to fully harness the power of kernels.
In this article, we aim to understand the significance of kernel methods and survey the theoretical underpinnings of Gaussian Kernel approximations, focusing on Taylor approximation and random features.
Motivation
Let’s take a look at the kernel (soft) SVM optimization problem:
where R is a monotonically nondecreasing function. According to Representer theorem (proof involves using the orthogonality of Hilbert space and the monotonicity of the function R), the solution takes the form f(x)=∑i=1mαiψ(xi). This implies that our minimizing function can be completely expressed in terms of K(xi,xj):=⟨ϕ(xi),ϕ(xj)⟩. However, this might still be unsatisfactory. For instance, if evaluating the Kernel ϕ:Rd→H takes Θ(d), then we would need Θ(md) operations in total.
If we can find a way to transform ϕ:X→H to ϕ~(x):Rd→RD, and define
f~(x)=⟨w~,ϕ~(x)⟩
we can potentially reduce the cost to Θ(D).
Notice how this bears resemblance to the method of weighted sums and quadrature points that is commonly employed for numerically approximating integrals in partial differential equations.
Approximation method 1: Taylor approximation
1. Motivation
The following theorem illustrates how the concept of projection can achieve this reduction in computational cost.
Theorem 1. Let p∗=infwp(w) be the optimal value of soft SVM. Define ϕ~(x)=Pϕ(x) and p~∗=infw~p~(w~). Then we have
p∗≤p~∗≤p∗+mλ1i=1∑mK(xi,xi)−K~(xi,xi)
Proof. Note that since P is an orthogonal projection, we have ∥Pw∥≤∥w∥. Further, we have
using the orthogonality of Hilbert space. Now note that ∥P⊥ϕ(xi)∥2=∥ϕ(xi)∥2−∥Pϕ(xi)∥2, and so by regularizing ∥w∗∥≤λ1, we can get the second inequality by routine algebra. □
This suggests that minimizing the ∑i=1m∥ϕ(x)−Pϕ(x)∥ provides a good approximation, and Taylor approximation gives a solution for this.
2. Taylor approximation
For the Gaussian Kernel K(x,x′)=exp(−2σ2∥x−x′∥2), we first find ϕ, truncate it, and use Taylor’s theorem with remainder to obtain a bound.
First note that we have K(x,x′)=⟨ϕ(x),ϕ(x′)⟩=∑k=0∞∑j∈[d]kϕk,j(x)ϕk,j(x′).
This is like inner product on ∞×dk dimensional space, projected onto r×dk dimensional space.
So we have reduced to D=∑k=1rdr dimensional feature space. However, there are duplicates; for example we treat j∈{1,2,3} differently as j∈{3,1,2} when in fact they can be treated same. So dk can be reduced to (kd+k−1) (this is like finding the number of ways to solve ∑i=1dxi=k, xi≥0). Using Pascal’s rule, we have D=(rd+r).
Approximation method 2: Random features
Gaussian Kernel can be represented as K(x,x′)=K(x−x′) and it falls into the Schwartz class of functions. As such, we may use Fourier inversion formula, given by
We surveyed the Kernel method within the realm of machine learning theory and how one might go about overcoming the computational challenges in approximating Gaussian Kernel.
References
Andrew Cotter, N. S., Joseph Keshet. (2011). Explicit Approximations of the Gaussian Kernel.
arXiv:1109.4603.
Shai Ben-David, S. S.-S. (2014). Understanding Machine Learning: From Theory to Algorithms.
Cambridge University Press.