Approximation of Gaussian Kernel

We explain the motivation of the Kernel method in machine learning theory and how one might go about overcoming the computational challenges.

The concept of a kernel plays a crucial role when dealing with massive datasets, as it provides a computationally feasible method to encapsulate non-linearity.

There are several popular types of kernels, including the Gaussian Kernel (RBF), the polynomial kernel, and the Sigmoid Kernel, each with their unique advantages.

Essentially, kernels serve as a valuable tool to navigate computational limitations while still capturing the intricate non-linearity inherent in data. However, it’s important to note that approximations may be necessary in order to fully harness the power of kernels.

In this article, we aim to understand the significance of kernel methods and survey the theoretical underpinnings of Gaussian Kernel approximations, focusing on Taylor approximation and random features.


Motivation

Let’s take a look at the kernel (soft) SVM optimization problem:

minwHp(w)=λ2wH2+1mi=1m(1yiw,ϕ(xi))+ \min_{w \in \mathcal{H}} p(w) = \frac{\lambda}{2} \|w\|^2_{\mathcal{H}} + \frac{1}{m} \sum_{i=1}^m (1 - y_i \langle w, \phi(x_i) \rangle)_+

Or, more generally,

minw(f(w,ϕ(x1),,w,ϕ(xm))+R(w)) \min_w \left(f(\langle w, \phi(x_1) \rangle, \ldots, \langle w, \phi(x_m) \rangle) + R(\|w\|)\right)

where RR is a monotonically nondecreasing function. According to Representer theorem (proof involves using the orthogonality of Hilbert space and the monotonicity of the function RR), the solution takes the form f(x)=i=1mαiψ(xi)f(x) = \sum_{i=1}^m \alpha_i \psi(x_i). This implies that our minimizing function can be completely expressed in terms of K(xi,xj):=ϕ(xi),ϕ(xj)K(x_i, x_j) := \langle \phi(x_i), \phi(x_j) \rangle. However, this might still be unsatisfactory. For instance, if evaluating the Kernel ϕ:RdH\phi : \mathbb{R}^d \to \mathcal{H} takes Θ(d)\Theta(d), then we would need Θ(md)\Theta(md) operations in total.

If we can find a way to transform ϕ:XH\phi : \mathcal{X} \to \mathcal{H} to ϕ~(x):RdRD\tilde{\phi}(x) : \mathbb{R}^d \to \mathbb{R}^D, and define

f~(x)=w~,ϕ~(x) \tilde{f}(x) = \langle \tilde{w}, \tilde{\phi}(x) \rangle

we can potentially reduce the cost to Θ(D)\Theta(D).

Notice how this bears resemblance to the method of weighted sums and quadrature points that is commonly employed for numerically approximating integrals in partial differential equations.


Approximation method 1: Taylor approximation

1. Motivation

The following theorem illustrates how the concept of projection can achieve this reduction in computational cost.

Theorem 1. Let p=infwp(w)p^* = \inf_w p(w) be the optimal value of soft SVM. Define ϕ~(x)=Pϕ(x)\tilde{\phi}(x) = P\phi(x) and p~=infw~p~(w~)\tilde{p}^* = \inf_{\tilde{w}} \tilde{p}(\tilde{w}). Then we have

pp~p+1mλi=1mK(xi,xi)K~(xi,xi) p^* \leq \tilde{p}^* \leq p^* + \frac{1}{m\sqrt{\lambda}} \sum_{i=1}^m \sqrt{K(x_i, x_i) - \tilde{K}(x_i, x_i)}

Proof. Note that since PP is an orthogonal projection, we have Pww\|Pw\| \leq \|w\|. Further, we have

w~,ϕ(xi)=Pw,ϕ(xi)=w,Pϕ(xi)=w,Pϕ(xi)=w,ϕ~(xi) \langle \tilde{w}, \phi(x_i) \rangle = \langle Pw, \phi(x_i) \rangle = \langle w, P^* \phi(x_i) \rangle = \langle w, P \phi(x_i) \rangle = \langle w, \tilde{\phi}(x_i) \rangle

and so p(Pw)p~(w)p(Pw) \leq \tilde{p}(w). This implies that pp~p^* \leq \tilde{p}^*, which establishes the first inequality.

For the second inequality, note that

(1yiw,ϕ(xi))+(1yiw,Pϕ(xi))+wPϕ(xi) |(1 - y_i \langle w, \phi(x_i) \rangle)_+ - (1 - y_i \langle w, P \phi(x_i) \rangle)_+| \leq \|w\| \|P^\perp \phi(x_i)\|

using the orthogonality of Hilbert space. Now note that Pϕ(xi)2=ϕ(xi)2Pϕ(xi)2\|P^\perp \phi(x_i)\|^2 = \|\phi(x_i)\|^2 - \|P \phi(x_i)\|^2, and so by regularizing w1λ\|w^*\| \leq \frac{1}{\sqrt{\lambda}}, we can get the second inequality by routine algebra. □

This suggests that minimizing the i=1mϕ(x)Pϕ(x)\sum_{i=1}^m \| \phi(x) - P\phi(x) \| provides a good approximation, and Taylor approximation gives a solution for this.

2. Taylor approximation

For the Gaussian Kernel K(x,x)=exp(xx22σ2)K(x, x') = \exp\left( -\frac{\|x - x'\|^2}{2\sigma^2} \right), we first find ϕ\phi, truncate it, and use Taylor’s theorem with remainder to obtain a bound.

First note that we have K(x,x)=ϕ(x),ϕ(x)=k=0j[d]kϕk,j(x)ϕk,j(x)K(x, x') = \langle \phi(x), \phi(x') \rangle = \sum_{k=0}^{\infty} \sum_{j \in [d]^k} \phi_{k,j}(x)\phi_{k,j}(x').

This is like inner product on ×dk\infty \times d^k dimensional space, projected onto r×dkr \times d^k dimensional space.

This leads to the truncated Kernel

K~(x,x)=ϕ~(x),ϕ~(x)=exp(x2+x22σ2)k=0r1k!(x,xσ2)k \tilde{K}(x, x') = \langle \tilde{\phi}(x), \tilde{\phi}(x') \rangle = \exp\left(-\frac{\|x\|^2 + \|x'\|^2}{2\sigma^2} \right) \sum_{k=0}^r \frac{1}{k!} \left( \frac{\langle x, x' \rangle}{\sigma^2} \right)^k

By Taylor’s theorem with remainder, we have

K(x,x)K~(x,x)1(r+1)!(xxσ2)r+1 |K(x, x') - \tilde{K}(x, x')| \leq \frac{1}{(r+1)!} \left( \frac{\|x\| \|x'\|}{\sigma^2} \right)^{r+1}

So we have reduced to D=k=1rdrD = \sum_{k=1}^r d^r dimensional feature space. However, there are duplicates; for example we treat j{1,2,3}j \in \{1, 2, 3\} differently as j{3,1,2}j \in \{3, 1, 2\} when in fact they can be treated same. So dkd^k can be reduced to (d+k1k)\binom{d+k-1}{k} (this is like finding the number of ways to solve i=1dxi=k\sum_{i=1}^d x_i = k, xi0x_i \geq 0). Using Pascal’s rule, we have D=(d+rr)D = \binom{d+r}{r}.


Approximation method 2: Random features

Gaussian Kernel can be represented as K(x,x)=K(xx)K(x, x') = K(x - x') and it falls into the Schwartz class of functions. As such, we may use Fourier inversion formula, given by

K(xx)=RdK^(w)exp(2πi(xx)w)dw K(x - x') = \int_{\mathbb{R}^d} \widehat{K}(w) \exp(2\pi i (x - x') \cdot w) dw

=RdRe(K^(w))cos(2π(xx)w)Im(K^(w))sin(2π(xx)w)dw = \int_{\mathbb{R}^d} \text{Re}(\widehat{K}(w)) \cos(2\pi (x - x') \cdot w) - \text{Im}(\widehat{K}(w)) \sin(2\pi (x - x') \cdot w) dw

, second equality since KK is real-valued. Now

RdIm(K^(w))sin(2π(xx)w)dw=0 \int_{\mathbb{R}^d} \text{Im}(\widehat{K}(w)) \sin(2\pi (x - x') \cdot w) dw = 0

since K(xx)=K((xx))K(x - x') = K(-(x - x')). We thus have (appropriately scaling K^\widehat{K}, and denoting the real part of K^\widehat{K} as Kˉ\bar{K}, again for ease of notation)

K(xx)=EwKˉ(w)[cos(w(xx))] K(x - x') = \mathbb{E}_{w \sim \bar{K}(w)} [\cos(w \cdot (x - x'))]

=EwKˉ(w)[cos(wx+θ)cos(wx+θ)sin(wx+θ)sin(wx+θ)] = \mathbb{E}_{w \sim \bar{K}(w)} [\cos(w \cdot x + \theta) \cdot \cos(w \cdot x' + \theta) - \sin(w \cdot x + \theta) \cdot \sin(w \cdot x' + \theta)]

On the other hand, we have

EθU[sin(wx+θ)sin(wx+θ)]=12EθU[cos(w(xx))] \mathbb{E}_{\theta \sim U} [\sin(w \cdot x + \theta) \cdot \sin(w \cdot x' + \theta)] = \frac{1}{2} \mathbb{E}_{\theta \sim U} [\cos(w \cdot (x - x'))]

In conclusion, we have, for properly scaled KK,

K(xx)=EwKˉ(w),θ[0,2π][cos(wx+θ)cos(wx+θ)] K(x - x') = \mathbb{E}_{w \sim \bar{K}(w), \theta \sim [0, 2\pi]} [\cos(w \cdot x + \theta) \cdot \cos(w \cdot x' + \theta)]

We can then employ Monte Carlo method to approximate KK with this equality.

Feature mapping is defined as ϕ~j(x)=cos(wjx+θj)\tilde{\phi}_j(x) = \cos(w_j \cdot x + \theta_j) where wjKˉ(w)w_j \sim \bar{K}(w) and θjU([0,2π])\theta_j \sim U([0, 2\pi]).


Fact 1. (Rahimi and Recht, union bound of random features method) Let K~\tilde{K} be the kernel defined by DD random Fourier features. For any ϵ\epsilon, we have

P[supx,yRK(x,y)K~(x,y)ϵ]28dϵ2(Rσ)2exp(Dϵ24(2+d)) P\left[ \sup_{\|x\|, \|y\| \leq R} \left| K(x, y) - \tilde{K}(x, y) \right| \geq \epsilon \right] \leq 2^8 \cdot \frac{d}{\epsilon^2} \left( \frac{R}{\sigma} \right)^2 \exp\left( - \frac{D\epsilon^2}{4(2 + d)} \right)


Conclusion

We surveyed the Kernel method within the realm of machine learning theory and how one might go about overcoming the computational challenges in approximating Gaussian Kernel.

References

  1. Andrew Cotter, N. S., Joseph Keshet. (2011). Explicit Approximations of the Gaussian Kernel.
    arXiv:1109.4603.

  2. Shai Ben-David, S. S.-S. (2014). Understanding Machine Learning: From Theory to Algorithms.
    Cambridge University Press.