Home

Approximation of Gaussian Kernel

We explain the motivation of the Kernel method in machine learning theory and how one might go about overcoming the computational challenges.

Published

28 October 2023

The concept of a kernel plays a crucial role when dealing with massive datasets, as it provides a computationally feasible method to encapsulate non-linearity.

There are several popular types of kernels, including the Gaussian Kernel (RBF), the polynomial kernel, and the Sigmoid Kernel, each with their unique advantages.

Essentially, kernels serve as a valuable tool to navigate computational limitations while still capturing the intricate non-linearity inherent in data. However, it’s important to note that approximations may be necessary in order to fully harness the power of kernels.

In this article, we aim to understand the significance of kernel methods and survey the theoretical underpinnings of Gaussian Kernel approximations, focusing on Taylor approximation and random features.

Motivation

Let’s take a look at the kernel (soft) SVM optimization problem:

$\min_{w \in \mathcal{H}} p(w) = \frac{\lambda}{2} \|w\|^2_{\mathcal{H}} + \frac{1}{m} \sum_{i=1}^m (1 - y_i \langle w, \phi(x_i) \rangle)_+$

Or, more generally,

$\min_w \left(f(\langle w, \phi(x_1) \rangle, \ldots, \langle w, \phi(x_m) \rangle) + R(\|w\|)\right)$

where $R$ is a monotonically nondecreasing function. According to Representer theorem (proof involves using the orthogonality of Hilbert space and the monotonicity of the function $R$ ), the solution takes the form $f(x) = \sum_{i=1}^m \alpha_i \psi(x_i)$ . This implies that our minimizing function can be completely expressed in terms of $K(x_i, x_j) := \langle \phi(x_i), \phi(x_j) \rangle$ . However, this might still be unsatisfactory. For instance, if evaluating the Kernel $\phi : \mathbb{R}^d \to \mathcal{H}$ takes $\Theta(d)$ , then we would need $\Theta(md)$ operations in total.

If we can find a way to transform $\phi : \mathcal{X} \to \mathcal{H}$ to $\tilde{\phi}(x) : \mathbb{R}^d \to \mathbb{R}^D$ , and define

$\tilde{f}(x) = \langle \tilde{w}, \tilde{\phi}(x) \rangle$

we can potentially reduce the cost to $\Theta(D)$ .

Notice how this bears resemblance to the method of weighted sums and quadrature points that is commonly employed for numerically approximating integrals in partial differential equations.

Approximation method 1: Taylor approximation

1. Motivation

The following theorem illustrates how the concept of projection can achieve this reduction in computational cost.

Theorem 1. Let $p^* = \inf_w p(w)$ be the optimal value of soft SVM. Define $\tilde{\phi}(x) = P\phi(x)$ and $\tilde{p}^* = \inf_{\tilde{w}} \tilde{p}(\tilde{w})$ . Then we have

$p^* \leq \tilde{p}^* \leq p^* + \frac{1}{m\sqrt{\lambda}} \sum_{i=1}^m \sqrt{K(x_i, x_i) - \tilde{K}(x_i, x_i)}$

Proof. Note that since $P$ is an orthogonal projection, we have $\|Pw\| \leq \|w\|$ . Further, we have

$\langle \tilde{w}, \phi(x_i) \rangle = \langle Pw, \phi(x_i) \rangle = \langle w, P^* \phi(x_i) \rangle = \langle w, P \phi(x_i) \rangle = \langle w, \tilde{\phi}(x_i) \rangle$

and so $p(Pw) \leq \tilde{p}(w)$ . This implies that $p^* \leq \tilde{p}^*$ , which establishes the first inequality.

For the second inequality, note that

$|(1 - y_i \langle w, \phi(x_i) \rangle)_+ - (1 - y_i \langle w, P \phi(x_i) \rangle)_+| \leq \|w\| \|P^\perp \phi(x_i)\|$

using the orthogonality of Hilbert space. Now note that $\|P^\perp \phi(x_i)\|^2 = \|\phi(x_i)\|^2 - \|P \phi(x_i)\|^2$ , and so by regularizing $\|w^*\| \leq \frac{1}{\sqrt{\lambda}}$ , we can get the second inequality by routine algebra. □

This suggests that minimizing the $\sum_{i=1}^m \| \phi(x) - P\phi(x) \|$ provides a good approximation, and Taylor approximation gives a solution for this.

2. Taylor approximation

For the Gaussian Kernel $K(x, x') = \exp\left( -\frac{\|x - x'\|^2}{2\sigma^2} \right)$ , we first find $\phi$ , truncate it, and use Taylor’s theorem with remainder to obtain a bound.

First note that we have $K(x, x') = \langle \phi(x), \phi(x') \rangle = \sum_{k=0}^{\infty} \sum_{j \in [d]^k} \phi_{k,j}(x)\phi_{k,j}(x')$ .

This is like inner product on $\infty \times d^k$ dimensional space, projected onto $r \times d^k$ dimensional space.

This leads to the truncated Kernel

$\tilde{K}(x, x') = \langle \tilde{\phi}(x), \tilde{\phi}(x') \rangle = \exp\left(-\frac{\|x\|^2 + \|x'\|^2}{2\sigma^2} \right) \sum_{k=0}^r \frac{1}{k!} \left( \frac{\langle x, x' \rangle}{\sigma^2} \right)^k$

By Taylor’s theorem with remainder, we have

$|K(x, x') - \tilde{K}(x, x')| \leq \frac{1}{(r+1)!} \left( \frac{\|x\| \|x'\|}{\sigma^2} \right)^{r+1}$

So we have reduced to $D = \sum_{k=1}^r d^r$ dimensional feature space. However, there are duplicates; for example we treat $j \in \{1, 2, 3\}$ differently as $j \in \{3, 1, 2\}$ when in fact they can be treated same. So $d^k$ can be reduced to $\binom{d+k-1}{k}$ (this is like finding the number of ways to solve $\sum_{i=1}^d x_i = k$ , $x_i \geq 0$ ). Using Pascal’s rule, we have $D = \binom{d+r}{r}$ .

Approximation method 2: Random features

Gaussian Kernel can be represented as $K(x, x') = K(x - x')$ and it falls into the Schwartz class of functions. As such, we may use Fourier inversion formula, given by

$K(x - x') = \int_{\mathbb{R}^d} \widehat{K}(w) \exp(2\pi i (x - x') \cdot w) dw$

$= \int_{\mathbb{R}^d} \text{Re}(\widehat{K}(w)) \cos(2\pi (x - x') \cdot w) - \text{Im}(\widehat{K}(w)) \sin(2\pi (x - x') \cdot w) dw$

, second equality since $K$ is real-valued. Now

$\int_{\mathbb{R}^d} \text{Im}(\widehat{K}(w)) \sin(2\pi (x - x') \cdot w) dw = 0$

since $K(x - x') = K(-(x - x'))$ . We thus have (appropriately scaling $\widehat{K}$ , and denoting the real part of $\widehat{K}$ as $\bar{K}$ , again for ease of notation)

$K(x - x') = \mathbb{E}_{w \sim \bar{K}(w)} [\cos(w \cdot (x - x'))]$

$= \mathbb{E}_{w \sim \bar{K}(w)} [\cos(w \cdot x + \theta) \cdot \cos(w \cdot x' + \theta) - \sin(w \cdot x + \theta) \cdot \sin(w \cdot x' + \theta)]$

On the other hand, we have

$\mathbb{E}_{\theta \sim U} [\sin(w \cdot x + \theta) \cdot \sin(w \cdot x' + \theta)] = \frac{1}{2} \mathbb{E}_{\theta \sim U} [\cos(w \cdot (x - x'))]$

In conclusion, we have, for properly scaled $K$ ,

$K(x - x') = \mathbb{E}_{w \sim \bar{K}(w), \theta \sim [0, 2\pi]} [\cos(w \cdot x + \theta) \cdot \cos(w \cdot x' + \theta)]$

We can then employ Monte Carlo method to approximate $K$ with this equality.

Feature mapping is defined as $\tilde{\phi}_j(x) = \cos(w_j \cdot x + \theta_j)$ where $w_j \sim \bar{K}(w)$ and $\theta_j \sim U([0, 2\pi])$ .

Fact 1. (Rahimi and Recht, union bound of random features method) Let $\tilde{K}$ be the kernel defined by $D$ random Fourier features. For any $\epsilon$ , we have

$P\left[ \sup_{\|x\|, \|y\| \leq R} \left| K(x, y) - \tilde{K}(x, y) \right| \geq \epsilon \right] \leq 2^8 \cdot \frac{d}{\epsilon^2} \left( \frac{R}{\sigma} \right)^2 \exp\left( - \frac{D\epsilon^2}{4(2 + d)} \right)$

Conclusion

We surveyed the Kernel method within the realm of machine learning theory and how one might go about overcoming the computational challenges in approximating Gaussian Kernel.

References

Andrew Cotter, N. S., Joseph Keshet. (2011). Explicit Approximations of the Gaussian Kernel.
arXiv:1109.4603.
Shai Ben-David, S. S.-S. (2014). Understanding Machine Learning: From Theory to Algorithms.
Cambridge University Press.