## Subset of Regressors Approximation for GPR Models

The subset of regressors (SR) approximation method consists of replacing the kernel function $k\left(x,{x}_{r}|\theta \right)$ in the exact GPR method by its approximation ${\stackrel{^}{k}}_{SR}\left(x,{x}_{r}|\theta ,\mathcal{A}\right)$, given the active set $\mathcal{A}\subset \mathcal{N}=\left\{1,2,...,n\right\}$. You can specify the SR method for parameter estimation by using the `'FitMethod','sr'` name-value pair argument in the call to `fitrgp`. For prediction using SR, you can use the `'PredictMethod','sr'` name-value pair argument in the call to `fitrgp`.

### Approximating the Kernel Function

For the exact GPR model, the expected prediction in GPR depends on the set of $\mathcal{N}$ functions ${\mathcal{S}}_{\mathcal{N}}=\left\{k\left(x,{x}_{i}|\theta \right),i=1,2,\dots ,n\right\}$, where $\mathcal{N}=\left\{1,2,...,n\right\}$ is the set of indices of all observations, and n is the total number of observations. The idea is to approximate the span of these functions by a smaller set of functions, ${\mathcal{S}}_{\mathcal{A}}$, where $\mathcal{A}\subset \mathcal{N}=\left\{1,2,...,n\right\}$ is the subset of indices of points selected to be in the active set. Consider ${\mathcal{S}}_{\mathcal{A}}=\left\{k\left(x,{x}_{j}|\theta \right),j\in \mathcal{A}\right\}$. The aim is to approximate the elements of ${\mathcal{S}}_{\mathcal{N}}$ as linear combinations of the elements of ${\mathcal{S}}_{\mathcal{A}}$.

Suppose the approximation to $k\left(x,{x}_{r}|\theta \right)$ using the functions in ${\mathcal{S}}_{\mathcal{A}}$ is as follows:

`$\stackrel{^}{k}\left(x,{x}_{r}|\theta \right)=\sum _{j\in \mathcal{A}}{c}_{jr}k\left(x,{x}_{j}|\theta \right),$`

where ${c}_{jr}\in ℝ$ are the coefficients of the linear combination for approximating $k\left(x,{x}_{r}|\theta \right)$. Suppose $C$ is the matrix that contains all the coefficients ${c}_{jr}$. Then, $C$, is a $|\mathcal{A}|×n$ matrix such that $C\left(j,r\right)={c}_{jr}$. The software finds the best approximation to the elements of ${\mathcal{S}}_{\mathcal{N}}$ using the active set $\mathcal{A}\subset \mathcal{N}=\left\{1,2,...,n\right\}$ by minimizing the error function

`$E\left(\mathcal{A},C\right)=\sum _{r=1}^{n}{‖k\left(x,{x}_{r}|\theta \right)-\stackrel{^}{k}\left(x,{x}_{r}|\theta \right)‖}_{ℋ}^{2},$`

where $ℋ$ is the Reproducing Kernel Hilbert Spaces (RKHS) associated with the kernel function k [1], [2].

The coefficient matrix that minimizes $E\left(\mathcal{A},C\right)$ is

and an approximation to the kernel function using the elements in the active set $\mathcal{A}\subset \mathcal{N}=\left\{1,2,...,n\right\}$ is

The SR approximation to the kernel function using the active set $\mathcal{A}\subset \mathcal{N}=\left\{1,2,...,n\right\}$ is defined as:

and the SR approximation to $K\left(X,X|\theta \right)$ is:

### Parameter Estimation

Replacing $K\left(X,X|\theta \right)$ by ${\stackrel{^}{K}}_{SR}\left(X,X|\theta ,\mathcal{A}\right)$ in the marginal log likelihood function produces its SR approximation:

`$\begin{array}{ll}\mathrm{log}{P}_{SR}\left(y|X,\beta ,\theta ,{\sigma }^{2},\mathcal{A}\right)=\hfill & -\frac{1}{2}{\left(y-H\beta \right)}^{T}{\left[{\stackrel{^}{K}}_{SR}\left(X,X|\theta ,\mathcal{A}\right)+{\sigma }^{2}{I}_{n}\right]}^{-1}\left(y-H\beta \right)\hfill \\ \hfill & -\frac{N}{2}\mathrm{log}2\pi -\frac{1}{2}\mathrm{log}|{\stackrel{^}{K}}_{SR}\left(X,X|\theta ,\mathcal{A}\right)+{\sigma }^{2}{I}_{n}|\hfill \end{array}$`

As in the exact method, the software estimates the parameters by first computing $\stackrel{^}{\beta }\left(\theta ,{\sigma }^{2}\right)$, the optimal estimate of $\beta$, given $\theta$ and ${\sigma }^{2}$. Then it estimates $\theta$, and ${\sigma }^{2}$ using the $\beta$-profiled marginal log likelihood. The SR estimate to $\beta$ for given $\theta$, and ${\sigma }^{2}$ is:

`${\stackrel{^}{\beta }}_{SR}\left(\theta ,{\sigma }^{2},\mathcal{A}\right)={\left[\underset{*}{\underbrace{{H}^{T}{\left[{\stackrel{^}{K}}_{SR}\left(X,X|\theta ,\mathcal{A}\right)+{\sigma }^{2}{I}_{n}\right]}^{-1}H}}\right]}^{-1}\underset{**}{\underbrace{{H}^{T}{\left[{\stackrel{^}{K}}_{SR}\left(X,X|\theta ,\mathcal{A}\right)+{\sigma }^{2}{I}_{n}\right]}^{-1}y}},$`

where

`$\begin{array}{l}{\left[{\stackrel{^}{K}}_{SR}\left(X,X|\theta ,\mathcal{A}\right)+{\sigma }^{2}{I}_{n}\right]}^{-1}=\frac{{I}_{N}}{{\sigma }^{2}}-\frac{K\left(X,{X}_{\mathcal{A}}|\theta \right)}{{\sigma }^{2}}{A}_{\mathcal{A}}^{-1}\frac{K\left({X}_{\mathcal{A}},X|\theta \right)}{{\sigma }^{2}},\\ {A}_{\mathcal{A}}=K\left({X}_{\mathcal{A}},{X}_{\mathcal{A}}|\theta \right)+\frac{K\left({X}_{\mathcal{A}},X|\theta \right)K\left(X,{X}_{\mathcal{A}}|\theta \right)}{{\sigma }^{2}},\\ *=\frac{{H}^{T}H}{{\sigma }^{2}}-\frac{{H}^{T}K\left(X,{X}_{\mathcal{A}}|\theta \right)}{{\sigma }^{2}}{A}_{\mathcal{A}}^{-1}\frac{K\left({X}_{\mathcal{A}},X|\theta \right)H}{{\sigma }^{2}},\\ **=\frac{{H}^{T}y}{{\sigma }^{2}}-\frac{{H}^{T}K\left(X,{X}_{\mathcal{A}}|\theta \right)}{{\sigma }^{2}}{A}_{\mathcal{A}}^{-1}\frac{K\left({X}_{\mathcal{A}},X|\theta \right)y}{{\sigma }^{2}}.\end{array}$`

And the SR approximation to the $\beta$-profiled marginal log likelihood is:

`$\begin{array}{l}\mathrm{log}{P}_{SR}\left(y|X,{\stackrel{^}{\beta }}_{SR}\left(\theta ,{\sigma }^{2},\mathcal{A}\right),\theta ,{\sigma }^{2},\mathcal{A}\right)=\\ \begin{array}{ll}\hfill & -\frac{1}{2}{\left(y-H{\stackrel{^}{\beta }}_{SR}\left(\theta ,{\sigma }^{2},\mathcal{A}\right)\right)}^{T}{\left[{\stackrel{^}{K}}_{SR}\left(X,X|\theta ,\mathcal{A}\right)+{\sigma }^{2}{I}_{n}\right]}^{-1}\left(y-H{\stackrel{^}{\beta }}_{SR}\left(\theta ,{\sigma }^{2},\mathcal{A}\right)\right)\hfill \\ \hfill & -\frac{N}{2}\mathrm{log}2\pi -\frac{1}{2}\mathrm{log}|{\stackrel{^}{K}}_{SR}\left(X,X|\theta ,\mathcal{A}\right)+{\sigma }^{2}{I}_{n}|.\hfill \end{array}\end{array}$`

### Prediction

The SR approximation to the distribution of ${y}_{new}$ given $y$, $X$, ${x}_{new}$ is

`$P\left({y}_{new}|y,X,{x}_{new}\right)=\mathcal{N}\left({y}_{new}|h{\left({x}_{new}\right)}^{T}\beta +{\mu }_{SR},{\sigma }_{new}^{2}+{\Sigma }_{SR}\right),$`

where ${\mu }_{SR}$ and ${\Sigma }_{SR}$ are the SR approximations to $\mu$ and $\Sigma$ shown in prediction using the exact GPR method.

${\mu }_{SR}$ and ${\Sigma }_{SR}$ are obtained by replacing $k\left(x,{x}_{r}|\theta \right)$ by its SR approximation ${\stackrel{^}{k}}_{SR}\left(x,{x}_{r}|\theta ,\mathcal{A}\right)$ in $\mu$ and $\Sigma$, respectively.

That is,

Since

and from the fact that , ${\mu }_{SR}$ can be written as

Similarly, ${\Sigma }_{SR}$ is derived as follows:

`${\Sigma }_{SR}=\underset{*}{\underbrace{{\stackrel{^}{k}}_{SR}\left({x}_{new},{x}_{new}|\theta ,\mathcal{A}\right)}}-\underset{**}{\underbrace{{\stackrel{^}{K}}_{SR}\left({x}_{new}^{T},X|\theta ,\mathcal{A}\right)}}\underset{***}{\underbrace{{\left({\stackrel{^}{K}}_{SR}\left(X,X|\theta ,\mathcal{A}\right)+{\sigma }^{2}{I}_{N}\right)}^{-1}}}\underset{****}{\underbrace{{\stackrel{^}{K}}_{SR}\left(X,{x}_{new}^{T}|\theta ,\mathcal{A}\right)}}.$`

Because

${\Sigma }_{SR}$ is found as follows:

### Predictive Variance Problem

One of the disadvantages of the SR method is that it can give unreasonably small predictive variances when making predictions in a region far away from the chosen active set $\mathcal{A}\subset \mathcal{N}=\left\{1,2,...,n\right\}$. Consider making a prediction at a new point ${x}_{new}$ that is far away from the training set $X$. In other words, assume that $K\left({x}_{new}^{T},X|\theta \right)\approx 0$.

For exact GPR, the posterior distribution of ${f}_{new}$ given $y$, $X$ and ${x}_{new}$ would be Normal with mean $\mu =0$ and variance $\Sigma =k\left({x}_{new},{x}_{new}|\theta \right)$. This value is correct in the sense that, if ${x}_{new}$ is far from $X$, then the data $\left(X,y\right)$ does not supply any new information about ${f}_{new}$ and so the posterior distribution of ${f}_{new}$ given $y$, $X$, and ${x}_{new}$ should reduce to the prior distribution ${f}_{new}$ given ${x}_{new}$, which is a Normal distribution with mean $0$ and variance $k\left({x}_{new},{x}_{new}|\theta \right)$.

For the SR approximation, if ${x}_{new}$ is far away from $X$ (and hence also far away from ${X}_{\mathcal{A}}$), then ${\mu }_{SR}=0$ and ${\Sigma }_{SR}=0$. Thus in this extreme case, ${\mu }_{SR}$ agrees with $\mu$ from exact GPR, but ${\Sigma }_{SR}$ is unreasonably small compared to $\Sigma$ from exact GPR.

The fully independent conditional approximation method can help avoid this problem.

## References

[1] Rasmussen, C. E. and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press. Cambridge, Massachusetts, 2006.

[2] Smola, A. J. and B. Schökopf. "Sparse greedy matrix approximation for machine learning." In Proceedings of the Seventeenth International Conference on Machine Learning, 2000.