Kernel Least-Squares
MotivationsConsider a linear auto-regressive model for time-series, where ![]() This writes ![]() We can fit this model based on historical data via least-squares: ![]() The associated prediction rule is We can introduce a non-linear version, where ![]() This writes ![]() Everything the same as before, with It appears that the size of the least-squares problem grows quickly with the degree of the feature vectors. How do we do it in a computationally efficient manner? The kernel trickWe exploit a simple fact: in the least-squares problem ![]() the optimal ![]() for some vector ![]() where Hence the least-squares problem depends only on ![]() The prediction rule depends on the scalar products between new point ![]() Once Nonlinear caseIn the nonlinear case, we simply replace the feature vectors This leads to the modified kernel matrix ![]() The kernel function associated with mapping ![]() It provides information about the metric in the feature space, eg: ![]() The computational effort involved in
depends only on our ability to quickly evaluate such scalar products. We can't choose Examples of kernelsA variety of kernels are available. Some are adapted to the structure of data, for example text or images. Here are a few popular choices. Polynomial kernelsRegression with quadratic functions involves feature vectors ![]() In fact, given two vectors ![]() More generally when ![]() The computational effort grows linearly in This represents a dramatic reduction in speed over the ‘‘brute force’’ approach:
Gaussian kernelsGaussian kernel function: ![]() where Other kernelsThere is a large variety (a zoo?) of other kernels, some adapted to structure of data (text, images, etc). Kernels in practice
|