Main Content

Neighborhood component analysis (NCA) is a non-parametric method for selecting
features with the goal of maximizing prediction accuracy of regression and
classification algorithms. The Statistics and Machine Learning Toolbox™ functions `fscnca`

and `fsrnca`

perform NCA feature selection with regularization to learn feature weights for
minimization of an objective function that measures the average leave-one-out
classification or regression loss over the training data.

Consider a multi-class classification problem with a training set containing
*n* observations:

$$\begin{array}{l}S=\left\{\left({x}_{i},{y}_{i}\right),i=1,2,\dots ,n\right\}\hfill \end{array},$$

where $${x}_{i}\in {\mathbb{R}}^{p}$$ are the feature vectors, $${y}_{i}\in \{1,2,\dots ,c\}$$ are the class labels, and *c* is the number of
classes. The aim is to learn a classifier $$f:{\mathbb{R}}^{p}\to \{1,2,\dots ,c\}$$ that accepts a feature vector and makes a prediction $$f\left(x\right)$$ for the true label $$y$$ of $$x$$.

Consider a randomized classifier that:

Randomly picks a point, $$\text{Ref}\left(x\right)$$, from $$S$$ as the ‘reference point’ for $$x$$

Labels $$x$$ using the label of the reference point $$\text{Ref}\left(x\right)$$.

This scheme is similar to that of a 1-NN classifier where the reference point is chosen to be the nearest neighbor of the new point $$x$$. In NCA, the reference point is chosen randomly and all points in $$S$$ have some probability of being selected as the reference point. The probability $$P\left(\text{Ref}\left(x\right)=\text{}{x}_{j}|S\right)$$ that point $${x}_{j}$$ is picked from $$S$$ as the reference point for $$x$$ is higher if $${x}_{j}$$ is closer to $$x$$ as measured by the distance function $${d}_{w}$$, where

$${d}_{w}({x}_{i},{x}_{j})={\displaystyle \sum _{r=1}^{p}{w}_{r}^{2}}\left|{x}_{ir}-{x}_{jr}\right|,$$

and $${w}_{r}$$ are the feature weights. Assume that

$$\begin{array}{l}P\left(\text{Ref}\left(x\right)=\text{}{x}_{j}|S\right)\propto k\left({d}_{w}\left(x,{x}_{j}\right)\right)\hfill \end{array},$$

where $$k$$ is some kernel or a similarity function that assumes large values when $${d}_{w}\left(x,{x}_{j}\right)$$ is small. Suppose it is

$$k\left(z\right)=\mathrm{exp}\left(-\frac{z}{\sigma}\right),$$

as suggested in [1]. The reference point for $$x$$ is chosen from $$S$$, so sum of $$P\left(\text{Ref}\left(x\right)=\text{}{x}_{j}|S\right)$$ for all *j* must be equal to 1. Therefore, it is
possible to write

$$\begin{array}{l}P\left(\text{Ref}\left(x\right)=\text{}{x}_{j}|S\right)=\frac{k\left({d}_{w}\left(x,{x}_{j}\right)\right)}{{\displaystyle \sum _{j=1}^{n}k}\left({d}_{w}\left(x,{x}_{j}\right)\right)}\hfill \end{array}.$$

Now consider the leave-one-out application of this randomized classifier, that is, predicting the label of $${x}_{i}$$ using the data in $${S}^{-i}$$, the training set $$S$$ excluding the point $$\left({x}_{i},{y}_{i}\right)$$. The probability that point $${x}_{j}$$ is picked as the reference point for $${x}_{i}$$ is

$${p}_{ij}=P\left(\text{Ref}\left({x}_{i}\right)=\text{}{x}_{j}|{S}^{-i}\right)=\frac{k\left({d}_{w}\left({x}_{i},{x}_{j}\right)\right)}{{\displaystyle \sum _{j=1,j\ne i}^{n}k}\left({d}_{w}\left({x}_{i},{x}_{j}\right)\right)}.$$

The average leave-one-out probability of correct classification is the probability $${p}_{i}$$ that the randomized classifier correctly classifies observation
*i* using $${S}^{-i}$$.

$$\begin{array}{l}{p}_{i}={\displaystyle \sum _{j=1,j\ne i}^{n}P\left(\text{Ref}\left({x}_{i}\right)={x}_{j}|{S}^{-i}\right)I\left({y}_{i}={y}_{j}\right)}\hfill \end{array}={\displaystyle \sum _{j=1,j\ne i}^{n}{p}_{ij}}{y}_{ij},$$

where

$${y}_{ij}=I\left({y}_{i}={y}_{j}\right)=\{\begin{array}{cc}1& \text{if}\text{\hspace{0.17em}}{y}_{i}={y}_{j,}\\ 0& \text{otherwise}\text{.}\end{array}$$

The average leave-one-out probability of correct classification using the randomized classifier can be written as

$$F\left(w\right)=\frac{1}{n}{\displaystyle \sum _{i=1}^{n}{p}_{i}}.$$

The right hand side of $$F\left(w\right)$$ depends on the weight vector $$w$$. The goal of neighborhood component analysis is to maximize $$F\left(w\right)$$ with respect to $$w$$. `fscnca`

uses the regularized objective
function as introduced in [1].

$$\begin{array}{ll}F\left(w\right)\hfill & =\frac{1}{n}{\displaystyle \sum _{i=1}^{n}{p}_{i}-\lambda {\displaystyle \sum _{r=1}^{p}{w}_{r}^{2}}}\hfill \\ \hfill & =\frac{1}{n}{\displaystyle \sum _{i=1}^{n}\underset{{F}_{i}\left(w\right)}{\underbrace{\left[{\displaystyle \sum _{j=1,j\ne i}^{n}{p}_{ij}{y}_{ij}-\lambda {\displaystyle \sum _{r=1}^{p}{w}_{r}^{2}}}\right]}}}\hfill \\ \hfill & =\frac{1}{n}{\displaystyle \sum _{i=1}^{n}{F}_{i}\left(w\right)}\hfill \end{array},$$

where $$\lambda $$ is the regularization parameter. The regularization term drives many of the weights in $$w$$ to 0.

After choosing the kernel parameter $$\sigma $$ in $${p}_{ij}$$ as 1, finding the weight vector $$w$$ can be expressed as the following minimization problem for given $$\lambda $$.

$$\widehat{w}=\underset{w}{\text{argmin}}f\left(w\right)=\underset{w}{\text{argmin}}\frac{1}{n}{\displaystyle \sum _{i=1}^{n}{f}_{i}\left(w\right)},$$

where *f*(*w*) =
-*F*(*w*) and *f*_{i}(*w*)
=
-*F*_{i}(*w*).

Note that

$$\frac{1}{n}{\displaystyle \sum _{i=1}^{n}{\displaystyle \sum _{j=1,j\ne i}^{n}{p}_{ij}}}=1,$$

and the argument of the minimum does not change if you add a constant to an objective function. Therefore, you can rewrite the objective function by adding the constant 1.

$$\begin{array}{c}\widehat{w}=\underset{w}{\text{argmin}}\left\{1+f(w)\right\}\\ =\underset{w}{\text{argmin}}\left\{\frac{1}{n}{\displaystyle \sum _{i=1}^{n}{\displaystyle \sum _{j=1,j\ne i}^{n}{p}_{ij}}}-\frac{1}{n}{\displaystyle \sum _{i=1}^{n}{\displaystyle \sum _{j=1,j\ne i}^{n}{p}_{ij}{y}_{ij}}+\lambda {\displaystyle \sum _{r=1}^{p}{w}_{r}^{2}}}\right\}\\ =\underset{w}{\text{argmin}}\left\{\frac{1}{n}{\displaystyle \sum _{i=1}^{n}{\displaystyle \sum _{j=1,j\ne i}^{n}{p}_{ij}\left(1-{y}_{ij}\right)}+\lambda {\displaystyle \sum _{r=1}^{p}{w}_{r}^{2}}}\right\}\\ =\underset{w}{\text{argmin}}\left\{\frac{1}{n}{\displaystyle \sum _{i=1}^{n}{\displaystyle \sum _{j=1,j\ne i}^{n}{p}_{ij}l({y}_{i},{y}_{j})}+\lambda {\displaystyle \sum _{r=1}^{p}{w}_{r}^{2}}}\right\},\end{array}$$

where the loss function is defined as

$$l\left({y}_{i},{y}_{j}\right)=\{\begin{array}{cc}1& \text{if}\text{\hspace{0.17em}}{y}_{i}\ne {y}_{j,}\\ 0& \text{otherwise}\text{.}\end{array}$$

The argument of the minimum is the weight vector that minimizes
the classification error. You can specify a custom loss function using the `LossFunction`

name-value
pair argument in the call to `fscnca`

.

The `fsrnca`

function performs NCA feature
selection modified for regression. Given *n* observations

$$\begin{array}{l}S=\left\{\left({x}_{i},{y}_{i}\right),i=1,2,\dots ,n\right\}\hfill \end{array},$$

the only difference from the classification problem is that the response values $${y}_{i}\in \mathbb{R}$$ are continuous. In this case, the aim is to predict the response $$y$$ given the training set $$S$$.

Consider a randomized regression model that:

Randomly picks a point ($$\text{Ref}\left(x\right)$$) from $$S$$as the ‘reference point’ for $$x$$

Sets the response value at $$x$$ equal to the response value of the reference point $$\text{Ref}\left(x\right)$$.

Again, the probability $$P\left(\text{Ref}\left(x\right)=\text{}{x}_{j}|S\right)$$ that point $${x}_{j}$$ is picked from $$S$$ as the reference point for $$x$$ is

$$\begin{array}{l}P\left(\text{Ref}\left(x\right)=\text{}{x}_{j}|S\right)=\frac{k\left({d}_{w}\left(x,{x}_{j}\right)\right)}{{\displaystyle \sum _{j=1}^{n}k}\left({d}_{w}\left(x,{x}_{j}\right)\right)}\hfill \end{array}.$$

Now consider the leave-one-out application of this randomized regression model, that is, predicting the response for $${x}_{i}$$ using the data in $${S}^{-i}$$, the training set $$S$$ excluding the point $$\left({x}_{i},{y}_{i}\right)$$. The probability that point $${x}_{j}$$ is picked as the reference point for $${x}_{i}$$ is

$${p}_{ij}=P\left(\text{Ref}\left({x}_{i}\right)=\text{}{x}_{j}|{S}^{-i}\right)=\frac{k\left({d}_{w}\left({x}_{i},{x}_{j}\right)\right)}{{\displaystyle \sum _{j=1,j\ne i}^{n}k}\left({d}_{w}\left({x}_{i},{x}_{j}\right)\right)}.$$

Let $${\widehat{y}}_{i}$$ be the response value the randomized regression model predicts and $${y}_{i}$$ be the actual response for $${x}_{i}$$. And let $$l:{\mathbb{R}}^{2}\to \mathbb{R}$$ be a loss function that measures the disagreement between $${\widehat{y}}_{i}$$ and $${y}_{i}$$. Then, the average value of $$l\left({y}_{i},{\widehat{y}}_{i}\right)$$ is

$${l}_{i}=E\left(l\left({y}_{i},{\widehat{y}}_{i}\right)|{S}^{-i}\right)={\displaystyle \sum _{j=1,j\ne i}^{n}{p}_{ij}l\left({y}_{i},{y}_{j}\right).}$$

After adding the regularization term, the objective function for minimization is:

$$f\left(w\right)=\frac{1}{n}{\displaystyle \sum _{i=1}^{n}{l}_{i}+\lambda {\displaystyle \sum _{r=1}^{p}{w}_{r}^{2}}}.$$

The default loss function $$l\left({y}_{i},{y}_{j}\right)$$ for NCA for regression is mean absolute deviation, but you can
specify other loss functions, including a custom one, using the `LossFunction`

name-value
pair argument in the call to `fsrnca`

.

The regularization term derives the weights of irrelevant predictors to zero. In
the objective functions for NCA for classification or regression, there is only one
regularization parameter $$\lambda $$ for all weights. This fact requires the magnitudes of the weights
to be comparable to each other. When the feature vectors $${x}_{i}$$ in $$S$$ are in different scales, this might result in weights that are in
different scales and not meaningful. To avoid this situation, standardize the
predictors to have zero mean and unit standard deviation before applying NCA. You
can standardize the predictors using the `'Standardize',true`

name-value pair argument in the call to `fscnca`

or
`fsrnca`

.

It is usually necessary to select a value of the regularization parameter by calculating the accuracy of the randomized NCA classifier or regression model on an independent test set. If you use cross-validation instead of a single test set, select the $$\lambda $$ value that minimizes the average loss across the cross-validation folds. For examples, see Tune Regularization Parameter to Detect Features Using NCA for Classification and Tune Regularization Parameter in NCA for Regression.

[1] Yang, W., K. Wang, W. Zuo. "Neighborhood Component Feature Selection for High-Dimensional Data." Journal of Computers. Vol. 7, Number 1, January, 2012.

`FeatureSelectionNCAClassification`

| `FeatureSelectionNCARegression`

| `fscnca`

| `fsrnca`