Main Content

t-SNE (`tsne`

) is an algorithm
for dimensionality reduction that is well-suited to visualizing high-dimensional
data. The name stands for *t*-distributed Stochastic
Neighbor Embedding. The idea is to embed high-dimensional points in
low dimensions in a way that respects similarities between points.
Nearby points in the high-dimensional space correspond to nearby embedded
low-dimensional points, and distant points in high-dimensional space
correspond to distant embedded low-dimensional points. (Generally,
it is impossible to match distances exactly between high-dimensional
and low-dimensional spaces.)

The `tsne`

function creates
a set of low-dimensional points from high-dimensional data. Typically,
you visualize the low-dimensional points to see natural clusters in
the original high-dimensional data.

The algorithm takes the following general steps to embed the data in low dimensions.

Calculate the pairwise distances between the high-dimensional points.

Create a standard deviation

*σ*for each high-dimensional point_{i}*i*so that the*perplexity*of each point is at a predetermined level. For the definition of perplexity, see Compute Distances, Gaussian Variances, and Similarities.Calculate the

*similarity matrix*. This is the joint probability distribution of X, defined by Equation 1.Create an initial set of low-dimensional points.

Iteratively update the low-dimensional points to minimize the Kullback-Leibler divergence between a Gaussian distribution in the high-dimensional space and a

*t*distribution in the low-dimensional space. This optimization procedure is the most time-consuming part of the algorithm.

See van der Maaten and Hinton [1].

The basic t-SNE algorithm performs the following steps.

`tsne`

first removes each row of the input
data X that contains any `NaN`

values. Then, if the `Standardize`

name-value
pair is `true`

, `tsne`

centers
X by subtracting the mean of each column, and scales X by dividing
its columns by their standard deviations.

The original authors van der Maaten and Hinton [1] recommend reducing the original data X to
a lower-dimensional version using Principal Component Analysis (PCA). You can set the `tsne`

`NumPCAComponents`

name-value
pair to the number of dimensions you like, perhaps 50. To exercise
more control over this step, preprocess the data using the `pca`

function.

After the preprocessing, `tsne`

calculates
the distance *d*(*x _{i}*,

`Distance`

name-value pair.
By default, `tsne`

uses the standard Euclidean
metric. `tsne`

uses the square of the distance
metric in its subsequent calculations.Then for each row *i* of X, `tsne`

calculates
a standard deviation *σ _{i}* so
that the

`Perplexity`

name-value
pair. The perplexity is defined in terms of a model Gaussian distribution
as follows. As van der Maaten and Hinton [1] describe, “The similarity of data point Define the conditional probability of *j* given *i* as

$$\begin{array}{l}{p}_{j|i}=\frac{\mathrm{exp}\left(-d{\left({x}_{i},{x}_{j}\right)}^{2}/\left(2{\sigma}_{i}^{2}\right)\right)}{{\displaystyle \sum _{k\ne i}\mathrm{exp}\left(-d{\left({x}_{i},{x}_{k}\right)}^{2}/\left(2{\sigma}_{i}^{2}\right)\right)}}\\ {p}_{i|i}=0.\end{array}$$

Then define the joint probability *p _{ij}* by
symmetrizing the conditional probabilities:

$${p}_{ij}=\frac{{p}_{j|i}+{p}_{i|j}}{2N},$$ | (1) |

where *N* is the number of rows of X.

The distributions still do not have their standard deviations *σ _{i}* defined
in terms of the

`Perplexity`

name-value
pair. Let $$\text{perplexity}\left({P}_{i}\right)={2}^{H\left({P}_{i}\right)},$$

where *H*(*P _{i}*)
is the Shannon entropy of

$$H\left({P}_{i}\right)=-{\displaystyle \sum _{j}{p}_{j|i}{\mathrm{log}}_{2}\left({p}_{j|i}\right)}.$$

The perplexity measures the effective number of neighbors of
point *i*. `tsne`

performs a binary
search over the *σ _{i}* to
achieve a fixed perplexity for each point

To embed the points in X into a low-dimensional space, `tsne`

performs
an optimization. `tsne`

attempts to minimize the
Kullback-Leibler divergence between the model Gaussian distribution
of the points in X and a Student *t* distribution
of points Y in the low-dimensional space.

The minimization procedure begins with an initial set of points
Y. `tsne`

create the points by default as random
Gaussian-distributed points. You can also create these points yourself
and include them in the `'InitialY'`

name-value pair
for `tsne`

. `tsne`

then calculates
the similarities between each pair of points in Y.

The probability model *q _{ij}* of
the distribution of the distances between points

$$\begin{array}{l}{q}_{ij}=\frac{{\left(1+{\Vert {y}_{i}-{y}_{j}\Vert}^{2}\right)}^{-1}}{{\displaystyle \sum _{k}{\displaystyle \sum _{l\ne k}{\left(1+{\Vert {y}_{k}-{y}_{l}\Vert}^{2}\right)}^{-1}}}}\\ {q}_{ii}=0.\end{array}$$

Using this definition and the model of distances in X given
by Equation 1, the Kullback-Liebler
divergence between the joint distribution *P* and *Q* is

$$KL(P||Q)={\displaystyle \sum _{j}{\displaystyle \sum _{i\ne j}{p}_{ij}\mathrm{log}\frac{{p}_{ij}}{{q}_{ij}}}.}$$

For consequences of this definition, see Helpful Nonlinear Distortion.

To minimize the Kullback-Leibler divergence, the `'exact'`

algorithm
uses a modified gradient descent procedure. The gradient with respect
to the points in Y of the divergence is

$$\frac{\partial KL(P||Q)}{\partial {y}_{i}}=4{\displaystyle \sum _{j\ne i}Z\left({p}_{ij}-{q}_{ij}\right){q}_{ij}\left({y}_{i}-{y}_{j}\right)},$$

where the normalization term

$$Z={\displaystyle \sum _{k}{\displaystyle \sum _{l\ne k}{\left(1+{\Vert {y}_{k}-{y}_{l}\Vert}^{2}\right)}^{-1}}}.$$

The modified gradient descent algorithm uses a few tuning parameters to attempt to reach a good local minimum.

`'Exaggeration'`

— During the first 99 gradient descent steps,`tsne`

multiplies the probabilities*p*from Equation 1 by the exaggeration value. This step tends to create more space between clusters in the output Y._{ij}`'LearnRate'`

—`tsne`

uses adaptive learning to improve the convergence of the gradient descent iterations. The descent algorithm has iterative steps that are a linear combination of the previous step in the descent and the current gradient.`'LearnRate'`

is a multiplier of the current gradient for the linear combination. For details, see Jacobs [3].

To speed the t-SNE algorithm and to cut down on its memory usage, `tsne`

offers
an approximate optimization scheme. The Barnes-Hut algorithm groups
nearby points together to lower the complexity and memory usage of
the t-SNE optimization step. The Barnes-Hut algorithm is an approximate
optimizer, not an exact optimizer. There is a nonnegative tuning parameter `Theta`

that effects a tradeoff
between speed and accuracy. Larger values of `'Theta'`

give
faster but less accurate optimization results. The algorithm is relatively
insensitive to `'Theta'`

values in the range (0.2,0.8).

The Barnes-Hut algorithm groups nearby points in the low-dimensional space, and performs an approximate gradient descent based on these groups. The idea, originally used in astrophysics, is that the gradient is similar for nearby points, so the computations can be simplified.

See van der Maaten [2].

Because t-SNE often separates data clusters well, it can seem that t-SNE can classify new data points. However, t-SNE cannot classify new points. The t-SNE embedding is a nonlinear map that is data-dependent. To embed a new point in the low-dimensional space, you cannot use the previous embedding as a map. Instead, run the entire algorithm again.

t-SNE can take a good deal of time to process data. If you have *N* data
points in *D* dimensions that you want to map to *Y* dimensions,
then

Exact t-SNE takes of order

*D***N*^{2}operations.Barnes-Hut t-SNE takes of order

*D***N*log(*N*)*exp(dimension(*Y*)) operations.

So for large data sets, where *N* is greater
than 1000 or so, and where the embedding dimension *Y* is
2 or 3, the Barnes-Hut algorithm can be faster than the exact algorithm.

T-SNE maps high-dimensional distances to distorted low-dimensional
analogues. Because of the fatter tail of the Student *t* distribution
in the low-dimensional space, `tsne`

often moves
close points closer together, and moves far points farther apart than
in the high-dimensional space, as illustrated in the following figure.
The figure shows both Gaussian and Student *t* distributions
at the points where the densities are at 0.25 and 0.025. The Gaussian
density relates to high-dimensional distances, and the *t* density
relates to low-dimensional distances. The *t* density
corresponds to close points being closer, and far points being farther,
compared to the Gaussian density.

t = linspace(0,5); y1 = normpdf(t,0,1); y2 = tpdf(t,1); plot(t,y1,'k',t,y2,'r') hold on x1 = fzero(@(x)normpdf(x,0,1)-0.25,[0,2]); x2 = fzero(@(x)tpdf(x,1)-0.25,[0,2]); z1 = fzero(@(x)normpdf(x,0,1)-0.025,[0,5]); z2 = fzero(@(x)tpdf(x,1)-0.025,[0,5]); plot([0,x1],[0.25,0.25],'k-.') plot([0,z2],[0.025,0.025],'k-.') plot([x1,x1],[0,0.25],'g-',[x2,x2],[0,0.25],'g-') plot([z1,z1],[0,0.025],'g-',[z2,z2],[0,0.025],'g-') text(1.1,.25,'Close points are closer in low-D') text(2.4,.05,'Far points are farther in low-D') legend('Gaussian(0,1)','Student t (df = 1)') xlabel('x') ylabel('Density') title('Density of Gaussian(0,1) and Student t (df = 1)') hold off

This distortion is helpful when it applies. It does not apply
in cases such as when the Gaussian variance is high, which lowers
the Gaussian peak and flattens the distribution. In such a case, `tsne`

can
move close points farther apart than in the original space. To achieve
a helpful distortion,

Set the

`'Verbose'`

name-value pair to`2`

.Adjust the

`'Perplexity'`

name-value pair so the reported range of variances is not too far from`1`

, and the mean variance is near`1`

.

If you can achieve this range of variances, then the diagram
applies, and the `tsne`

distortion is helpful.

For effective ways to tune `tsne`

, see Wattenberg, Viégas and Johnson
[4].

[1] van der Maaten, Laurens, and Geoffrey Hinton. *Visualizing
Data using t-SNE*. J. Machine Learning Research 9, 2008,
pp. 2579–2605.

[2] van der Maaten, Laurens. *Barnes-Hut-SNE*. arXiv:1301.3342 [cs.LG],
2013.

[3] Jacobs, Robert A. *Increased rates
of convergence through learning rate adaptation*. Neural
Networks 1.4, 1988, pp. 295–307.

[4] Wattenberg, Martin, Fernanda Viégas, and Ian Johnson.
*How to Use t-SNE Effectively.* Distill, 2016. Available at
```
How to Use
t-SNE Effectively
```

.