Visualization of High-Dimensional Data

Matrices > Applications > Visualization
  • Visualization problem

  • Projecting data on a line

  • Projecting data on a plane

Visualization Problem

Consider a data set of n points x_j, j=1,ldots,x_n in mathbf{R}^m. Each point can represent the a chemical experiment under m specific conditions; the response of a specific gene to a number of m different drugs; the votes of a particular citizen on an array of m issues; m atmospheric readings (temperature, pressure, humidity, etc) at a specific location; m past prices of a single asset; etc.

We can represent this data set as a m times n matrix X = [x_1 , ldots, x_n], where each x_j is a n-vector. Simply plotting the raw matrix is often not very informative.

Example: Raw data matrix for the US Senate, 2004-2006.

We can try to visualize the data set, by projecting each data point (each row or column of the matrix) on (say) a 1D-, 2D- or 3D-space. Each ‘‘view’’ corresponds to a particular projection, that is, a particular one-, two- or three-dimensional subspace on which we choose to project the data. The visualization problem consists of choosing an appropriate projection.

There are many ways to formulate the visualization problem, and none dominates the others. Here,we focus on the basics of that problem.

Projecting on a line

To simplify, let us first consider the simple problem of representing the high-dimensional data set on a simple line, using the method described here.

Specifically we would like to assign a single number, or ‘‘score’’, to each column of the matrix. We choose a direction u in mathbf{R}^m, and a scalar v in mathbf{R}. This corresponds to the affine ‘‘scoring’’ function f : mathbf{R}^m rightarrow mathbf{R}, which, to a generic column x in mathbf{R}^m of the data matrix, assigns the value

 f(x) = u^Tx + v.

We thus obtain a vector of values f in mathbf{R}^n, with f_j = u^Tx_j+v, j=1,ldots,n. It is often useful to center these values around zero. This can be done by choosing v such that

 0 = sum_{j=1}^n ( u^Tx_j+v ) = u^Tleft( sum_{j=1}^n x_j right) + n cdot v,

that is: v = -u^That{x}, where

 hat{x} := frac{1}{n} sum_{j=1}^n x_j in mathbf{R}^m

is the vector of sample averages across the columns of the matrix (that is, data points). The vector hat{x} can be interpreted as the ‘‘average response’’ across experiments.

The values of our scoring function can now be expressed as

 f(x) = u^T(x-hat{x}).

In order to be able to compare the relative merits of different directions, we can assume, without loss of generality, that the vector u is normalized (so that |u|_2 = 1).

It is convenient to work with the ‘‘centered’’ data matrix, which is

 X_{rm cent} = left( begin{array}{ccc} x_1 -hat{x} & ldots x_n - hat{x} end{array}right) = X - hat{x}mathbf{1}_n^T,

where mathbf{1}_n is the vector of ones in mathbf{R}^n.

In matlab, we can compute the centered data matrix as follows.

Matlab syntax
>> xhat = mean(X,2);
>> [m,n] = size(X);
>> Xcent = X-xhat*ones(1,n);

We can compute the (row) vector scores using the simple matrix-vector product:

 f = u^TX_{rm cent} in mathbf{R}^{1 times m}.

We can check that the average of the above row vector is zero:

 fmathbf{1}_n = u^TX_{rm cent}mathbf{1}_n = u^T(X - hat{x}mathbf{1}_n^T) mathbf{1}_n = u^T(Xmathbf{1}_n - n cdot hat{x}) = 0.

Example: Senator scores on average bill.

Projection on a plane

We can also try to project the data on a plane, which involves assigning two scores to each data point.

This corresponds to the affine ‘‘scoring’’ map f : mathbf{R}^m rightarrow mathbf{R}, which, to a generic column x in mathbf{R}^m of the data matrix, assigns the two-dimensional value

 f(x) = left( begin{array}{c} u_1^Tx + v_1  u_2^Tx+v_2 end{array}right) = U^Tx + v,

where u_1,u_2 mathbf{R}^m are two vectors, and v_1,v_2 two scalars, while U = [u_1,u_2]in mathbf{R}^{m times 2}, v in mathbf{R}^2.

The affine map f allows to generate n two-dimensional data points (instead of m-dimensional) f_j = U^Tx_j+v, j=1,ldots,n. As before, we can require that the f_j's be centered:

 0 = sum_{j=1}^n f_j = sum_{j=1}^n (U^Tx_j+v) ,

by choosing the vector v to be such that v = -U^That{x}, where hat{x} in mathbf{R}^m is the ‘‘average response’’ defined above. Our (centered) scoring map takes the form

 f(x) = U^T(x-hat{x}).

We can encapsulate the scores in the 2 times n matrix F=[f_1,ldots,f_n]. The latter can be expressed as the matrix-matrix product

 F = U^TX_{rm cent} = left( begin{array}{c} u_1^TX_{rm cent}  u_2^TX_{rm cent} end{array}right),

with X_{rm cent} the centered data matrix defined above.

Example: Visualizing Senate voting on a plane.