Dimensionality Reduction#

The dimensionality reduction problem is as follows. You have many observations \(\mathbf{x}_{1:n}\). Each observation \(\mathbf{x}_i\) is a high-dimensional vector, say \(\mathbf{x}_i\) is in \(\mathbb{R}^D\) with \(D\gg 1\). What you would like to do is describe this dataset with a smaller number of dimensions without losing too much information. You would like to project each of the \(\mathbf{x}_i\)’s to \(d\)-dimensional vector \(\mathbf{z}_i\) with \(d \ll D\).

Why would you like to do dimensionality reduction? First, you can take any dataset, no matter how high-dimensional, and visualize it in two dimensions. This may help develop intuition about this dataset. Second, once you project the high-dimensional dataset to lower dimensions, it is often easier to carry out unsupervised tasks like clustering or density estimation. Third, supervised tasks involving high-dimensional data become more manageable if you reduce the dimensionality. For example, suppose you want to do regression between the high-dimensional \(\mathbf{x}\) and a scalar quantity \(y\). In that case, it will probably pay off if you first reduce the dimensionality of \(\mathbf{x}\) by projecting it to a lower-dimensional vector \(\mathbf{z}\) and then do regression between \(\mathbf{z}\) and \(y\).

There are dozens of dimensionality reduction techniques. See this for an incomplete list. In this lecture, we will develop the simplest and the most widely used dimensionality reduction technique: Principal Component Analysis (PCA). If you want to read about it independently, I suggest Chapter 12.1-2, Bishop (2006).

Why is Dimensionality Reduction Possible?#

Suppose that you are taking a picture of your head. Each picture is a high-dimensional object. For example, if you take a picture with a 10-megapixel camera, then each image is a vector in \(\mathbb{R}^{10^7}\). Suppose you generate a dataset of such pictures with the same background, lighting, and facial expression. All you do is change the angle of the camera. Then, you can describe this dataset with much smaller dimensions. All you need to know is where and how the camera is oriented. So, you can get away with six dimensions: three for the camera position and three for the camera’s orientation. We went from \(\mathbb{R}^{10^7}\) to \(\mathbb{R}^6\). The dataset you generated is a 6-dimensional manifold embedded in \(\mathbb{R}^{10^7}\). It is a curved surface in \(\mathbb{R}^{10^7}\). This is the intuition behind dimensionality reduction. Now you can imagine varying things: the background, the lighting, the facial expression, the hairstyle, the clothes, etc. This increases the dimensions you need to describe the dataset, but it remains much smaller than \(\mathbb{R}^{10^7}\).

Principal Component Analysis#

Let \(\mathbf{x}_{1:n}\) be data points in \(\mathbb{R}^D\). We want to project them to a lower dimensional space \(\mathbb{R}^d\) with \(d \ll D\). Specifically, we will find a linear map that projects the data to vectors \(\mathbf{z}_{1:n}\) in \(\mathbb{R}^d\) such that the reconstruction error is minimized.

First, we need a projection map. The projection map takes us from \(\mathbb{R}^D\) to \(\mathbb{R}^d\). We will use an affine projection map:

\[ \mathbf{z} = \mathbf{f}(\mathbf{x}) = \mathbf{W}^\top (\mathbf{x} - \mathbf{x}_0). \]

Here the matrix \(\mathbf{W}\) is a \(D\times d\) matrix and \(\mathbf{x}_0\) is the empirical mean of the data:

\[ \mathbf{x}_0 = \frac{1}{n}\sum_{i=1}^n \mathbf{x}_i. \]

Second, we need a reconstruction map. The reconstruction map takes us from \(\mathbb{R}^d\) to \(\mathbb{R}^D\). We will use an affine reconstruction map:

\[ \mathbf{x} = \mathbf{g}(\mathbf{z}) = \mathbf{V}\mathbf{z} + \mathbf{x}_0. \]

Here the matrix \(\mathbf{V}\) is a \(D\times d\) matrix and \(\mathbf{x}_0\) is a \(D\)-dimensional vector.

How can we find the matrices \(\mathbf{W}\), \(\mathbf{V}\), and \(\mathbf{x}_0\)? We will minimize the reconstruction error. So, we project through \(\mathbf{f}\), then reconstruct through \(\mathbf{g}\), and then measure the error. It is:

\[ L(\mathbf{W}, \mathbf{V}, \mathbf{x}_0) = \frac{1}{n}\sum_{i=1}^n \|\mathbf{x}_i - \mathbf{g}(\mathbf{f}(\mathbf{x}_i))\|^2. \]

Or, in more detail:

\[ L(\mathbf{W}, \mathbf{V}, \mathbf{x}_0) = \frac{1}{n}\sum_{i=1}^n \|\mathbf{x}_i - \mathbf{V}\mathbf{W}^\top \mathbf{x}_i - \mathbf{x}_0\|^2. \]

We now proceed as usual. We take the derivative of \(L\) with respect to \(\mathbf{W}\), \(\mathbf{V}\), and \(\mathbf{b}\) and set them to zero. We get:

\[ \mathbf{W} = \mathbf{V}, \]

and \(\mathbf{V}\) is made out of the eigenvectors and eigenvalues of the empirical covariance matrix:

\[ \mathbf{C} = \frac{1}{n}\sum_{i=1}^n (\mathbf{x}_i - \mathbf{x}_0)(\mathbf{x}_i - \mathbf{x}_0)^\top. \]

Specifically, let \(\mathbf{u}_i\) and \(\lambda_i\) be the \(i\)-th eigenvector and eigenvalue of \(\mathbf{C}\). Assume the eigenvectors are sorted in decreasing order of the eigenvalues:

\[ \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_D. \]

Then, the \(i\)-th column of \(\mathbf{V}\) is:

\[ \mathbf{v}_i = \sqrt{\lambda_i}\mathbf{u}_i. \]

So, in terms of eigenvectors and eigenvalues, the projection map is:

\[ \mathbf{z} = \mathbf{f}(\mathbf{x}) = \sum_{i=1}^d \sqrt{\lambda_i} \mathbf{u}_i^\top \mathbf{x} \]

and the reconstruction map is:

\[ \mathbf{x} = \mathbf{g}(\mathbf{z}) = \mathbf{x}_0 + \sum_{i=1}^d \sqrt{\lambda_i} \mathbf{u}_i \mathbf{z}_i. \]

We can also show that the minimum reconstruction error is

\[ \frac{1}{n}\sum_{i=1}^n \|\mathbf{x}_i - \mathbf{g}(\mathbf{f}(\mathbf{x}_i))\|^2 = \sum_{j=d+1}^D \lambda_j. \]

We can use this to decide how many dimensions to keep. We can keep the first \(d\) dimensions so the reconstruction error is below a threshold. For example, we can keep the first \(d\) dimensions such that the reconstruction error is less than 1%. We call the sum of the first \(d\) eigenvalues the variance explained by the first \(d\) dimensions. Another name is the energy of the first \(d\) dimensions.

Probabilistic Interpretation#

We can also interpret PCA probabilistically. We assume that the data points \(\mathbf{x}_{1:n}\) are generated by a linear Gaussian model:

\[ \mathbf{x}_i | \mathbf{z}_i \sim \mathcal{N}(\mathbf{x}_0 + \mathbf{V}\mathbf{z}_i, \sigma^2 \mathbf{I}). \]

The latent variables \(\mathbf{z}_{1:n}\) are generated by a Gaussian prior:

\[ \mathbf{z}_i \sim \mathcal{N}(\mathbf{0}, \mathbf{I}). \]

One proceeds by maximizing the marginal likelihood of the data:

\[ p(\mathbf{x}_{1:n}) = \int p(\mathbf{x}_{1:n} | \mathbf{z}_{1:n}) p(\mathbf{z}_{1:n}) d\mathbf{z}_{1:n}. \]

For such problems with hidden variables, we can use the EM algorithm. But in this case, we are lucky. Inside the integral, the Gaussian prior and the Gaussian likelihood combine to give another Gaussian. And then, when you integrate the latent variables, you get another Gaussian. So, the marginal likelihood is a Gaussian. When you write everything down and take the log, you get the same objective function as before. And the solution is the same. An addendum is that you also get the variance \(\sigma^2\) of the Gaussian likelihood. It is

\[ \sigma^2 = \frac{1}{D-d}\sum_{j=d+1}^D \lambda_j. \]