Connection Between SVD and Principal Component Analysis (PCA)

Connection Between SVD and Principal Component Analysis (PCA)#

Principal component analysis (PCA) is a linear dimensionality reduction technique. We typically use PCA in supervised or unsupervised learning to reduce the number of features in the dataset. The main idea behind PCA is to find a new set of features that are uncorrelated and ordered by the amount of variance they explain.

Let \(\mathbf{X}\) be a \(n \times m\) matrix, where \(n\) is the number of samples and \(m\) is the number of features. In this setting, we typically have \(n \ll m\). We will use \(\mathbf{x}_i\) to denote the \(i\)-th row of \(\mathbf{X}\). So, write:

\[\begin{split} \mathbf{X} = \begin{bmatrix} -\mathbf{x}_1- \\ -\mathbf{x}_2-\\ \vdots \\ -\mathbf{x}_n- \end{bmatrix}, \end{split}\]

where:

\[ -\mathbf{x}_j- \equiv \mathbf{x}_j^T. \]

When we do PCA, we always want to center the data. To this, end we calculate the empirical mean of the data:

\[ \bar{x} = \langle \mathbf{x}_i \rangle = \frac{1}{n} \sum_{i=1}^n \mathbf{x}_i, \]

and we make the centered data matrix:

\[\begin{split} \mathbf{B} = \mathbf{X} - \bar{x} = \begin{bmatrix} -\mathbf{x}_1-\bar{x}- \\ -\mathbf{x}_2-\bar{x}- \\ \vdots \\ -\mathbf{x}_n-\bar{x}- \end{bmatrix}. \end{split}\]

Now take an arbitrary unit vector \(\mathbf{v}\). A little visualization helps to understand the PCA.

../_images/272248708511f7fb485c10ca962e54ad62ed2a390a5629ffdf40790fcb7b4994.svg

Consider the projection of \(\mathbf{b}_i\) on \(\mathbf{v}\). It is:

\[ \text{Proj}_i = \mathbf{b}_i^T \mathbf{v}. \]

What is the empirical mean of the projections?

\[ \langle \text{Proj}_i \rangle = \langle \mathbf{b}_i^T \mathbf{v} \rangle = \langle \mathbf{b}_i\rangle^T \mathbf{v} = 0, \]

because \(\mathbf{b}_i\) is centered.

What is the variance of the projections?

\[ \langle \text{Proj}_i^2\rangle = \langle \mathbf{v}^T\mathbf{b}_i \mathbf{b}_i^T \mathbf{v} \rangle = \mathbf{v}^T \langle \mathbf{b}_i \mathbf{b}_i^T \rangle \mathbf{v} = \mathbf{v}^T \mathbf{C} \mathbf{v}, \]

where \(\mathbf{C}\) is the covariance matrix of the centered data:

\[ \mathbf{C} = \frac{1}{n} \mathbf{B}^T \mathbf{B}. \]

PCA, finds \(\mathbf{v}\) by solving the following problem:

\[ \mathbf{v}_1 = \arg\max \mathbf{v}^T \mathbf{C} \mathbf{v}, \]

subject to the constraint that \(\mathbf{v}\) is a unit vector.

Using the method of Lagrange multipliers, we can show that the solution to this problem is the eigenvector of \(\mathbf{C}\) with the largest eigenvalue.

In a similar way, we can find the second principal direction, \(\mathbf{v}_2\). And so on. We always get a sequence of orthogonal unit vectors, \(\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_m\), corresponding to the largest \(m\) eigenvalues of \(\mathbf{C}\).

And now, here is the connection between PCA and SVD. Do the SVD of the centered matrix \(\mathbf{B}\):

\[ \mathbf{B} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^T. \]

Form the covariance:

\[ \mathbf{C} = \frac{1}{n} \mathbf{B}^T \mathbf{B} = \frac{1}{n} \mathbf{V} \mathbf{\Sigma}^T \mathbf{U}^T \mathbf{U} \mathbf{\Sigma} \mathbf{V}^T = \frac{1}{n} \mathbf{V} \mathbf{\Sigma}^2 \mathbf{V}^T. \]

So, we get that the SVD diagonalizes the covariance matrix. We can read of the eigenvalues and eigenvectors. The \(j\)-th column of \(\mathbf{V}\) is the \(j\)-th principal direction. The \(j\)-th singular value squared and divided by \(n\) is the variance in the \(j\)-th principal direction.

The total variance of the data is:

\[ \text{Tr}(\mathbf{C}) = \frac{1}{n} \text{Tr}(\mathbf{B}^T \mathbf{B}) = \frac{1}{n} \text{Tr}(\mathbf{V} \mathbf{\Sigma}^2 \mathbf{V}^T) = \frac{1}{n} \text{Tr}(\mathbf{\Sigma}^2) = \sum_{j=1}^m \frac{\sigma_j^2}{n}. \]

The projection coefficients are known as the principal components. They are:

\[ \mathbf{Z} = \mathbf{X} \mathbf{V} = \mathbf{U} \mathbf{\Sigma}. \]

The principal components are uncorrelated#

Let \(\mathbf{z}_i\) be the \(i\)-th row of \(\mathbf{Z}\). Then:

\[ \langle z_{ik}z_{il}\rangle = \langle u_{ik}\sigma_k u_{il}\sigma_l\rangle = \sigma_k \sigma_l \langle u_{ik} u_{il}\rangle = \frac{\sigma_k^2}{n} \delta_{kl}. \]

Here we used the Einstein summation convention and the fact that the \(u_{ik}\) are orthonormal.

Example - PCA on MNIST Dataset#

We will use the MNIST dataset to illustrate PCA. First, let’s download it.

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)

Let’s do the steps described above:

import numpy as np
import scipy

# 1. Extract the relevant data
X = mnist.data
# 2. Find the empirical mean
x_bar = np.mean(X, axis=0)
# 3. Center the data
B = X - x_bar
# 4. Do the SVD
U, S, Vt = np.linalg.svd(B, full_matrices=False)
# 5. Project the data
Z = U @ np.diag(S)

Let’s look at the explained variance:

../_images/b66313b72bcfc3722cad9d67c12d15b8e7fd41313935512d54e16402206f06cd.svg

Let’s look at the first 10 principal directions:

../_images/3809154f748f5e97b0102a4c2179429cace130d4071b957b5732740c9a98fa0c.svg

Let’s look at the projections of all the data in two dimensions. We are going to color the points by the digit they represent.

y = mnist.target.astype(int)
fig, ax = plt.subplots()
for i in range(10):
    mask = y == i
    ax.scatter(Z[mask, 0], Z[mask, 1], label=str(i), alpha=0.5)
ax.legend(loc='best')
ax.set(xlabel='First principal component', ylabel='Second principal component')
sns.despine();

../_images/68460f3a2b3ce683cdc4102d649aeb0b60bcd07b9caafc4bd6874768e0ef8bd6.svg

Connection Between SVD and Principal Component Analysis (PCA)

Contents

Connection Between SVD and Principal Component Analysis (PCA)#

The principal components are uncorrelated#

Example - PCA on MNIST Dataset#