import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')
import seaborn as sns
sns.set_context("paper")
sns.set_style("ticks");

Connection Between SVD and Principal Component Analysis (PCA)#

Principal component analysis (PCA) is a linear dimensionality reduction technique. We typically use PCA in supervised or unsupervised learning to reduce the number of features in the dataset. The main idea behind PCA is to find a new set of features that are uncorrelated and ordered by the amount of variance they explain.

Let \(\mathbf{X}\) be a \(n \times m\) matrix, where \(n\) is the number of samples and \(m\) is the number of features. In this setting, we typically have \(n \ll m\). We will use \(\mathbf{x}_i\) to denote the \(i\)-th row of \(\mathbf{X}\). So, write:

\[\begin{split} \mathbf{X} = \begin{bmatrix} -\mathbf{x}_1- \\ -\mathbf{x}_2-\\ \vdots \\ -\mathbf{x}_n- \end{bmatrix}, \end{split}\]

where:

\[ -\mathbf{x}_j- \equiv \mathbf{x}_j^T. \]

When we do PCA, we always want to center the data. To this, end we calculate the empirical mean of the data:

\[ \bar{x} = \langle \mathbf{x}_i \rangle = \frac{1}{n} \sum_{i=1}^n \mathbf{x}_i, \]

and we make the centered data matrix:

\[\begin{split} \mathbf{B} = \mathbf{X} - \bar{x} = \begin{bmatrix} -\mathbf{x}_1-\bar{x}- \\ -\mathbf{x}_2-\bar{x}- \\ \vdots \\ -\mathbf{x}_n-\bar{x}- \end{bmatrix}. \end{split}\]

Now take an arbitrary unit vector \(\mathbf{v}\). A little visualization helps to understand the PCA.

Hide code cell source
import numpy as np
A = np.linalg.cholesky([[1, 0.9], [0.9, 1]])
X = (np.array([3, 3])[:, None] + A @ np.random.randn(2, 500)).T
bar_x = X.mean(axis=0)
v = np.array([np.cos(0.5), np.sin(0.5)])
x_i = np.array([4, 5])
b_i = x_i - bar_x
fig, ax = plt.subplots()
ax.plot(X[:, 0], X[:, 1], '.', alpha=0.2)
ax.plot(bar_x[0], bar_x[1], 'o', color='red')
ax.quiver(0, 0, x_i[0], x_i[1], angles='xy', scale_units='xy', scale=1)
ax.text(3, 4, '$x_i$', fontsize=12)
ax.quiver(0, 0, bar_x[0], bar_x[1], angles='xy', scale_units='xy', scale=1, color='red')
ax.text(bar_x[0], bar_x[1], '$\\bar{x}$', fontsize=12, color='red')
ax.quiver(bar_x[0], bar_x[1], v[0], v[1], angles='xy', scale_units='xy', scale=1, color='green')
ax.text(bar_x[0] + v[0], bar_x[1] + v[1], '$v$', fontsize=12, color='green')
ax.quiver(bar_x[0], bar_x[1], b_i[0], b_i[1], angles='xy', scale_units='xy', scale=1, color='blue')
ax.text(bar_x[0] + b_i[0], bar_x[1] + b_i[1], '$b_i$', fontsize=12, color='blue')
sns.despine();
../_images/272248708511f7fb485c10ca962e54ad62ed2a390a5629ffdf40790fcb7b4994.svg

Consider the projection of \(\mathbf{b}_i\) on \(\mathbf{v}\). It is:

\[ \text{Proj}_i = \mathbf{b}_i^T \mathbf{v}. \]

What is the empirical mean of the projections?

\[ \langle \text{Proj}_i \rangle = \langle \mathbf{b}_i^T \mathbf{v} \rangle = \langle \mathbf{b}_i\rangle^T \mathbf{v} = 0, \]

because \(\mathbf{b}_i\) is centered.

What is the variance of the projections?

\[ \langle \text{Proj}_i^2\rangle = \langle \mathbf{v}^T\mathbf{b}_i \mathbf{b}_i^T \mathbf{v} \rangle = \mathbf{v}^T \langle \mathbf{b}_i \mathbf{b}_i^T \rangle \mathbf{v} = \mathbf{v}^T \mathbf{C} \mathbf{v}, \]

where \(\mathbf{C}\) is the covariance matrix of the centered data:

\[ \mathbf{C} = \frac{1}{n} \mathbf{B}^T \mathbf{B}. \]

PCA, finds \(\mathbf{v}\) by solving the following problem:

\[ \mathbf{v}_1 = \arg\max \mathbf{v}^T \mathbf{C} \mathbf{v}, \]

subject to the constraint that \(\mathbf{v}\) is a unit vector.

Using the method of Lagrange multipliers, we can show that the solution to this problem is the eigenvector of \(\mathbf{C}\) with the largest eigenvalue.

In a similar way, we can find the second principal direction, \(\mathbf{v}_2\). And so on. We always get a sequence of orthogonal unit vectors, \(\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_m\), corresponding to the largest \(m\) eigenvalues of \(\mathbf{C}\).

And now, here is the connection between PCA and SVD. Do the SVD of the centered matrix \(\mathbf{B}\):

\[ \mathbf{B} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^T. \]

Form the covariance:

\[ \mathbf{C} = \frac{1}{n} \mathbf{B}^T \mathbf{B} = \frac{1}{n} \mathbf{V} \mathbf{\Sigma}^T \mathbf{U}^T \mathbf{U} \mathbf{\Sigma} \mathbf{V}^T = \frac{1}{n} \mathbf{V} \mathbf{\Sigma}^2 \mathbf{V}^T. \]

So, we get that the SVD diagonalizes the covariance matrix. We can read of the eigenvalues and eigenvectors. The \(j\)-th column of \(\mathbf{V}\) is the \(j\)-th principal direction. The \(j\)-th singular value squared and divided by \(n\) is the variance in the \(j\)-th principal direction.

The total variance of the data is:

\[ \text{Tr}(\mathbf{C}) = \frac{1}{n} \text{Tr}(\mathbf{B}^T \mathbf{B}) = \frac{1}{n} \text{Tr}(\mathbf{V} \mathbf{\Sigma}^2 \mathbf{V}^T) = \frac{1}{n} \text{Tr}(\mathbf{\Sigma}^2) = \sum_{j=1}^m \frac{\sigma_j^2}{n}. \]

The projection coefficients are known as the principal components. They are:

\[ \mathbf{Z} = \mathbf{X} \mathbf{V} = \mathbf{U} \mathbf{\Sigma}. \]

The principal components are uncorrelated#

Let \(\mathbf{z}_i\) be the \(i\)-th row of \(\mathbf{Z}\). Then:

\[ \langle z_{ik}z_{il}\rangle = \langle u_{ik}\sigma_k u_{il}\sigma_l\rangle = \sigma_k \sigma_l \langle u_{ik} u_{il}\rangle = \frac{\sigma_k^2}{n} \delta_{kl}. \]

Here we used the Einstein summation convention and the fact that the \(u_{ik}\) are orthonormal.

Example - PCA on MNIST Dataset#

We will use the MNIST dataset to illustrate PCA. First, let’s download it.

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)
Hide code cell output
/Users/ibilion/.pyenv/versions/3.11.6/lib/python3.11/site-packages/sklearn/datasets/_openml.py:1022: FutureWarning: The default value of `parser` will change from `'liac-arff'` to `'auto'` in 1.4. You can set `parser='auto'` to silence this warning. Therefore, an `ImportError` will be raised from 1.4 if the dataset is dense and pandas is not installed. Note that the pandas parser may return different data types. See the Notes Section in fetch_openml's API doc for details.
  warn(

Let’s do the steps described above:

import numpy as np
import scipy

# 1. Extract the relevant data
X = mnist.data
# 2. Find the empirical mean
x_bar = np.mean(X, axis=0)
# 3. Center the data
B = X - x_bar
# 4. Do the SVD
U, S, Vt = np.linalg.svd(B, full_matrices=False)
# 5. Project the data
Z = U @ np.diag(S)

Let’s look at the explained variance:

Hide code cell source
cum_var = np.cumsum(S)/np.sum(S)
best_k = np.argmax(cum_var > 0.9)

fig, ax = plt.subplots()
ax.plot(cum_var)
ax.axhline(0.9, color='red', linestyle='--')
ax.axvline(best_k, color='red', linestyle='--')
ax.text(best_k, 0.5, f'90% at k={best_k}', verticalalignment='center', color='red')
ax.set(yscale='log', xlabel="Singular value index", ylabel="Cumulative fraction of total variance")
sns.despine();
../_images/b66313b72bcfc3722cad9d67c12d15b8e7fd41313935512d54e16402206f06cd.svg

Let’s look at the first 10 principal directions:

Hide code cell source
fig, axes = plt.subplots(5, 5, figsize=(5, 5))
for i, ax in enumerate(axes.ravel()):
    ax.imshow(Vt[i].reshape(28, 28), cmap='gray')
    ax.axis('off')
../_images/3809154f748f5e97b0102a4c2179429cace130d4071b957b5732740c9a98fa0c.svg

Let’s look at the projections of all the data in two dimensions. We are going to color the points by the digit they represent.

y = mnist.target.astype(int)
fig, ax = plt.subplots()
for i in range(10):
    mask = y == i
    ax.scatter(Z[mask, 0], Z[mask, 1], label=str(i), alpha=0.5)
ax.legend(loc='best')
ax.set(xlabel='First principal component', ylabel='Second principal component')
sns.despine();
../_images/68460f3a2b3ce683cdc4102d649aeb0b60bcd07b9caafc4bd6874768e0ef8bd6.svg