Hide code cell source
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')
import seaborn as sns
sns.set_context("paper")
sns.set_style("ticks")

def show_digit_image(data):
    """Show a digit as an image.
    
    Arguments
    data -- The image data.
    """
    fig, ax = plt.subplots()
    ax.imshow(data.reshape((28, 28)), cmap=plt.cm.gray_r, interpolation='nearest')
    ax.set_xticks([])
    ax.set_xticklabels([])
    ax.set_yticks([])
    ax.set_yticklabels([]);

Density Estimation with High-dimensional Data#

We are going to create a model that can sample handwritten digits. To achieve this, we will use PCA to reduce the dimensionality of the MNIST images and then apply Gaussian mixture density estimation on the principal components. The resulting model will not be perfect, but it very simple and a decent start. For simplicity, we are going to work only with the threes.

Start by loading the data and extracting the threes:

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

# Load data from https://www.openml.org/d/554
X, y = fetch_openml("mnist_784", version=1, return_X_y=True, as_frame=False)
X = X / 255.0

# Split data into train partition and test partition
np.random.seed(12345)
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.3)
/opt/homebrew/lib/python3.11/site-packages/sklearn/datasets/_openml.py:968: FutureWarning: The default value of `parser` will change from `'liac-arff'` to `'auto'` in 1.4. You can set `parser='auto'` to silence this warning. Therefore, an `ImportError` will be raised from 1.4 if the dataset is dense and pandas is not installed. Note that the pandas parser may return different data types. See the Notes Section in fetch_openml's API doc for details.
  warn(

Apply PCA to the threes keeping just a few components:

from sklearn.decomposition import PCA

threes = x_train[y_train == '3']
num_components = 2
pca = PCA(
    n_components=num_components,
    whiten=True
).fit(threes)

Now, use the Gaussian mixture model on the principal components. We are also going to use BIC to figure out what is the correct number of mixture components.

from sklearn.mixture import GaussianMixture

Z = pca.transform(threes)

max_num_components = 11
bics = np.ndarray((max_num_components - 1, ))
models = []
for nmc in range(1, max_num_components):
    m = GaussianMixture(n_components=nmc).fit(Z)
    bics[nmc-1] = m.bic(Z)
    models.append(m)

Here are the BICS:

fig, ax = plt.subplots()
ax.bar(range(1, max_num_components), bics)
ax.set_ylabel('BIC Score')
ax.set_xlabel('Number of mixture components')
sns.despine(trim=True);
../_images/4cea791064fe6ec84ef2171dd62b4dca01fceccf652356becb6e31a95249b960.svg

Let’s find the mixture model with the smallest BIC:

model = models[np.argmin(bics)]
print(model)
GaussianMixture(n_components=5)

Now let’s sample some random threes…

for i in range(5):
    z = model.sample()[0]
    x = pca.inverse_transform(z[None, :])
    show_digit_image(x)
../_images/535c61eb614f45d2a6fd0ea8b7a7716dc04c91340072e5745b3b3fe908ff3bda.svg../_images/75e6e2af004528e5c75df28ba2ed9bd1411ae6b28dfb304f0825841d2b9d3909.svg../_images/6b78226305c114e424eac7987feeae53933edf15462c7320fa612e311ab578c8.svg../_images/a326109e81aa42a3b21a12b2fdf15ef68fdfe468566880f9fc89d2f665688066.svg../_images/0a722ef9d989ad51d1f71121c2a30032ff60e0bf48409961b2d9bc2b15a3bba1.svg

Questions#

  • Try the same code above with ones instead of threes. You just need to modify the code line threes = x_train[y_train == 3] to threes = x_train[y_train == 1]. Don’t bother about renaming the variables.

  • Try increasing the number of PCA components (3, 5, 10, 20). Do the results improve or become worse? What seems to be the problem?