Hide code cell source
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')
import seaborn as sns
sns.set_context("paper")
sns.set_style("ticks");

The Multivariate Normal - Full Covariance Case#

Consider the \(N\)-dimensional multivariate normal:

\[ \mathbf{X} \sim N\left(\boldsymbol{\mu}, \boldsymbol{\Sigma}\right), \]

where \(\boldsymbol{\mu}\) is a \(N\)-dimensional vector, \(\boldsymbol{\Sigma}\) is a positive-definite matrix.

The joint PDF of \(\mathbf{X}\) is

\[ p(\mathbf{x}) = \frac{1}{\sqrt{\left(2\pi\right)^N \det\left(\boldsymbol{\Sigma}\right)}} \exp\left(-\frac{1}{2} \left(\mathbf{x} - \boldsymbol{\mu}\right)^T \boldsymbol{\Sigma}^{-1} \left(\mathbf{x} - \boldsymbol{\mu}\right)\right). \]

The term \(\det\left(\boldsymbol{\Sigma}\right)\) is the determinant of \(\boldsymbol{\Sigma}\). Sometimes we write \(\left|\boldsymbol{\Sigma}\right|\) instead of \(\det\left(\boldsymbol{\Sigma}\right)\). Because \(\boldsymbol{\Sigma}\) is positive-definite, \(\det\left(\boldsymbol{\Sigma}\right)\) is positive so that we can take the square root.

The term \(\boldsymbol{\Sigma}^{-1}\) is the inverse of \(\boldsymbol{\Sigma}\). It exists because \(\boldsymbol{\Sigma}\) is positive-definite. And it is also positive-definite. So the term \(\left(\mathbf{x} - \boldsymbol{\mu}\right)^T \boldsymbol{\Sigma}^{-1} \left(\mathbf{x} - \boldsymbol{\mu}\right)\) is a positive quantity. You can also show that it has a minimum at \(\mathbf{x} = \boldsymbol{\mu}\).

The vector \(\boldsymbol{\mu}\) is the mean vector, because it is the expected value of \(\mathbf{X}\),

\[ \mathbb{E}\left[\mathbf{X}\right] = \boldsymbol{\mu}. \]

The matrix \(\boldsymbol{\Sigma}\) is the covariance matrix, because it is the (self-)covariance of \(\mathbf{X}\),

\[ \mathbb{C}\left[\mathbf{X}\right] = \mathbb{C}\left[\mathbf{X}, \mathbf{X}\right] = \boldsymbol{\Sigma}. \]

Let’s plot contours and take samples.

import numpy as np
import scipy.stats as st

# The mean vector
mu = np.array([1.0, 2.0])
# The covariance matrix
Sigma = np.array(
    [
        [1.0, 0.9],
        [0.9, 1.0]
    ]
)
# The multivariate normal random vector
X = st.multivariate_normal(mean=mu, cov=Sigma)

# CONTOURS
fig, ax = plt.subplots(dpi=150)
x1 = np.linspace(-3, 5, 64)
x2 = np.linspace(-3, 5, 64)
X1, X2 = np.meshgrid(x1, x2)
X_flat = np.hstack(
    [
        X1.flatten()[:, None],
        X2.flatten()[:, None]
    ]
)
# PDF values
pdf_X = X.pdf(X_flat).reshape(X1.shape)
c = ax.contour(X1, X2, pdf_X)
ax.clabel(c, inline=1, fontsize=10)

# SAMPLES
num_samples = 500
x_samples = X.rvs(size=num_samples)
ax.plot(
    x_samples[:, 0],
    x_samples[:, 1],
    '.',
    markersize=2,
    label='Samples'
)
plt.legend(loc='best', frameon=False)
ax.set_xlabel('$x_1$')
ax.set_ylabel('$x_2$')
sns.despine(trim=True);
../_images/c7e8f971ae6316260fa2f39d866df1ac806d68818916b7076ebf2c246dc3b8a7.svg

Questions#

  • Rerun the steps above after moving \(\boldsymbol{\mu}\) to \((0, 1)\). Observe how the contours of the PDF move.

  • Rerun the steps above for \(\Sigma_{12} = \Sigma_{21} = 0.1\). Observe how the contours of the PDF change.

  • Rerun the steps above for \(\Sigma_{12} = \Sigma_{21} = -0.9\). Observe how the contours of the PDF change.

  • Rerun the steps above for \(\Sigma_{11} = 0.4\). Why does the code fail?

The covariance matrix must be positive definite so that \(p(\mathbf{x})\) has a well-defined, unique maximum#

We said that because the covariance matrix \(\boldsymbol{\Sigma}\) is positive definite, the PDF of \(\mathbf{X}\) has a unique maximum. Let’s try to understand this step by step.

First, let us numerically examine what positive definiteness means. We take two matrices. One will be positive definite; the other won’t be. We will draw random vectors \(v\), evaluate the expression \(\mathbf{v}^T\boldsymbol{\Sigma}\mathbf{v}\), and see what I get.

Hide code cell source
# A covariane matrix that we know works
Sigma_good = np.array(
    [
        [1.0, 0.9],
        [0.9, 1.0]
    ]
)

# A covariance matrix that we know does not work
Sigma_bad = np.array(
    [
        [1, 1.1],
        [1.1, 1.0]
    ]
)

# Take random vectors and compute at quantity
num_vectors = 5000
Q_good = np.ndarray((num_vectors,))
Q_bad = np.ndarray((num_vectors,))
for i in range(num_vectors):
    v = np.random.randn(2)
    Q_good[i] = v @ (Sigma_good @ v)
    Q_bad[i] = v @ (Sigma_bad @ v)

# Let's do the histograms of these quantities to see whether or not they are positive
fig, ax = plt.subplots()
ax.hist(
    Q_good,
    density=True,
    alpha=0.5,
    bins=100,
    label='Positive definite $\Sigma$'
)
ax.hist(
    Q_bad,
    density=True,
    alpha=0.5,
    bins=100,
    label='Not positive definite $\Sigma$'
)
ax.set_xlabel(r'$v^T\Sigma v$')
ax.set_ylabel('Frequency')
plt.legend(loc='best', frameon=False)
sns.despine(trim=True);
../_images/e1caef3a1365b9aec3ff1d631101fa9dcaa03f8e4dcaf20210f87fb754463a0e.svg

Please observe that the non-positive definite matrix gives us several negative values.

Is there a way to check if a matrix is positive definite without doing this random test? Yes, you check if all the eigenvalues of the matrix are positive. Here is how:

print("Eigenvalues of Sigma_good", np.linalg.eigh(Sigma_good)[0])
print("Eigenvalues of Sigma_bad", np.linalg.eigh(Sigma_bad)[0])
Eigenvalues of Sigma_good [0.1 1.9]
Eigenvalues of Sigma_bad [-0.2486833  1.6486833]

And you see that the second one has a negative eigenvalue.

Finally, let’s visualize the probability density contour and see with our eyes that it does not have a unique minimum when the matrix \(\boldsymbol{\Sigma}\) is not positive definite.

First, I define the PDF:

def pdf_mvn(x, mu, Sigma):
    """Compute the PDF of the multivariate Gaussian in a way that does not require
    Sigma to be positive definite, so that you can see what happens.
    
    Just keep in mind that this is not computationally efficient (or stable),
    but it is okay for this example.
    
    Arguments
    x     -- A 1D numpy array.
    mu    -- The mean vector.
    Sigma -- The covariance matrix.
    """
    N = Sigma.shape[0]
    return np.exp(
        -0.5 * N * np.log(2.0 * np.pi)
        - 0.5 * np.linalg.det(Sigma)
        - 0.5 * (x - mu) @ np.linalg.inv(Sigma) @ (x - mu)
    )

You can evaluate the array PDF at a point like this:

pdf_mvn(np.array([0.5, 0.6]), mu, Sigma)
0.011880260158671472

To do the contour, we have to evaluate the PDF at many points. We can do this with a for loop. But we can also vectorize our function:

vpdf_mvn = np.vectorize(
    pdf_mvn,
    excluded=[1, 2],
    signature="(n)->()"
)

In the above, exclude ensures that inputs 1 (mu) and 2 (Sigma) are not vectorized. The signature tells numpy that the log_pdf_mv function accepts a 1D array and returns a scalar. Vectorization returns the function vlog_pdf_mv that acts on 2D arrays. Each row corresponds to a different \(\mathbf{X}\) sample. Here is how:

vpdf_mvn(
    np.array(
        [
            [0.5, 0.6],
            [-0.1, 0.3],
            [1.5, -1.5]
        ]
    ),
    mu,
    Sigma
)
array([1.18802602e-02, 2.09743291e-02, 1.88142445e-19])

And now I can do the contour:

Hide code cell source
x1 = np.linspace(-5, 5, 64)
x2 = np.linspace(-5, 5, 64)
X1, X2 = np.meshgrid(x1, x2)
X_flat = np.hstack(
    [
        X1.flatten()[:, None],
        X2.flatten()[:, None]
    ]
)

Z_bad = vpdf_mvn(
    X_flat,
    mu,
    Sigma_bad
).reshape(X1.shape)

Z_good = vpdf_mvn(
    X_flat,
    mu,
    Sigma_good
).reshape(X1.shape)

fig, ax = plt.subplots(
    nrows=2,
    ncols=1,
    figsize=(8, 8)
)
c = ax[0].contour(
    X1,
    X2,
    Z_bad,
    levels=np.linspace(0, 0.4, 5)
)
ax[0].clabel(c, inline=1, fontsize=10);
ax[0].set_title("Bad \"covariance\"")
c = ax[1].contour(
    X1,
    X2,
    Z_good,
    levels=5
)
ax[1].clabel(c, inline=1, fontsize=10);
ax[1].set_title("Good covariance")
ax[1].set_xlabel('$x_1$')
ax[0].set_ylabel('$x_2$')
ax[1].set_ylabel('$x_2$')
sns.despine(trim=True)
../_images/d291c53aef6083bd71ea410a46fcead4753017c2da26cf60dcd4ea4ddf79ea60.svg

Notice that \(\boldsymbol{\mu}\) is not a maximum of \(\log p(\mathbf{x})\) but a saddle point.

Questions#

  • Rerun the code above for \(\Sigma_{12} = \Sigma_{21} = -0.9\) (both for the “good” and the “bad” covariance).

Sampling the multivariate normal with diagonal covariance using the standard normal#

Let \(\mathbf{Z}\) be a \(N\)-dimensional standard normal:

\[ \mathbf{Z} \sim N(0,\mathbf{I}), \]

where \(\mathbf{I}\) is the \(N\times N\) unit matrix (all zeros except the diagonal, which is all ones). Define the random vector \(\mathbf{X}\) as:

\[ \mathbf{X} = \boldsymbol{\mu} + \mathbf{A}\mathbf{Z}. \]

One can show that it is a multivariate normal:

\[ \mathbf{X} \sim N\left(\boldsymbol{\mu}, \boldsymbol{\Sigma})\right), \]

with

\[ \boldsymbol{\Sigma} = \mathbf{A}\mathbf{A}^T. \]

Such a matrix \(\mathbf{A}\) is non-unique and is called a “square root” of \(\boldsymbol{\Sigma}\). The most commonly used square root of \(\boldsymbol{\Sigma}\), however, is the Cholesky (and pronounced KOLESKI not TSOLESKI or SHOLESKI – at least by the people who taught me linear algebra!) In the Cholesky decomposition, \(\mathbf{A}\) is a lower triangular matrix (everything above the diagonal is zero), and the diagonal contains only positive numbers.

The above theorem is useful because it allows us to sample from a multivariate normal with diagonal covariance using the standard normal. We go through the following steps:

  1. Generate \(N\) standard normal random variables \(\mathbf{Z}\).

  2. Multiply \(\mathbf{Z}\) by the Cholesky decomposition of \(\boldsymbol{\Sigma}\).

  3. Add the mean \(\boldsymbol{\mu}\).

Let’s do this in Python. First, let’s find the Cholesky decomposition of \(\boldsymbol{\Sigma}\):

# A covariane matrix that we know works
Sigma = np.array(
    [
        [1.0, 0.9],
        [0.9, 1.0]
    ]
)
A = np.linalg.cholesky(Sigma)
print("A =")
print(A)
A =
[[1.         0.        ]
 [0.9        0.43588989]]
# As a sanity check let's see if A * A.T gives us Sigma
print("A * A.T = ")
print(A @ A.T)
print("\nCompare to Sigma =")
print(Sigma)
A * A.T = 
[[1.  0.9]
 [0.9 1. ]]

Compare to Sigma =
[[1.  0.9]
 [0.9 1. ]]

Let’s now verify that if we sample \(\mathbf{Z}\) from \(N(\mathbf{0},\mathbf{I})\) and evaluate \(\mathbf{X} = \boldsymbol{\mu} + \mathbf{A}\mathbf{Z}\), then \(\mathbf{X}\) will be distributed according to \(N(\boldsymbol{\mu},\boldsymbol{\Sigma})\).

Let’s write some code to take a single sample and then we will vectorize it.

def sample_mvn(mu, A):
    """Samples from a multivariate normal.
    
    Arguments
    mu -- The mean vector.
    A  -- The Cholesky decomposition of the covariance matrix.
    """
    z = np.random.randn(mu.shape[0])
    return mu + A @ z

Here is how it works:

for i in range(10):
    print(f"sample {i} = {sample_mvn(mu, A)}")
sample 0 = [-0.18890091  0.35159494]
sample 1 = [0.97453387 2.18909632]
sample 2 = [1.80830105 2.81649643]
sample 3 = [2.37771011 3.19377202]
sample 4 = [2.22812554 3.20392518]
sample 5 = [0.55014157 1.19822046]
sample 6 = [0.40626639 1.39052606]
sample 7 = [0.08688004 1.41870291]
sample 8 = [1.91221029 3.46608438]
sample 9 = [0.94520142 1.37631589]

Let’s make a function that will help us take many samples at once:

def sample_many_mvn(size, mu, A):
    """Sample many times from a multivariate Normal.
    
    Arguments
    size -- The number of samples.
    mu   -- The mean vector.
    A    -- The Cholesky decomposition of the covariance matrix.
    """
    return np.array(
        [
            sample_mvn(mu, A)
            for _ in range(size)
        ]
    )

And it works like this:

sample_many_mvn(10, mu, A)
array([[2.59107626, 3.0731549 ],
       [2.41806169, 3.54035803],
       [1.38749675, 1.85218773],
       [0.8554794 , 2.27355883],
       [1.86912983, 3.04120868],
       [1.31770416, 1.94390974],
       [2.23897879, 3.20695567],
       [2.90815115, 4.31252856],
       [1.45768185, 3.20416916],
       [2.72066845, 3.47094575]])

Let’s plot some samples along with the contour of the PDF:

mu = np.array([1.0, 2.0])

Sigma = np.array(
    [
        [1.0, 0.9],
        [0.9, 1.0]
    ]
)

num_samples = 500
x_samples = sample_many_mvn(num_samples, mu, A)

x1 = np.linspace(-3, 5, 64)
x2 = np.linspace(-3, 5, 64)
X1, X2 = np.meshgrid(x1, x2)
X_flat = np.hstack(
    [
        X1.flatten()[:, None],
        X2.flatten()[:, None]
    ]
)
Z = vpdf_mvn(
        X_flat,
        mu,
        Sigma
).reshape(X1.shape)

fig, ax = plt.subplots()
c = ax.contour(X1, X2, Z)
ax.clabel(c, inline=1, fontsize=10)
ax.plot(
    x_samples[:, 0],
    x_samples[:, 1],
    '.',
    markersize=2
)
ax.set_xlabel('$x_1$')
ax.set_ylabel('$x_2$')
sns.despine(trim=True)
../_images/79da5321cf44f253cc90b36ac561c7498ced722e552fcd5aa4e2f6130e15c3c1.svg

Now, in practice you do not have to use the code above to sample from a multivariate normal. Everything is already implemeneted in scipy.stats. Here is what you can do:

import scipy.stats as st

# A multivariate normal object
X = st.multivariate_normal(mu, Sigma)

# Evaluate the pdf
Z = X.pdf(X_flat).reshape(X1.shape)

# Take samples
x_samples = X.rvs(num_samples)

fig, ax = plt.subplots()
c = ax.contour(X1, X2, Z)
ax.clabel(c, inline=1, fontsize=10)
ax.plot(
    x_samples[:, 0],
    x_samples[:, 1],
    '.',
    markersize=2
)
ax.set_xlabel('$x_1$')
ax.set_ylabel('$x_2$')
sns.despine(trim=True)
../_images/74719004ec666535b06879eacb44b8cf3b7d370790d13e17595f886d10f9be91.svg

Questions#

  • Rerun the steps above after moving \(\boldsymbol{\mu}\) to \((0, 1)\). Observe how the contours of the PDF move.

  • Rerun the steps above for \(\Sigma_{12} = \Sigma_{21} = 0.1\). Observe how the contours of the PDF change.

  • Rerun the steps above for \(\Sigma_{12} = \Sigma_{21} = -0.9\). Observe how the contours of the PDF change.