Sampling Methods#

For more information, see Chapter 11 of [Bishop, 2006]

Problem Definition#

We have seen that the Bayesian formulation of inverse problems results in intractable posterior distributions. In particular, these posteriors are known only up to a normalization constant. In the next series of lectures, we will develop methodologies that allow us to sample from these distributions. The most celebrated of these methodologies is Markov Chain Monte Carlo (MCMC), which has at its core the Metropolis algorithms. There is a long way to go, but we can immediately state the problem definition and what we plan to do.

Without loss of generality, let \(X\in\mathcal{X}\subset\mathbb{R}^d\) be a random variable with an arbitrary probability density, say \(\pi(x)\) known up to a normalization constant. That is, we have that:

\[ \pi(x) = \frac{h(x)}{Z}, \]

where \(h(x)\) is a known function that we can evaluate at will, but \(Z\) is unknown. We aim to generate samples from \(\pi(x)\) by only evaluating \(h(x)\). The revolutionary idea of Metropolis was to construct a stochastic process using only \(h(x)\) samples which resemble (in some way we will specify below) samples from \(\pi(x)\). To understand the details, we will have to introduce some key concepts.

Markov Chains#

Definition#

Let \(X_n, n=1,2,\dots\) be a stochastic process taking values in \(\mathcal{X}\subset\mathbb{R}^d\), which could be discrete or continuous. We will refer to \(n\) as the time step. We say that this stochastic process is a Markov chain if the evolution of \(X_{n+1}\) depends only on the value of \(X_n\) and not on the history of the process. Let us define this a little bit more rigorously.

Let \(x_1,\dots,x_n\in\mathcal{X}\) be the observed values of the process up to \(n\)-th time step. The Markov property can now be expressed as:

\[ p(X_{n+1}=x_{n+1}|X_1=x_1,\dots,X_n=x_n) = p(X_{n+1}=x_{n+1}|X_n=x_n). \]

We will simplify the notation by dropping the capital letters. That is, we will be writing:

\[ p(x_{n+1}|x_1,\dots,x_n) = p(x_{n+1}|x_n). \]

To simplify things even further, we will also use the collective notation:

\[ x_{1:n} = (x_1,\dots,x_n)\in\mathcal{X}^n. \]

With this notation, we can re-write the Markov property in even simpler terms:

\[ p(x_{n+1}|x_{1:n}) = p(x_{n+1}|x_n). \]

The joint distribution of a Markov chain#

The joint distribution is defined as:

\[ p(x_{1:n}) := P(X_1=x_1,\dots,X_n=x_n). \]

If \(X_n\) is a Markov chain, then we have:

\[ p(x_{1:n}) = p(x_1)p(x_2|x_1)\dots p(x_n|x_{n-1}), \]

or

\[ p(x_{1:n}) = p(x_1)\prod_{t=2}^np(x_t|x_{t-1}). \]

So, we know the joint distribution of a Markov chain if we know the probability of hopping from one state to the next. This probability is known as the transition kernel of the Markov chain.

Transition Kernel#

To describe a Markov chain, we only need to know the transition kernel. The transition kernel gives the probability of moving from one state to any other at a given step. Mathematically, the transition kernel of the \(n\)-th step is the function:

\[ T_n:\mathcal{X}\times \mathcal{X}\rightarrow \mathbb{R}_+, \]

defined by:

\[ T_n(x_n, x_{n+1}) = P(X_{n+1}=x_{n+1}|X_n=x_n). \]

In words, \(T_n(x_n, x_{n+1})\) is the probability that at step \(n\) we jump from state \(x_n\) to to state \(x_{n+1}\).

Please note that the transition kernel generally depends on the time step \(n\). We say that the Markov chain is stationary if the transition kernel does not depend on \(n\), i.e., if

\[ T_n(x_n, x_{n+1}) = T(x_n,x_{n+1}). \]

From now on, we will only consider stationary Markov chains. For stationary Markov chains, and when there is no ambiguity, we will be writing:

\[ T(x_n,x_{n+1}) = p(x_{n+1}|x_n). \]

Invariant Distributions#

Let \(X_1,X_2,\dots\) be a Markov chain with transition kernel \(T(x,x')\) and \(\pi(x)\) be a probability density. We say that the Markov chain leaves \(\pi(x)\) invariant if:

\[ \pi(x) = \int \pi(x')T(x',x)dx'. \]

In other words, \(\pi(x)\) is invariant if you start from a sample from it and follow the Markov chain, you get a sample from it.

Invariance is one of the key requirements of a working MCMC algorithm. Whatever you do, the chain you construct must be invariant with respect to the distribution from which you want to sample. Once you get one sample from your distribution, you can get as many as you want by following the transition kernel.

The Detailed Balance Condition#

Checking invariance is not trivial for a generic Markov chain. However, there is a sufficient condition that guarantees invariance. This condition is known as the detailed balance condition and it is:

\[ \pi(x)T(x,x') = \pi(x')T(x',x). \]

A Markov chain that satisfies the detailed balance condition is reversible in the following sense. The probability of sampling an \(x\) and transitioning to \(x'\) is the same as the probability of doing the reverse.

If the detailed balance condition is satisfied, then \(\pi(x)\) is an invariant distribution:

\[ \int \pi(x')T(x',x)dx' = \int \pi(x)T(x,x')dx' = \pi(x)\int T(x,x')dx' = \pi(x), \]

since

\[ \int T(x,x') dx' = \int p(x'|x)dx' = 1. \]

The reverse does not necessarily hold. The key idea of Metropolis was to construct a Markov chain that satisfies the detailed balance condition for the distribution you are interested in.

Ergodicity#

A Markov chain may have no invariant distribution (e.g., the random walk does not have an invariant distribution), one, or more than one. We need ergodicity to guarantee the invariant distribution’s uniqueness. We need a little notation to define ergodicity precisely for continuous Markov chains. Ergodicity requires that the random variable \(X_n\) converges in distribution to \(\pi(x)\) irrespective of the starting point. This is challenging to show for a generic Markov chain. Fortunately, we know that a Markov chain is ergodic if:

  • It is aperiodic (i.e., it does not return to the same state at fixed intervals);

  • It is positive recurrent (i.e., the expected number of steps for returning to the same state is finite).

Equilibrium Distribution#

If a Markov chain is ergodic and has an invariant distribution, then that invariant distribution is unique and called the equilibrium distribution. The Metropolis algorithm constructs a Markov chain that has a desired equilibrium distribution.

The Metropolis Algorithm#

Now, let’s get back to the initial problem of sampling:

\[ \pi(x) = \frac{h(x)}{Z}, \]

without knowing \(Z\). In 1953, Metropolis et al. demonstrated how to construct a Markov chain with \(\pi(x)\) as the equilibrium density. The algorithm is based on biasing an underlying symmetric, stationary Markov chain. Let \(T(x,x')\) be the transition kernel of this underlying Markov chain (also called the proposal distribution. The transition kernel must be symmetric, i.e.,

\[ T(x,x') = T(x',x). \]

A ubiquitous choice of the proposal distribution is the random walk transition kernel:

\[ T(x,x') = \mathcal{N}(x'|x, \Sigma). \]

However, this is just one possibility. Once we have picked a proposal, we construct the desired Markov chain as follows:

  • Initialization: Pick an arbitrary starting point \(x_0\).

  • For each time step \(n\):

    • Generation: Sample a candidate \(x\) from \(T(x_n, x)\).

    • Calculation: Calculate the acceptance ratio:

    \[ \alpha(x_n, x) = \min\left\{1, \frac{h(x)}{h(x_n)}\right\}. \]

This is the only place where you may need to evaluate the underlying model. - Accept/Reject: - Generate a uniform number \(u\sim \mathcal{U}([0,1])\). - If \(u\le \alpha\), accept and set \(x_{n+1}=x\). - If \(u > \alpha\), reject ad set \(x_{n+1} = x_n\).

Why Does Metropolis Work?#

It works because it gives us a Markov chain with the desired equilibrium distribution. That chain has an invariant distribution of our choice that is also ergodic.

To show that \(\pi(x)\) is the invariant distribution of the Metropolis Markov chain, we will show that the latter satisfies the detailed balance condition. To this end, we need the transition kernel of the chain. The transition kernel \(K(x,x')\) gives the probability that the Metropolis chain moves from \(x\) to \(x'\). It is:

\[ K(x,x') = T(x,x')\alpha(x,x') + (1 - r(x))\delta(x' - x), \]

where \(T(x,x')\) is the transition kernel of the proposal distribution,

\[ \alpha(x,x') = \min\left\{1, \frac{h(x')}{h(x)}\right\} \]

is the acceptance ratio,

\[ r(x) = \int T(x, y)\alpha(x, y)dy, \]

is the probability of accepting any move, i.e., \(1 - r(x)\) is the probability of not accepting the move, and \(\delta(x-x')\) is the Dirac delta centered at \(x'\).

Let’s prove that the detailed balance holds for this transition kernel. The equation holds trivially for \(x = x'\) (even though we would have to interpret it slightly differently to be 100% rigorous). For \(x\not= x'\), we have:

\[\begin{split} \begin{array}{ccc} \pi(x) K(x, x') &=& \frac{h(x)}{Z} T(x,x')\alpha(x,x') \\ &=& \frac{h(x)}{Z}T(x,x')\min\left\{1, \frac{h(x')}{h(x)}\right\}\\ &=& \frac{h(x')}{h(x')}\frac{h(x)}{Z}T(x,x')\min\left\{1, \frac{h(x')}{h(x)}\right\}\\ &=& \frac{h(x')}{Z}T(x,x')\min\left\{\frac{h(x)}{h(x')},\frac{h(x)}{h(x')}\cdot\frac{h(x')}{h(x)}\right\}\\ &=& \pi(x')T(x,x')\min\left\{\frac{h(x)}{h(x')},1\right\}\\ &=& \pi(x')T(x,x')\alpha(x',x)\\ &=& \pi(x')T(x', x)\alpha(x',x)\\ &=& \pi(x')K(x',x), \end{array} \end{split}\]

we also used the symmetry of the proposl \(T(x,x') = T(x',x)\).

The Metropolis-Hastings Algorithm#

The Metropolis algorithm requires that the proposal kernel \(T(x,x')\) is symmetric. Hastings (1970) created an algorithm that does not require symmetric proposal kernels. The only thing that changes is the acceptance ratio. In every other regard, the algorithm is the same:

  • Initialization: Pick an arbitrary starting point \(x_0\).

  • For each time step \(n\):

    • Generation: Sample a candidate \(x\) from \(T(x_n, x)\).

    • Calculation: Calculate the acceptance ratio:

    \[ \alpha(x_n, x) = \min\left\{1, \frac{h(x)}{h(x_n)}\frac{T(x,x_n)}{T(x_n,x)}\right\}. \]

    This is the only place where you may need to evaluate the underlying model.

    • Accept/Reject:

      • Generate a uniform number \(u\sim \mathcal{U}([0,1])\).

      • If \(u\le \alpha\), accept and set \(x_{n+1}=x\).

      • If \(u > \alpha\), reject ad set \(x_{n+1} = x_n\).

Why Does Metropolis-Hastings Work?#

It works because it gives us a Markov chain with the desired equilibrium distribution. That chain has an invariant distribution of our choice that is also ergodic.

To show that \(\pi(x)\) is the invariant distribution of the Metropolis-Hastings Markov chain, we will show that the latter satisfies the detailed balance condition. To this end, we need the transition kernel of the chain. The transition kernel \(K(x,x')\) gives the probability that the Metropolis chain moves from \(x\) to \(x'\). It is: $\( K(x,x') = T(x,x')\alpha(x,x') + (1 - r(x))\delta(x' - x), \)\( where \)T(x,x’)\( is the transition kernel of the proposal distribution, \)\( \alpha(x,x') = \min\left\{1, \frac{h(x')}{h(x)}\frac{T(x',x)}{T(x,x')}\right\} \)\( is the acceptance ratio, \)\( r(x) = \int T(x, y)\alpha(x, y)dy, \)\( is the probability of accepting any move, i.e., \)1 - r(x)\( is the probability of not accepting the move, and \)\delta(x-x’)\( is the Dirac delta centered at \)x’$.

Let’s prove that the detailed balance holds for this transition kernel. The equation holds trivially for \(x = x'\) (even though we would have to interpret it slightly differently to be 100% rigorous). For \(x\not= x'\), we have: $\( \begin{array}{ccc} \pi(x) K(x, x') &=& \frac{h(x)}{Z} T(x,x')\alpha(x,x') \\ &=& \frac{h(x)}{Z}T(x,x')\min\left\{1, \frac{h(x')}{h(x)}\frac{T(x',x)}{T(x,x')}\right\}\\ &=& \frac{h(x')}{h(x')}\frac{T(x',x)}{T(x',x)}\frac{h(x)}{Z}T(x,x')\min\left\{1, \frac{h(x')}{h(x)}\frac{T(x',x)}{T(x,x')}\right\}\\ &=& \frac{h(x')}{Z}T(x',x)\min\left\{\frac{h(x)}{h(x')}\frac{T(x,x')}{T(x',x)},\frac{h(x)}{h(x')}\frac{T(x,x')}{T(x',x)}\cdot\frac{h(x')}{h(x)}\frac{T(x',x)}{T(x,x')}\right\}\\ &=& \pi(x')T(x',x)\min\left\{\frac{h(x)}{h(x')\frac{T(x,x')}{T(x',x)}},1\right\}\\ &=& \pi(x')T(x',x)\alpha(x',x)\\ &=& \pi(x')K(x',x). \end{array} \)$

The Metropolis-Hastings Algorithm is Not One Algorithm!#

You get a different MH algorithm for every choice of proposal \(T(x',x)\). This is exceptionally empowering since you can construct an MH that best exploits your problem. Over the years, several cases have been proposed that are extremely useful. We enumerate a few:

  • Metrpolis Adjusted Langevin Dynamics: This algorithm uses gradient information to push your chain towards highly probable states.

  • Hybrid Monte Carlo: This method associates the random variables you have with the generalized coordinates of a hypothetical physical system and the negative log of the probability density you want to sample from with fictitious energy. The proposal follows the hypothetical Hamiltonian dynamics with randomly sampled (fake) velocities. This moves you to low-energy states associated with high probabilities.

  • Riemannian Manifold Hamiltonian Monte Carlo: One of the most advanced algorithms. Like HMC, it exploits the parameter space’s Riemannian structure to adapt to local features automatically.

Metropolis Adjusted Langevin Dynamics (MALA)#

Understanding Langevin dynamics requires familiarity with Itô calculus. The math is very advanced, but we will do our best to explain what is happening intuitively. Remember that we want to sample from:

\[ \pi(x) = \frac{h(x)}{Z}, \]

where \(Z\) is unknown.

Consider the stochastic differential equation (Itô diffusion):

\[ \dot{X}_t = -\nabla V(X_t) + \sqrt{2}\dot{W}_t, \]

where the time is continuous, and \(W_t\) is a Brownian motion. This is called the Langevin equation. Intuitively, think of \(X_t\) as the position of a particle that wants to move towards regions of low potential energy \(V(x)\), but it is bombarded continuously by random forces. We want to pick a \(V(x)\) that will force this fictitious particle to move towards regions of high \(h(x)\). This can be done in many ways, but let us take:

\[ V(x) = -\log h(x), \]

because we already know the answer! Using the theory of stochastic differential equations, one can show that the distribution of \(X_t\), say \(\rho_t\), converges to a stationary distribution \(\rho_\infty\). Well, it turns out that:

\[ \rho_\infty(x) \propto h(x), \]

i.e.,

\[ \rho_\infty = \pi. \]

Here is the idea: Simulate the Langevin equation for a long time, and you should get a sample from \( \pi \).

The only issue is that you cannot get exact sample paths from the Langevin equation. You have to use a time discretization scheme. The simplest scheme is the Euler-Maruyama method (a generalization of the Euler method for ODEs to SODEs). You fix a time step \(\Delta t > 0\) and you take:

\[ X_{n+1} = X_n + \Delta t \nabla \log h(X_n) + \sqrt{2\Delta t}Z_n, \]

where \(Z_n\sim \mathcal{N}(0,I_d)\) independent. This is a discrete-time Markov chain with a non-symmetric transition kernel:

\[ T(x,x') = \mathcal{N}\left(x'|x + \Delta t\nabla \log h(x), 2\Delta t\right) \propto \exp\left\{-\frac{\parallel x+\Delta t\log h(x)-x'\parallel_2^2}{4\Delta t}\right\}. \]

In the \(\Delta t\rightarrow 0\) limit, you will get exact sample paths and samples from \(\pi(x)\). Unfortunately, for finite \(\Delta t\), you will converge to a perturbed version of \(\pi(x)\). Fortunately, you can use \(T(x,x')\) as the proposal kernel of a Metropolis-Hastings algorithm. In other words, you can Metropolize the discretized version of the Langevin equation. Then, your resulting Markov chain will satisfy the detailed balance for the right probability density, and you are all set.

Combining Transition Kernels#

Now, let’s get back to the original problem of sampling from:

\[ \pi(x) = \frac{h(x)}{Z}. \]

Assume that you have \(m\) different Markov kernels, \(K_1(x,x'), \dots, K_m(x,x')\) (which could be Metropolis-Hastings kernels) that leave \(\pi(x)\) invariant. Then, the Markov kernel that consists of applying these kernels in order also leaves \(\pi(x)\) invariant. This kernel is:

\[ K(x,x') = \int K_1(x, x_1)K_2(x_1, x_2)\dots K_m(x_{m-1}, x')dx_1dx_2\dots dx_{m-1}. \]

The proof is trivial:

\[\begin{split} \begin{array}{ccc} \int \pi(x) K(x, x') dx &=& \int \pi(x) \int K_1(x, x_1)K_2(x_1, x_2)\dots K_m(x_{m-1}, x')dx_1dx_2\dots dx_{m-1} dx\\ &=& \int \left(\int \pi(x) K(x, x_1) dx \right) K_2(x_1, x_2)\dots K_m(x_{m-1}, x')dx_1dx_2\dots dx_{m-1} \\ &=& \int \pi(x_1) K_3(x_2, x_4)\dots K_m(x_{m-1}, x')dx_2dx_3\dots dx_{m-1}\\ &=& \int \left(\pi(x_1) K_2(x_1, x_2)dx_1\right)K_3(x_2, x_4)\dots K_m(x_{m-1}, x')dx_2dx_3\dots dx_{m-1}\\ &=& \int \pi(x_2)K_3(x_2, x_4)\dots K_m(x_{m-1}, x')dx_2dx_3\dots dx_{m-1}\\ &=& \dots\\ &=& \pi(x'). \end{array} \end{split}\]

So, if you have many Metropolis-Hastings kernels, or any other kernels really, you can combine them all together in arbitrary ways. You will still be getting samples from the target distribution.

Gibbs Sampler#

The Gibbs sampler is based on combining kernels that operate on groups of components of your random variables and exploit the availability of the conditional distributions. For example, assume that \(x\) consists of \(m\) groups:

\[ x = (x_{g1}, \dots, x_{gm}). \]

Note that each group may consist of more than one variable. That is, the number of groups \(m\le d\).

Let \(x_{gi}\) denote the \(i\)-th group of variables, \(i=1,\dots,m\), and

\[ x_{g,-i} = \left(x_{g1},\dots,x_{g,i-1},x_{g,i+1},\dots,x_{gm}\right), \]

all groups except \(i\). Of course, we have that:

\[ x = (x_{gi}, x_{g,-i}). \]

To implement a Gibbs sampler, we need the ability to sample from the conditional probability densities:

\[ \pi(x_{gi} | x_{g,-i}) = \frac{\pi(x_{gi},x_{g,-i})}{\pi(x_{g-i})}. \]

If it is possible to sample easily from this distribution, then we are all set. Then we say that we are using an exact Gibbs sampler. If analytical samples are not possible, we can construct a Metropolis-Hastings kernel that samples from the conditional (you just think of \(x_{g,-i}\) as given when using this kernel. Then, we say we are using an approximate Gibbs sampler.

Many possible versions of the Gibbs depend on how we select from which conditional to sample next. The simplest version is the following, in which we sample from all the conditionals in order:

  • Initialize the sampler:

\[ x_0 = (x_{0g1},\dots,x_{0gm}). \]
  • For steps \(t=1,2\dots\) do:

    • Set:

      \[ x_{t} \leftarrow x_{t-1}. \]
    • Sample from the conditional of group \(i\):

      \[ x_{tgi} \sim \pi(\cdot|x_{tg,-i}). \]

The (approximate) Gibbs sampler has various flavors, each with pros and cons. For example, another common approach is to select \(i\) at random for each step \(t\).