Metropolis-Hastings with Blackjax

Metropolis-Hastings with Blackjax#

Let’s use the Metropolis-Hastings algorithm to sample from a “banana”-shaped distribution (defined in section 5.1.3 of Wang et. al.).

Here is the probability density of the banana distribution:

plot_2d_function(banana_logdensity);

../../_images/903796612d00c2ea0930ecfe6f5aedbb0ee1b7312f59ba4ec394193ed8ff0719.svg

Metropolis-Hastings recap#

Here is how to set up a Metropolis-Hastings sampler in BlackJax:

import blackjax

# Pick a proposal distribution
def proposal_generator(key, x, sigma=1.0):
    """A Gaussian random walk proposal."""
    return x + sigma*jrandom.normal(key, shape=(2,))

# Create a Rosenbluth-Metropolis-Hastings sampler
rmh = blackjax.rmh(logdensity_fn=banana_logdensity, proposal_generator=proposal_generator)

And here is how the sampling goes. First, start at some point in the parameter space:

# Initialize the sampler state
init_state = rmh.init(jnp.array([0.0, 0.0]))

plot_mcmc_samples_demo(init_state.position.reshape(1, -1))

../../_images/f2a4217bf18d249dc2fc95362e4bf43f81dcc84f82dd7562acf1c3ddffc370d4.svg

The green lines are the contours of the proposal distribution. The next step is to propose a new point by sampling this proposal distribution. This proposed point is either accepted or rejected. And then we repeat. Over and over. Let’s run the chain for a few steps:

state = init_state
prelim_samples = [init_state.position]
for i in range(10):
    key, subkey = jrandom.split(key)
    state, info = rmh.step(subkey, state)
    prelim_samples.append(state.position)
prelim_samples = jnp.stack(prelim_samples)

plot_mcmc_samples_demo(prelim_samples)

../../_images/8fc2b4ec2ef495e79ce08c7fec0dc043c1ff115ba7f6389c164938b3d219ce0a.svg

Let’s do it for real now. We’ll run 5 MCMC chains, each with 400 steps:

def step(state, _):
    """A single step of the Metropolis-Hastings sampler. Used with `lax.scan`."""
    key, kernel_state = state
    key, subkey = jrandom.split(key)
    kernel_state, info = rmh.step(subkey, kernel_state)
    return (key, kernel_state), (kernel_state.position, info)

def run_mcmc_chain(key, init_state, num_samples):
    """Run a chain of MCMC."""
    _, (samples, info) = lax.scan(step, (key, init_state), None, length=num_samples)
    return samples, info

num_chains = 5
num_samples_per_chain = 400

key, key_run, key_init = jrandom.split(key, 3)
keys = jrandom.split(key_run, num_chains)
init_state_spread = 5.0
init_state = vmap(rmh.init)(init_state_spread*jrandom.normal(key_init, (num_chains, 2)))
samples, info = vmap(run_mcmc_chain, in_axes=(0, 0, None))(keys, init_state, num_samples_per_chain)

Let’s make the trace plot (with the help of the arviz library):

import arviz as az
az.plot_trace(np.array(samples[:, :]), compact=False, backend_kwargs=dict(figsize=(8,4), tight_layout=True));

../../_images/0af05ab78c2b4dd4c7df04614ce26d609ff89318c7cd673ba9efda98471c9c82.svg

Assessing convergence#

In general, it takes some time for the chains to converge to the target distribution. The samples gathered while the chain has not yet converged are called “warm-up” or “burn-in” samples.

But how many burn-in samples do we need before the chain has converged? One common diagnostic to help answer this question is the split-\(\hat{R}\). It is defined as

\[ \hat{R} = \sqrt{\frac{\hat{V}}{W}} \]

where \(W\) is the within-chain variance and \(\hat{V}\) an estimate of the variance between chains. If the chains have converged, \(\hat{R}\) should be close to 1.

Let’s see how \(\hat{R}\) evolves as we take more and more MCMC steps:

compute_diagnostics_every = 10
rhats = []
for i in range(2, num_samples_per_chain, compute_diagnostics_every):
    rhat = blackjax.diagnostics.potential_scale_reduction(samples[:, :i])
    rhats.append(rhat)
rhats = jnp.array(rhats)

../../_images/da40a21e039c89260606f71974137bd20c62bd2a549189b7b26122f05a4c4e09.svg

This plot suggests that the chains are not converged in the first 100 samples. These are burn-in samples and should be discarded.

Dealing with correlated samples#

Another issue with our samples is that they are correlated. Meaning that the \((i+1)^\text{th}\), \((i+2)^\text{th}\), \(\dots\) samples are not independent from \(i^\text{th}\) sample. If we want approximately independent samples, we need to throw away most samples and only keep every \(n\)-th sample. This is called thinning.

But how many samples should we thin out? The effective sample size (ESS) helps answer this question. Let’s plot ESS as the number of samples increases:

n_effs = []
for i in range(2, num_samples_per_chain, compute_diagnostics_every):
    n_eff = blackjax.diagnostics.effective_sample_size(samples[:, :i])
    n_effs.append(n_eff)
n_effs = jnp.array(n_effs)

../../_images/973139bbf1ca79a85c8eef6b41560b5e25cd4b2d148255c5b20a4cad183a336d.svg

After 400 MCMC steps, we can see how many times we’ve effectively sampled our distribution. The ESS tells us by how much we should thin our samples.

Now that we’ve diagnosed our chains, let’s plot the “true” samples (i.e., after thinning and removing burn-in):

# The original shape of the `samples` array is (n_chains, n_samples, n_dim)
burn_in = 100  # Remove first N samples
thin = 4  # Only keep every M samples
true_samples = samples[:, burn_in::thin]

# Concatenate the chains. Final shape is (n_chains * n_true_samples_per_chain, n_dim)
true_samples = true_samples.reshape(-1, 2)

../../_images/a116d54a0e7a9c621294b25fe2db9f19762fbc554106fa16e6f96c03b25e6fdd.svg

Indeed, these do look like samples from the banana distribution. MCMC was successful!

Questions#

Play with the proposal distribution. Change the alpha parameter to make proposal distribution more narrow or wide. How does this affect the evolution of \(\hat{R}\) and ESS?
Try a proposal distribution that is independent of the state \(x\). What happens if this proposal distribution does not cover well the region of high probability in the target distribution?
Play with the starting point. Try starting the chains from points that are farther away from the mode of the distribution (e.g., by modifying the init_state_spread parameter). How does starting from a “bad” point affect the convergence? How can you tell when you’ve started at a “bad” point?