Posterior Predictive Checking

MAKE_BOOK_FIGURES=Trueimport numpy as npimport scipy.stats as stimport matplotlib as mplimport matplotlib.pyplot as plt%matplotlib inlineimport matplotlib_inlinematplotlib_inline.backend_inline.set_matplotlib_formats('svg')import seaborn as snssns.set_context("paper")sns.set_style("ticks")def set_book_style():    plt.style.use('seaborn-v0_8-white')     sns.set_style("ticks")    sns.set_palette("deep")    mpl.rcParams.update({        # Font settings        'font.family': 'serif',  # For academic publishing        'font.size': 8,  # As requested, 10pt font        'axes.labelsize': 8,        'axes.titlesize': 8,        'xtick.labelsize': 7,  # Slightly smaller for better readability        'ytick.labelsize': 7,        'legend.fontsize': 7,                # Line and marker settings for consistency        'axes.linewidth': 0.5,        'grid.linewidth': 0.5,        'lines.linewidth': 1.0,        'lines.markersize': 4,                # Layout to prevent clipped labels        'figure.constrained_layout.use': True,                # Default DPI (will override when saving)        'figure.dpi': 600,        'savefig.dpi': 600,                # Despine - remove top and right spines        'axes.spines.top': False,        'axes.spines.right': False,                # Remove legend frame        'legend.frameon': False,                # Additional trim settings        'figure.autolayout': True,  # Alternative to constrained_layout        'savefig.bbox': 'tight',    # Trim when saving        'savefig.pad_inches': 0.1   # Small padding to ensure nothing gets cut off    })def set_notebook_style():    plt.style.use('seaborn-v0_8-white')    sns.set_style("ticks")    sns.set_palette("deep")    mpl.rcParams.update({        # Font settings - using default sizes        'font.family': 'serif',        'axes.labelsize': 10,        'axes.titlesize': 10,        'xtick.labelsize': 9,        'ytick.labelsize': 9,        'legend.fontsize': 9,                # Line and marker settings        'axes.linewidth': 0.5,        'grid.linewidth': 0.5,        'lines.linewidth': 1.0,        'lines.markersize': 4,                # Layout settings        'figure.constrained_layout.use': True,                # Remove only top and right spines        'axes.spines.top': False,        'axes.spines.right': False,                # Remove legend frame        'legend.frameon': False,                # Additional settings        'figure.autolayout': True,        'savefig.bbox': 'tight',        'savefig.pad_inches': 0.1    })def save_for_book(fig, filename, is_vector=True, **kwargs):    """    Save a figure with book-optimized settings.        Parameters:    -----------    fig : matplotlib figure        The figure to save    filename : str        Filename without extension    is_vector : bool        If True, saves as vector at 1000 dpi. If False, saves as raster at 600 dpi.    **kwargs : dict        Additional kwargs to pass to savefig    """        # Set appropriate DPI and format based on figure type    if is_vector:        dpi = 1000        ext = '.pdf'    else:        dpi = 600        ext = '.tif'        # Save the figure with book settings    fig.savefig(f"{filename}{ext}", dpi=dpi, **kwargs)def make_full_width_fig():    return plt.subplots(figsize=(4.7, 2.9), constrained_layout=True)def make_half_width_fig():    return plt.subplots(figsize=(2.35, 1.45), constrained_layout=True)if MAKE_BOOK_FIGURES:    set_book_style()else:    set_notebook_style()make_full_width_fig = make_full_width_fig if MAKE_BOOK_FIGURES else lambda: plt.subplots()make_half_width_fig = make_half_width_fig if MAKE_BOOK_FIGURES else lambda: plt.subplots()

(posterior-predictive=)

Posterior Predictive Checking#

Assume that we have built a model using some data, say \(x_{1:n}\). Now, assume that you experiment again under the same conditions. What data does your model tell you that you would observe? The posterior predictive distribution of the replicated data \(x^{\text{rep}}_{1:n}\) is simply:

\[ p(x^{\text{rep}}_{1:n}|x_{1:n}) = \int p(x^{\text{rep}}_{1:n}|\theta)p(\theta|x_{1:n})d\theta, \]

where \(p(x^{\text{rep}}_{1:n}|\theta)\) is just the likelihood and \(p(\theta|x_{1:n})\) the posterior. The idea of posterior predictive checking is to repeatedly sample \(x^{\text{rep}}_{1:n}\) and compare their characteristics to the actual data. You may reject a model that performs very poorly under predictive checking. However, you cannot accept a model that performs very well. There are other methods for doing this which are more involved. Predictive checking is suitable for identifying bugs in your code or developing ideas to extend the models to match the data better.

Visual Inspections of Replicated Data#

The idea here is to simply sample \(x^{\text{rep}}_{1:n}\) and compare it visually to \(x_{1:n}\). Let’s see this in the coin toss example.

Demonstration with the Coin Toss Example#

Consider \(n\) coin-tosses with probability of heads (\(1\)) \(\theta\), i.e.,

\[ x_{i}|\theta \sim \operatorname{Ber}(\theta). \]

We put a uniform prior on \(\theta\):

\[ \theta\sim U([0,1]), \]

and we have already seen that the posterior is a Beta:

\[ \theta|x_{1:n} \sim \operatorname{Beta}\left(1 + \sum_{i=1}^nx_i, 1 + n - \sum_{i=1}^nx_i\right). \]

Now, it is actually possible to get \(p(x^{\text{rep}}_{1:n} | x_{1:n})\) analytically in this case. However, we will not bother doing it. We can sample from it as follows:

First, sample a \(\theta\) from the posterior \(p(\theta|x_{1:n})\).
Second, sample a \(x^{\text{rep}}\) from the likelihood \(p(x^{\text{rep}}|\theta)\).
Repeat steps one and two as many times as needed.

Let’s start with a synthetic example.

import scipy.stats as st

theta_true = 0.5
X = st.bernoulli(theta_true)

N = 50
data = X.rvs(size=N)
data

array([1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1,
       0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0,
       1, 0, 1, 0, 0, 1])

The posterior is as follows:

../_images/760cb7836af4d1ac7b32dee11b601c9c9b31cb46ee1a5a4b862509e6d017bbdf.svg

Now that we have fitted the model let’s draw many replicated data \(y^{\text{rep}}\) from the posterior predictive and compare them to the original dataset. We are going to write a function to do this. We will use it several times later.

def replicate_experiment(post_rv, n=N, n_rep=9):
    """Replicate the experiment.
    
    Arguments
    post_rv -- The random variable object corresponding to
               the posterior from which to sample.
    n       -- The number of observations.
    nrep    -- The number of repetitions.
    
    Returns:
    A numpy array of size n_rep x n.
    """
    x_rep = np.ndarray((n_rep, n), dtype="i")
    for i in range(n_rep):
        theta_post_sample = post_rv.rvs()
        x_rep[i, :] = st.bernoulli(
            theta_post_sample
        ).rvs(size=n)
    return x_rep

And here is how to use it:

x_rep = replicate_experiment(Theta_post)
x_rep

We can visualize the data as images. The first row of pixels is are the observed data.

fig, ax = plt.subplots(figsize=(30, 10))
ax.imshow(np.vstack([data, x_rep]));

../_images/c385d4de634842ae51f67a37b924dd4962ceb7d27a2d6b8fd4a696a315aa4d37.svg

This visual inspection does not reveal anything strange. Let’s now repeat this exercise with a completely artificial dataset that does not match the model. This a dataset that I picked by hand trying to emulate a coin toss with a probability of heads equal to \(0.5\):

data_fake = np.array(
    [
        0, 1, 0, 0, 1, 0, 0, 1, 1, 0,
        0, 1, 0, 1, 0, 1, 1, 0, 0, 1,
        0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
        0, 0, 0, 1, 0, 1, 1, 1, 0, 0,
        0, 0, 1, 0, 1, 0, 1, 0, 1, 0
    ]
)

Here is the posterior for this fake dataset:

alpha_2 = 1.0 + data_fake.sum()
beta_2 = 1.0 + N - data_fake.sum()
Theta_2_post = st.beta(alpha_2, beta_2)

fig, ax = plt.subplots()
thetas = np.linspace(0, 1, 100)
ax.plot(
    thetas,
    Theta_2_post.pdf(thetas),
    label=r'$p(\theta|x_{1:n})$'
)
ax.set_xlabel(r'$\theta$')
ax.set_title('$n={N}$')
plt.legend(loc='best', frameon=False)
sns.despine(trim=True);

../_images/0c28528e7260a563d38966d24900bd60afbe4e1b5fcdc609e3521acb793fbc00.svg

Let’s see if the visual comparison to replicated data reveals that the data did not come from a fair coin.

x_2_rep = replicate_experiment(Theta_2_post)
    
fig, ax = plt.subplots(figsize=(30, 10))
ax.imshow(np.vstack([data_fake, x_2_rep]));

../_images/bf76298da2d28687fd36af359c136fbd432d30e50c29ffdee386273798fb5f1c.svg

If you pay close attention, you will notice that the data I picked by hand transitions from heads to tails more than the replicated data. In other words, the replicated data have more extended consecutive series of either heads or tails. How can we see this more clearly?

Test quantities and visual inspections#

We use test quantities to characterize the model and data discrepancy. In particular, test quantities help us zoom into the characteristics of the data that are of particular interest to us. Mathematically, a test quantity is a scalar function of the data and the model parameters \(T(x_{1:n},\theta)\). There are some general recipes for creating test quantities for regression and classification. However, in general, you must use common sense in selecting them. What are the essential data characteristics that your model should be capturing? We will be seeing specific examples below.

Now, assume that you have selected one or more test quantities. What do you do with them? Well, the easiest thing to do is a visual comparison of the histogram of the test quantity over replicated data, i.e., of \(p(T(x^{\text{rep}}_{1:n},\theta) | x_{1:n})\), and compare it to the observe value \(T(x_{1:n},\theta)\). In these plots, you are trying to see how likely or unlikely the observed test quantity is according to your model. We will be visualizing them next for the coin toss example.

Test quantities and Bayesian \(p\)-values#

Another thing you can do to check your model is to evaluate the probability that the replicated data gives you a test quantity that is more extreme than the observed value. This probability is known as the posterior (or Bayesian) \(p\)-value, and it is defined by:

\[ p_B = \mathbb{P}(T(x^{\text{rep}}_{1:n},\theta) > T(x_{1:n},\theta) | x_{1:n}) = \int 1_{[T(x_{1:n},\theta),\infty]}(T(x^{\text{rep}}_{1:n})) p(x^{\text{rep}}_{1:n}, \theta|x_{1:n}) dx^{\text{rep}}_{1:n}d\theta. \]

Of course, you can estimate the Bayesian \(p\)-value using Monte Carlo sampling from the joint posterior of \(x^{\text{rep}}_{1:n}\) and \(\theta\) conditioned on the data \(x_{1:n}\).

How should I interpret the Bayesian \(p\)-values? It is not the probability that your model is correct. We will derive this probability in the Bayesian model selection lecture. The Bayesian \(p\)-value is the probability that replications of the experiment will yield test quantities that exceed the observed value under the assumption that your model is correct. So, here is an excellent way to interpret them:

A Bayesian \(p\)-value close to \(0\) or \(1\) indicates that the observed test quantity is unlikely under the assumption that your model is correct. So, your model does not capture this aspect of the data. You need to expand/modify your model somehow.

Back to the coin toss example#

What are some good test quantities that we can pick for this example? An obvious one is the number of heads. It is only a function of the data:

\[ T_{h}(x_{1:n}) = \sum_{i=1}^nx_i. \]

Let’s implement this as a Python function of the data:

def T_h(x):
    """Return the sum of x."""
    return x.sum()

Remember that we have run this example with two datasets: one generated from the correct model and one generated by hand. We will see the results that we get from both.

Let’s put everything in a function because we are going to be using this repeatedly:

Here is how to use it:

res = perform_diagnostics(
    Theta_post,
    data,
    T_h
)

T_h_obs = res["T_obs"]
p_val = res["p_val"]

print(f'The observed test quantity is {T_h_obs}')
print(f'The Bayesian p_value is {p_val:.4f}')

The observed test quantity is 29
The Bayesian p_value is 0.4570

Let’s now write a function that makes a diagnostics plot:

And let’s use it:

../_images/751d92e78760cf31d7479a7d186de4fee3b5ff3fd1e7e2bde03d85d7de247f12.svg

To save us some typing, let’s combine the calculation and the plots in a single function:

Now, let’s look also at the other dataset (the one we picked by hand):

do_diagnostics(
    Theta_2_post,
    data_fake,
    T_h
)

The observed test quantity is 22
The Bayesian p_value is 0.4580

../_images/7656e2577938c25243a4aff3ebb5338183bf68f62fd8de543dde94a89df4ebc4.svg

It looks about the same. I could replicate this test quantity when I picked values by hand. Can we find a better statistic? Remember our observation when we plotted the replicated data vs. the actual data for the second case? We observed that my hand-picked data included more transitions from heads to tails. Let’s build a statistic that captures that. We are going to take this:

\[ T_s(x) = \text{\# number of switches from 0 and 1 in the sequence }x. \]

It is not easy to write this test in an analytical form, but we can program it:

def T_s(x):
    """Return the number of switches between 0s and 1s."""
    s = 0
    for i in range(1, x.shape[0]):
        if x[i] != x[i-1]:
            s += 1
    return s

Let’s look first at the original dataset:

do_diagnostics(
    Theta_post,
    data,
    T_s
)

The observed test quantity is 26
The Bayesian p_value is 0.2160

../_images/76f026f338ce728af590c39783ea48ba82dd599e5610ad5009386fee7f9df861.svg

It looks okay. Let’s now look at the one I picked by hand:

do_diagnostics(
    Theta_2_post,
    data_fake,
    T_s
)

The observed test quantity is 36
The Bayesian p_value is 0.0000

../_images/47a947cba1a3ef41f6b4ccc88bcbd37d6a9a59a3e0b951abc2b5e3aeeee73ed4.svg

The data are implausible under the assumptions of this model.

Questions#

Rerun all the steps above with a larger number \(N\) of coin toss observations.