Example – The catalysis problem using variational inference

Example – The catalysis problem using variational inference#

Let’s do variational inference for the catalysis problem.

Model for data generation#

The model was described in detail in the activity where we introduced the classical approach to inverse problems. We’ll briefly summarize it here.

Catalysis dynamics#

Recall that we have modeled the production of various chemicals in a catalytic reaction with the following ODE:

\[\begin{split} \begin{aligned} \dot{z} &= A(k)z \\ z(0) &= (500,0,0,0,0,0) \in \mathbb{R}^6 \end{aligned} \end{split}\]

where \(z \in \mathbb{R}^6\) are the concentrations of each chemical. The linear dynamics \(A\) depend on some (unknown) kinetic rates \(k \in \mathbb{R}_+^5\) as

\[\begin{split} A(k) = \left(\begin{array}{cccccc} -k_1 & 0 & 0 & 0 & 0 & 0\\ k_1 & -(k_2+k_4+k_5) & 0 & 0 & 0 & 0\\ 0 & k_2 & -k_3 & 0 & 0 & 0\\ 0 & 0 & k_3 & 0 & 0 & 0\\ 0 & k_4 & 0 & 0 & 0 & 0\\ 0 & k_5 & 0 & 0 & 0 & 0 \end{array}\right)\in\mathbb{R}^{6\times 6}. \end{split}\]

Inverse problem setup#

We want to infer the rates, \(k\), from data. For numerical stability, let’s work with the scaled quantities \(\tilde{z} = \frac{z}{500}, \tilde t = \frac{t}{180}\), and \(\tilde k = \frac{180k}{500}\) so that our ODE becomes

\[\begin{split} \begin{aligned} \dot{\tilde z} &= A(\tilde k) \tilde z \\ \tilde z(0) &= (1,0,0,0,0,0) \in \mathbb{R}^6, \end{aligned} \end{split}\]

We’ll denote the solution of the scaled ODE at time \(t\) as \(\tilde z(t; \tilde k)\). The full probabilistic model is

\[\begin{split} \begin{aligned} \log \tilde k &\sim \mathcal{N}(0, \gamma^2 I) &\quad \text{Lognormal prior on rates} \\ \tilde \sigma^2 &\sim \text{Exp}(\lambda) &\quad \text{Exponential prior on noise variance} \\ Y_{ij} &\sim \mathcal{N}\Big(\tilde z\big(\mathbf{\tilde t}, \tilde k\big), \tilde \sigma^2\Big) &\quad \text{Independent Gaussian measurements} \end{aligned} \end{split}\]

where \(\gamma=2\), \(\lambda=\frac{1}{0.1}\), and \(Y \in \mathbb{R}^{5 \times 6}\) are the observed concentrations for 5 of the chemicals at 6 equally-spaces times \(\mathbf{\tilde t} = \left( \frac{1}{6}, \frac{1}{3}, \frac{1}{2}, \frac{2}{3}, \frac{5}{6}, 1 \right)\).

Finally, let’s follow the common practice of mapping all random variables to a standard Gaussian random vector \(x \in \mathbb{R}^6\). To this end, let \(T\) be the transformation such that

\[ T(x) = (k, ~ \sigma^2) \]

and

\[ x \sim \mathcal{N}(0, I). \]

Our goal is now to use variational inference to characterize the posterior

\[\begin{split} p(x|Y) \propto \underbrace{p(Y|x) p(x)}_{\substack{\text{generative model} \\ \text{for the} \\ \text{catalysis dataset}}}. \end{split}\]

From now on, we’ll call \(x\) the model parameters. Let’s import the data and define the solver:

Show code cell source Hide code cell source

url = 'https://raw.githubusercontent.com/PredictiveScienceLab/advanced-scientific-machine-learning/refs/heads/main/book/data/catalysis.csv'
download(url)
catalysis_data = pd.read_csv('catalysis.csv')
catalysis_data = catalysis_data[catalysis_data['Time'] > 0.0].reset_index(drop=True)
t_obs = jnp.array(catalysis_data['Time'].values, dtype=float)
Y_obs = jnp.array(catalysis_data[['NO3', 'NO2', 'N2', 'NH3', 'N2O']].values.T, dtype=float)
t_scale = 180.0
z_scale = 500.0

t_obs_scaled = t_obs / t_scale
Y_obs_scaled = Y_obs / z_scale

k_transformation = lambda x: jnp.exp(x) / t_scale

z0_scaled = jnp.array([500.0, 0.0, 0.0, 0.0, 0.0, 0.0])/z_scale
model_kwargs = dict(
    z0=z0_scaled,
    y=Y_obs_scaled,
    t=t_obs_scaled,
    gamma=2.0,
    lambda_=1/0.005,
)

# Define the linear system
def A(k):
    """
    Return the matrix of the dynamical system.
    """
    # jax.debug.print('k = {k}', k=k)
    res = jnp.zeros((6, 6))
    res = res.at[0, 0].set(-k[0])
    res = res.at[1, 0].set(k[0])
    res = res.at[1, 1].set(-(k[1] + k[3] + k[4]))
    res = res.at[2, 1].set(k[1])
    res = res.at[2, 2].set(-k[2])
    res = res.at[3, 2].set(k[2])
    res = res.at[4, 1].set(k[4])
    res = res.at[5, 1].set(k[3])
    return res

def dynamic_sys(t, z, k):
    return jnp.dot(A(k), z)

def state_has_nan(t, y, args, **kwargs):
    return jnp.isnan(y).any()

# Solve the ODE using Diffrax
def solve_catalysis(t, k, z0, make_compatible_with_hessian):

    if make_compatible_with_hessian:
        solver = Tsit5(scan_kind="bounded")
        adjoint = DirectAdjoint()
    else:
        solver = Tsit5()
        adjoint = RecursiveCheckpointAdjoint()
        
    sol = diffeqsolve(
        ODETerm(dynamic_sys),
        solver=solver,
        t0=0.0,
        t1=t[-1],
        dt0=0.001,
        y0=z0,
        args=k,
        saveat=SaveAt(ts=t),
        adjoint=adjoint,
        throw=True,
        event=Event(state_has_nan)  # Here, we are telling the solver to stop once a NaN is detected.
    )
    return sol.ys

As a reminder, here is what the data look like:

../../_images/f36b79305eff02f06f5343f18fc5cf78db275e48c5dbc9e5392bc98e53a57490.svg

And here is the negative log unnormalized posterior, i.e., \(-\log p(x|Y) = -\log p(Y|x) - \log p(x)\):

def T(x, gamma, lambda_):
    """Transforms a standard normal random vector into the parameters of the model."""
    # Get reaction rates
    k = jnp.exp(x[:-1]*gamma)

    # Get measurement noise variance
    expon_icdf = lambda p: -jnp.log1p(-p)  # Inverse CDF of Exp(1)
    norm_to_expon = lambda x: expon_icdf(jstats.norm.cdf(x))  # This converts a N(0,1) to Exp(1)
    sigma2 = norm_to_expon(x[-1])/lambda_

    return k, sigma2

# Negative log posterior
def minus_log_post(x, z0, y, t, gamma, lambda_, make_compatible_with_hessian=False, solve_catalysis=solve_catalysis):
    """Negative log posterior of the catalysis model."""
    
    # Negative log prior
    prior = 0.5*jnp.sum(x**2)

    # Get reaction rates and measurement noise variance
    k, sigma2 = T(x, gamma, lambda_)

    # Simulate physical process
    states = solve_catalysis(t, k, z0, make_compatible_with_hessian)
    obs_states = jnp.hstack([states[:, :2], states[:, 3:]])  # We don't observe the third component

    # Negative log likelihood
    likelihood = 0.5*jnp.sum((obs_states.T - y)**2) / sigma2

    # Total negative log posterior
    posterior = likelihood + prior
    
    return posterior

Warning: Guarding against “bad” parameter values#

Bayesian inference algorithms (such as variational inference) will often visit regions of parameter space that have extremely low posterior probability density in their attempt to characterize the entire posterior. Unfortunately, if you have a numerical ODE solver in your likelihood, parameter values in these “bad” regions also tend to blow up the solver.

To guard against this, we’d ideally choose a prior on the parameters that excludes these “bad” regions. In practice, however, it is usually not obvious what this prior should be. In variational inference, a more practical approach is to simply catch these “bad” parameter values early and skip the ODE solve. Since these “bad” parameter values had low posterior density to begin with, skipping them will not affect the final posterior approximation.

We have already implemented this in the code above by passing in event=Event(state_has_nan) into the ODE solver. This tells the solver to exit prematurely if it detects a NaN value in the solution. While there are more sophisticated heuristics we could use, this will do for our example.

Full-rank multivariate Gaussian guide#

Let’s choose the guide to be a multivariate normal distribution, i.e.,

\[ q_\phi(x) = \mathcal{N}(x| \mu_\phi, \Sigma_\phi). \]

where \(\phi \in \mathbb{R}^v\) are the guide parameters (and \(x\) are the model parameters). The mean, \(\mu\), of the guide is parameterized as

\[ \mu_\phi = ( \phi_1 \ldots \phi_6 ) \in \mathbb{R}_+^5 \]

We’ll represent the covariance matrix, \(\Sigma_\phi\), with its Cholesky decomposition

\[ \Sigma_\phi = L_\phi L_\phi^T \]

and we’ll parameterize the Cholesky factor \(L_\phi\) as

\[\begin{split} L_\phi = \begin{pmatrix} \exp(\phi_7) & 0 & 0 & 0 & 0 & 0 \\ \phi_{13} & \exp(\phi_8) & 0 & 0 & 0 & 0 \\ \phi_{14} & \phi_{15} & \exp(\phi_9) & 0 & 0 & 0 \\ \phi_{16} & \phi_{17} & \phi_{18} & \exp(\phi_{10}) & 0 & 0 \\ \phi_{19} & \phi_{20} & \phi_{21} & \phi_{22} & \exp(\phi_{11}) & 0 \\ \phi_{23} & \phi_{24} & \phi_{25} & \phi_{26} & \phi_{27} & \exp(\phi_{12})\end{pmatrix} \end{split}\]

Parameterizing this way ensures the covariance matrix is positive definite. We now have a parameterized guide \(q_\phi\), where \(\phi\) fully specifies a multivariate normal distribution. Here it is:

class FullRankGaussianGuide(eqx.Module):
    """Class that represents a multivariate normal guide with variational parameters phi."""
    guide_params: jnp.ndarray
    num_model_params: int = eqx.field(static=True, default=None)

    def __post_init__(self):
        if self.guide_params.shape[0] != self.get_num_guide_params(self.num_model_params):
            raise ValueError("The length of phi is not consistent with the number of parameters.")
        
    def logprob(self, model_params):
        """The log probability density of the guide."""
        return jax.scipy.stats.multivariate_normal.logpdf(model_params, self.mu, self.Sigma)
    
    def sample(self, key, num_samples):
        """Samples from the guide."""
        return jr.multivariate_normal(key, self.mu, self.Sigma, shape=(num_samples,))
    
    def forward(self, epsilon):
        """Transforms a multivariate normal sample to a sample from the guide, as per the reparameterization trick."""
        return self.mu + jnp.dot(self.L, epsilon)

    @property
    def mu(self):
        """The mean of the guide."""
        return self.guide_params[:self.num_model_params]

    @property
    def Sigma(self):
        """The covariance of the guide."""
        L = self.L
        return jnp.dot(L, L.T)
    
    @property
    def L(self):
        """The Cholesky decomposition of the covariance of the guide."""
        # The diagonal
        ell = jnp.exp(self.guide_params[self.num_model_params:2*self.num_model_params])
        L = jnp.diag(ell)

        # The lower triangular part
        L = L.at[jnp.tril_indices(self.num_model_params, -1)].set(self.guide_params[2*self.num_model_params:])
        return L

    @classmethod
    def from_mean_covariance(cls, mu, Sigma):
        """Constructs a guide with a given mean and covariance. Useful for initializing phi to a reasonable value."""
        L = jnp.linalg.cholesky(Sigma)
        ell = jnp.diag(L)
        tri = L[jnp.tril_indices(L.shape[0], -1)]
        guide_params = jnp.hstack([mu, jnp.log(ell), tri])
        return cls(guide_params, mu.shape[0])
    
    @staticmethod
    def get_num_guide_params(num_model_params):
        """Return the number of guide parameters there are."""
        num_mu_params = num_model_params
        num_L_params = num_model_params*(num_model_params + 1)//2
        return num_mu_params + num_L_params

# Test
_q_test = FullRankGaussianGuide(guide_params=jnp.arange(27, dtype=float), num_model_params=6)
assert jnp.all(jnp.isclose(_q_test.guide_params, FullRankGaussianGuide.from_mean_covariance(_q_test.mu, _q_test.Sigma).guide_params))

Let’s take a moment to unpack what FullRankGaussianGuide does. To create a full-rank Gaussian guide for 6 model parameters \(x\), we need 27 guide parameters \(\phi\) (6 for the mean and 21 for the covariance):

q = FullRankGaussianGuide(guide_params=jnp.ones(27), num_model_params=6)
q

FullRankGaussianGuide(guide_params=f64[27], num_model_params=6)

Now that we’ve initialized the guide \(q_\phi\), we can sample from it:

q.sample(key, num_samples=10)  # (num_samples, num_model_params)

Array([[ 1.88932471, -1.5637705 ,  3.72270858,  2.64803995, -1.88099295,
         0.42745633],
       [ 1.32275064,  3.2610557 , -0.83318207, -5.01128203,  0.23841035,
        -2.7874846 ],
       [ 0.38151253,  4.10676143,  1.35253079,  1.60414817,  2.37724822,
         4.83099338],
       [-2.32335451, -0.01582255, -2.12877634,  0.6569862 ,  1.21638922,
        -4.28205955],
       [ 0.33227438, -2.16352229, -3.83945106,  1.59908828,  3.52282461,
        -1.42397557],
       [-0.52039262,  1.50218908,  2.10090965,  2.44114853,  2.83772427,
         1.88150794],
       [ 1.13278278,  1.58299748,  1.30313838,  2.07861409,  4.0514414 ,
         2.00564814],
       [ 3.90652322,  4.20058764,  7.27088389,  5.76664235,  0.76445972,
         6.58587042],
       [-2.5016685 , -3.85621742, -7.57468741, -4.3424708 , -5.04174148,
        -3.16441665],
       [ 0.1996419 ,  1.77729404,  3.74818901,  0.25718042,  3.26198766,
         5.06183689]], dtype=float64)

We can also compute the log probability of some \(x\):

q.logprob(jnp.zeros(6))  # log probability of the model parameters

Array(-11.62587234, dtype=float64)

We can also transform a standard Gaussian random sample \(\epsilon\) to a sample from the guide \(x=g_\phi(\epsilon)\). This is the reparameterization trick, and it is implemented by the forward method. See Basics of Variational Inference for more details.

# Sample from standard Gaussian
key, key1 = jr.split(key)
epsilon = jr.normal(key1, shape=(10_000, 6))  # (num_samples, num_model_params)

# Transform to the guide samples
guide_samples = vmap(q.forward)(epsilon)  # (num_samples, num_model_params)

../../_images/5ccbaf0c00db28ca6aa3f187137525a4ce3863f01b53300b00afb9e00419ad22.svg

Finally, we can set the guide to a particular mean and covariance. Let’s initialize to a standard Gaussian:

\[ {q_\phi}_\text{init}(x) = \mathcal{N}(x|0, I) \]

# Set the initial guide to a standard Gaussian
q_init = FullRankGaussianGuide.from_mean_covariance(mu=jnp.zeros(6), Sigma=jnp.eye(6))

Okay, we now have a better feel for what FullRankGaussianGuide is. We’ll be using each of the showcased methods (sample, log_prob, and forward) as part of our VI implementation.

First, though, let’s plot a few concentration-vs-time trajectories, sampled from our (untrained) guide.

(And coincidentally, these are also samples of our model prior because we had carefully defined the transform \(T\) so that \(x \sim \mathcal{N}(0, I)\).)

key, subkey = jr.split(key)
plot_guide_samples(q=q_init, num_samples=50, key=subkey)

../../_images/6c89820c7e58bf8540fa7f3da68083a4167c3cb00624d62ac667e7942dd25042.svg

The prior seems reasonable. If we didn’t have the data, any of these trajectories could be plausible.

Maximizing the ELBO#

The goal of variational inference is to find optimal parameters \(\phi\) so that the guide is as close as possible to the true posterior \(p(x|Y).\)

We’ll do this by maximizing the Evidence Lower Bound (ELBO) with repect to \(\phi\). Again, see Basics of Variational Inference for a review on how exactly this is done. Here, we’ll just focus on the code implementation:

# Minimizing the negative ELBO is equivalent to maximizing the ELBO.
def neg_elbo(guide_params, epsilon, model_kwargs, num_model_params):
    """The integrand of the negative reparameterized ELBO, f(g_phi(xi)).
    
    Parameters
    ----------
    guide_params : jnp.ndarray
        The guide parameters. (Denoted as ϕ in the notes.)
    epsilon : jnp.ndarray
        The standard Gaussian sample to be transformed to a guide sample.
    model_kwargs : dict
        The keyword arguments for the model.
    num_model_params : int
        The number of model parameters. (The dimension of x in the notes.)

    Returns the negative ELBO.
    """
    q = FullRankGaussianGuide(guide_params=guide_params, num_model_params=num_model_params)
    x = q.forward(epsilon)
    return minus_log_post(x, **model_kwargs) + q.logprob(x)

We are now ready to do variational inference:

def step(carry, _, optim, batch_size, model_kwargs, num_model_params):
    """A single optimization step."""
    guide_params, opt_state, key = carry
    
    # Sample latent variable ϵ
    key, subkey = jr.split(key)
    epsilon = jr.normal(subkey, shape=(batch_size, num_model_params))

    # Compute ELBO gradient estimate
    neg_elbo_val, neg_elbo_grad = tree.map(
        lambda x: jnp.sum(x, axis=0), 
        vmap(value_and_grad(neg_elbo), in_axes=(None, 0, None, None))(guide_params, epsilon, model_kwargs, num_model_params)
    )

    # Update variational parameters phi
    updates, opt_state = optim.update(neg_elbo_grad, opt_state)
    guide_params = optax.apply_updates(guide_params, updates)

    return (guide_params, opt_state, key), neg_elbo_val

# Setup
optim = optax.apply_if_finite(optax.adam(5e-2), 10)
step_frozen_args = jit(partial(step, optim=optim, batch_size=5, model_kwargs=model_kwargs, num_model_params=6))
key, subkey = jr.split(key)
vi_state = (q_init.guide_params, optim.init(q_init.guide_params), subkey)
neg_elbo_vals = []
print_every = 100
num_iter = 3000

# Run variational inference loop
for i in range(num_iter):
    vi_state, neg_elbo_val = step_frozen_args(vi_state, None)
    neg_elbo_vals.append(neg_elbo_val)
    if (i + 1) % print_every == 0:
        print(f"Iteration {(i + 1):4d}/{num_iter:4d} \t ELBO: {jnp.mean(jnp.array(neg_elbo_vals[-print_every:])):.2f}")

# Extract optimized guide parameters and ELBO values
phi_opt, _, _ = vi_state
neg_elbo_vals = jnp.array(neg_elbo_vals)

../../_images/93c95c19d76685773c3a63d8c73579e79a5a68befbdac4804f01af1bd61eaf16.svg

We now have a good approximation to the posterior, i.e., we’ve found \(\phi\) such that \(q_\phi(x) \approx p(x|Y)\). Nice!

Let’s plot concentration-vs-time trajectories, sampled from the posterior:

q = FullRankGaussianGuide(guide_params=phi_opt, num_model_params=6)
key, subkey = jr.split(key)
plot_guide_samples(q=q, num_samples=50, key=subkey)

../../_images/f23b5cf6a6abbe6a99979379a2a7d4385295288d21c8b1411ebb531b211950d5.svg

It looks reasonable. We can also visualize at the posterior correlation between the parameters with a heatmap:

../../_images/507341abd5d078ba0562757e4a7034373de4b307735db6d70141b94e33b84a15.svg

Questions#

Try playing with the prior.
- Increase the mean \(\lambda\) of the prior measurement noise variance to 0.1 (lambda_ = 1/0.1). How does this affect the posterior approximation? What about if you decrease \(\lambda\)?
- Increase the standard deviation \(\gamma\) of the prior log-rate parameter to 3 (gamma = 3). Does this change the posterior approximation? What about if you decrease \(\gamma\)?
Play with the starting point. Start variational inference from different points by modifying q_init. Does the optimization algorithm always converge to the same solution?