Variational Inference#

References#

The notes are not exhaustive. Variational inference represents the state of the art in Bayesian inferene and it is still evolving. Please consult the papers above for more details.

Note: This document was originally developed by Dr. Rohit Tripathy.

Bayesian Inference#

Quick Review#

Once again, let’s begin with a review of Bayesian inference.

Our goal is to derive a probability distribution over unknown quantities (or latent variables), conditional on any observed data (i.e. a posterior distribution). Without loss of generality, we denote all unknown quantities in our model as \(\theta\) and the observed data as \(\mathcal{D}\).

We start with a description of our prior state of knowledge over \(\theta\) - \(p(\theta)\).

We then specify a conditional probabilistic model that links the observed data with the unknown quantities \(p(\mathcal{D}|\theta)\) (the likelihood). We want \(p(\theta|\mathcal{D})\) which we know, from Bayes rule, to be: \( p(\theta | \mathcal{D}) \propto p(\mathcal{D}, \theta). \)

The posterior distribution \(p(\theta | \mathcal{D})\) captures our state of knowledge about \(\theta\) conditional on all the information available to us \(\mathcal{D}\).

In the Bayesian framework, predictions about unseen data (or test data), are posed as expectations over this posterior distribution.

What is the problem?#

Unfortunately, as you already know, the posterior distribution is more often than not unavailable in closed form. This is due to the intractablity of the evidence (or marginal likelihood), i.e., the denominator in the Bayes’ rule, \(Z = \int p(\theta, \mathcal{D}) \mathrm{d}\theta\). Infact, there are only a small class of prior-posterior models that admit closed form expressions for the posterior distributions (conjugate models).

Approximating the posterior#

There are several approaches to do this:

  1. The posterior density \(p(\theta | \mathcal{D})\) is approximated with a point mass density, i.e., \(p(\theta | \mathcal{D}) = \delta_{\theta^*}(\theta)\), where, \(\delta_{\theta^*}(\theta) = \begin{cases} 1, \text{ if } \theta = \theta^*, \\ 0 \text{ otherwise.} \end{cases}\) This is the well-known maximum a-posteriori (MAP) estimation procedure. The parameter \(\theta^*\) is obtained as the solution of the optimization problem, \(\theta^* = \underset{\theta}{\mathrm{argmax}} p(\theta, \mathcal{D})\). The MAP approximation is often justified by the assumption that the true posterior distribution \(p(\theta|\mathcal{D})\) has a single, sharply peaked mode. In practice this approach often provides reasonable predictive accuracy but is unable to capture any of the epistemic uncertainty induced by limited data. We saw this very early in the class when we introduced basic supervised and unsupervised learning techniques.

  2. The posterior distribution is approximated with a finite number of particles, i.e., \(p(\theta | \mathcal{D}) = \sum_{i=1}^{N} w^{(i)} \delta (\theta - \theta^{(i)})\). The most popular class of techniques that approximates the posterior distribution this way is Markov Chain Monte Carlo (MCMC). Recall that the general idea of MCMC is to construct a discrete-time, reversible and ergodic Markov Chain whose equilibrium distribution is the target posterior distribution. The goal is to simulate the Markov Chain long enough that it enters it’s equilibrium phase (i.e. target posterior density). Once this is accomplished, sampling from the Markov Chain is the same as sampling from the target posterior density. Since MCMC samples (in theory) directly from the posterior, the weights of the approximation \(, w^{(i)}\) are simply set to 1. There are several other approaches to approximate probability densities with particle distributions such as Sequential Monte Carlo (SMC) (which developed primarily as tools for inferring latent variables in state-space models but can be used for general purpose inference) and Stein Variational Gradient Descent (SVGD). We covered everything except SVGD in the previous lecture.

  3. Set up a parameterized family of densities over the latent variables - \(q_{\phi}(\theta)\), and infer the parameters, \(\phi\) by solving an optimization problem of the form:

\[ \phi^{*} = \underset{\phi}{\mathrm{argmin}} \ \mathrm{D}[ p(\theta| \mathcal{D}) , q_{\phi}(\theta)], \]

where, \(\mathrm{D}[\cdot, \cdot]\) is some measure of discrepancy between the approximate (or variational) posterior and the true posterior. Needless to say, we want to set up this optimization problem such that we only need to know \(p(\theta | \mathcal{D})\) upto a multiplicative constant. Variational Inference (VI) is the name given to this general class of methods that seek to approximate the posterior this way. This is our topic today.

Variational Inference#

Different VI procedures are obtained based on different choices of the approximating family \(q_{\phi}(\cdot)\) and the functional \(\mathrm{D}[\cdot, \cdot]\). The most standard choice for \(\mathrm{D}\) is the Kullback Leibler (KL) divergence. The KL divergence between two densities \(q(\theta)\) and \(p(\theta)\) is defined as follows:

\[ \mathrm{KL}[q(\theta)|| p(\theta)] = \int q(\theta) \log \left( \frac{q(\theta)}{p(\theta)} \right) \mathrm{d}\theta = \mathbb{E}_{q(\theta)} \left[ \log \left( \frac{q(\theta)}{p(\theta)} \right) \right]. \]

The KL divergence is always non-negative, i.e., \(\mathrm{KL}[q(\theta)|| p(\theta)] \ge 0\), with \(\mathrm{KL}[q(\theta)|| p(\theta)] = 0\) implying that \(q(\theta) = p(\theta)\) almost everywhere. Our inference goal can, therefore, be stated as follows - given a choice of a family of densities \(q(\cdot)\), parameterized by \(\phi\), what is the setting of \(\phi\) that will return the closest match, i.e. minimum KL divergence, between the approximate posterior \(q(\theta)\) and the true posterior \(p(\theta|\mathcal{D})\)?

This brings us to \(q\) - the approximate posterior. Notice that we have made no assumptions on \(q\) thus far. We can, ofcourse, pick any arbitrary distribution we want to approximate the posterior. However, in practice, we pick \(q\) such that it satisfies some desirable properties:

  1. If we know that a latent variable has finite support (positive reals for instance), we pick \(q\) such \(q\) itself has support on the same interval only.

  2. We would also like \(q\) to be easy to sample from and easy to evaluate it’s log probability since the variational objective requires computing an expectation over log probability ratios. A common simplfying assumption that enables easier sampling and log probability computation is the mean-field assumption - i.e., setting up approximation such that the individual latent variables are independent. If \(\theta = (\theta_1, \theta_2, \dots, \theta_M)\) is the vector of latent variables, the mean-field assumption implies an approximation of the form,

\[ p(\theta|\mathcal{D}) \approx q_{\phi}(\theta) = \prod_{i=1}^{M} {q_{i}}_{\phi_i}(\theta_i), \]

where \(q_{\phi_i}(\cdot)\) is the approximate marginal posterior over the latent variable \(\theta_i\) parameterized by \(\phi_i\).

Evidence Lower Bound (ELBO)#

So, to recap, the generic VI strategy is to pose a suitable parameterized family of densities \(q_{\phi}(\theta)\) to approximate the true posterior \(p(\theta|\mathcal{D})\) and to minimize the KL divergence from \(q\) to \(p\):

\[ \phi^* = \underset{\phi}{\mathrm{argmin}}\ \mathrm{KL}\left[ q_{\phi}(\theta) || p(\theta|\mathcal{D}) \right]. \]

We cannot actually optimize the KL divergence directly because of it’s dependence on the true posterior \(p(\theta | \mathcal{D})\). Instead, we will solve an equivalent, tractable optimization problem. Define the function \(\mathcal{L}(\phi)\) as \(\mathcal{L}(\phi) = \mathbb{E}_{q(\theta)}[\log p(\theta, \mathcal{D})] + \mathbb{H}[q(\theta)]\), where, \(\mathbb{H}[q(\theta)] =\mathbb{E}_{q(\theta)}[-\log q(\theta)] \) is the entropy of \(q\). With some simple algebra you can show that solving the optimization problem:

\[ \phi^* = \underset{\phi}{\mathrm{argmax}}\ \mathcal{L}(\phi), \]

is equivalent to minimizing the KL divergence between \(q\) and \(p(\theta|\mathcal{D})\).

Proof:

\[\begin{split} \begin{align} \mathrm{KL}\left[ q_{\phi}(\theta) || p(\theta|\mathcal{D}) \right] &= \mathbb{E}_q \left[ \log \left( \frac{q_{\phi}(\theta)}{p(\theta|\mathcal{D})} \right) \right], \\ &= \mathbb{E}_q \left[ \log \left( \frac{q_{\phi}(\theta) Z}{p(\theta, \mathcal{D})} \right) \right], \ \text{where $Z$ is the evidence,} \\ &= \underset{=-\mathbb{H}[q(\theta)]}{\underbrace{\mathbb{E}_q [\log q_{\phi}(\theta)]}} - \mathbb{E}_q [\log p(\theta, \mathcal{D})] + \underset{\text{this is a constant}}{\underbrace{\log Z}},\\ &= - \mathcal{L}(\phi) + \log Z. \end{align} \end{split}\]

Therefore,

\[ \log Z = \mathrm{KL}\left[ q_{\phi}(\theta) || p(\theta|\mathcal{D}) \right] + \mathcal{L}(\phi). \]

We see that the log evidence (which is a constant) is the sum of the objective function \(\mathcal{L}(\phi)\) and the KL divergence between the true and approximate posteriors. Since, the KL divergence is non-negative, the objective \(\mathcal{L}(\phi)\) is a lower-bound on the log evidence. The bound is tight, i.e., \(\log Z = \mathcal{L}(\phi)\), if \(q_{\phi}\) matches the true posterior perfectly. Minimizing the KL divergence with respect to the variational parameters, \(\phi\), is equivalent to maximizing the objective \(\mathcal{L}(\phi)\) with respect to \(\phi\). Since \(\mathcal{L}(\phi)\) depends on terms that we know and can compute and/or approximate, we use it as the objective function for our VI optimization problem. \(\mathcal{L}(\phi)\) is also known as the Evidence Lower Bound or ELBO.

One of the nice things about the ELBO is that it has a neat interpretation. The ELBO is a sum of two terms:

  1. \(\mathbb{E}_{q(\theta)}[\log p(\theta, \mathcal{D})]\) is a measure of the expected model fit under the approximate posterior density.

  2. \(\mathbb{H}[q(\theta)]\) - the entropy of the approximate posterior acts a regularizer. The entropy of a distribution is a measure of how “diffuse” it is. In maximizing the entropy, we try to construct our posterior approximation such that it accounts for the maximum ammount of uncertainty in the latent variables conditional on the observed data.

The two terms in the objective function \(\mathcal{L}(\phi)\) therefore have an associated trade-off - in optimizing the ELBO we are simultaneously trying to achieve the best possible fit to the data without introducing any excess bias that is not supported by the data (see the principle of maximum entropy for assigning probability distributions).

Another nice by-product of doing Bayesian inference by maximizing the ELBO is that we can perform Bayesian model selection. Bayesian model selection relies on the estimation and comparison of the model evidence \(Z\) (or it’s log) and in VI we work with an approximation to this quantity.

Automatic Differentiation Variational Inference (ADVI)#

In what follows, we will discuss a practical way of curring out VI. The details can be found in this paper. Suppose you have put together the joint probability model \(p(\theta, \mathcal{D})\). The latent variables that have to be inferred are \(\theta = (\theta_1, \theta_2, \dots, \theta_M)\). Variational inference in generic probability models can become extremely tedious and complicated due to the fact the individual \(\theta_i\)s may come from different probability spaces and have different supports. This means that the user must pose appropriate variational distributions for each \(\theta_i\) and derive gradients of the probability model, \(p\), wrt to the individual latent variables separately. Furthermore, taking the gradient of the ELBO wrt to the variational parameters require differentiating through a sampling procedure for approximating the datafit term - \(\mathbb{E}_{q(\theta)}[\log p(\theta, \mathcal{D})] \approx \frac{1}{S} \sum_{s=1}^{S}\log p(\theta^{(s)}, \mathcal{D}), \theta^{(s)} \sim q(\theta)\). It turns out that the estimator of ELBO gradient obtained this way has very high variance. This high variance problem is alleviated by means of the reparameterization trick (see Kingma’s paper on Autoencoding Variational Bayes).

To the greatest extent possible, we would like to automate the variational inference procedure and for this we will explore the ADVI approach to variational inference. ADVI requires the user to specify two things only -

  1. the joint probability model \(p(\theta, \mathcal{D})\), and,

  2. the dataset \(\mathcal{D}\).

How does ADVI work?

  1. First, ADVI transforms all latent variables, i.e. all \(\theta_i\)s into new variables \(\zeta_i\)s by means of a suitable invertible transformation, i.e., \(\zeta_i = \mathcal{T}(\theta_i)\) such that \(\zeta_i\) will have support on the entire real space (recall from our discussion on MCMC with PyMC3 that this transformation happened by default when specifying PyMC3 probability models).

  2. Now that all latent variables have same support, ADVI proceeds to specify a common family of distributions on all latent variables. The usual choice is to specify a multivariate Gaussian approximation:

\[ q_{\phi}(\theta) = \mathrm{MVN}(\theta| \mu , \Sigma), \]

where, \(\phi = \{ \mu, \Sigma \}\) denotes the variational parameters.

  1. The approximate posterior is further reparameterized in terms of a standard Gaussian to remove the dependence of the sampling procedure from \(\phi\).

  2. Use standard stochastic optimization techniques to obtain estimates of the variational parameters.