Variational Inference#
References#
These notes.
Variational Inference: A Review for Statisticians (Blei et al, 2018).
Automatic Differentiation Variational Inference (Kucukelbir et al, 2016).
Variational Inference with Normalizing Flows (Rezende and Mohamed, 2016).
The notes are not exhaustive. Variational inference represents the state of the art in Bayesian inferene and it is still evolving. Please consult the papers above for more details.
Note: This document was originally developed by Dr. Rohit Tripathy.
Bayesian Inference#
Quick Review#
Once again, let’s begin with a review of Bayesian inference.
Our goal is to derive a probability distribution over unknown quantities (or latent variables), conditional on any observed data (i.e. a posterior distribution).
Without loss of generality, we denote all unknown quantities in our model as
We start with a description of our prior state of knowledge over
We then specify a conditional probabilistic model that links the observed data with the unknown quantities
The posterior distribution
In the Bayesian framework, predictions about unseen data (or test data), are posed as expectations over this posterior distribution.
What is the problem?#
Unfortunately, as you already know, the posterior distribution is more often than not unavailable in closed form.
This is due to the intractablity of the evidence (or marginal likelihood), i.e., the denominator in the Bayes’ rule,
Approximating the posterior#
There are several approaches to do this:
The posterior density
is approximated with a point mass density, i.e., , where, This is the well-known maximum a-posteriori (MAP) estimation procedure. The parameter is obtained as the solution of the optimization problem, . The MAP approximation is often justified by the assumption that the true posterior distribution has a single, sharply peaked mode. In practice this approach often provides reasonable predictive accuracy but is unable to capture any of the epistemic uncertainty induced by limited data. We saw this very early in the class when we introduced basic supervised and unsupervised learning techniques.The posterior distribution is approximated with a finite number of particles, i.e.,
. The most popular class of techniques that approximates the posterior distribution this way is Markov Chain Monte Carlo (MCMC). Recall that the general idea of MCMC is to construct a discrete-time, reversible and ergodic Markov Chain whose equilibrium distribution is the target posterior distribution. The goal is to simulate the Markov Chain long enough that it enters it’s equilibrium phase (i.e. target posterior density). Once this is accomplished, sampling from the Markov Chain is the same as sampling from the target posterior density. Since MCMC samples (in theory) directly from the posterior, the weights of the approximation are simply set to 1. There are several other approaches to approximate probability densities with particle distributions such as Sequential Monte Carlo (SMC) (which developed primarily as tools for inferring latent variables in state-space models but can be used for general purpose inference) and Stein Variational Gradient Descent (SVGD). We covered everything except SVGD in the previous lecture.Set up a parameterized family of densities over the latent variables -
, and infer the parameters, by solving an optimization problem of the form:
where,
Variational Inference#
Different VI procedures are obtained based on different choices of the approximating family
The KL divergence is always non-negative, i.e.,
This brings us to
If we know that a latent variable has finite support (positive reals for instance), we pick
such itself has support on the same interval only.We would also like
to be easy to sample from and easy to evaluate it’s log probability since the variational objective requires computing an expectation over log probability ratios. A common simplfying assumption that enables easier sampling and log probability computation is the mean-field assumption - i.e., setting up approximation such that the individual latent variables are independent. If is the vector of latent variables, the mean-field assumption implies an approximation of the form,
where
Evidence Lower Bound (ELBO)#
So, to recap, the generic VI strategy is to pose a suitable parameterized family of densities
We cannot actually optimize the KL divergence directly because of it’s dependence on the true posterior
is equivalent to minimizing the KL divergence between
Proof:
Therefore,
We see that the log evidence (which is a constant) is the sum of the objective function
One of the nice things about the ELBO is that it has a neat interpretation. The ELBO is a sum of two terms:
is a measure of the expected model fit under the approximate posterior density. - the entropy of the approximate posterior acts a regularizer. The entropy of a distribution is a measure of how “diffuse” it is. In maximizing the entropy, we try to construct our posterior approximation such that it accounts for the maximum ammount of uncertainty in the latent variables conditional on the observed data.
The two terms in the objective function
Another nice by-product of doing Bayesian inference by maximizing the ELBO is that we can perform Bayesian model selection. Bayesian model selection relies on the estimation and comparison of the model evidence
Automatic Differentiation Variational Inference (ADVI)#
In what follows, we will discuss a practical way of curring out VI.
The details can be found in this paper.
Suppose you have put together the joint probability model
To the greatest extent possible, we would like to automate the variational inference procedure and for this we will explore the ADVI approach to variational inference. ADVI requires the user to specify two things only -
the joint probability model
, and,the dataset
.
How does ADVI work?
First, ADVI transforms all latent variables, i.e. all
s into new variables s by means of a suitable invertible transformation, i.e., such that will have support on the entire real space (recall from our discussion on MCMC withPyMC3
that this transformation happened by default when specifyingPyMC3
probability models).Now that all latent variables have same support, ADVI proceeds to specify a common family of distributions on all latent variables. The usual choice is to specify a multivariate Gaussian approximation:
where,
The approximate posterior is further reparameterized in terms of a standard Gaussian to remove the dependence of the sampling procedure from
.Use standard stochastic optimization techniques to obtain estimates of the variational parameters.