Basics of Variational Inference#
Our problem is to approximate the posterior distribution of the parameters of a model given some observed data. As usual, the parameters are \(x\) with prior \(p(x)\) and the data are \(y\) with likelihood \(p(y|x)\).
The posterior is given by Bayes’ theorem:
Here, for later use, we defined the joint distribution \(p(x,y) = p(y|x)p(x)\) and the marginal distribution, or evidence, \(p(y) = \int p(x,y) dx\).
The idea in variational inference (VI) is to approximate the posterior \(p(x|y)\) with a simpler distribution \(q_\phi(x)\) that depends on some parameters \(\phi\). We often call \(\phi\) the variational parameters and the distribution \(q_\phi(x)\) the variational distribution or the guide.
To identify the best parameters \(\phi\) for the guide, we want to minimize some sort of distance between the posterior and the guide. We use the KL divergence for this purpose:
We should mention that the KL divergence is not a distance, but a divergence. It is not symmetric and does not satisfy the triangle inequality.
So, the problem that VI solves is:
Solving this problem is easier said than done. There are a lot of issues to resolve. We will need to come up with some good choices for the guide. We will have to show that the optimization problem does indeed make sense. That is, that the KL divergence has a minimum and that if we achieve it get do get closer to the posterior. Finally, we will have to come up with a scalable algorithm to actually converges to the minimum.
On the choice of the guide#
Guide example 1: Gaussian with diagonal covariance#
One of the simplest guides we can use is a Gaussian distribution with a diagonal covariance matrix:
When implementing this guide, we prefer to work with unconstrained parameters. Since \(\sigma\) must be positive, we parameterize it as:
The variational parameters are then:
Guide example 2: Gaussian with low-rank covariance#
We can extend the previous example to a Gaussian with a low-rank covariance matrix:
where the covariance matrix has a low-rank structure:
Here, \(k\) is much smaller than the dimension of \(x\). The variational parameters are:
Guide example 3: Gaussian with full covariance#
For a more flexible guide, we can use a Gaussian with a full covariance matrix:
To ensure that \(\Sigma\) is positive definite, we parameterize it using a Cholesky decomposition:
where \(L\) is a lower triangular matrix:
The diagonal entries are parameterized as exponentials to ensure they’re positive, while the \(d(d-1)/2\) entries below the diagonal (marked with *) are unconstrained. We’ll denote these unconstrained entries as \(u\). The variational parameters are:
Transformed Gaussian guides#
Sometimes, the parameter space has constraints that make a direct Gaussian approximation inappropriate. In these cases, we can use a transformation approach:
- Define a one-to-one transformation \(T\) of the parameters 
- Define the transformed parameters: \(z = T(x)\) 
- Put a Gaussian guide on \(z\): \(\tilde{q}_\phi(z) = \mathcal{N}(z|\mu, \Sigma)\) 
The guide for \(x\) is then:
where \(J_T(x) = \frac{\partial z}{\partial x}\) is the Jacobian of the transformation.
Example 4: Guide for positive variables#
If \(x\) is a scalar and must be positive, we can use the transformation:
The Jacobian is:
The guide for \(x\) is then:
Many probabilistic programming frameworks like Pyro handle these transformations automatically.
Example 5: Guide for variables in [0,1]#
If \(x\) is a scalar constrained to the interval \([0, 1]\), we can use the logit transformation:
The Jacobian is:
The guide for \(x\) is then:
Example 6: Non-Gaussian guides#
The guide doesn’t have to be Gaussian. We can choose distributions that match the constraints of our parameters:
For a positive scalar \(x\), we might use a Gamma distribution:
with variational parameters \(\phi = (\alpha, \beta)\).
For a scalar \(x\) between 0 and 1, a Beta distribution might be appropriate:
with variational parameters \(\phi = (\alpha, \beta)\).
Example 7: Composite guides#
For multivariate parameters with different constraints, we can combine different distributions:
If \(x = (x_1, x_2)\) with \(x_1\) positive and \(x_2\) between 0 and 1, we might use:
with variational parameters \(\phi = (\alpha_1, \beta_1, \alpha_2, \beta_2)\).
Structured guides#
The guide can be adapted to match the structure of the model:
- If the model has a hierarchical structure, the guide can have the same structure 
- This approach can capture dependencies between parameters more effectively 
- We’ll explore this further when discussing hierarchical models 
The optimization problem to fit the guide#
The “distance” between the guide and the posterior#
We need a distance between the guide and the posterior. We use the Kullback-Leibler (KL) divergence:
It measures how much information is lost when we use \(q_\phi(x)\) to approximate \(p(x|y)\).
Some basic properties of the KL divergence#
The KL divergence has several important properties:
- It is non-negative: $\(\text{KL}(q_\phi(x) || p(x|y)) \geq 0\)$ 
- It is zero if and only if \(q_\phi(x) = p(x|y)\) 
- But it is not a proper distance, because it is not symmetric: $\(\text{KL}(q_\phi(x) || p(x|y)) \neq \text{KL}(p(x|y) || q_\phi(x))\)$ 
Proof of the KL properties#
Recall Jensen’s inequality:
if \(f\) is convex. If \(f\) is strictly convex, then equality holds if and only if \(x\) is a constant.
Equipped with Jensen’s inequality, we can prove that the KL divergence is non-negative. Here is the proof:
We used Jensen’s inequality with the convex function \(f(x) = -\log x\) and the fact that \(p(x|y)\) is a probability distribution and thus integrates to 1.
The other property of the KL divergence is that it is zero if and only if \(q_\phi(x) = p(x|y)\). We can see this from the fact that \(f(x) = -\log x\) is strictly convex and thus Jensen’s inequality is an equality if and only if the ratio \(\frac{p(x|y)}{q_\phi(x)}\) is constant. This means that the two terms are equal up to a constant factor. But the constant factor has to be one because they are both probability distributions.
Derivation of the evidence lower bound (ELBO)#
In practice, we don’t minimize the KL divergence directly. Instead, we maximize the evidence lower bound (ELBO). Minimizing the KL divergence is equivalent to maximizing the ELBO.
Start with the fact that the KL divergence is non-negative:
Use the definition of the KL divergence and the fact that \(p(x|y) = \frac{p(x,y)}{p(y)}\):
Since this is non-negative, we can rearrange it to get:
From the long equation above, we see that the KL divergence and the ELBO are related by:
Now think that you are maximizing the ELBO:
What would this do to the KL divergence? You are pushing the ELBO up, which closes the gap between the ELBO and the evidence. This reduces the KL divergence. So, indeed, maximizing the ELBO minimizes the KL divergence.
The reparameterization trick#
To optimize the ELBO with respect to the variational parameters, we will need gradients like:
But there is a problem: this is not an expectation over \(q_\phi(x)\). So we cannot use standard Monte Carlo methods to estimate the gradient.
To overcome this, we use the reparameterization trick. This is an idea found in Kingma and Welling, 2014. The idea is to express the variable \(x\) as a deterministic function of a random variable \(\epsilon\) drawn from a fixed distribution (without parameters). Like this:
where \(\epsilon \sim p(\epsilon)\), a fixed distribution, and \(g_\phi\) is one-to-one transformation. Then we can express the expectation over \(q_\phi(x)\) as an expectation over \(p(\epsilon)\):
Now, there is no problem taking the gradient:
And we can easily construct a sampling average approximation:
ELBO maximization as a stochastic optimization problem#
We like minimizing things instead of maximizing things. So we define a “loss” function:
Use the reparameterization trick to approximate the expectations over \(q_\phi(x)\) by expectations over \(p(\epsilon)\). Then construct a sampling average approximation of the loss:
where \(\epsilon_s \sim p(\epsilon)\) independently. At this point, we can use any standard stochastic optimization method. We can use Adam, for example.
Let’s go over some specific examples of how to apply the reparameterization trick.
Example 1: The reparameterization trick for a univariate Gaussian guide#
Suppose
with
We can take:
where \(\epsilon \sim \mathcal{N}(0, I)\).
Example 2: The reparameterization trick for a multivariate Gaussian guide#
Suppose
where
\(T\) is a one-to-one transformation, and \(J_T(x)\) is the Jacobian.
Then we can take:
and thus:
where \(\epsilon \sim \mathcal{N}(0, I)\).
