Posterior Approximation for Variational Inference

We explore normalizing flows (planar flow, radial flow, inverse autoregressive flow) and Gaussian autoregressive flow as flexible posterior estimators for variational inference.

1. Variational inference: motivation and goals

Motivation

Suppose that given the gender, how tall a person follows an unkown Guassian distribution. If only with the height data of male, it's easy to fit a Gaussian model to explain the data. However, if given the height without knowing the gender, it's not easy to fit such a well defined model for the data anymore. For data as described in this example, EM (such as K-means) is sufficient to solve the problem. However, when data goes from low dimension to high dimension, the modeling becomes increasingly more intractable. And this is exactly the case for the real-world high dimensional data that we collect, such as MNIST, CIFAR, etc. The reason is that most often data generation process is guided by some sort of latent factors, such as the gender in the previous case. Therefore, in oder to explain the data, we must also take the latent factors into consideration. This is the direct motivation for variational inference.

Goals

Mathematically, we formalize the definition of a variational inference problem as bellow. We use \( x \) (univariable/ multivariables as vector or tensor) to represent the data observed and \( z \) for the latent variables. The goal of variational inference is to:
  • learn the posterior distribution \( p_{\theta}(z|x) \) of the latent variables \( z \);
  • derive a lower bound for the marginal likelihood \( p(x) \) of the observed data \( x \);
Usually the data \( x \) is complex and high dimensional, and the posterior distribution \( p_{\theta}(z|x) \) is intractable. Analytically computing \( p_{\theta}(z|x) \) is untrivial, and therefore we would like to approximate it with a flexible estimator \( q_{\phi}(z|x) \). Now we derive the lower bound of \( log(p(x)) \) by analyzing how good the posterior estimator \( q_{\phi}(z|x) \) can approximate the true posterior distribution \( p_{\theta}(z|x) \): $$KL[q_{\phi}(z|x)||p_{\theta}(z|x)] = E_{z \sim q_{\phi}(z|x)}[log(q_{\phi}(z|x))-log(p_{\theta}(z|x))]$$ $$=E_{z \sim q_{\phi}(z|x)}[log(q_{\phi}(z|x))-log(p(z))-log(p_{\theta}(x|z))+log(p(x))]$$ $$=E_{z \sim q_{\phi}(z|x)}[log(q_{\phi}(z|x))-log(p(z))-log(p_{\theta}(x|z))]+log(p(x))$$
Let's move \( KL[q_{\phi}(z|x)||p_{\theta}(z|x)] \) to the right side: $$log(p(x)) - KL[q_{\phi}(z|x)||p_{\theta}(z|x)] = L(x,\theta,\phi)$$ $$=-KL[q_{\phi}(z|x)||p(z)]+E_{z \sim q_{\phi}(z|x)}[log(p_{\theta}(x|z))]\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ (1)$$ $$=E_{z \sim q_{\phi}(z|x)}[log(q_{\phi}(z|x))]+E_{z \sim q_{\phi}(z|x)}[log(p(z)p_{\theta}(x|z))]\ \ \ \ \ \ (2)$$
The R.H.S. term \(L(x,\theta,\phi)\) is a lower bound for \( log(p(x)) \) because the KL-diveregence term \( KL[q_{\phi}(z|x)||p_{\theta}(z|x)] \geqslant 0\). Ideally, if the posterior estimator \( q_{\phi}(z|x) \) is powerfull enought to approximate the true posterior distribution \( p_{\theta}(z|x) \), then \( KL[q_{\phi}(z|x)||p_{\theta}(z|x)] \rightarrow 0\). Namely, at this moment \( L(x,\theta,\phi) \) is a tight lower bound and maximizing this bound will lead us to a close approximation of \( log(p(x)) \).

2. A typical variational autoencoder architecture as implementation

Let's focus on the lower bound \(L(x,\theta,\phi)\) in the form of eq. (1). For those who are familiar with the autoencoder architecutre, you can treat the first R.H.S term as a regularizor for the latent variables, and the second term as the reconstruction error. In the simplest case, we can just assume that both the latent vairables' marginal distribution \( p(z) \) and true posterior distribution \( p_{\phi}(z|x) \) are normal distributions. However infact both of them can be arbitrarily complex distributions depending on how complex the data is. Therefore, people propose normalizing flows, Gaussian autoregressive flow, etc. as methods to derive more flexible posterior estimator \( q_{\phi}(z|x) \) so as to approximate \( p_{\theta}(z|x) \). More discussions on these flexible models will be provided in the next section.

By assuming that \( p(z) \) and \( q_{\phi}(z|x) \) are diagonal Gaussian, the KL-divergence term \(KL[q_{\phi}(z|x)||p(z)]\) can be explicitly expressed in a differentiable form w.r.t. parameters \( \phi \) as bellow: $$KL[q_{\phi}(z|x)||p(z)] = \frac{1}{2}[log\frac{|\Sigma_{2}|}{|\Sigma_{1}|}-d+tr(\Sigma_{2}^{-1}\Sigma_{1})+(\mu_{2}-\mu_{1})^{T}\Sigma_{2}^{-1}(\mu_{2}-\mu_{1})]$$ $$with:\ q_{\phi}(z|x)\sim N(\mu_{1},\Sigma_{1}) = N(\mu_{\phi}(x),\Sigma_{\phi}(x))$$ $$\ p(z)\sim N(\mu_{2},\Sigma_{2}) = N(0,I_{d})$$
As for \(E_{z\sim q_{\phi}(z|x)}[log(p_{\theta}(x|z))]\) in eq. (1), we use two tricks to build up a differentiable form w.r.t. parameter sets \(\phi\) and \(\theta\) for optimization. By re-parameterizing \(z\sim q_{\phi}(z|x)\) as: $$z = \mu_{\phi}(x) + \varepsilon*sqr(\Sigma_{\phi}(x))$$ $$with:\ \varepsilon\sim N(0,I_{d})$$
Since that \(\Sigma_{\phi}(x)\) is diagonal, \(sqr\) is element-wise computation. Moreover, because that we will use minibatch SGD as the optimizer, we will also use stochastic sampling for \(z\) to approximate \(E_{z\sim q_{\phi}(z|x)}[log(p_{\theta}(x|z))]\) for optimization. Imperically, we often set the minibatch size as 100 and sampling number as 1. Also, \(p_{\theta}(x|z)\) is assumed to be a Guassian so as to output a scalar value as the reconstruction error for each of the high-dimensional datapoints observed. One thing to notice is that the decoder only outputs the mean vector, and thus the covariance matrix is treated as hyper-parameters.
So far, we have derived a differentiable lower bound form based on eq. (1). Therefore, now we can simply deploy the whole lower bound maximization problem into an autoencoder architecture, which we refer to as variational autoencoder (VAE). Usually, we call the encoder as the recognition model, and the decoder as the generative model. Both models are implemented as neural networks, such as CNN for image data or MLP for data of simpler structures.

3. More flexible posterior estimators for better variational inference

Though a powerful method for Bayesian inference on complex high-dimensional data, variational inference is often argued by its defect in approximating the true posterior distribution \(p_{\theta}(z|x)\), which might harm the credibility of the lower bound \(L(x,\theta,\phi)\) that we optimize. Therefore, as mentioned in the previous section, researchers propose normalizing flows, Gaussian autoregressive flow, etc. as more complex posterior estimators.

The basic principle of normalizing flow is applying invertible transformations on variables with simple distribution so as to express complicated distribution for transformed variables. For instance, \(z_{0}\sim N(0,I)\) and \(z_{k} = f_{k}\circ \cdot \cdot \cdot \circ f_{1}(z_{0})\). Then \(z_{k}\) can be with rather complex distribution based on \(z_{0}\) and the invertible transformations \(f_{i}\) in use. Specifically, based on \(z_{0}\sim N(0,I)\) we can derive \(q(z_{k})\) using the Jacobian matrices as bellow: $$log(q(z_{k}))=log\{q(z_{0})\prod_{i=1}^{k}|det(\frac{\partial f_{i}}{\partial z_{i-1}})|^{-1}\}=log(q(z_{0})) - \sum_{i=1}^{k}log|det(\frac{\partial f_{i}}{\partial z_{i-1}})|$$
In practice, most often we would just take \(z_{0}\) as simple diagonal Gaussian. However, if the latent variable \(z_{i}\) is with length A and the transformation depth \(k\) is set as B, then the computational complexity of \(\sum_{i=1}^{k}log|det(\frac{\partial f_{i}}{\partial z_{i-1}})|\) scales as \(O(A^{2}A!B)\). Therefore, we prefer \(f_{i}\) to be both invertible as well as with Jacobian determinant of computational efficient form, such as planar flow introduced here.

To compute the expectation of invertible function \(h(z_{k})\) of transformed variables \(z_{k}=f_{k}\circ \cdot \cdot \cdot \circ f_{1}(z_{0})\), we don't even need to explicitly derive \(q(z_{k})\). The rule used here is: $$E_{z_{k}\sim q(z_{k})}[h(z_{k})] = E_{z_{0}\sim q(z_{0})}[h(f_{k}\circ \cdot \cdot \cdot \circ f_{1}(z_{0}))]$$ Therefore, the lower bound \(L(x,\theta,\phi)\) in the form of eq. (2) can be re-written as: $$L(x,\theta,\phi,\lambda) = E_{z_{k}\sim q_{\phi,\lambda}(z_{k}|x)}[log(q_{\phi,\lambda}(z_{k}|x))]+E_{z_{k}\sim q_{\phi,\lambda}(z_{k}|x)}[log(p_{\lambda}(z_{k})p_{\theta}(x|z_{k}))]$$ $$=E_{z_{0}\sim q_{\phi}(z_{0}|x)}[log(q_{\phi}(z_{0}|x))]-E_{z_{0}\sim q_{\phi}(z_{0}|x)}[\sum_{i=1}^{k}log|det(\frac{\partial f_{i}}{\partial z_{i-1}})|]$$ $$+E_{z_{0}\sim q_{\phi}(z_{0}|x)}[log(p(z_{0})p_{\theta}(x|f_{k}\circ \cdot \cdot \cdot \circ f_{1}(z_{0})))]-E_{z_{0}\sim q_{\theta}(z_{0}|x)}[\sum_{i=1}^{k}log|det(\frac{\partial f_{i}}{\partial z_{i-1}})|]$$ $$=E_{z_{0}\sim q_{\phi}(z_{0}|x)}[log(q_{\phi}(z_{0}|x))]+E_{z_{0}\sim q_{\phi}(z_{0}|x)}[log(p(z_{0})p_{\theta}(x|f_{k}\circ \cdot \cdot \cdot \circ f_{1}(z_{0})))]-2*E_{z_{0}\sim q_{\phi}(z_{0}|x)}[\sum_{i=1}^{k}log|det(\frac{\partial f_{i}}{\partial z_{i-1}})|]\ \ \ \ \ \ \ (3)$$
The invertible transformations \(f_{k}\circ \cdot \cdot \cdot \circ f_{1}\) are parameterized by \(\lambda\). As what we introduced in sec. 2, usually we assume \( z_{0}\sim q_{\phi}(z_{0}|x) = N(\mu_{\phi}(x),\Sigma_{\phi}(x))\) as diagonal Gaussian, or conditional Gaussian in more complex settings. And we take \( p(z_{0}) \) as \(N(0,I)\). By sampling under \(z_{0}\sim q_{\phi}(z_{0}|x)\), the first two terms in the upper equation can be easily derived in a form differentiable w.r.t. paramter sets \(\phi\) and \(\theta\), again using the re-parameterization trick on \(z_{0}\). In practice, the sampling size of \(z_{0}\) is often set to 1.

For different normalizing flows, the influence basically happens on the second and the third term depending on what invertible functions \(f_{k}\circ \cdot \cdot \cdot \circ f_{1}\) in use. Some transformations may result in \(q_{\phi,\lambda}(z_{k}|x)\) with more flexible distribution than others, and therefore is more likely to derive a tighter lower bound \(L(x,\theta,\phi,\lambda)\) due to smaller KL-divergence of \(KL[q_{\phi,\lambda}(z_{k}|x)||p_{\theta}(z_{k}|x)]\). In the following parts, we introduce a few commonly used normalizing flows.

Planar normalizing flow

A planar normalizing flow is a function of the form: $$z_{i}=f_{i}(z_{i-1})=z_{i-1}+u_{i}h(w_{i}^{T}z_{i-1}+b_{i}),\ i = \{1,..,k\}$$ where \(u_{i} \) and \( w_{i} \) are vectors with the same dimension as \( z_{i-1} \), \(b_{i}\) is a scalar and h is the activation function. Invertible transformation in this form has a det-Jacobian term that can be computed in \(O(AB)\), compared with \(O(A^{2}A!B)\) explained before. Thus the third term of the lower bound \(L(x,\theta,\phi,\lambda)\) in eq. (3) can be efficiently computed as: $$ E_{z_{0}\sim q_{\phi}(z_{0}|x)}[\sum_{i=1}^{k}log|det(\frac{\partial f_{i}}{\partial z_{i-1}})|] = E_{z_{0}\sim q_{\phi}(z_{0}|x)}[\sum_{i=1}^{k}log|det(\frac{\partial (z_{i-1} + u_{i}h(w_{i}^{T}z_{i-1}+h))}{\partial z_{i-1}})|]$$ $$ = E_{z_{0}\sim q_{\phi}(z_{0}|x)}[\sum_{i=1}^{k}log|det(I + h^{'}u_{i}w_{i}^{T})|] = E_{z_{0}\sim q_{\phi}(z_{0}|x)}[\sum_{i=1}^{k}log|1 + h^{'}u_{i}^{T}w_{i}|]$$ The planar flows can be viewd as applying a series of contraction and expansions, along the direction perpendicular to the hyper plane \(w_{i}^{T}z_{i}+b_{i}\), to the initial probability distribution of \(z_{0}\). Some experimental results demonstrating the effect of planar normalizing flows are shown in fig. 1.

Radial normalizing flow

Another alternative transformation family is radial normalizing flow. The transformation function can be expressed as: $$ z_{i}=f_{i}(z_{i-1}) = z_{i-1} + \beta_{i} h(\alpha_{i},r(z_{i-1}))(z_{i-1}-{z_{0}}_{i}),\ i = \{1,...,k\}$$ $$with:\ r(z_{i-1}) = ||z_{i-1}-{z_{0}}_{i}||,\ h(\alpha_{i},r(z_{i-1})) = 1/(\alpha_{i}+r(z_{i-1}))$$ where \(z_{0}\) is a vector with the same dimension as \(z_{i-1}\), \(\beta_{i}\) and \(\alpha_{i}\) are scalars. Also, radial normalizing flow provides a linear-time computation of the det-Jacobian, and therefore the third term of the lower bound \(L(x,\theta,\phi,\lambda)\) in eq. (3) can be derived as: $$ E_{z_{0}\sim q_{\phi}(z_{0}|x)}[\sum_{i=1}^{k}log|det(\frac{\partial f_{i}}{\partial z_{i-1}})|] = E_{z_{0}\sim q_{\phi}(z_{0}|x)}[\sum_{i=1}^{k}log|det(\frac{\partial (z_{i-1} + \beta_{i} h(\alpha_{i},r(z_{i-1}))(z_{i-1}-{z_{0}}_{i}))}{\partial z_{i-1}})|] $$ $$ = E_{z_{0}\sim q_{\phi}(z_{0}|x)}[\sum_{i=1}^{k}log|[1+\beta_{i}h(\alpha_{i},r(z_{i-1}))]^{d-1}[1+\beta_{i}h(\alpha_{i},r(z_{i-1}))+\beta_{i}h^{'}(\alpha_{i},r(z_{i-1}))r(z_{i-1})]|]$$ It applies contractions and expansions around the reference point \({z_{0}}_{i}\) of current distribution \(z_{i-1}\), thus referred to as radial flow. Experiments showing the effect of radial normalizing flow are also shown in fig. 1.

Figure 1. Effect of normalizing flow on two distributions.

Gaussian autoregressive flow

Different from the normalizing flows introduced previously, Gaussian autoregressive flow is applied by optimizing the lower bound in form of eq. (1), instead of eq. (2). Though eq. (1) is also optimized in Sec. 2 in a typical variational autoencoder architecture, now we model the posterior distribution \(z\sim q_{\phi}(z|x)\) as a more complex conditional diagonal Gaussian, not just a simple diagonal Gaussian. For instance the latent variables \(z\) can be partitioned into \(z=(z_{0},z_{1},z_{2})\), and \(q_{\phi}(z_{0}|x), q_{\phi}(z_{1}|x,z_{0}), q_{\phi}(z_{2}|x,z_{0},z_{1})\) are separately modeled as different diagonal Gaussian distributions. A conditional diagonal Guassian estimator is more powerful than the often used single diagonal Guassian in approximating the true posterior distribution \(p_{\theta}(z|x)\).

Since that \(z\sim q_{\phi}(z|x)\) is no longer a diagonal Guassian, computing the KL-divergence term in eq. (1) analytically is a bit different as explained before. Assuming that the latent variables \(z\) can be partitioned into \((z_{0},z_{1},z_{2})\), the modifications taken for optimizing this term are: $$KL[q_{\phi}(z|x)||p(z)] = KL[q_{\phi}(z_{0}|x)q_{\phi}(z_{1}|x,z_{0})q_{\phi}(z_{2}|x,z_{0},z_{1})||p(z_{0})p(z_{1})p(z_{2})]$$ $$=E_{z_{0}z_{1}z_{2}\sim q_{\phi}(z_{0}|x)q_{\phi}(z_{1}|x,z_{0})q_{\phi}(z_{2}|x,z_{0},z_{1})}[log\frac{q_{\phi}(z_{0}|x)}{p(z_{0})}+log\frac{q_{\phi}(z_{1}|x,z_{0})}{p(z_{1})}+log\frac{q_{\phi}(z_{2}|x,z_{0},z_{1})}{p(z_{2})}]$$ $$=KL[q_{\phi}(z_{0}|x)||p(z_{0})]+KL[q_{\phi}(z_{1}|x,z_{0})||p(z_{1})]+KL[q_{\phi}(z_{2}|x,z_{0},z_{1})||p(z_{2})]$$ The rest is just trivial using the multivariate Gaussian formula for computing KL-divergence as derived in Sec. 2. In practice, RNN is often used to output different sets of mean and covariance for each partition of the latent variables \(z\).

The difficulty of computing the second term \(E_{z\sim q_{\phi}(z|x)}[log(p_{\theta}(x|z))]\) in eq. (1) is to find a way of sampling from \(z\sim q_{\phi}(z|x)\), which is a conditional diagonal Gaussian. Still, assuming that the latent variables \(z\) can be partitioned into \((z_{0},z_{1},z_{2})\), samples of \(z\) are obtained by sequentially sampling from \(z_{i}\), which is diagonal Guassian. The re-parameterization trick is used when we sample from each partition \(z_{i}\). So far, we have elicited a full derivation of a differentiable form of the lower bound \(L(x,\theta,\phi)\) in form of eq. (1).

Inverse autoregressive flow

The idea of inverse autoregressive flow (IAF) comes from Gaussian autoregressive flow. IAF works as if the inverse function of Gaussian autoregressive flow. Another way to understand autoregressive flow is to think of it as a transformation taking a diagonal standard normal distribution \(\varepsilon\sim N(0,I_{d})\) to a conditional diagonal normal distribution \(z=(z_{0},...,z_{T})\), with \(p(z_{t}|x,z_{0},...,z_{t-1})\sim N(\mu_{t}(x,z_{0:t-1}),\sigma_{t}(x,z_{0:t-1})^{2}*I_{d_{t}})\). This transformation is what we refer to as the re-parameterization trick before. We partition \(\varepsilon\) into \((\varepsilon_{0},...,\varepsilon_{T})\) in the same way as what we do for the latent variables \(z\). Thus the Gaussian autoregressive transformation is: $$ z_{t} = \mu_{t}(x,z_{0:t-1}) + \varepsilon_{t}*\sigma_{t}(x,z_{0:t-1}),\ \ \ t\in\{0,..,T\}$$ This is a powerful group of transfomations in the sense that they can transform a simple multivariate standard normal distribution \(\varepsilon=(\varepsilon_{0}...\varepsilon_{T})\) into a complex conditional diagonal normal distribution \(z=(z_{0}...z_{T})\). Therefore, we would be interested to see what kind of distribution it can take us to, if we feed in a simple diagonal Gaussian distribution to its inverse function, namely inverse autoregressive flow (IAF). The IAF is given as bellow: $$ \varepsilon_{t} = \frac{z_{t}-\mu_{t}(x,z_{0:t-1})}{\sigma_{t}(x,z_{0:t-1})},\ \ \ t\in\{0,...,T\}$$ This transformation takes \(z\mapsto \varepsilon\). For convenience, we replace \(z\) by \(y\), and \(\varepsilon\) by \(z\). Therefore, the det-Jacobian of transformation \(y\mapsto z\) can be efficiently expressed as: $$ log|det(\frac{\partial f}{\partial y})| = log|\frac{1}{\sigma_{0}*...*\sigma_{T}}| = -\sum_{t=0}^{T}log(\sigma_{t}(x,y_{0:t-1}))$$ Therefore, again, using eq. (2) as the lower bound we are able to optimize \(L(x,\theta,\phi)\) for variational inference. So far we have explained using IAF for one iteration of transformation. However in normalizing flows, we often apply a few rounds of transformation in order to obtain a rather complex posterior estimator \(q(z_{k}|x)\). Thus, once again, we replace \(y\) as \(z_{i-1}\), and \(z\) as \(z_{i}\). The transformation can be expressed as: $$ z_{i} = f_{i}(z_{i-1}) = \frac{z_{i-1}-\mu_{i}(z_{i-1})}{\sigma_{i}(z_{i-1})},\ \ \ i\in\{1,...,k\} $$ It's important to notice that \(z_{i}\) here no longer represents a partition of \(z\), but current latent variables \(z\) in a whole. Also, in the upper equation all the operations are element-wise. In practice, we often take \(z_{0}\sim q(z_{0}|x)=N(\mu(x),\sigma(x)^{2}*I_{d})\) as a diagonal Gaussian. It's also important to realize that \((\mu(x),\sigma(x))\) used here for sampling \(z_{0}\) don't equal \((\mu_{1}(z_{0}),\sigma_{1}(z_{0}))\) in the upper equation, or else \(z_{1}\) will just be a standard diagonal normal distribution. All the \((\mu_{i}(z_{i-1}),\sigma_{i}(z_{i-1}))\), including \((\mu(x),\sigma(x))\) are output of the inference network.
Currently, the best generative model results on CIFAR and MNIST are based on IAF.