Variational Autoencoder (CH, Hard)

A Variational Autoencoder (VAE) blends the principles of deep learning with Bayesian inference using sampling techniques. It operates by encoding input data $x$ into a compressed latent distribution space representation $p(z)$, then reconstructing the input $x$ from the sample $z$ in this latent distribution space, effectively learning the distribution $p(x)$ of the input data.

Analogy

The Artist and the Painting Process: Imagine an artist attempting to replicate a stunning artwork. The initial step involves an intensive analysis of the original painting, denoted as $x$. This includes dissecting the artwork to grasp its fundamental elements, $z$, such as genre, color palette, brushwork, and overall composition. Identifying these key components is crucial, as they serve as the blueprint for the recreation process. The artist then gathers these elements, aiming to mimic the original with as much accuracy as possible. This approach highlights how the elements $z$ serve as the guiding principles in faithfully reconstructing the painting $x$.

The Goal of Faithful Replication: The success of the replication is measured by comparing the recreated painting, $\hat{x}$, with the original $x$. The primary objective is to minimize discrepancies, mirroring the concept of reconstruction loss—the smaller the differences, the more successful the replication. Concurrently, the artist endeavors to maintain the simplicity and efficiency of the elements $z$, steering clear of unnecessary complexities. This balancing act between precision (minimizing reconstruction loss) and simplicity (ensuring effective regularization) is critical, ensuring that the new painting not only faithfully replicates the original but is also created with efficiency.

Uncertainty in the Creative Process: Despite adhering to the same foundational elements, the artist's recreated paintings in different time points may not always be identical, introducing an element of randomness. A well-defined and meaningful deconstruction of these fundamental components $z$, however, can guide the artist towards more consistent results.

Problem Definition and Intuition

VAE（变分自编码器）的本质是一个基于有向概率图的生成模型，其目标是对 $x$ 和 $z$ 进行联合建模，从而实现双向的推断（inference）。需要注意的是，双向概率推断并不是归一化流里面的可逆函数。不准确的比喻，可逆函数可以被看成方差为 0 双向概率推断的一种特殊形式。

从实现上，VAE 假设一个条件概率模型 $p(x|z)$，用于描述从 $z$ 到 $x$ 的生成过程，从而最大化对观测数据 $x$ 的似然 $p_\theta(x)$，即进行最大似然估计（Maximum Likelihood Estimation, MLE）。

$$ \log p_\theta(x) = \log\int p_\theta(x|z)p(z) dz. $$

但是由于 $\log p_\theta(x)$ 的积分形式往往难以直接求解，我们才转而引入一个可解的辅助分布（即变分后验）来构建下界（ELBO）。最终，我们通过优化这个下界（ELBO），间接实现了对 $\log p_\theta(x)$ 的最大化。

Encoding and Decoding Architecture

The image below illustrates the architecture of VAE, which involves 3 stages: encoding, sampling, and decoding.

Encoder Stage: In this stage, the encoder transforms the input data $x$ from the data space into a parametric space via a encoder neural network. The outputs are the parameters that configure the distribution of the latent variable $z$. Specifically, the encoder outputs two values for each input $x$: the mean $\mu_z(x)$, which represents the mean of the latent distribution conditioned on $x$, and the variance $\sigma^2_z(x)$, which represents the variance of the latent distribution and encodes the uncertainty or spread of the latent codes around the mean.

embed (64).svg

Sampler Stage: The sampler utilizes the mean and variance provided by the encoder to sample from the latent distribution. This is typically accomplished using the reparameterization trick, where the latent variable $z$ is generated by the equation $z = \mu_z(x) + \epsilon \sqrt{\sigma^2_z(x)}$. Here, $\epsilon$ is a noise variable sampled from a standard normal distribution $\mathcal{N}(0,1)$. This technique allows the model to backpropagate the gradients through the random sampling process, facilitating the use of gradient descent.

Decoder Stage: The decoder receives the sampled latent codes $z$ and reconstructs the input data $x$. The output $\hat{x}$ is the reconstruction of the original input, aimed at closely resembling $x$. The quality of the reconstruction is used to calculate part of the VAE’s loss, specifically the reconstruction error, encoded in $p_\theta(x|z)$, which measures the effectiveness of the decoder in recreating the input from the latent code.

Note: The above neural network structure is not bound to its usages. For the same neural network structure, we can configure different loss functions as needed. Next, we will derive the loss function of the VAE.

Insight

VAE 核心构思是借助潜在变量 $z$ 来模拟复杂数据分布 $p(x)$。潜在变量 $z$ 能够揭示数据的内在变异因素。与直接建模 $p(x)$ 不同，VAE 假设了一个生成过程，即 $z$ 服从某一先验分布 $p(z)$（通常为标准高斯分布），数据 $x$ 则在给定 $z$ 的条件下服从未知分布 $p(x|z)$。此处，$p(x|z)$ 可以通过神经网络实现，称作解码器，它将潜在变量 $z$ 映射回数据空间。模型的目标是通过边际化潜在变量 $z$ 来最大化观测数据 $x$ 的似然：

$$ p(x) = \int p(x|z) p(z) \, dz, $$

在VAE中，解码器神经网络被用来近似未知分布 $p(x|z)$，近似后的分布记为 $p_\theta(x|z)$。将其替换进数据 $x$ 的似然中，得到参数化的数据似然 $p_\theta(x)$ 如下所示：

$$ p(x) \geq p_\theta(x) = \int p_\theta(x|z) p(z) \, dz, $$