Normalizing flows are a class of generative models that represent complex data distributions by starting from a simple, known distribution (e.g., a Gaussian) and applying a sequence of invertible transformations. Through these transformations, the model “warps” an easy-to-sample distribution into one that matches the target real-world data distribution. Unlike GANs, which rely on adversarial training, normalizing flows are trained by maximizing the exact likelihood of the data, enabling direct probability evaluation and stable, non-adversarial training.
The Skilled Pastry Chef (Flow Model): Imagine a pastry chef starting with a simple, uniform dough (representing a simple distribution like a Gaussian). The dough is initially shapeless and uniform. With skillful, reversible shaping steps—rolling, folding, pressing—the chef transforms this simple dough into intricate pastries (the target data distribution). Each step is measured, controlled, and invertible (you can, in principle, unfold each layer back to the original dough). Over time, by perfecting the sequence of transformations, the chef can produce pastries that match the textures, shapes, and flavors that customers (the dataset) desire.
Here, the "dough" is like your initial known distribution (e.g., a Gaussian), and the step-by-step shaping operations are the layers of the normalizing flow. Each transformation must be invertible and differentiable, ensuring we can track how probability mass changes at every step.
The idea of normalizing flows is that, since the true distribution $p(x)$ is difficult to express explicitly, a neural network (often called a "decoder" in normalizing flows) $f_\theta$ is used to perfectly map samples $z$ from a simple base distribution to samples $x$ in a complex target distribution. This offloads the direct maximum likelihood estimation (MLE) of $x$ over $p(x)$ to the MLE under a parameterized model, meaning the probability of these $x$ values is maximized within the learned distribution.
As we already saw in the pre-requisite, the function $f_\theta$ mapping $z$ to $x$ is not the same as the function mapping from $p(z)$ to $p(x)$, where the distribution mapping is necessary for the MLE calculation. According to the change of variables, the probability mapping function is given below,
$$ p_x(x) = p_z(z) \left|\det\left(\frac{d z}{dx}\right)\right| $$
The subscripts $z$ and $x$ in $p_x$ and $p_z$ indicate that they represent two different distributions, while the $x$ and $z$ in parentheses refer to the specific variables.
Pre-requisite - Change of Variables Techniques
Now the question is, how do we use this formula? We cannot directly use $p_x$ on the left because it is unknown. On the right-hand side of the equation, the variable $z$ is missing, which presents a challenge. This is where the power of normalizing flows comes into play. Normalizing flows require that the function $f_\theta: z \rightarrow x$ is invertible, so we can use the inversed function $f^{-1}\theta$ to map $x$ back to $z$. Since $p_z$ is a standard Gaussian distribution, this part $p_z(z)$, i.e., $p_z(f\theta^{-1}(x))$, of the right-hand side can be computed. The second part on the right-hand side, although more complex, essentially represents the determinant of the Jacobian matrix of the output with respect to the input. Since the determinant is very expensive to compute, usually, the invertible function $f_\theta$ must be carefully designed to minimize the determinant calculation costs. We will set that aside for now and first take a look at what the loss function looks like.
Specifically, the MLE estimation can be expressed as:
$$ \argmax_\theta \prod_i p_x(x_i) = \argmax_\theta \prod_i p_z(f_\theta^{-1}(x_i)) \left|\det\left(\frac{d f_\theta^{-1}(x_i)}{dx_i}\right)\right| $$
The goal is to find the optimal parameters $\theta$ that maximize the likelihood of observing all the given image samples $x_i$. By using the negative log-likelihood, we can simplify the computation by converting the product of probabilities into a sum of logarithms, which leads to the following loss function to minimize:
$$ \small \argmin_\theta \sum_i -\log p_x(x_i) = \argmin_\theta \sum_i \left( -\log p_z(f_\theta^{-1}(x_i)) - \log \left|\det\left(\frac{d f_\theta^{-1}(x_i)}{dx_i}\right)\right| \right) $$
An invertible function is a function where each input uniquely maps to an output, and the mapping can be reversed to recover the original input. In normalizing flows, the role of the neural network is a part of the invertible function, to parameterize these invertible functions, enabling flexible and learnable transformations that map a simple distribution to a complex target distribution.
The affine coupling layer is one of the most widely used transformations, introduced in models like RealNVP. It splits the input into two parts and applies an affine (scaling and shifting) transformation to one part, conditioned on the other part.