Normalized Flow

Normalizing flows are a class of generative models that represent complex data distributions by starting from a simple, known distribution (e.g., a Gaussian) and applying a sequence of invertible transformations. Through these transformations, the model “warps” an easy-to-sample distribution into one that matches the target real-world data distribution. Unlike GANs, which rely on adversarial training, normalizing flows are trained by maximizing the exact likelihood of the data, enabling direct probability evaluation and stable, non-adversarial training.

Pre-requisite - Change of Variables Techniques

Analogy

The Skilled Pastry Chef (Flow Model): Imagine a pastry chef starting with a simple, uniform dough (representing a simple distribution like a Gaussian). The dough is initially shapeless and uniform. With skillful, reversible shaping steps—rolling, folding, pressing—the chef transforms this simple dough into intricate pastries (the target data distribution). Each step is measured, controlled, and invertible (you can, in principle, unfold each layer back to the original dough). Over time, by perfecting the sequence of transformations, the chef can produce pastries that match the textures, shapes, and flavors that customers (the dataset) desire.

Here, the "dough" is like your initial known distribution (e.g., a Gaussian), and the step-by-step shaping operations are the layers of the normalizing flow. Each transformation must be invertible and differentiable, ensuring we can track how probability mass changes at every step.

Intuition and Design

Normalizing flows addresses the challenge of modeling complex, intractable distributions $p(x)$. Instead of directly modeling $p(x)$, we construct a "decoder" function $f_\theta: z \rightarrow x$ that transforms samples from a simple, tractable distribution $p_z(z)$. (such as a standard Gaussian) into samples from the target complex distribution $p_x(x)$: $z$ represents the latent space with the simple distribution, while $x$ represents the observable space with the complex distribution we wish to model.

This offloads the direct maximum likelihood estimation (MLE) of $x$ over $p(x)$ to the MLE under a parameterized model, meaning the probability of these $x$ values is maximized within the learned distribution.

embed (21).svg

As we already saw in the pre-requisite, the function $f_\theta$ mapping $z$ to $x$ is not the same as the function mapping from $p(z)$ to $p(x)$, where the distribution mapping is necessary for the MLE calculation. According to the change of variables theorem, the probability density mapping function is given below,

$$ p_x(x) = p_z(z) \left|\det\left(\frac{d z}{dx}\right)\right| $$

while the $x$ and $z$ in parentheses refer to the specific variables.

Now the question is, how do we use this formula?

We cannot directly use $p_x$ on the left because its form is unknown and intractable.
On the right-hand side of the equation, the variable $z$ is not provided from the dataset.

This is where the power of normalizing flows comes into play. Normalizing flows require that the function $f_\theta: z \rightarrow x$ is invertible, so we can use the inverted function $f^{-1}\theta:x \rightarrow z$ that map $x$ back to $z$. Since $p_z$ is a standard Gaussian distribution, this part $p_z(z)$, i.e., $p_z(f\theta^{-1}(x))$, of the right-hand side can be computed. The second part on the right-hand side, although more complex, represents the determinant of the Jacobian matrix of the output with respect to the input.

Since the determinant is often expensive to compute, the invertible function $f_\theta$ must be carefully designed to minimize the calculation costs. We will set that aside for now and first take a look at what the loss function looks like.

Loss Function in Normalized Flow

Specifically, the MLE estimation can be expressed as:

$$ \argmax_\theta \prod_i p_x(x_i) = \argmax_\theta \prod_i p_z(f_\theta^{-1}(x_i)) \left|\det\left(\frac{d f_\theta^{-1}(x_i)}{dx_i}\right)\right| $$

The goal is to find the optimal parameters $\theta$ that maximize the likelihood of observing all the given image samples $x_i$. By using the negative log-likelihood framework, we can simplify the computation by converting the product of probabilities into a sum of logarithms, which leads to the following loss function to minimize: