PixelRNN is a generative model that combines deep learning with autoregressive conditional probability modeling, focusing on generating images pixel by pixel. It learns the conditional dependencies between pixels, treating the generation of each pixel as the result of sampling from a conditional distribution based on previously generated pixels.

Specifically, PixelRNN models the pixel sequence of an input image $x$, decomposing the joint probability distribution $p(x)$ into multiple conditional distributions $p(x_i | x_1, x_2, \ldots, x_{i-1})$, thereby generating a complete image step by step. This approach directly models within the pixel space, without relying on latent variable representations, effectively capturing complex spatial correlations in images.

animated_image (3).gif

Analogy

Raster Scanning: Recall how old televisions worked, where the image on the screen was generated by an electron beam scanning line by line.

Starting from the top-left corner of the screen, the electron beam draws pixels row by row, filling each row from left to right until the entire picture is completely displayed.

The color and brightness of each pixel can be adjusted based on previously drawn pixels to ensure that the entire image is smooth and consistent.

Problem Definition and Intuition

Raster scanning is the core idea of the PixelRNN model, which makes conditional predictions for pixels that have not yet been generated, conditioned on the parts already generated. As the "completed pixels" gradually extend to the entire image, PixelRNN works like solving a puzzle, recursively generating a complete image pixel by pixel.

Given an image $x$ (which can be viewed as a pixel matrix), PixelRNN's approach is to autoregressively model the joint probability distribution $p(x)$ of image pixels. Autoregression refers to breaking down a multidimensional (e.g., two-dimensional image) or sequential joint distribution into a series of conditional distributions:

$$ p(x) = \prod_{t} p\bigl(x[t] \mid x[1], x[2], \ldots, x[t-1]\bigr), $$

embed (26).svg

where $x[t]$ represents the $t$-th pixel (or sequence item). When we sample the $t$-th pixel, we predict its distribution based on the previously sampled $t-1$ pixels. Note that in practical implementation, the prediction of $x[t]$ is based on the aggregate hidden representation $h[t]$ of the previous $t-1$ pixels.

This is a straightforward design - the value of any pixel can be seen as dependent on the pixels already generated before it, and not dependent on later pixels (of course, depending on how "before" and "after" are defined). Therefore, we only need to learn how to use previously generated pixels to predict the conditional distribution of the current pixel, rather than directly modeling the joint distribution of the entire image. Compared to directly handling high-dimensional joint distributions, this step-by-step decomposition method greatly reduces the complexity of modeling.

Row LSTM Brief Example

  1. Treat the image as a one-dimensional sequence from top to bottom, left to right:

    $$ (x[1], x[2], \dots, x[w], x[w+1], x[w+2], \dots, x[H\times W]) $$

  2. Use an RNN (such as LSTM) to process this sequence sequentially, outputting parameter estimates for the distribution of the next pixel. For example:

    $$ h[t] = \text{LSTM}(h[t-1], x[t-1]), \quad p(x[t] \mid x[1],\ldots,x[t-1]) = f_\theta(h[t]) $$

    where $h[t]$ is the hidden state of the RNN, and $f_\theta(\cdot)$ is a function that may output the pixel distribution (e.g., using a softmax for multicategory classification with outputs 0-255).

You might ask, why can we predict $p(x[t] \mid x[1],\ldots,x[t-1])$ through $f_\theta(h[t])$, even though $h[t]$ is not completely equivalent to the complete historical sequence $x[1],\ldots,x[t-1]$?

Here's a simple example: let the function $v=g(u) = u + 1 \quad u \in \mathbb{Z}$. In this case, $p(x∣u)$ and $p(x∣v)$ are the same.

Because $v$ is obtained through a shift transformation of $u$, i.e., $v = u + 1$, this transformation does not change the form of the conditional distribution $p(x ∣ u)$. In other words, a simple shifting of variables does not affect the conditional dependency relationships between them and other variables. In other words, for any given $x$ and corresponding $u$ and $v$, the following relationship holds:

$$ p(x ∣ u = 1) = p(x ∣ v = 2), \, p(x ∣ u = 2) = p(x ∣ v = 3), \dots $$

This indicates that $p(x ∣ u)$ and $p(x ∣ v)$ are consistent in value and form.

Note: Mathematically, the following two probability expressions may not always be equivalent,