A Generative Adversarial Network (GAN) is a machine learning framework developed by Ian Goodfellow and his team in June 2014, notable for its application in generative AI. It functions through a competitive process involving two components: a generator ($G$) that creates data and a discriminator ($D$) that evaluates the data's authenticity. The discriminator serves as a dynamic loss function, determining the quality of the generator's output, thus pushing both components to improve iteratively in a game-like setting.
The Forger (Generator): Imagine an aspiring art forger who attempts to replicate the works of renowned painters. In the beginning, their imitations might be far from perfect, with noticeable flaws in brushstrokes, colors, or proportions. However, as they study famous paintings, learn about various techniques, and practice their craft, they become better at mimicking the style and precision of the original artists. The forger's ultimate goal is to create replicas so convincing that even the most skilled art expert cannot tell them apart from genuine masterpieces. Here, the forger represents the generator, striving to produce outputs (artworks) that are indistinguishable from real ones.
The Art Expert (Discriminator): On the other side is a seasoned art expert, trained to spot even the smallest inconsistencies in paintings. Initially, the expert can easily detect flaws in the forger's work, such as mismatched details or unnatural brushstrokes. However, as the forger improves and creates increasingly convincing imitations, the expert must deepen their knowledge, studying finer nuances of texture, technique, and historical style to keep up. The art expert, like the discriminator, evolves over time, sharpening their ability to identify authentic works from forgeries.
This dynamic mirrors the relationship between the generator and discriminator in GANs. The forger continuously improves their skill based on the feedback provided by the art expert, while the art expert simultaneously becomes more adept at identifying fakes.
Notably, the discriminator acts as a unique form of a loss function. Unlike conventional loss functions, which remain fixed, the discriminator evolves as it learns. This means that even for a given output (prediction), the evaluation changes over time, becoming increasingly precise as both components improve.
This continuous "push and pull" between the forger and the expert reflects the iterative, competitive learning process in GANs, driving both components to excel in their respective roles.
In image generation, we mostly struggle to determine whether an image is real or not. If we could easily calculate realism, the gradient could be used to guide an image generation network $G$ to improve the realism of the generated image. However, such a discriminator, denoted as $D$, is unknown and often complex.
So, why not learn a discriminator $D$? This is actually feasible because, if we have a neural network $G_\theta$ generating synthetic images, we know these images are fake, whereas the images from the original dataset (such as photographs taken by a camera) are considered real.
Specifically, the loss can be expressed as:
$$ \argmin_\phi\text{NLL}\left(c,D_\phi(x)\right) $$
Here, $D_\phi$ represents a learnable fake detector model parameterized by $\phi$, $c$ is a binary label indicating True (Real) or False (Fake), and $x$ refers to the images sampled from the generated image $x^{\text{fake}}$ (Fake) set and the real image $x^{\text{real}}$ (Real) set.
The objective is to minimize the classification loss—Negative Log-Likelihood (NLL)—of the predicted output. That is, the $D_\phi$ function can successfully distinguish generated images from real images. This approach allows us to approximate the reality evaluation function $D$ even without explicitly knowing what it is, leveraging data-driven learning to guide the process.
The loss structure above does not clearly distinguish between real and fake images. We separate them as shown below.
$$ \mathcal{L}D(\phi) = \text{NLL}\left(c,D\phi(x)\right) = - \mathbb{E}{x \sim p{\text{real}}} \left[ \log D_\phi(x) \right] - \mathbb{E}{x \sim p{\text{fake}}} \left[ \log \left(1 - D_\phi(x)\right) \right], $$
where: $D_\phi(x)$ is the discriminator's predicted probability that $x$ is real or fake; $p_{\text{real}}$ represents the distribution of real data (e.g., from the true dataset); $p_{\text{fake}}$ represents the distribution of fake data (e.g., generated by the generator).
The above uses an expectation form, and in the actual code, we consider a sampling-based approach to obtain the data. The empirical formula is as follows:
$$ \mathcal{L}D(\phi) = - \frac{1}{N{\text{real}}} \sum_{i=1}^{N_{\text{real}}} \log D_\phi(x_i^{\text{real}}) - \frac{1}{N_{\text{fake}}} \sum_{j=1}^{N_{\text{fake}}} \log \left( 1 - D_\phi(x_j^{\text{fake}}) \right), $$