Adopting a probabilistic perspective provides a unified framework for understanding how we measure the discrepancy between ground truth labels and predictions. It not only offers valuable insights for selecting and designing loss functions tailored to different tasks but also serves as the "sea level" in our exploration of deeper concepts.
Maximum Likelihood Estimation (MLE) acts as this foundation, enabling us to dive beneath the surface and uncover the hidden complexities of advanced neural network designs, particularly in generative AI.
This section will explore how viewing loss functions through a probabilistic lens can transform our understanding, allowing us to navigate and assess their impacts across diverse predictive models, much like exploring the vast depths of an iceberg below the waterline.
Before discussing likelihood estimation, it's important to first clarify what "likelihood" actually means, particularly within the framework of probabilistic inference. In this framework, we often focus on the relationship between causes and effects. Inference, at its core, involves estimating the potential causes based on observed outcomes.
Let’s break down the key components:
In essence, probabilistic inference especially the Bayes’ theorem allows us to use these components to move from prior assumptions about causes (prior) to refined conclusions (posterior) by incorporating observed data (evidence and likelihood).
In MLE, we do not consider Bayesian terms such as priors or posteriors. Instead, MLE focuses solely on the likelihood, aligning with the frequentist perspective that emphasizes how well the data is explained by the model.
The likelihood function $\mathcal{L}$ is defined as the probability of observing the given data $y$ under the distribution parameterized by $q$ (usually it is written as $\theta$, but to distinguish it from the machine learning model parameters $\theta$, we use $q$ here instead). Mathematically, it can be expressed as:
$$ \mathcal{L}(q) = p(\mathcal{Y} = c|q) $$
Where:
Example: Suppose we have a sequence of observed coin tosses, denoted as:
$$ \mathcal{Y}=[\text{head}, \text{head}, \text{tail}] $$
Assuming the coin is biased with $q = 0.8$ in favor of heads (20% in favor of tails), we can compute the likelihood of observing this particular sequence $\mathcal{Y}$ as follows: