Maximum Likelihood Estimation (MLE)

Adopting a probabilistic perspective provides a unified framework for understanding how we measure the discrepancy between ground truth labels and predictions. It not only offers valuable insights for selecting and designing loss functions tailored to different tasks but also serves as the "sea level" in our exploration of deeper concepts.

Maximum Likelihood Estimation (MLE) acts as this foundation, enabling us to dive beneath the surface and uncover the hidden complexities of advanced neural network designs, particularly in generative AI.

This section will explore how viewing loss functions through a probabilistic lens can transform our understanding, allowing us to navigate and assess their impacts across diverse predictive models, much like exploring the vast depths of an iceberg below the waterline.

Probabilistic Inference

Before discussing likelihood estimation, it's important to first clarify what "likelihood" actually means, particularly within the framework of probabilistic inference. In this framework, we often focus on the relationship between causes and effects. Inference, at its core, involves estimating the potential causes based on observed outcomes.

Let’s break down the key components:

Prior ($p(q)$): This represents our initial belief about the cause $q$ without observing any data. It encapsulates what we assume or know about $q$ in advance.
Evidence ($p(y)$): This is the overall probability of observing the outcome $y$, regardless of any specific cause. It integrates over all possible causes and serves as a normalization factor in Bayesian inference.
Likelihood $(p(y|q)$): This is the probability of observing the outcome $y$ given a specific cause $q$. It reflects how well the cause $q$ explains the observed result $y$.
Posterior ($p(q|y)$): This is the updated belief about the cause $q$ after observing the outcome $y$. It combines the prior and the likelihood using Bayes’ theorem to provide a refined estimate.

In essence, probabilistic inference especially the Bayes’ theorem allows us to use these components to move from prior assumptions about causes (prior) to refined conclusions (posterior) by incorporating observed data (evidence and likelihood).

Likelihood in Statistics

In MLE, we do not consider Bayesian terms such as priors or posteriors. Instead, MLE focuses solely on the likelihood, aligning with the frequentist perspective that emphasizes how well the data is explained by the model.

The likelihood function $\mathcal{L}$ is defined as the probability of observing the given data $y$ under the distribution parameterized by $q$ (usually it is written as $\theta$, but to distinguish it from the machine learning model parameters $\theta$, we use $q$ here instead). Mathematically, it can be expressed as:

$$ \mathcal{L}(q) = p(\mathcal{Y} = c|q) $$

Where:

$\mathcal{Y}$ represents the random variable associated with the observed data.
$c$ specific values that the data takes.
$q$ denotes the model parameters.

Example: Suppose we have a sequence of observed coin tosses, denoted as:

$$ \mathcal{Y}=[\text{head}, \text{head}, \text{tail}] $$

Assuming the coin is biased with $q = 0.8$ in favor of heads (20% in favor of tails), we can compute the likelihood of observing this particular sequence $\mathcal{Y}$ as follows:

embed - 2025-02-08T002947.502.svg