MLE for Binary Classification

In this section, we extend the concept of Maximum Likelihood Estimation (MLE) to a general machine learning model.

Unlike the straightforward MLE application for estimating parameters like the probability $q$ in the previous section in a Bernoulli distribution discussed earlier, here MLE is applied to find the set of model parameters $\theta$ that maximizes the likelihood of observing the given labels $y^{(i)}$ for corresponding inputs $\mathbf{x}^{(i)}$.

You can contrast them in this way:

Originally, the intrinsic probability of the coin landing on heads is $q$, and the probability of it landing on tails is $1-q$.
In the context of machine learning, the coin is a magic coin characterized by $\theta=\mathbf{w}, b$. Each magician, denoted as $\mathbf{x}^{(i)}$, tosses the coin. The interaction between the coin and magician is $f_\theta(\mathbf{x}^{(i)})$ which affects the probability of the coin landing on heads is given by $f_\theta(\mathbf{x}^{(i)})$. It means, the probability of the coin landing on heads is not fixed; instead, it varies depending on who $\mathbf{x}^{(i)}$ tosses the coin.

image (3) (1).png

Note: the magician here is not omnipotent; they cannot control the outcome of the coin toss. Instead, they can only interact with the coin in a way that influences the probability of it landing heads, with each magician's effect being different. You might view this as evaluating the magician's skill.

Therefore, the focus shifts from estimating the parameter $q$ to estimating the machine learning model’s parameters $\theta$, which, together with $\mathbf{x}^{(i)}$ which is provided, determine the likelihood of observing $y^{(i)}$.

$$ \arg\max_{q} \prod_{i=1}^{N} p\left(y^{(i)} \mid q\right) \Longrightarrow\arg\max_{\theta} \prod_{i=1}^{N} p\left(y^{(i)} \mid \theta, \mathbf{x}^{(i)}\right). $$

Simplify Derivative Calculation- NLL Framework

Our goal is to find the parameters that make the observed data as probable as possible.

$$ \arg\max_{\theta} \prod_{i=1}^{N} p\left(y^{(i)} \mid \theta, \mathbf{x}^{(i)}\right) $$

A universal approach to solve this optimization problem is to use gradient-based methods, such as gradient descent or its variants. However, directly computing the gradient of a product, denoted by $\prod$, can be quite challenging due to the intricacies of the product rule in calculus.

In convention, we simplify the computation by taking the logarithm of the objective function and solving the resulting $\log$ problem instead. This approach is valid because the logarithmic function does not alter the locations of the minima or maxima in a function.

Why Logarithms: We use logarithms because they transform a problem involving products into one involving sums, as illustrated below:

$$ \arg\max_{\theta} \color{ff0000}\log \prod_{i=1}^{N} \color{000000}p(y^{(i)} \mid \theta, \mathbf{x}^{(i)})\quad \iff\quad \arg\max_{\theta} \color{ff0000}\sum_{i=1}^{N} \log\color{000000} p(y^{(i)} \mid \theta, \mathbf{x}^{(i)}). $$

Here, the logarithm of the product is converted to the sum of logarithms, while all other terms remain unchanged. This step is helpful when the problem scale increases (e.g., $(i)$ is large) because sums are easier to work with than products in gradient calculation.

Gradient Descent: In machine learning, we typically address the task of minimizing a function with respect to the parameters $\theta$. To transform a maximization problem into a minimization problem, we introduce a negative sign, as follows:

$$ \arg\max_{\theta} \sum_{i=1}^{N} \log p(y^{(i)} \mid \theta, \mathbf{x}^{(i)})\quad\iff\quad\arg\color{ff0000}\min_{\theta}\color{000000} \sum_{i=1}^{N} \color{ff0000}{-} \color{000000}\log p(y^{(i)} \mid \theta, \mathbf{x}^{(i)}). $$

The minimization objective on the right is commonly referred to as the Negative Log Likelihood (NLL). This approach is versatile and applicable to a variety of probability distributions.

MLE for Binary Classfication

We generally consider the model's prediction output $\hat{y}^{(i)} =f_\theta(\mathbf{x}^{(i)})$ as corresponding to the distribution parameter $q$ discussed in earlier sections. In binary classification, the model output is represented as follows: