In machine learning, Maximum Likelihood Estimation (MLE) is a fundamental technique for identifying the parameters of a given distribution. For instance, it is used to estimate the mean and variance of a normal distribution.

Likelihood in Statistics

The likelihood function $\mathcal{L}$ is defined as the probability of observing the given data $x$ under the model parameterized by $\theta$. Mathematically, it can be expressed as:

$$ \mathcal{L}(\theta) = P(X = x|\theta) $$

Where:

The meaning of the formula is, within this box (the assumed proportion of blue balls in the box $\theta$), we guess that the next ball’s color ($X$) is blue ($x$).

The box represents our statistical model, and the balls inside the box represent possible outcomes of a random event. The proportion of blue balls in the box is unknown and our guess is $\theta$. This unknown proportion is what we're trying to estimate using our observations. When we use the probability $P(X = x|\theta)$, it means we're trying to calculate the likelihood of observing a specific outcome (in this case, randomly drawing a ball $X$, it is blue $x$) given our assumption about the proportion of blue balls in the box ($\theta$).

The goal of MLE in parameter estimation is to find the value of $\theta$ that maximizes $\mathcal{L}(\theta)$. This value, denoted as $\hat{\theta}$, is called the maximum likelihood estimate and is calculated as:

$$ \hat{\theta} = \underset{\theta}{\text{arg max}} \ \mathcal{L}(\theta) $$

The likelihood function can vary based on the nature of the data and the model. For independent and identically distributed (i.i.d) data $x_0, \cdots ,x_n$, the likelihood of serving them all is the product of the probabilities of observing each one $x_i$:

$$ \mathcal{L}(\theta) = \prod_{i=1}^{n} P(X_i = x_i|\theta) $$

Here, $n$ is the number of data points.

Evaluating Likelihood: Toss a Coin

https://images.squarespace-cdn.com/content/v1/54905286e4b050812345644c/1609607725338-2QQ9FXKDTS7LZHVRREO1/CoinToss.jpg

To exemplify the likelihood calculation, let's consider a coin toss scenario. For instance, our data consists of a sequence of coin tosses, say $\mathcal{D}=[\text{head}, \text{head}, \text{tail}]$, if we assume a coin with the bias $\theta = 0.8$ towards heads (20% towards tail), the likelihood of observing this sequence $\mathcal{D}$ can be calculated as:

$$ L(\theta) = P(\mathcal{D}|\theta) = P(x^{(1)}=\text{head}|\theta) \cdot P(x^{(2)}=\text{head}|\theta) \cdot P(x^{(3)}=\text{tail}|\theta) $$

Because the coin's configuration θ is the probability directly $P(x^{(1)}=\text{head}| θ)= θ$.