In machine learning, Maximum Likelihood Estimation (MLE) is a fundamental technique for identifying the parameters of a given distribution. For instance, it is used to estimate the mean and variance of a normal distribution.
The likelihood function $\mathcal{L}$ is defined as the probability of observing the given data $x$ under the model parameterized by $\theta$. Mathematically, it can be expressed as:
$$ \mathcal{L}(\theta) = P(X = x|\theta) $$
Where:
The meaning of the formula is, within this box (the assumed proportion of blue balls in the box $\theta$), we guess that the next ball’s color ($X$) is blue ($x$).
The box represents our statistical model, and the balls inside the box represent possible outcomes of a random event. The proportion of blue balls in the box is unknown and our guess is $\theta$. This unknown proportion is what we're trying to estimate using our observations. When we use the probability $P(X = x|\theta)$, it means we're trying to calculate the likelihood of observing a specific outcome (in this case, randomly drawing a ball $X$, it is blue $x$) given our assumption about the proportion of blue balls in the box ($\theta$).
The goal of MLE in parameter estimation is to find the value of $\theta$ that maximizes $\mathcal{L}(\theta)$. This value, denoted as $\hat{\theta}$, is called the maximum likelihood estimate and is calculated as:
$$ \hat{\theta} = \underset{\theta}{\text{arg max}} \ \mathcal{L}(\theta) $$
The likelihood function can vary based on the nature of the data and the model. For independent and identically distributed (i.i.d) data $x_0, \cdots ,x_n$, the likelihood of serving them all is the product of the probabilities of observing each one $x_i$:
$$ \mathcal{L}(\theta) = \prod_{i=1}^{n} P(X_i = x_i|\theta) $$
Here, $n$ is the number of data points.
To exemplify the likelihood calculation, let's consider a coin toss scenario. For instance, our data consists of a sequence of coin tosses, say $\mathcal{D}=[\text{head}, \text{head}, \text{tail}]$, if we assume a coin with the bias $\theta = 0.8$ towards heads (20% towards tail), the likelihood of observing this sequence $\mathcal{D}$ can be calculated as:
$$ L(\theta) = P(\mathcal{D}|\theta) = P(x^{(1)}=\text{head}|\theta) \cdot P(x^{(2)}=\text{head}|\theta) \cdot P(x^{(3)}=\text{tail}|\theta) $$
Because the coin's configuration θ is the probability directly $P(x^{(1)}=\text{head}| θ)= θ$.