The example above demonstrates some discrete-value (0/1) events, such as the coin toss experiment. However, in many scenarios, we more commonly encounter events with continuous values. Let us consider the normal distribution as an example to illustrate this.
Suppose we have a set of data that follows a normal distribution with an unknown mean ($\mu$) and a known standard deviation ($\sigma$). In this case, our goal is to estimate this unknown mean $\mu$. We can use the method of maximum likelihood estimation to achieve this goal.
For example, we observe a set of data $[2.1, 2.3, 2.5]$, and we know that these data come from a normal distribution with a standard deviation of $\sigma=0.5$. Therefore, our goal is to identify a mean $\mu$ that maximizes the probability of observing this set of data.
The probability density function (PDF) formula for the normal distribution is:
$$ f(x | \mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} $$
If we have several selected mean values $\mu$, such as 1, 2, or 3, we can calculate the likelihood for these values respectively. The likelihood function is defined as:
$$ \mathcal{L}(\mu) = f(2.1|\mu, 0.5) \cdot f(2.3|\mu, 0.5) \cdot f(2.5|\mu, 0.5) $$
This value of $\mu$ is the result of the maximum likelihood estimation, providing us with the best estimate of the mean of the normal distribution that the data follows.
We use the probability density function of the normal distribution to calculate the probability corresponding to each observed value. For each $\mu$ value, we compute as follows:
When $\mu=1$:
$$ \mathcal{L}(1) = f(2.1|1, 0.5) \cdot f(2.3|1, 0.5) \cdot f(2.5|1, 0.5)= 1.70839 \times 10^{-5} $$
When $\mu=2$:
$$ \mathcal{L}(2) = f(2.1|2, 0.5) \cdot f(2.3|2, 0.5) \cdot f(2.5|2, 0.5) = 0.25224 $$
When $\mu=3$:
$$ \mathcal{L}(3) = f(2.1|3, 0.5) \cdot f(2.3|3, 0.5) \cdot f(2.5|3, 0.5)=0.02288 $$
Through calculation, it can be determined that the likelihood reaches its maximum when $μ=2$. However, in this scenario, we are primarily choosing parameters from a predefined set of candidates. In practical situations, these parameters should be optimized to obtain the most optimal values
In practice, when we really need to solve the MLE problem using optimization, it is more convenient to work with the natural logarithm of the likelihood function, known as the log-likelihood. This transformation turns products (the nature of a series of event happening togather) into sums (this is why we sum errors up!!!), simplifying the optimization process:
$$ \log \mathcal{L}(\theta) = \sum_{i=1}^{n} \log P( x^{(i)}|\theta) $$