Our overarching goal is to find the perfect set of network parameters, $\theta$, that make all of our observed data as probable as possible. We express this mathematically by multiplying the conditioned probabilities for all $N$ examples in our dataset:
$$ \arg\max_{\theta} \prod_{i=1}^{N} p\left(y^{(i)} \mid \theta, {x}^{(i)}\right) $$
A universal approach to solve this optimization problem is to use optimization algorithms like gradient ascent. However, directly computing the gradients (the mathematical derivatives needed to update our model) of a massive chain of multiplied probabilities, denoted by $\prod$, is extremely difficult due to the intricacies of the product rule in calculus and the risk of numerical underflow in computers.
To make life easier, we simplify the computation by taking the logarithm of the objective function and solving the resulting $\log$ problem instead. This approach is perfectly valid because the logarithmic function is strictly increasing—meaning it does not alter the locations of the minima or maxima of our original function; it just reshapes the curve.
We use logarithms because they possess a highly useful mathematical property: they transform a messy problem involving products into a much simpler one involving sums, as illustrated below:
$$ \arg\max_{\theta} \color{ff0000}\log \prod_{i=1}^{N} \color{000000}p(y^{(i)} \mid \theta, {x}^{(i)})\quad \iff\quad \arg\max_{\theta} \color{ff0000}\sum_{i=1}^{N} \log\color{000000} p(y^{(i)} \mid \theta, {x}^{(i)}). $$
Here, the logarithm of the product is cleanly converted to the sum of logarithms, while all other terms remain unchanged. This step is incredibly helpful when the problem scale increases (e.g., when $N$ represents millions of data points) because sums are vastly easier to work with than products when calculating gradients.
In machine learning convention, optimization algorithms are typically designed to minimize an error or loss function, rather than maximize a probability.
To transform our maximization problem into a standard minimization problem, we simply introduce a negative sign. Finding the peak of a mountain is the exact same task as finding the lowest point of that mountain if you flip it upside down:
$$ \arg\max_{\theta} \sum_{i=1}^{N} \log p(y^{(i)} \mid \theta, {x}^{(i)})\quad\iff\quad\arg\color{ff0000}\min_{\theta}\color{000000} \sum_{i=1}^{N} \color{ff0000}{-} \color{000000}\log p(y^{(i)} \mid \theta, {x}^{(i)}). $$
The minimization objective on the right is commonly referred to as the Negative Log-Likelihood (NLL). By minimizing the NLL, we are effortlessly performing Maximum Likelihood Estimation in a way that is much more computationally friendly. This approach is versatile and seamlessly applicable to a variety of probability distributions your model might need to predict.