Recap the binary classification, we categorize an outcome into two classes, $y = 0$ or $y = 1$. We model the likelihood of observing class $y$ with input $\mathbf{x}$ and parameters $\theta$ as:
$$ p(y|\theta, \mathbf{x}) = (\hat{y})^y (1 - \hat{y})^{(1-y)} $$
Here, $\hat{y}$ is the prediction from the given input $\mathbf{x}$, for example, in a logistic regression model, $\hat{y}(\mathbf{w}, b) = \sigma(\mathbf{w} \cdot \mathbf{x} + b)$, which predicts the probability of $y = 1$, as shown in the left part in the figure below).
In the upper figure's right part, which depicts a multi-class classification model with $K$ classes, the model outputs $K$ probabilities $\hat{\mathbf{y}} = [\hat{y}_1, \cdots, \hat{y}_K]$. These probabilities form what is typically referred to as a categorical distribution, not a Bernoulli distribution. Each probability, $\hat{y}_k$, represents the likelihood of belonging to a particular class $k$.
$$ \hat{y}_k = p(y=k)\,\,\text{OR}\,\,\hat{y}_k = p(y_k=1) $$
To ensure the output probabilities sum to 1, we use the softmax function, $\hat{\mathbf{y}} = \text{softmax}(\mathbf{W} \cdot \mathbf{x} + \mathbf{b})$, which confirms $\sum_{k=1}^{K} \hat{y}_k = 1$.
The softmax function is a "soft" version of the max function. Instead of just picking the largest value, it converts all scores into a probability distribution. Each output is computed as
$$ \hat{y}k = \frac{\exp(z_k)}{\sum{j}\exp(z_j)}, $$
where $z_k = \mathbf{W} \cdot \mathbf{x} + \mathbf{b}$ represents the scores from the model. Taking the exponential of each score emphasizes the largest score, making it dominate in the output—this score approaches 1, while others approach 0. Dividing by the sum of all exponentials ensures that the resultant values are probabilities that sum to 1.
The sum-up-to-1 property of softmax makes it ideal for multi-class classification, transforming linear outputs into a valid probability distribution where each value is between 0 and 1, totaling 1.
The likelihood function for a class label $y$ in multi-class scenarios is given by:
$$ p(\mathbf{y}|\theta, \mathbf{x}) = (\hat{y}_1)^{y_1}(\hat{y}_2)^{y_2}, ..., (\hat{y}K)^{y_K} = \prod{k=1}^{K} (\hat{y}_k)^{y_k} $$
Each $\mathbf{x}^{(i)}$ results in one true $y^{(i)} = k$, simplifying the product by nullifying terms where $k \neq y$. We optimize $\theta$ by minimizing the negative log-likelihood for all samples $i=1 \cdots N$:
$$ \argmin_{\theta}-\log\prod_{i=1}^{N} \prod_{k=1}^{K} { (\hat{y}k^{(i)})^{y^{(i)}k}} \iff \argmin{\theta}-\sum{i=1}^{N} \sum_{k=1}^{K} {y^{(i)}_k \log(\hat{y}_k^{(i)})} $$
This framework effectively extends the binary model to accommodate multiple classes.