Output Layers and Task Types

Machine learning tasks can be interpreted in terms of probability distributions assigned to different types of target variables. In other words, when we train a model, we assume there is some random process generating our observed outputs, and we try to learn the parameters of that process from data. In practice, this viewpoint is rooted in generalized linear models but also applies more broadly. Below is a tutorial that explores these perspectives, highlighting how the nature of the response (binary, real-valued, count, rank, and so on) guides us toward certain distributions and learning approaches. The discussion is organized in continuous prose, with illustrative examples for each task type.

Classification - Bernoulli / Categorical

Starting with classification, imagine you want to decide whether an incoming email is spam or not. In this situation, the output is a binary label, indicating “spam” or “not spam.” Because there are only two possible outcomes, a Bernoulli distribution neatly captures the probability of seeing a label True for “spam” versus False for “not spam.” Logistic regression is a classic approach that models this probability via a sigmoid function of a linear combination of the input features. Now, expand this beyond two classes to say, three or more topics (e.g., classifying news articles as politics, business, or sports). In that case, the response can be seen as a discrete category among several.

$$ \text{Loss}{\text{Bernoulli}}(\theta)= -\sum{i=1}^N\Bigl[\,y^{(i)} \,\log(\pi^{(i)})\;+\;(1 - y^{(i)})\,\log\bigl(1 - \pi^{(i)}\bigr)\Bigr] \\ \small{\text{where} \quad \pi^{(i)} = \sigma\bigl(\mathrm{NN}_\theta(x^{(i)})\bigr), \sigma(z) = \frac{1}{1 + e^{-z}}}. $$

A categorical distribution generalizes Bernoulli to accommodate multiple possible outcomes, and we often use a softmax function to estimate the probability for each category.

$$ \small\text{Loss}{\text{Multinomial}}(\theta)= -\sum{i=1}^N \sum_{k=1}^Ky^{(i)}{k} \,\log\bigl(\pi^{(i)}{k}\bigr) \quad \text{where} \quad \pi^{(i)}{k} = \frac{\exp\bigl(\mathrm{NN}\theta^k(x^{(i)})\bigr)}{\sum_{j=1}^K \exp\bigl(\mathrm{NN}_\theta^j(x^{(i)})\bigr)}. $$

General Regression - Gaussian

When your target output is a continuous quantity, such as predicting the price of a house or the temperature on a given day, regression models are the go-to solution. Classical linear regression assumes the data comes from a Gaussian (Normal) distribution. In this setting, you posit that for each input, the output is normally distributed around some mean with constant variance. This assumption makes sense for many measurements that are influenced by a large number of small, random factors. Of course, if data exhibits different characteristics—like heteroscedasticity (variance not constant across inputs)—extensions such as weighted least squares or other specialized models become useful.

$$ \text{Loss}{\text{Gaussian}}(\theta)= \sum{i=1}^N\frac{\bigl(y^{(i)} - \mu^{(i)}\bigr)^2}{2\sigma^2}\;+\;\frac{n}{2}\log\bigl(2\pi\sigma^2\bigr) \quad \text{where} \quad \mu_i=\text{NN}_\theta(x^{(i)}) $$

Counting Regression - Poisson

Not all numerical targets are well-approximated by a Gaussian. In particular, count data (like the number of customer support calls per hour, or the number of website clicks per day) often appears in scenarios where counts cannot be negative, and the mean tends to be linked to the variance in characteristic ways. The Poisson distribution is often the first choice here: it models the probability of a certain number of events in a fixed interval of time (or space), assuming events occur independently at a certain average rate. The Poisson regression setup allows you to link the expected count $\lambda$ to input features. When real-world data is “over-dispersed” (variance exceeding the mean), a negative binomial distribution can step in and handle cases where Poisson assumptions are violated.

$$ \text{Loss}{\text{Poisson}}(\theta)\;=\; \sum{i=1}^{N} \Bigl[ \lambda^{(i)} - y^{(i)} \log(\lambda^{(i)}) \Bigr]\quad\text{where}\quad\lambda^{(i)} = \exp\bigl(\text{NN}_\theta(x^{(i)})\bigr). $$

(Optional) Normalized Regression - Beta

Next, consider scenarios where the target variable is a ratio or proportion bounded between zero and one. Examples include the fraction of a product that is defective in a batch, or the click-through rate of an advertisement. A beta distribution is often employed for these proportions because it is naturally supported on the interval $(0,1)$ and can flexibly capture different shapes of distributions. Beta regression models the parameters of the beta distribution (often denoted $\alpha$ and $\beta$) as functions of the input features, letting you account for how the distribution of a proportion changes under different conditions.

$$ \text{Loss}{\text{Beta}}(\theta)= -\sum{i=1}^N \Bigl[(\alpha^{(i)} - 1)\,\ln(y^{(i)})+ (\beta^{(i)} - 1)\,\ln\bigl(1 - y_i\bigr)- \ln \mathrm{B}(\alpha^{(i)}, \beta^{(i)})\Bigr] \\ \small{\text{where} \quad \mu^{(i)} = \sigma\bigl(\mathrm{NN}\theta^\mu(x^{(i)})\bigr), \phi_i = \exp\bigl(\mathrm{NN}\theta^\phi(x^{(i)})\bigr),

\alpha^{(i)} = \mu^{(i)} \,\phi^{(i)},

\beta^{(i)} = (1 - \mu^{(i)})\,\phi^{(i)}}. $$

Ranking - Distributions over Permutations

Ranking tasks bring a different flavor of learning. Instead of predicting a single numeric or categorical value, the goal is to produce an ordering over a set of items. A familiar example is a search engine that ranks webpages based on their relevance to a query. Ranking problems require distributions defined on permutations.

Models such as Bradley–Terry–Luce focus on pairwise comparisons (the probability that one item outranks another), while the Plackett–Luce distribution characterizes the probability of observing a particular full ranking of items from first place to last. Another family, the Mallows model, places a distribution over all permutations by measuring how different each possible ranking is from a central “consensus” ordering. These models quantify how likely one ranking is versus another, providing a principled framework for learning rankings from data.

Conclusion

In summary, the heart of the matter is to align the type of target you observe with a corresponding statistical distribution: Bernoulli or multinomial for classification, Gaussian for continuous regression, Poisson or negative binomial for count data, beta for proportions, and permutation distributions like Plackett–Luce for ranking. Even specialized tasks like object tracking or matching can often be recast as a simpler classification or regression problem once carefully formulated. This distribution-centric viewpoint provides a common language to analyze diverse machine learning tasks and informs how we fit and evaluate models in practice.