The loss function in Logistic Regression is given follows:

$$ J(\mathbf{y}, \mathbf{\hat{y}})= -\sum_i\left[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \right] $$

Here, $\mathbf{y}=y^{(1)} \cdots y^{(N)}$ and $\mathbf{\hat{y}}=\hat{y}^{(1)} \cdots \hat{y}^{(N)}$. This loss function is the Negative Log Likelihood Loss (NLL Loss). In Pytorch, it is known as BCE (Binary Cross-Entropy) Loss. This loss on binary prediction is visualized in the figure below:

Untitled (2) (2).png

MLE for Discriminative Models

Let's have a look at the optimization objective function of maximum likelihood estimation for discriminative models:

$$ \arg\max_\theta{p(y|\theta, x)} $$

In logistic regression, the model output is denoted as $f_\theta(x)$, which equals $\sigma(wx + b)$, where $w$ and $b$ are the parameters represented by $\theta$. This function, $f_\theta(x)$, maps the input data $x$ to an output $\hat{y}$, which is interpreted as the probability that the input $x$ belongs to the positive ($y=1$).

Note: In logistic regression, the model's output $f_\theta(x)$ is just $p(y=1|\theta, x)$. The model never explicitly outputs $p(y=0|\theta, x)$. Instead, $p(y=0|\theta, x)$ is considered only for negative samples in the loss function.

Conversely, the probability that sample $x$ is classified as the negative class $y = 0$ is $1 - f_\theta(x)$, i.e., $p(y=0|\theta,x)$.

Naive Analogy:

We can use the analogy of flipping a coin to understand this model. In this analogy, $y^{(i)} = 1$ indicates that the $i$-th observation result is heads, and $\hat{y}^{(i)} = f_\theta(x^{(i)})$ represents the probability of predicting the $i$-th observation result as heads, given parameters $\theta$ (you can view theta is the coin) and condition $x^{(i)}$ (assume different ones have different level of super power to control the coin towards heads or tails). This is different from a standard coin toss experiment where if the coin's property $\theta$ is fixed without considering any additional conditions $x^{(i)}$, the outcome probability $\hat{y}^{(i)}$ would be constant.

Suppose we have three observation labels [True, True, False], along with corresponding inputs $[x^{(1)}, x^{(2)}, x^{(3)}]$ and a given model $f_\theta$. Then, the joint probability of these three observational outputs can be expressed as:

$$ \begin{aligned}p(y^{(1)}=1, y^{(2)}=1, y^{(3)}=0|\theta, x^{(1)}, x^{(2)}, x^{(3)}) =&p(y^{(1)}=1|\theta, x^{(1)})\times\\&p(y^{(1)}=1|\theta, x^{(2)})\times\\&p(y^{(3)}=0|\theta, x^{(3)}) \\=& f_\theta(x^{(1)}) f_\theta(x^{(2)})) (1-f_\theta(x^{(3)})) \end{aligned} $$

In this formula, the first and second terms respectively represent the probabilities of the model classifying samples $x^{(1)}$ and $x^{(2)}$ as True, and the third term represents the probability of the model classifying $x^{(3)}$ as False. This process is similar to a coin toss experiment, except that in logistic regression, $\theta$ is replaced with the function $f_\theta(x)$ based on input $x$.

Negative Log Likelihood Loss for Binary Classification

In the basic design of logistic regression, we are indeed able to effectively compute the model parameters $\theta$. However, the current expression is not unified in form for $y = 0$ and $y = 1$. For such a binary classification problem, we usually use the Bernoulli distribution method to integrate both formulas into a unified expression.

The Bernoulli distribution is a common distribution for handling binary random variables (i.e., outcomes of 0 or 1), which suits the logistic regression scenario well. According to the Bernoulli distribution, we can express the output probabilities of the logistic regression model as:

$$ p(y=1|\theta,x)=f_\theta(x) \\ p(y=0|\theta,x)=1-f_\theta(x) $$

Integrating these two probabilities, we get a unified formula:

$$ p(y|\theta,x)=(f_\theta(x))^y(1-f_\theta(x))^{1-y}=(\hat{y})^y(1-\hat{y})^{1-y} $$