The loss function in Logistic Regression is given follows:
$$ J(\mathbf{y}, \mathbf{\hat{y}})= -\sum_i\left[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \right] $$
Here, $\mathbf{y}=y^{(1)} \cdots y^{(N)}$ and $\mathbf{\hat{y}}=\hat{y}^{(1)} \cdots \hat{y}^{(N)}$. This loss function is the Negative Log Likelihood Loss (NLL Loss). In Pytorch, it is known as BCE (Binary Cross-Entropy) Loss. This loss on binary prediction is visualized in the figure below:
Let's have a look at the optimization objective function of maximum likelihood estimation for discriminative models:
$$ \arg\max_\theta{p(y|\theta, x)} $$
In logistic regression, the model output is denoted as $f_\theta(x)$, which equals $\sigma(wx + b)$, where $w$ and $b$ are the parameters represented by $\theta$. This function, $f_\theta(x)$, maps the input data $x$ to an output $\hat{y}$, which is interpreted as the probability that the input $x$ belongs to the positive ($y=1$).
Note: In logistic regression, the model's output $f_\theta(x)$ is just $p(y=1|\theta, x)$. The model never explicitly outputs $p(y=0|\theta, x)$. Instead, $p(y=0|\theta, x)$ is considered only for negative samples in the loss function.
Conversely, the probability that sample $x$ is classified as the negative class $y = 0$ is $1 - f_\theta(x)$, i.e., $p(y=0|\theta,x)$.
Naive Analogy:
We can use the analogy of flipping a coin to understand this model. In this analogy, $y^{(i)} = 1$ indicates that the $i$-th observation result is heads, and $\hat{y}^{(i)} = f_\theta(x^{(i)})$ represents the probability of predicting the $i$-th observation result as heads, given parameters $\theta$ (you can view theta is the coin) and condition $x^{(i)}$ (assume different ones have different level of super power to control the coin towards heads or tails). This is different from a standard coin toss experiment where if the coin's property $\theta$ is fixed without considering any additional conditions $x^{(i)}$, the outcome probability $\hat{y}^{(i)}$ would be constant.
Suppose we have three observation labels [True, True, False]
, along with corresponding inputs $[x^{(1)}, x^{(2)}, x^{(3)}]$ and a given model $f_\theta$. Then, the joint probability of these three observational outputs can be expressed as:
$$ \begin{aligned}p(y^{(1)}=1, y^{(2)}=1, y^{(3)}=0|\theta, x^{(1)}, x^{(2)}, x^{(3)}) =&p(y^{(1)}=1|\theta, x^{(1)})\times\\&p(y^{(1)}=1|\theta, x^{(2)})\times\\&p(y^{(3)}=0|\theta, x^{(3)}) \\=& f_\theta(x^{(1)}) f_\theta(x^{(2)})) (1-f_\theta(x^{(3)})) \end{aligned} $$
In this formula, the first and second terms respectively represent the probabilities of the model classifying samples $x^{(1)}$ and $x^{(2)}$ as True, and the third term represents the probability of the model classifying $x^{(3)}$ as False. This process is similar to a coin toss experiment, except that in logistic regression, $\theta$ is replaced with the function $f_\theta(x)$ based on input $x$.
In the basic design of logistic regression, we are indeed able to effectively compute the model parameters $\theta$. However, the current expression is not unified in form for $y = 0$ and $y = 1$. For such a binary classification problem, we usually use the Bernoulli distribution method to integrate both formulas into a unified expression.
The Bernoulli distribution is a common distribution for handling binary random variables (i.e., outcomes of 0 or 1), which suits the logistic regression scenario well. According to the Bernoulli distribution, we can express the output probabilities of the logistic regression model as:
$$ p(y=1|\theta,x)=f_\theta(x) \\ p(y=0|\theta,x)=1-f_\theta(x) $$
Integrating these two probabilities, we get a unified formula:
$$ p(y|\theta,x)=(f_\theta(x))^y(1-f_\theta(x))^{1-y}=(\hat{y})^y(1-\hat{y})^{1-y} $$