In logistic regression, we model the probability of a binary outcome $y \in \{0, 1\}$ given a feature vector $\mathbf{x}$. The model is expressed through the sigmoid (logistic) function, $\sigma(z) = \frac{1}{1 + e^{-z}}$ . Given parameters $\mathbf{w}, b$ and input $\mathbf{x}$, the predicted probability is $\hat{y} = \sigma(\mathbf{w}\cdot \mathbf{x} + b)$. We then define the likelihood of observing the actual labels under this model, and from it derive the logistic loss. This loss for a single training example $(\mathbf{x}, y)$ is:

$$ J=L(\hat{y}, {y})=-\left[ y \log \left( \hat{y} \right) + (1 - y) \log \left( 1 - \hat{y} \right) \right]\\ \hat{y}=\sigma(\mathbf{w}\cdot\mathbf{x}+b). $$

When we sum (or average) over all $N$ training examples, we obtain the total loss function $\argmin_{\mathbf{w}, b}J(\mathbf{w}, b)$

$$ \argmin_{\mathbf{w}, b}\frac{1}{N}\sum_{i=1}^{N}\left(-\left[ y^{(i)} \log \left( \sigma(\mathbf{w} \cdot \mathbf{x}^{(i)}+b) \right) + (1 - y^{(i)}) \log \left( 1 - \sigma(\mathbf{w} \cdot \mathbf{x}^{(i)}+b) \right) \right]\right). $$

This total loss is what we aim to minimize to find the best parameters $\mathbf{w}, b$.

Compute Gradient

We aim to compute the gradients of the loss function with respect to $\mathbf{w}$ and $b$. The loss function is:

$$ \small J(\mathbf{w}, b) = \frac{1}{N} \sum_{i=1}^{N} \left( - \left[ y^{(i)} \log \sigma(\mathbf{w} \cdot \mathbf{x}^{(i)} + b) + (1 - y^{(i)}) \log \big( 1 - \sigma(\mathbf{w} \cdot \mathbf{x}^{(i)} + b) \big) \right] \right), $$

The argument of $\sigma$ is $z^{(i)} = \mathbf{w} \cdot \mathbf{x}^{(i)} + b$. The prediction is $\hat{y}^{(i)} = \sigma(z^{(i)})$; the derivative of $\sigma(z)$ is: $\frac{d\sigma(z)}{dz} = \sigma(z)(1 - \sigma(z)).$

Gradient for weights:

  1. The derivative of the loss $J^{(i)}$ for a single sample with respect to $\mathbf{w}$ is $\frac{\partial J^{(i)}}{\partial \mathbf{w}} = \frac{\partial J^{(i)}}{\partial z^{(i)}} \cdot \frac{\partial z^{(i)}}{\partial \mathbf{w}}$, where $\frac{\partial z^{(i)}}{\partial \mathbf{w}} = \mathbf{x}^{(i)}$.
  2. The derivative of the loss with respect to $z^{(i)}$ is $\frac{\partial J^{(i)}}{\partial z^{(i)}} = \hat{y}^{(i)} - y^{(i)}$.
  3. Combining these, the gradient for $\mathbf{w}$ is $\frac{\partial J}{\partial \mathbf{w}} = \frac{1}{N} \sum_{i=1}^{N} (\hat{y}^{(i)} - y^{(i)}) \mathbf{x}^{(i)}.$

Gradient for bias:

  1. The derivative of the loss $J^{(i)}$ with respect to $b$ is $\frac{\partial J^{(i)}}{\partial b} = \frac{\partial J^{(i)}}{\partial z^{(i)}} \cdot \frac{\partial z^{(i)}}{\partial b},$ where $\frac{\partial z^{(i)}}{\partial b} = 1$.
  2. As before, $\frac{\partial J^{(i)}}{\partial z^{(i)}} = \hat{y}^{(i)} - y^{(i)}$.
  3. Thus, the gradient for $b$ is $\frac{\partial J}{\partial b} = \frac{1}{N} \sum_{i=1}^{N} (\hat{y}^{(i)} - y^{(i)})$ .

Perform Gradient Descent

With the gradients of the loss function $J(\mathbf{w}, b)$ computed, we can now perform gradient descent to iteratively update the parameters $\mathbf{w}$ and $b$.

The gradient descent update rule for each parameter is:

$$ \mathbf{w} \leftarrow \mathbf{w} - \eta\frac{\partial J}{\partial \mathbf{w}}, \\b \leftarrow b - \eta\frac{\partial J}{\partial b}, $$

where $\eta> 0$ is the learning rate you need to decide in advance that controls the aggressiveness of the updates.