In logistic regression, we model the probability of a binary outcome $y \in \{0, 1\}$ given a feature vector $\mathbf{x}$. The model is expressed through the sigmoid (logistic) function, $\sigma(z) = \frac{1}{1 + e^{-z}}$ . Given parameters $\mathbf{w}, b$ and input $\mathbf{x}$, the predicted probability is $\hat{y} = \sigma(\mathbf{w}\cdot \mathbf{x} + b)$. We then define the likelihood of observing the actual labels under this model, and from it derive the logistic loss. This loss for a single training example $(\mathbf{x}, y)$ is:
$$ J=L(\hat{y}, {y})=-\left[ y \log \left( \hat{y} \right) + (1 - y) \log \left( 1 - \hat{y} \right) \right]\\ \hat{y}=\sigma(\mathbf{w}\cdot\mathbf{x}+b). $$
When we sum (or average) over all $N$ training examples, we obtain the total loss function $\argmin_{\mathbf{w}, b}J(\mathbf{w}, b)$
$$ \argmin_{\mathbf{w}, b}\frac{1}{N}\sum_{i=1}^{N}\left(-\left[ y^{(i)} \log \left( \sigma(\mathbf{w} \cdot \mathbf{x}^{(i)}+b) \right) + (1 - y^{(i)}) \log \left( 1 - \sigma(\mathbf{w} \cdot \mathbf{x}^{(i)}+b) \right) \right]\right). $$
This total loss is what we aim to minimize to find the best parameters $\mathbf{w}, b$.
We aim to compute the gradients of the loss function with respect to $\mathbf{w}$ and $b$. The loss function is:
$$ \small J(\mathbf{w}, b) = \frac{1}{N} \sum_{i=1}^{N} \left( - \left[ y^{(i)} \log \sigma(\mathbf{w} \cdot \mathbf{x}^{(i)} + b) + (1 - y^{(i)}) \log \big( 1 - \sigma(\mathbf{w} \cdot \mathbf{x}^{(i)} + b) \big) \right] \right), $$
The argument of $\sigma$ is $z^{(i)} = \mathbf{w} \cdot \mathbf{x}^{(i)} + b$. The prediction is $\hat{y}^{(i)} = \sigma(z^{(i)})$; the derivative of $\sigma(z)$ is: $\frac{d\sigma(z)}{dz} = \sigma(z)(1 - \sigma(z)).$
Gradient for weights:
Gradient for bias:
With the gradients of the loss function $J(\mathbf{w}, b)$ computed, we can now perform gradient descent to iteratively update the parameters $\mathbf{w}$ and $b$.
The gradient descent update rule for each parameter is:
$$ \mathbf{w} \leftarrow \mathbf{w} - \eta\frac{\partial J}{\partial \mathbf{w}}, \\b \leftarrow b - \eta\frac{\partial J}{\partial b}, $$
where $\eta> 0$ is the learning rate you need to decide in advance that controls the aggressiveness of the updates.