Let’s have a look at the geometric intepretation of logistic regression and discuss how we can extend it to support nonlinear decision boundary, in compared with its original linear boundary.
Due the nature that we cannot visualize high dimensional data, we only consider the case of two-dimensional input data $\mathbf{x}=[x_0, x_1]$ (encoded into position) and a classification problem (encoded into color). Logistic regression first applies a linear transformation to $\mathbf{x}$, creating a scalar value $z$, the process can be written as:
$$ z = \mathbf{w}·\mathbf{x} + b $$
To understand what is the meaning of $z$ in the context of logistic regression, we need to visualize all data samples $\mathbf{x}$ and color them according to the labels $y$ either 0 or 1. In particular, the positive $y=1$ is colored into yellow while the negative $y=0$ is colored into purple, as shown below,
The variable z itself does not directly reflected in the figure, while ${z}/{||w||}$ reflects the minimal distance (aka. perpendicular distance) between each sample x and the decision boundary wx+b=0.
Because after the model is fully trained, w becomes a constant, so z means the scaled distance which is affected by ||w||, but it will not change the relationship of z data 8(order preserved).
The distance can be either positive or negative. Positive distance means the prediction of this sample x is positive, while negative distance means the prediction of this sample is negative.
In summary, to understand z, we only need to care two things, one is the absolute value of z, and the sign of z. The absolution value of z relates to the distance from the corresponding sample to the decision boundary, and the sign of z denotes which side is the data. For multi-class, it is an extension to this situation that is for each class, the samples should be on the side of its decision plane i.e., each class has a decision plane.
In this process, each 2D point is mapped onto a position along the z-axis. The dots (purple) and crosses (yellow) in the figure illustrate this mapping. Next, these z-values are processed through a sigmoid function $\sigma(z)$, assigning each sample a probability between 0 and 1. A z-value above zero indicates a probability greater than 0.5, suggesting the point belongs to the positive class. Conversely, a z-value below zero signifies a probability under 0.5, indicating the negative class.
The critical decision point in z-space, where class probability shifts, is at $z = 0$. This marks our decision boundary $\mathbf{w}·\mathbf{x} + b$ in the original feature space. Here, it manifests as a decision line (the dashed line in the diagram), dividing the space into regions of positive and negative classifications.
In logistic regression, we understand that the decision boundary is a straight line in a 2D space. However, in cases involving nonlinear classification tasks, as illustrated below, a linear decision boundary (similar to a straight line) proves insufficient for effectively distinguishing between the purple and yellow samples.
In this section, we will explore how we can stack activation functions $\sigma$ with linear transformation $\mathbf{w}\cdot\mathbf{x}+b$ to create a nonlinear model that can effectively separate the yellow and purple nodes.
The concept is similar to folding a piece of paper. A workable folding can be visualized as shown below.
This folding operation occurs at the point $x_1=0$ along the y-axis, effectively transferring all data samples from the left space to the right space. This folding process can be described mathematically as an absolute function applied to $x_1$, represented as:
$$ \hat{x}_1=\text{abs}(x_1) $$