Perceptron, also known as single layer perceptron, is a type of machine learning model that can be tracked back to the 1950s and 1960s. It is a simple algorithm for binary classification just like logistic regression. The reason we discuss perceptrons is that their core structural components, specifically linear transformations and activation functions, are the fundamental compoent in modern neural networks.

Vintage Perceptron

The original perceptron was developed for classification tasks. An perceptron with 3 inputs and 3 weights is visualized below.

embed (28).svg

The full math equation of the above diagram is given as follows:

$$ \hat{y} = \sigma(\mathbf{w} \cdot \mathbf{x}+b)= \sigma(\begin{bmatrix} w_1 && w_2 && w_3 \end{bmatrix}\begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}+ b)=\sigma(\sum_{i=1}^3{w_ix_i} + b) $$

The activation function $\sigma$ used in the perceptron is a step function, defined as follows:

$$ \hat{y}=\sigma(\mathbf{w} \cdot \mathbf{x}+b)=\text{step}(\mathbf{w} \cdot \mathbf{x}+b) = \begin{cases} 1 & \text{if } \mathbf{w} \cdot \mathbf{x}+b > 0 \\ 0 & \text{if } \mathbf{w} \cdot \mathbf{x}+b < 0 \end{cases}

$$

In essence, the step activation function assigns a value of 1 to once $\mathbf{w} \cdot \mathbf{x}+b$ is greater than 1, and a value of 0 to inputs that value is less than zero. When $\mathbf{w} \cdot \mathbf{x}+b=0$, the output is usually manually defined to be 0.5 however, the case that $\mathbf{w} \cdot \mathbf{x}+b=0$ is quite rare in terms of numerical optimization. This output reflects the binary classification of the input data $\mathbf{x}$.

Optimization: The perceptron algorithm adjusts its weights $\mathbf{w}$ and bias $b$ iteratively to minimize classification errors (the discrepancy between $y$ and $\hat{y}$) on the training dataset. For each input $\mathbf{x}$, the perceptron predicts $\hat{y} = \sigma(\mathbf{w} \cdot \mathbf{x} + b)$, which outputs either 0 or 1. If $\hat{y}$ matches the actual label $y$, the weights and bias remain unchanged; otherwise, they are updated by adding or subtracting the input vector $\mathbf{x}$, scaled by a learning rate, depending on whether the misclassification is positive or negative, with the bias adjusted similarly. #

Despite its intuitive approach, the rule-based updates do not guarantee convergence, and the process often continues for a fixed number of epochs, a limitation better understood after learning advanced optimization algorithms like sub-gradient methods.

Logistic Regression as Improved Perceptron

Logistic regression and the original perceptron share similarities but differ in two key aspects: the perceptron uses a step function as its activation function, while logistic regression employs the Sigmoid function $\sigma(x) = \frac{1}{1 + e^{-x}}$, which serves as a soft activation function; additionally, the perceptron relies on a rule-based binary update mechanism for training, whereas logistic regression minimizes the negative log-likelihood loss $-\left(y\log(\hat{y}) + (1 - y)\log(1 - \hat{y})\right)$ using gradient-based optimization methods.

embed (27).svg