From Perceptron to Multi-Layer Perceptron

Perceptron, also known as single layer perceptron, is a type of machine learning model that can be tracked back to the 1950s and 1960s. It is a simple algorithm for binary classification just like logistic regression. The reason we discuss perceptrons is that their core structural components, specifically linear transformations and activation functions, are the fundamental compoent in modern neural networks.

Vintage Perceptron

The original perceptron was developed for classification tasks. An perceptron with 3 inputs and 3 weights is visualized below.

embed (28).svg

Inputs $x_1,x_2,x_3$: These represent the features or input variables. In the Iris dataset, these could be the petal and sepal lengths and widths.
Weights $w_1, w_2, w_3$: These are the parameters that the perceptron learns during the training process. Each feature is multiplied by its corresponding weight.
Bias $b$: This is an additional parameter that helps the perceptron adjust the decision boundary.
Sum $Σ$: The perceptron calculates a weighted sum of the inputs and the bias:

$$ z = w_1 \cdot x_1 + w_2 \cdot x_2 + w_3 \cdot x_3 + b $$
Activation (Step Function $\sigma$): This is a threshold function that decides whether the perceptron outputs a "1" or a "0" based on the value of $z$. For example, if $z$ is greater than or equal to a threshold, the output is 1; otherwise, it is 0.
Output $\hat{y}$: This is the final prediction. In the case of the Iris dataset, this could represent whether the flower belongs to one of two categories (e.g., Setosa or Versicolor).

The full math equation of the above diagram is given as follows:

$$ \hat{y} = \sigma(\mathbf{w} \cdot \mathbf{x}+b)= \sigma(\begin{bmatrix} w_1 && w_2 && w_3 \end{bmatrix}\begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}+ b)=\sigma(\sum_{i=1}^3{w_ix_i} + b) $$

The activation function $\sigma$ used in the perceptron is a step function, defined as follows:

$$ \hat{y}=\sigma(\mathbf{w} \cdot \mathbf{x}+b)=\text{step}(\mathbf{w} \cdot \mathbf{x}+b) = \begin{cases} 1 & \text{if } \mathbf{w} \cdot \mathbf{x}+b > 0 \\ 0 & \text{if } \mathbf{w} \cdot \mathbf{x}+b < 0 \end{cases}

In essence, the step activation function assigns a value of 1 to once $\mathbf{w} \cdot \mathbf{x}+b$ is greater than 1, and a value of 0 to inputs that value is less than zero. When $\mathbf{w} \cdot \mathbf{x}+b=0$, the output is usually manually defined to be 0.5 however, the case that $\mathbf{w} \cdot \mathbf{x}+b=0$ is quite rare in terms of numerical optimization. This output reflects the binary classification of the input data $\mathbf{x}$.

Optimization: The perceptron algorithm adjusts its weights $\mathbf{w}$ and bias $b$ iteratively to minimize classification errors (the discrepancy between $y$ and $\hat{y}$) on the training dataset. For each input $\mathbf{x}$, the perceptron predicts $\hat{y} = \sigma(\mathbf{w} \cdot \mathbf{x} + b)$, which outputs either 0 or 1. If $\hat{y}$ matches the actual label $y$, the weights and bias remain unchanged; otherwise, they are updated by adding or subtracting the input vector $\mathbf{x}$, scaled by a learning rate, depending on whether the misclassification is positive or negative, with the bias adjusted similarly. #

Despite its intuitive approach, the rule-based updates do not guarantee convergence, and the process often continues for a fixed number of epochs, a limitation better understood after learning advanced optimization algorithms like sub-gradient methods.

Logistic Regression as Improved Perceptron

Logistic regression and the original perceptron share similarities but differ in two key aspects: the perceptron uses a step function as its activation function, while logistic regression employs the Sigmoid function $\sigma(x) = \frac{1}{1 + e^{-x}}$, which serves as a soft activation function; additionally, the perceptron relies on a rule-based binary update mechanism for training, whereas logistic regression minimizes the negative log-likelihood loss $-\left(y\log(\hat{y}) + (1 - y)\log(1 - \hat{y})\right)$ using gradient-based optimization methods.

embed (27).svg