Activation functions are a crucial component of neural networks, serving as the gatekeepers that determine whether a neuron should be activated or not based on the input it receives. They introduce non-linearity into the network, enabling it to learn and model complex patterns in data. Without activation functions, a neural network would behave as a simple linear model like logistic regression, incapable of solving problems like image recognition, language translation, or speech synthesis. These functions are typically applied after each neuron processes the input data, and their selection plays a critical role in the network's ability to converge and perform well. By transforming raw signals into meaningful output, activation functions are the driving force behind the deep learning capabilities that power modern AI applications.

A Bit about Name

Historically, the initial concept of a neural network was inspired by the human brain's structure and function, aiming to replicate its ability to learn and make decisions. This inspiration is evident in the development of activation functions within artificial neural networks. In the human brain, neurons receive signals through their dendrites and, once a certain threshold of excitation is reached, they 'fire' a signal along their axon.

This process is known as the "all-or-nothing" law, where a neuron either sends a full signal or none at all.

Activation functions determine whether and to what extent a signal should be passed forward through the network.

They act as a gate, similar to the firing mechanism of biological neurons, deciding when a neuron should be activated based on the weighted sum of its inputs.

Untitled

Recap: $\text{abs}$ Activation

One common example we've studied before is the folding process, which can be understood as an absolute value function. This function ensures that negative inputs are transformed into positive values, much like the ReLU (Rectified Linear Unit) activation function, which outputs zero for negative inputs and passes through positive values unchanged. The folding process can be further decomposed into a series of ReLU functions, making it a foundational building block in neural networks.

image.png

$$ h_0 = \text{abs}(x_0) \rightarrow h_0 = \text{relu}(x_0)+\text{relu}(-x_0) $$

By breaking down the absolute value function into simpler ReLU components, we gain a better understanding of how non-linearity is applied to enhance model learning.

Gen 0 - Basic Functions

The earliest neural networks drew inspiration directly from biology, using step functions to model the behavior of neurons. In biological neurons, signals are transmitted only when they surpass a certain threshold, much like an "all or nothing" response. This concept was foundational in early artificial neural networks, where a step function was used to decide whether a neuron would fire or remain inactive.

In this vintage approach, the step function was mathematically simple: if the input was above a certain threshold, the neuron activated (outputting a 1), and if it was below, the neuron remained inactive (outputting a 0). This binary output captured the idea that neurons either respond to input or don’t, much like the synapses in the brain.

$$ f(x) = \begin{cases} 1 & \text{if } x \geq 0 \\ 0 & \text{if } x < 0 \end{cases} $$

image.png

While this method was groundbreaking at the time, it had limitations. The step function is discontinuous and doesn’t allow for the smooth gradients necessary for modern optimization techniques like gradient descent and backpropagation. As neural networks grew more complex, there was a need for activation functions that could allow for gradual changes and better learning mechanisms.

This challenge led to the development of more advanced and continuous activation functions like Sigmoid, but the step function holds an important place as one of the earliest building blocks in neural network history.


The sigmoid function was one of the first major improvements over the step function, designed to address its limitations. Its smooth, S-shaped curve transitions gradually from 0 to 1, making it both continuous and differentiable across all possible inputs.

This was a significant advancement, as it allowed neural networks to adjust their weights using methods like gradient descent, making learning more effective. Unlike the binary behavior of the step function, the sigmoid function provided a gradient that could be used to calculate error and optimize the network.

$$ f(x) = \frac{1}{1 + e^{-x}} $$

image.png

However, the sigmoid function is not without its drawbacks. One of the most significant issues is the vanishing gradient problem. When the input values become very large (either positive or negative), the function saturates and the gradient approaches zero, effectively halting the learning process in those neurons. This problem becomes especially pronounced in deeper networks, where the signal diminishes as it travels through multiple layers.

Another challenge is that the sigmoid is non-zero-centered. Since the output is always positive, this can cause issues during optimization, as gradient updates may push the weights predominantly in one direction, slowing convergence. Additionally, the sigmoid function’s reliance on an exponential calculation makes it more computationally expensive, which can increase the time required for training large-scale models.