Historically, the initial concept of a neural network was inspired by the human brain's structure and function, aiming to replicate its ability to learn and make decisions. This inspiration is evident in the development of activation functions within artificial neural networks. In the human brain, neurons receive signals through their dendrites and, once a certain threshold of excitation is reached, they 'fire' a signal along their axon. This process is known as the "all-or-nothing" law, where a neuron either sends a full signal or none at all.

Untitled

The concept of activation functions in artificial neural networks parallels this biological phenomenon. These functions determine whether and to what extent a signal should be passed forward through the network. They act as a gate, similar to the firing mechanism of biological neurons, deciding when a neuron should be activated based on the weighted sum of its inputs. Without activation functions, a neural network would simply perform linear transformations, lacking the ability to model complex, non-linear relationships inherent in real-world data.

Nonlinearity and Activation Functions

Activation functions serve as critical nonlinear elements within neural networks, positioned after layers that perform parameterized linear or convolutional operations. Their fundamental role is to inject nonlinearity into the model. In the absence of such functions, neural networks would be confined to executing linear transformations from input to output, severely constraining their capacity to capture and represent complex data patterns and relationships.

sdsadsad.drawio (3) (2).png

Consider the illustrative example of differentiating between purple dots and yellow crosses samples. Utilizing a linear model for this classification task proves inadequate. However, the introduction of a nonlinear activation function, such as the absolute function, transforms the model. By applying the absolute function, we essentially enable the folding process of the feature space which serves as a nonlinear model adept at distinguishing between the two sets of samples. Please refer to the "Intuition" section to gain a more insightful feeling of why activation functions are useful.

Ancestor Activation Function

Initial iterations of perceptrons, a foundational model in neural network theory, were designed to utilize a step function to mimic the binary response of biological neurons—activating or not activating based on the input signal. This binary, all-or-nothing approach, while simple and intuitive, quickly encountered significant limitations in the burgeoning field of artificial neural networks.

Untitled

The primary challenge with the step function, illustrated by its abrupt jump from 0 to 1 at a threshold value, is its discontinuity and lack of differentiability. These characteristics of the step function presented a fundamental obstacle to advancing neural network models, as they precluded the possibility of using gradient-based optimization methods. Without the ability to calculate gradients, it was challenging and inefficient to adjust the weights of the network in a nuanced and continuous manner, severely limiting the learning capability and adaptability of early neural network models.

Gen 1 Activation Functions

Introduced in the context of neural networks as early as the 1980s, the Sigmoid function (denoted as $f(x) = \frac{1}{1 + e^{-x}}$ became a staple for early neural network research and applications. Its continuous nature allows for smooth transitions between output values, unlike the step function, which jumps abruptly from 0 to 1. The differentiability of the Sigmoid function across the entire input space permits the computation of gradients, a cornerstone for the backpropagation algorithm, enabling the network to adjust its weights and biases to minimize error.

Untitled

The shift to the Sigmoid function in neural networks marked a crucial evolution, offering continuous and differentiable advantages over the step function, essential for applying gradient descent. However, the Sigmoid function introduced three primary limitations: the vanishing gradient problem, where its output saturation at extreme values leads to diminishing gradients, hampering learning in deep networks; non-zero-centered output, which complicates optimization by causing gradients to move in unidirectional bursts; and computational intensity due to its exponential calculation, slowing down the training process. These challenges, despite the foundational role of the Sigmoid function, have led to the preference for alternative activation functions like ReLU in modern deep learning architectures, which circumvent these issues while enhancing computational efficiency and network performance.

Gen 2 Activation Functions

To address issues like vanishing gradients, non-zero-centered output, and computational inefficiency of the sigmoid function, a new set of activation functions like ReLU, Tanh, Softplus, Hard Sigmoid, and ELU were developed, each providing specific enhancements.

ReLU (Rectified Linear Unit): Proposed by Hinton et al. in 2010, ReLU has become one of the most widely used activation functions in neural networks, especially in deep learning models. It is defined mathematically as $f(x) = \max(0, x)$, which means it outputs the input directly if it is positive; otherwise, it will output zero. The simplicity of ReLU lies in its linear, non-saturating form, which helps mitigate the vanishing gradient problem that plagues training in deep networks. The vanishing gradient problem occurs when gradients become too small, severely slowing down the training process or stopping it altogether. ReLU's constant gradient for positive inputs ensures that the network continues to learn as long as there are positive inputs. However, it can lead to dead neurons which never activate if the inputs are negative.

Untitled

Tanh (Hyperbolic Tangent): The tanh function, which outputs values between -1 and 1, is a rescaled version of the sigmoid function. Mathematically, it is expressed as $f(x) = \tanh(x) = \frac{2}{1 + e^{-2x}} - 1$. By outputting negative values as well as positive, tanh addresses the issue of not being zero-centered that is present with functions like sigmoid. This zero-centered nature helps with faster convergence during training. Proposed earlier in the neural network literature, tanh also faces the vanishing gradient problem, especially for inputs with large magnitudes, but to a lesser extent than the sigmoid function. It is more effective in practice than sigmoid for hidden layers because its outputs are centered around zero, making it easier for the next layer to learn.