In this section, we will explore neural networks as an universal approximator. Unlike polynomials, neural networks primarily rely on linear computations, avoiding instability issues. Moreover, since they do not involve high-order concepts, they don't encounter the problem of #parameter explosion caused by cross-terms.
In particular, we will explain why neural networks are capable of approximating any function. Our focus will be on demonstrating this capability through two basic scenarios: using a simple, wide (or 'fat') shallow neural network for a straightforward regression task, and employing a basic deep neural network for a nonlinear classification problem.
NOTE: The following explanation does not delve into the learning or training aspects of the model. Instead, we manually craft parameters and functions, and show how they can be composed via neural networks.
Our goal in this regression task is to approximate a sine wave within a specific domain (the range of input values e.g., 3 periods), as a sine wave, being infinitely long, cannot be represented by a piecewise linear function.
First, it's important to know the concept that a function can be approximated by combining multiple functions. In this Fat Neural Network, our activation function is the ingredient to form the target function.
The percetron $\hat{y}=\sigma(\mathbf{w}\cdot \mathbf{x}+b)$ consists of parameters $w$, input $x$, bias $b$, and the activation function $\sigma$. To better visualize the function's composition, we consider one-dimensional input $x$ and one-dimensional parameter $w$. We will use the ReLU function $\sigma_{\text{ReLU}}(x) = \max(0, x)$ to highlight the slope changes at the corner point. The shape of ReLU is depicted in the following image, showing a distinct sharp corner at $x=0$.
The overall math expression for such as perceptron is given as,
$$ \sigma_{\text{ReLU}}(wx+b) $$
Additionally, no matter how we change $w$ (stretch) and $b$ (shift), a ReLU function itself cannot transform into the structure of a sine/triangle function. Hence, we need to add a module to stack different versions of ReLU together. This results in the regression multilayer perceptron (MLP) structure shown below:
NOTE: In the diagram, we've illustrated only two perceptrons in the first layer, but in reality, there should be more ReLU function to realize such a fitting. Additionally, we do not use the activation function after the hidden layer because we're tackling a regression problem.
You may have already noticed that because of the sharp bend in the ReLU function (at $x=0$), fitting a sine wave perfectly is quite challenging. Nevertheless, ReLU's unique and sharp turning point serves as an instructive example for our analysis. In this context, we are approximating the overall up-and-down piecewise pattern of the sine wave in a triangular manner. To achieve a better fit for the sine wave, it's advisable to employ smoother activation functions such as ELU (Exponential Linear Unit).
To facilitate processing, we first manually adjust the entire function by adding 1 to the y-values, ensuring all function values $y$ are non-negative. This adjustment will be beneficial for subsequent ReLU processing.
As a result, we need to configure $b_2$ to be equal to $-1$. This adjustment ensures that if the preceding neural networks can build the left piecewise function on the upper figure, $b_2$ can then shift it downwards to the right figure on the top. The neural network can now be set as below.