In the past, due to limited computational power, neural networks were relatively simple. For instance, a typical Multilayer Perceptron (MLP) would have only 3 to 4 layers, each containing just a few dozen neurons. However, as computational capabilities improved, neural networks evolved to become both deeper, with significantly more layers, and wider, usually with more neurons per layer.
This raises an important question:
Why is there so much emphasis on the "depth" of neural networks rather than their "width"? or why do we prioritize increasing depth over expanding width?
The key reason is that deep neural networks excel at structurally analyzing data and extracting patterns, a capability that shallow networks lack. This depth-shallowness distinction is akin to code abstraction, where programmers condense logic into functions to streamline and simplify code.
Reusable functions simplify code by enabling it to represent more complex logic within the same length. For instance, the inner product is a widely used function in fields like data analytics, machine learning, and signal processing.
In a single project, functions like linear regression and PCA often rely on the inner product. Encapsulating the inner product in a reusable function allows higher-level functions to call it directly, avoiding redundant implementation and simplifying the workflow.
This concept underpins the hierarchical structure of deep neural networks, which prioritize more layers over more neurons per layer. Early layers act as fundamental building blocks, similar to reusable sub-functions in code (e.g., the inner product). Later layers combine these foundations to form increasingly complex functions.
Let's take cat recognition as an example:
The process of recognizing a cat involves identifying shapes such as triangles for ears and circles for eyes. To discern these shapes, the neural network initially learns to recognize lines and curves (edges). In this context, "Cat Recognition" corresponds to the final output layer of the network, while the detection of a "45° Line" represents the functions of earlier hidden layers processing edge information.
A shallow neural network is like a programmer writing all code in a single script without nesting functions. Lacking a hierarchical structure, shallow networks must reimplement lower-level feature extraction for each higher-level task, resulting in redundant neurons and limiting the network's learning potential.
Let’s consider a practical example involving a Convolutional Neural Network (CNN) designed for car recognition. The figure above illustrates how different layers in a neural network process information at varying levels of abstraction.
These combinations can be thought of as logic gates, such as 'AND' and 'NAND,' where the presence or absence of specific patterns determines whether the object is classified as a car or not.
In the initial layers, the network focuses on extracting fundamental, low-level features, such as edges, lines, and simple geometric patterns. These are the building blocks of visual information, analogous to the inner product in coding — simple, reusable components that can be combined to form more complex structures. For instance, detecting edges at various angles helps the network establish the basic contours and shapes present in the image.