Modern deep learning models are endlessly stretching the boundaries of size and complexity, from massive language models with hundreds of billions of parameters to specialized architectures like mixtures of experts. Despite the clear gains in performance, these larger and more intricate models inevitably pose significant challenges in terms of training costs, memory usage, and deployment feasibility.
In this tutorial, we’ll discuss a systematic way to analyze and design neural network architectures by:
- Introducing the “Four Quadrants” framework, which classifies parameterization strategies along two major axes (Dense vs. Partial, Independent vs. Shared).
- Exploring three key “dimensions” along which parameters are arranged and reused (Data Features, Data Sequences/Relations, and Model Depth).
By understanding these design axes and dimensions, you can make more informed decisions that balance expressiveness (capacity to capture complex patterns) and efficiency (speed, scalability, and resource usage).
The Four Quadrants - Quadrant Axes
- Dense vs. Partial:
- Dense: All parameters participate in processing all tasks, inputs, or components. Classic MLPs and Transformers exemplify dense parameter usage.
- Partial: Parameters are split into specialized subsets dedicated to specific tasks, modalities, or inputs. Mixture-of-experts (MoE), adapter layers, and certain CNNs with local receptive fields are examples.
This figure is just a example, it is not accurate!
- Independent vs. Shared:
- Independent: Each connection or unit has its own dedicated parameters. This allows maximum flexibility but can lead to large parameter counts.
- Shared: The same parameters are reused in multiple parts of the model. This reduces the parameter footprint and can improve generalization (e.g., convolutional filters that shift across an image).
Dense vs. Partial Parameterization
In a dense setup, every weight is potentially involved in processing any given input. Transformer encoders, for instance, use the same attention layers for all tokens, and each weight in these layers remains active for every token processed.
- Pros:
- Straightforward to implement and reason about.
- Each parameter can theoretically learn any aspect of the data.
- Cons:
- Potential over-parameterization if the data is heterogeneous.
- Harder to scale if you need specialized processing for different tasks or modalities.
In partial architectures, the model explicitly divides or gates subsets of parameters for specific inputs or tasks. Examples include:
- Adapter Layers in Transformers, where small additional modules are inserted and specialized for each task.
- CNN Receptive Fields, where each filter only “sees” a local region of the image.
- Pros:
- Can reduce the effective number of parameters used at a time.
- Allows more targeted learning for different tasks or data types.
- Cons:
- Requires mechanisms (routing or gating) to decide which subset of parameters to use.
- Potential under-utilization if the gating is not well-tuned.
Independent vs. Shared Weights
An architecture with fully independent weights has a unique parameter for each connection or neuron. While this maximizes representational power, it also inflates the parameter count dramatically. A simple MLP with a weight matrix of size