Handling sequential data presents a significant challenge due to its increased dimensionality. While an oversimplified yet intuitive analogy is to consider channels—such as RGB in images—as the feature dimension (e.g., columns in tabular data). While an oversimplified yet intuitive analogy is to consider channels—such as RGB in images—as the feature dimension (e.g., columns in tabular data). For images, the width and height add two more dimensions, while for time-series signals, time serves as the sequential dimension.
The inclusion of these sequential dimensions dramatically expands the number of features for each sample. For instance, a 320x240 RGB image, when represented in a traditional tabular format, would result in 3202403 = 230,400 individual features. Managing such high-dimensional data is already demanding for conventional machine learning models, necessitating specialized techniques such as the duality method in support vector machines (SVM). This problem is often known as curse of dimensionality.
In this section, we will delve into the specific challenges posed by fully connected layers, particularly when applying a traditional neural network architecture like the Multilayer Perceptron (MLP) to high-dimensional sequential data. We will explore why MLPs struggle with such data and discuss the implications of the explosion in feature space on computational complexity, memory requirements, and model performance.
Let's review the inner product concept. It involves multiplying corresponding elements of two vectors and summing the results to measure their similarity. While typically applied to 1D data, we use 2D data here for illustration. Note that the following equations do not represent matrix multiplication.
For example, the inner product between a $3\times 3$ image vector and a $3\times 3$ weights is computed as follows,
$$ \begin{aligned}\left<\underbrace{ \begin{bmatrix} 1 && 2 && 3 \\ 4 && 5 && 6 \\ 7 && 8 && 9 \end{bmatrix}}{\text{image vector}},\underbrace{ \begin{bmatrix} 1 && 2 && 1 \\ 2 && 1 && 2 \\ 1 && 2 && 1 \end{bmatrix}}{\text{weights}} \right>&=\sum\begin{bmatrix} 1 && 4 && 3 \\ 8 && 5 && 12 \\ 7 && 16 && 9 \end{bmatrix}\\&=1+4+3+\cdots+9 = 65 \end{aligned} $$
This process quantifies the similarity, like a score, between the image vector and the weight, as discussed in the first intuition class. In this example, only a single set of weights is used. The total calculation is the square of the image vector length.
Let’s consider a more complex situation with the MNIST handwritten digits as input.
Consider an image of size 28x28 pixels with a single grayscale channel. When using a fully connected layer with three sets of weights, it produces three output values.
The total number of weights, represented by the gray lines in the figure on the right, is substantial. The black dots correspond to the input image pixels, and each output node (blue dot) connects to every pixel in the image. As a result, the number of weights per output node is calculated as is $28 \times 28 = 784$. With three output nodes, the total number of weights becomes $3 \times 784 = 2352$.
https://yyhtbs-yye.github.io/#/plotlyrender?data=https://raw.githubusercontent.com/yyhtbs-yye/plotly_json/refs/heads/main/fc_layer_3d_visualization_300px.json
Let's analyze the complexity of fully connected layer. Suppose the input length is $M$ and the output size is $Q$. A fully connected layer requires $Q$ sets of weights, with each set having a length of $M$. Consequently, the total number of weights is $M \times Q$. The number of multiplications is the same which is $M\times Q$.
For an image, if we consider the input size to be $H \times W$ , and the hidden feature values size to be $U \times V$, the total number of weights is $H \times W \times U \times V$.
For example, if we have a $320\times 240$ video as input, and the hidden feature map is $320 \times 240$, the total number of weights is:
$$ 320\times 240\times 320 \times 240 = 5.89\text{ billion weights (parameters).} $$
This is almost the number of parameters in a modern large language model (one of LLama 3.2 has about 8 billion parameters). Note that this is just a single layer, not the entire neural network.