Partial Connections and Shared Parameters

In previous models, each input sample had a single feature dimension. For instance, in the Iris dataset, each flower is classified based on four features (sepal/petal length and width) without considering other flowers. This simplifies the problem by focusing only on the feature dimension.

However, in real-world applications, we often deal with data that has multiple dimensions:

Sequential Data like Fall Detection: We collect data from multi-channel sensors over time. The time sequence of sensor signals matters for determining if a fall has occurred.
Relational Data like Credit Card Fraud Detection: Each account has intrinsic features (e.g., card number, balance) but also has relationships (e.g., transactions) with other accounts.

The figure below plots the hierarchical structure of a fully connected network for sequential data analysis. The left side presents the overall architecture, while the right side depicts the computational relationships between two features.

embed - 2025-02-09T130940.163.svg

Such connection is an extreme design not typically used, where both the temporal and sequence dimensions use fully connected layers. It shows all times are connected to all times, and all features to all features.

Fully Connected Layer

We have already learned about MLPs based on fully connected layers, and we are considering whether we can use fully connected layers to process this type of data.

The design idea is straightforward:

Flatten the sequential or relational data and then use fully connected (FC) layers for computation. Theoretically, this seems feasible.

The main challenge is the high computational cost. Naively flattening temporal or relational data can lead to an excessively large input size, increasing the parametric matrix $\mathbf{W}$. This not only slows down training but also raises the risk of overfitting.

embed - 2025-02-26T112321.225.svg

For example, consider an RGB video segment of 5 seconds at 24 FPS with a resolution of 320×240. Flattening this input results in:

$$ (5\times24) \times 3 \times 320 \times 240 = 41,472,000 $$

If the hidden feature size is of a similar scale—common in deep networks to prevent information loss—and is reduced by a factor of 8, the weight matrix size becomes:

$$ \frac{41,472,000\times41,472,000}{8}=214,990,848,000,000 $$

At 214 trillion parameters, this scale far exceeds the capabilities of modern GPUs, making it impractical.

Moreover, using a fully connected layer for sequential data presents several challenges:

Handling Varying Lengths: Sequential and relational data often have different lengths, but fully connected layers require fixed-size inputs. This necessitates padding or truncation, which can introduce challenges and potential information loss.