Sequential data, such as time series, language, or sensor readings, is typically represented as a matrix with dimensions corresponding to sequence length (time steps) and feature channels; however, when both the sequence length (denoted as $T$) and the feature dimension (denoted as $D$) are large, directly feeding this data into a traditional Multi-Layer Perceptron (MLP) can lead to an explosion in the number of parameters and computational complexity.
One might be tempted to flatten the input into a single vector and pass it through fully connected layers; however, this direct MLP approach results in a weight matrix in the first layer with a size proportional to $T \times D$, where $T$ is the sequence length and $D$ is the feature dimension.
Figure Placeholder [T,D 2D array → T*D array → \text{NN} → \hat{y}]
This enormous matrix becomes impractical for large $T$ and $D$, leading to huge weight matrices and inefficient computation, as the high parameter count increases memory usage and slows down training.
When dealing with high-dimensional sequential data—typically represented as a $T \times D$ matrix, especially when $T$ and $D$ are large, it becomes impractical to process the entire dataset in one go.
Therefore, we can adopt a strategy that decouples the processing of sequential and feature dimensions.
The notion of "orthogonality" in this context highlights that the operations on the feature and sequential dimensions are carried out independently or in an alternating fashion. This strategic separation prevents the computational challenges associated with coupling the two dimensions into a single large layer, thereby improving efficiency, scalability, and interpretability of the overall model.
In the first feature processing stage, a shared MLP (or linear transformation) is applied across the feature dimension for each individual time step. This operation transforms each $D$-dimensional feature vector into a lower-dimensional representation, say of size $D_{\text{low}}$. Mathematically, for each time step $t$, the transformation is given by
$$ \mathbf{z}[t] = \sigma(\mathbf{W}_D \mathbf{x}[t] + \mathbf{b}_D), $$
where $\mathbf{x}[t] \in \mathbb{R}^D$ is the input feature vector, $\mathbf{W}D \in \mathbb{R}^{D{\text{low}} \times D}$ is the weight matrix, $\mathbf{b}D \in \mathbb{R}^{D{\text{low}}}$ is the bias term, and $\sigma$ denotes an activation function (element-wise). This reduction step simplifies the high-dimensional features into a more manageable form without losing critical information.
The output $\mathbf{z}[t]$ over all time forms a big sequential-feature 2D matrix $\mathbf{Z}\in\mathbb{R}^{D_{\text{low}}\times T}$. In other words, each row represents all time samples for a particular feature $d$, written as $Z[d]$, while each column corresponds to all features at a specific time point $t$, say $\mathbf{z}[t]$.
In the second sequence processing stage, the feature vectors have been transformed, the resulting sequence of representations $\mathbf{Z}$ is further processed along the sequence dimension $D_\text{low}$. Here, another MLP is employed to capture temporal or sequential interactions among the time steps. This module may keep or reduce the sequence dimension (e.g., aggregate information across time). A simple formulation of this sequential operation is
$$ \mathbf{h}[d] = \sigma(\mathbf{W}_T Z[d]^\top + \mathbf{b}_T), $$
To further deepen the process, these two stages are alternated multiple times. This iterative alternation fosters increasingly rich interactions between feature-level and sequential representations, enabling the model to progressively refine its understanding of complex patterns. This is the same as the depth discussion at the early stage, dy dynamically exchanging information across these stages, the model enhances its expressiveness and strengthens its ability to capture intricate dependencies, temporal relationships, and latent structures inherent in the data.