Orthogonal MLP

Sequential data, such as time series, language, or sensor readings, is typically represented as a matrix with dimensions corresponding to sequence length (time steps) and feature channels; however, when both the sequence length (denoted as $T$) and the feature dimension (denoted as $D$) are large, directly feeding this data into a traditional MLP can lead to an explosion in the number of parameters and computational complexity.

Orthogonal Processing Strategy

When dealing with high-dimensional sequential data—typically represented as a $T \times D$ matrix, especially when $T$ and $D$ are large, it becomes impractical to process the entire dataset in one go.

Therefore, we can adopt a strategy that decouples the processing of sequential and feature dimensions.

In the first phase, we focus on feature processing: each time step's feature vector is independently transformed or reduced using a dedicated module. This localized transformation not only condenses the information contained within each feature vector but also simplifies the subsequent processing steps by managing the overall complexity of the data.
Following the feature transformation, the focus shifts to sequential processing. Here, the goal is to capture the temporal dynamics and interactions that occur across time steps. By addressing the sequential dimension separately, the model is better equipped to understand and leverage the inherent structure and evolution present in the data over time.

The notion of "orthogonality" in this context highlights that the operations on the feature and sequential dimensions are carried out independently or in an alternating fashion. This strategic separation prevents the computational challenges associated with coupling the two dimensions into a single large layer, thereby improving efficiency, scalability, and interpretability of the overall model.

Process Modelling

In the first feature processing stage, a shared MLP (or linear transformation) is applied across the feature dimension for each individual time step. This operation transforms each $D$-dimensional feature vector into a lower-dimensional representation, say of size $D_{\text{low}}$. Mathematically, for each time step $t$, the transformation is given by

$$ \mathbf{z}[t] = \sigma(\mathbf{W}_D \mathbf{x}[t] + \mathbf{b}_D), $$

where $\mathbf{x}[t] \in \mathbb{R}^D$ is the input feature vector, $\mathbf{W}D \in \mathbb{R}^{D{\text{low}} \times D}$ is the weight matrix, $\mathbf{b}D \in \mathbb{R}^{D{\text{low}}}$ is the bias term, and $\sigma$ denotes an activation function (element-wise). This reduction step simplifies the high-dimensional features into a more manageable form without losing critical information.

The output $\mathbf{z}[t]$ over all time forms a big sequential-feature 2D matrix $\mathbf{Z}\in\mathbb{R}^{D_{\text{low}}\times T}$. In other words, each row represents all time samples for a particular feature $d$, written as $Z[d]$, while each column corresponds to all features at a specific time point $t$, say $\mathbf{z}[t]$.

In the second sequence processing stage, the feature vectors have been transformed, the resulting sequence of representations $\mathbf{Z}$ is further processed along the sequence dimension $D_\text{low}$. Here, another MLP is employed to capture temporal or sequential interactions among the time steps. This module may keep or reduce the sequence dimension (e.g., aggregate information across time). A simple formulation of this sequential operation is

$$ \mathbf{h}[d] = \sigma(\mathbf{W}_T Z[d]^\top + \mathbf{b}_T), $$

To further deepen the process, these two stages are alternated multiple times. This iterative alternation fosters increasingly rich interactions between feature-level and sequential representations, enabling the model to progressively refine its understanding of complex patterns. This is the same as the depth discussion at the early stage, dy dynamically exchanging information across these stages, the model enhances its expressiveness and strengthens its ability to capture intricate dependencies, temporal relationships, and latent structures inherent in the data.

Benefits of the Decomposed Approach

Parameter Efficiency: By splitting the processing, the network uses two smaller weight matrices (one for features and one for sequences) instead of one massive matrix for the entire flattened input.
Modularity and Interpretability: The decoupled structure allows one to analyze and tune the feature transformation and sequential aggregation independently.
Flexibility: The architecture can be adapted to various types of sequential data by modifying either the feature or sequential module without overhauling the entire network.

Orthogonal Processing Strategy

Process Modelling

Benefits of the Decomposed Approach

Remained Problems