linear layer, known as the Matrix multiplication, is widely used in neural networks. Here are several scenarios demonstrating why linear layers may be beneficial. Please note that we can only discuss potential functionalities that linear layers might offer but not what it will offer, as neural networks are not inherently interpretable.

Aggregation Before Making a Decision

Matrix multiplication enables aggregating data. Imagine a time-sequenced feature set $X = [x^{(1)}, x^{(2)}, ..., x^{(T)}]$ with dimensions $(T, D)$, where each $x^{(t)}$ of dimension $(D)$ is a row (feature) vector at time $t$. A weight matrix $\mathbf{W}1=\begin{bmatrix} w{1}^{(1)} & w_{2}^{(1)} & \ldots & w_{D}^{(1)} \end{bmatrix}^T$ of size $(D,1)$ can condense these temporal features into a unified descriptor.

$$ Z = X @ \mathbf{W}1=\begin{bmatrix} x_1^{(1)} & x_2^{(1)} & \cdots & x_D^{(1)} \\ x_1^{(2)} & x_2^{(2)} & \cdots & x_D^{(2)} \\ \vdots & \vdots & \ddots & \vdots \\ x_1^{(T)} & x_2^{(T)} & \cdots & x_D^{(T)} \end{bmatrix}\begin{bmatrix} w{1}^{(1)} \\ w_{2}^{(1)} \\ \vdots \\ w_{D}^{(1)} \end{bmatrix} $$

The product $Z = X @ \mathbf{W}_1$ results in a column vector where each element is the dot product of a row from $X$ and $\mathbf{W}_1$:

$$ Z = \begin{bmatrix}x^{(1)} @ \mathbf{W}1 \\x^{(2)} @ \mathbf{W}1 \\\vdots \\x^{(T)} @ \mathbf{W}1\end{bmatrix}= \begin{bmatrix}\sum{i=1}^{D} x_i^{(1)} w{i}^{(1)} \\\sum{i=1}^{D} x_i^{(2)} w_{i}^{(1)} \\\vdots \\\sum_{i=1}^{D} x_i^{(T)} w_{i}^{(1)}\end{bmatrix} $$

The vector $Z$ encapsulates the time series information into aggregated feature descriptors for each time step. To distill these descriptors into a single, comprehensive metric, we perform a secondary transformation by transposing $Z$ and multiplying it with a second weight matrix $\mathbf{W}2=\begin{bmatrix} w{1}^{(2)} & w_{2}^{(2)} & \ldots & w_{T}^{(2)} \end{bmatrix}^T$ of dimension $(T,1)$:

$$ \begin{aligned}h = Z^T @ \mathbf{W}2&= \begin{bmatrix} \sum{i=1}^{D} x_i^{(1)} w_{i}^{(1)} & \sum_{i=1}^{D} x_i^{(2)} w_{i}^{(1)} & \cdots & \sum_{i=1}^{D} x_i^{(T)} w_{i}^{(1)} \end{bmatrix}\begin{bmatrix} w_{1}^{(2)} \\ w_{2}^{(2)} \\ \vdots \\ w_{T}^{(2)} \end{bmatrix}\\&=\sum_{t=1}^{T} \left( \sum_{i=1}^{D} x_i^{(t)} w_{i}^{(1)} \right) w_{t}^{(2)} \end{aligned} $$

The resulting scalar $h$ represents the aggregated information across all time steps, providing a succinct summary useful in applications such as sentiment analysis, where it's essential to derive a singular output from a sequence of data inputs.

Aggregation ≈ Signed Distance

In models like Logistic Regression or SVM, and also in the final layer of classification models, the linear layer plays a dual role. It not only categorizes samples into one of two distinct classes based on their position relative to a dividing plane, but it also quantifies the distance from this plane. This distance, determined by the given $w$ and $b$ in the model, is not just about the side (sign) on which the sample lies, but also about how far or close it is to the decision boundary, providing a measure of classification confidence or significance.

Aggregation ≈ Similarity Measurement

If we examine the multiplication of a row vector (which can be treated as a matrix) by a column vector, this operation can be interpreted as the inner product, which is the sum of elementwise multiplications between the two vectors.

Within the framework of matrix multiplication in a linear layer, we focus on the input row vector $x$ and a specific column from the weight matrix $\mathbf{W}$, represented as $\mathbf{W}_{i, :}$. The calculation of the dot product between $x$ and this particular column is carried out as described below:

$$ x \cdot \mathbf{W}{i, :} = \sum{j} x_{j} \cdot W_{j, i} $$

This sum of element-wise products gives us the similarity measure between the input vector $x$ and the $i$-th feature captured by $\mathbf{W}_{i, :}$.

If we normalize both $x$ and $\mathbf{W}_{i, :}$ to unit vectors, their dot product directly gives us the cosine of the angle between them:

$$ x \cdot \mathbf{W}{i, :} = \cos{\theta{i}} = \frac{x \cdot \mathbf{W}{i, :}}{\|x\| \|\mathbf{W}{i, :}\|} $$

This formula shows that when $x$ and $\mathbf{W}_{i, :}$ are unit vectors, their dot product simplifies to just the cosine of the angle between them, providing a clear geometric interpretation of similarity.

Furthermore, let's extend this concept to measure the similarity of $x$ with every column in $\mathbf{W}$. The weight matrix $\mathbf{W}$ can be viewed as composed of column vectors $\mathbf{W}{:, 1}, \mathbf{W}{:, 2}, ..., \mathbf{W}_{:, n}$, each representing a different feature or aspect that the neural network is learning to detect or utilize: