Recap - Matrix Multiplication

Matrix multiplication is a fundamental operation that extends to various layers in neural networks.

In linear layers (fully connected layers or dense layers), it takes the form $\mathbf{W}\mathbf{x}+\mathbf{b}$, where $\mathbf{W}$ is the weight matrix, $\mathbf{x}$ the input, and $\mathbf{b}$ the bias term.
In attention layers, it generalizes to a more complex structure, such as $\mathbf{W}(\mathbf{x})\mathbf{U}(\mathbf{x})+\mathbf{B}(\mathbf{x}) + \mathbf{b}$, where the weights $\mathbf{W}(\mathbf{x})$ and $\mathbf{U}(\mathbf{x})$ and biases $\mathbf{B}(\mathbf{x})$ vary with the input $\mathbf{x}$ while maintaining a linear relationship across terms. This allows the model to dynamically focus on different parts of the input data.

These variations in matrix multiplication enable the network to capture both straightforward and dynamic relationships within the data.

Linear Transformation of Input

In a neural network's fully connected (dense) layer,

embed (95).svg

the input undergoes a linear transformation that takes the form:

$$ \hat{\mathbf{y}}=\mathbf{W}\mathbf{x} + \mathbf{b} $$

where $\mathbf{W}$ is the weight matrix, $\mathbf{x}$ is the input vector, and $\mathbf{b}$ is the bias vector. This is a fundamental operation in deep learning that maps the input data into a new feature space. Let's express this using variables $\mathbf{W}$ and $\mathbf{x}$:

$$ \mathbf{W} = \begin{bmatrix} w_{11} & w_{12} \\ w_{21} & w_{22} \end{bmatrix}, \quad \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}, \quad \mathbf{b} = \begin{bmatrix} b_1 \\ b_2 \end{bmatrix} $$

The linear transformation becomes:

$$ \mathbf{W}\mathbf{x} + \mathbf{b} = \begin{bmatrix} w_{11} & w_{12} \\ w_{21} & w_{22} \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} + \begin{bmatrix} b_1 \\ b_2 \end{bmatrix} $$

Expanding this yields:

$$ \mathbf{W}\mathbf{x} + \mathbf{b} = \begin{bmatrix}w_{11}x_1 + w_{12}x_2 \\w_{21}x_1 + w_{22}x_2\end{bmatrix} + \begin{bmatrix}b_1 \\b_2\end{bmatrix} = \begin{bmatrix}w_{11}x_1 + w_{12}x_2 + b_1 \\w_{21}x_1 + w_{22}x_2 + b_2\end{bmatrix} $$

This linear transformation helps the model learn relationships between the input features. When combined with non-linear activation functions (e.g., ReLU), it enables the network to approximate more complex patterns and decision boundaries beyond just linear ones.

Practice: Matrix Multiplication Recap

Attention Scheme using Adaptive Weights

In attention mechanisms, the adaptive adjustment of weights and biases based on the input is a critical concept. Unlike static weights, attention layers dynamically compute weights $\mathbf{W}(\mathbf{x})$ and biases $\mathbf{B}(\mathbf{x})$, introducing a non-linear relationship through matrix multiplication influenced by the input $\mathbf{x}$.

Consider a simplified attention like below,