Notation WARNING: The symbol $x$ denotes an input feature vector. It's essential to realize that $x$ can be treated as a scalar when its length is one. Within the dataset, if we are referring to the i-th specific feature sample, we represent it as $x^{(i)}$. Similarly, when mentioning the j-th feature within a particular sample, we express it as $x_j$.

Linear Model is Matrix Multiplication

Linear models form the backbone of many algorithms in machine learning, operating through matrix multiplication where a weight matrix $\mathbf{W}$ interacts with an input vector $x$, followed by an addition of a bias scalar $b$ to introduce an offset. This operation can be compactly represented as:

$$ z = \mathbf{W}x + b $$

2-by-2 Matrix Multiplication Example

Let the weight matrix $\mathbf{W}$ be represented as follows, where $w_{ij}$ are the elements of the matrix:

$$ \mathbf{W} = \begin{bmatrix}w_{11} & w_{12} \\w_{21} & w_{22} \\\end{bmatrix} $$

Let the input vector $x$ be:

$$ x = \begin{bmatrix}x_1 \\x_2 \\\end{bmatrix} $$

And let the bias $b$ be a scalar value.

The multiplication of a 2-by-2 matrix $\mathbf{W}$ and a column vector $x$ is computed as:

$$ \mathbf{W}x = \begin{bmatrix}w_{11} & w_{12} \\w_{21} & w_{22} \\\end{bmatrix}\begin{bmatrix}x_1 \\x_2 \\\end{bmatrix}=\begin{bmatrix}w_{11}x_1 + w_{12}x_2 \\w_{21}x_1 + w_{22}x_2 \\\end{bmatrix} $$

After calculating the matrix-vector product, we add the bias $b$ to each element of the resulting vector:

$$ z = \mathbf{W}x + b = \begin{bmatrix}w_{11}x_1 + w_{12}x_2 \\w_{21}x_1 + w_{22}x_2 \\\end{bmatrix} + b=\begin{bmatrix}w_{11}x_1 + w_{12}x_2 + b \\w_{21}x_1 + w_{22}x_2 + b \\\end{bmatrix} $$

In this final expression, $z$ represents the output vector obtained by applying the weight matrix to the inputs and adding the bias $b$, illustrating the process of linear transformations followed by bias addition in machine learning models.

Processing 2D Array using Right Multiplication

Nevertheless, in common machine learning practices, particularly within neural networks, we frequently handle several input feature samples $x^{1}, …, x^{(N)}$ at once. To process these multiple inputs efficiently, they are organized into a matrix $X$ which encapsulates the features of all the inputs like below:

$$ X=\begin{bmatrix} \cdots x^{(1)} \cdots \\ \cdots x^{(2)} \cdots \\\vdots\\ \cdots x^{(N)} \cdots \end{bmatrix} $$

In the matrix, $X$ is of size $[N, D]$, where $N$ represents the number of samples, and $D$ represents the number of features per sample. Each row of the matrix corresponds to a unique vector representing the input features of a sample.

The standard method in deep neural network for performing matrix multiplication on the inputs $X$ is right multiplication, as illustrated below: