Besides Boolean circuits, there's another perspective to understand why neural networks work: template matching. Since template matching uses the inner product, we need to discuss what an inner product is before delving into template matching. The inner product is a fundamental component in all linear models (like Logistic Regression, SVM), represented as $w\cdot x$.

Inner Product and Cosine Similarity

The inner product can be interpreted in many ways, but here we focus on its function as a measure of similarity. First, we need to understand cosine distance:

$$ \cos{\theta} = \frac{\mathbf{x} \cdot \mathbf{w}}{|\mathbf{x}||\mathbf{w}|} $$

Cosine distance measures how similar the angles of two vectors are. If $\theta=90^\circ$, the vectors are perpendicular, indicating no similarity ($\cos\theta = 0$). If $\theta=0^\circ$ or $180^\circ$ ($\cos\theta = 1$ or $\cos\theta = -1$), the vectors are in the same or opposite direction, indicating high similarity or inverse similarity.

embed(2).svg

Assuming both vectors have a length of 1 (i.e., $|\mathbf{x}|=1$, and $|\mathbf{w}|=1$) the inner product itself describes the cosine distance. However, this is an idealized scenario, as even if we normalize the data $\mathbf{x}$, we can't ensure the intermediate hidden variables or each layer's parameters $\mathbf{w}$ are normalized. Nonetheless, this suggests that even without normalization, the inner product reflects their cosine similarity to some extent.

Inner Product and Template Matching

Now, let's discuss why the inner product value can be used for template matching. Let's define the parameters $\mathbf{w}$ as templates. For example, consider two templates: $[w_{11}, w_{12}, w_{13}]$ and $[w_{14}, w_{15}, w_{16}]$, as shown below:

embed (54).svg

For the upper part of the input region, we measure the similarity between $[x_{1}, x_{2}, x_{3}]$ and $[w_{11}, w_{12}, w_{13}]$. If similar, the value will be close to 1, triggering the sigmoid function to output a value near 1. If not, the similarity value will be close to or less than 0, leading to a sigmoid output near 0. The process is the same for the lower part. The biases $b_{11}$ and $b_{12}$ are there to adjust the shift of the Sigmoid function (assuming a bias of -0.5 for better understanding).

We can also visualize this process using MNIST handwriting digits as a toy example for understanding (though it's not entirely accurate):

embed (55).svg

For the upper path, it means measuring the similarity between the original image and a learned template of the bottom part of digit 9. For the lower path, it involves measuring the similarity between the original image and another predefined learned template of the upper part of digit 9.