Okay, some might argue that CNNs (Convolutional Neural Networks) inherently lack the ability to handle issues related to image scaling and rotation. How so? Let's say if all the dogs a CNN has been trained to recognize are of a certain size, it can identify them as dogs. But what happens when you enlarge that image? Can it still recognize it as a dog?

New Project (2).png

It might not be able to. You might wonder, why can't it recognize the image? Aren't the shapes exactly the same? Why can't it identify the enlarged image? Is the CNN really that inept? Yes, it is. To the CNN, although the shapes of the two images are identical, if you stretch them into vectors, the numerical values within those vectors differ. Therefore, to a CNN's network, even though to the human eye the shapes seem very similar, to the CNN's network, they are vastly different. Hence, in reality, CNNs are incapable of handling image scaling and rotation effectively. When it learns to recognize images of a certain size, and if the objects within those images are smaller, scaling up those objects could lead to a complete breakdown in recognition. Thus, CNNs are not as robust as one might think. This is precisely why Data Augmentation is often necessary in image recognition tasks. Data Augmentation means taking small sections of each training image, enlarging them, and allowing the CNN to see patterns of various sizes. It also involves rotating the images to let the CNN learn what an object looks like after it has been rotated, achieving better results.

If CNNs cannot handle scaling and rotation issues, is there any network architecture that can? Indeed, there is. One such architecture is called the Special Transformer Layer. https://arxiv.org/pdf/1506.02025.pdf