Convolutional layers replace fully connected layers with a small filter that's slid across the input. The same filter weights are applied at every spatial location.
Two key inductive biases
1. Translation equivariance: a cat is a cat whether it's in the top-left or bottom-right of the image. Sharing weights across positions makes the network's response shift when the input shifts. 2. Locality: nearby pixels matter more for recognizing a feature than distant ones. Filters are small (typically ) so each output depends on a small patch.
These priors mean a CNN has orders of magnitude fewer parameters than a fully connected network on the same input.
Receptive field
The receptive field of an output neuron is the input region it depends on. Stacking convolutions grows the receptive field linearly with depth. With downsampling (strided convs or pooling), it grows exponentially — by the end of a deep CNN, each output neuron can "see" most of the input image.
Architectural milestones
- LeNet (1998): first practical CNN, digit recognition.
- AlexNet (2012): kicked off the deep learning revolution. ReLU + dropout + GPU training.
- VGG (2014): pure stacks of convs.
- ResNet (2015): skip connections let you train networks with 100+ layers. Single most important architectural idea after the conv itself.
- EfficientNet (2019): principled scaling of depth/width/resolution. The standard "what should I scale?" baseline.
- ConvNeXt (2022): modernized CNNs match vision transformers when given the same training recipe — a good reminder that the architecture is less important than people sometimes claim.
CNN vs ViT in 2026
Vision transformers won on big datasets; CNNs are still better at small-to-medium data, faster on edge devices, and easier to train. Most production systems use a CNN backbone unless the task specifically demands global attention.