Convolutional Networks — Section 5: Deep Learning Architectures

Convolutional layers replace fully connected layers with a small filter that's slid across the input. The same filter weights are applied at every spatial location.

Two key inductive biases

1. Translation equivariance: a cat is a cat whether it's in the top-left or bottom-right of the image. Sharing weights across positions makes the network's response shift when the input shifts. 2. Locality: nearby pixels matter more for recognizing a feature than distant ones. Filters are small (typically $3 \times 3$ ) so each output depends on a small patch.

These priors mean a CNN has orders of magnitude fewer parameters than a fully connected network on the same input.

Receptive field

The receptive field of an output neuron is the input region it depends on. Stacking $k \times k$ convolutions grows the receptive field linearly with depth. With downsampling (strided convs or pooling), it grows exponentially — by the end of a deep CNN, each output neuron can "see" most of the input image.

Architectural milestones

LeNet (1998): first practical CNN, digit recognition.
AlexNet (2012): kicked off the deep learning revolution. ReLU + dropout + GPU training.
VGG (2014): pure stacks of $3 \times 3$ convs.
ResNet (2015): skip connections let you train networks with 100+ layers. Single most important architectural idea after the conv itself.
EfficientNet (2019): principled scaling of depth/width/resolution. The standard "what should I scale?" baseline.
ConvNeXt (2022): modernized CNNs match vision transformers when given the same training recipe — a good reminder that the architecture is less important than people sometimes claim.

CNN vs ViT in 2026

Vision transformers won on big datasets; CNNs are still better at small-to-medium data, faster on edge devices, and easier to train. Most production systems use a CNN backbone unless the task specifically demands global attention.

Convolutional layers replace fully connected layers with a small filter that's slid across the input. The same filter weights are applied at every spatial location.

Two key inductive biases

These priors mean a CNN has orders of magnitude fewer parameters than a fully connected network on the same input.

Receptive field

Architectural milestones

LeNet (1998): first practical CNN, digit recognition.
AlexNet (2012): kicked off the deep learning revolution. ReLU + dropout + GPU training.
VGG (2014): pure stacks of $3 \times 3$ convs.
ResNet (2015): skip connections let you train networks with 100+ layers. Single most important architectural idea after the conv itself.
EfficientNet (2019): principled scaling of depth/width/resolution. The standard "what should I scale?" baseline.
ConvNeXt (2022): modernized CNNs match vision transformers when given the same training recipe — a good reminder that the architecture is less important than people sometimes claim.