A multi-layer perceptron (MLP) is the simplest neural network: a stack of linear layers separated by nonlinear activations.
Why nonlinearity is non-negotiable
Compose two linear layers — the result is still a linear function of , just with a different weight matrix. Without nonlinear , depth gives you nothing. The activation breaks linearity and lets the network represent functions that aren't expressible as a single matrix multiplication.
Universal approximation
Cybenko (1989) showed an MLP with one hidden layer of sufficient width can approximate any continuous function on a compact set to arbitrary accuracy. This sounds great until you ask "how wide?" — the answer can be exponential in input dimension. Depth changes the constants dramatically. Multi-layer networks express many functions with polynomial width that single-hidden-layer networks would need exponential width to represent.
In practice you don't use a 1-hidden-layer net for anything serious. You use depth because depth is a sample-efficient way to compose features.
When MLPs are the right tool
- Tabular data after embedding categorical features (often loses to GBMs unless data is huge)
- Output heads on top of pretrained encoders (the last few layers of almost every modern network)
- Implicit neural representations (function fitting in low dimensions)
When they're wrong
- Raw image input — use CNNs
- Sequence input — use transformers or RNNs
- Any data with strong structural priors (graphs, point clouds, sets) — use a specialized architecture
The MLP is a building block, not usually the whole architecture.