Multi-Layer Perceptrons — Section 4: Neural Networks

A multi-layer perceptron (MLP) is the simplest neural network: a stack of linear layers separated by nonlinear activations.

h_1 = \phi(W_1 x + b_1), \quad h_2 = \phi(W_2 h_1 + b_2), \quad \dots, \quad \hat{y} = W_L h_{L-1} + b_L

Why nonlinearity is non-negotiable

Compose two linear layers $W_2 W_1 x$ — the result is still a linear function of $x$ , just with a different weight matrix. Without nonlinear $\phi$ , depth gives you nothing. The activation breaks linearity and lets the network represent functions that aren't expressible as a single matrix multiplication.

Universal approximation

Cybenko (1989) showed an MLP with one hidden layer of sufficient width can approximate any continuous function on a compact set to arbitrary accuracy. This sounds great until you ask "how wide?" — the answer can be exponential in input dimension. Depth changes the constants dramatically. Multi-layer networks express many functions with polynomial width that single-hidden-layer networks would need exponential width to represent.

In practice you don't use a 1-hidden-layer net for anything serious. You use depth because depth is a sample-efficient way to compose features.

When MLPs are the right tool

Tabular data after embedding categorical features (often loses to GBMs unless data is huge)
Output heads on top of pretrained encoders (the last few layers of almost every modern network)
Implicit neural representations (function fitting in low dimensions)

When they're wrong

Raw image input — use CNNs
Sequence input — use transformers or RNNs
Any data with strong structural priors (graphs, point clouds, sets) — use a specialized architecture

The MLP is a building block, not usually the whole architecture.