For classes, model and normalize:
This is the softmax function. Reduces to logistic regression when .
Loss
Categorical cross-entropy:
Convex in each — same story as binary logistic.
The redundancy
The model has vectors but only are identifiable: adding the same constant to every doesn't change any probability (the constant cancels in normalization). Conventions:
- Fix one class as the reference (say ). Standard in statistics.
- Train all with regularization that breaks the symmetry. Standard in deep learning — every output unit gets a separate weight vector, and weight decay handles the redundancy.
Numerical stability
Compute softmax with the logsumexp trick: subtract from every before exponentiating. Otherwise large inputs cause overflow. Every ML framework does this internally — but understand it because the same trick shows up in attention mechanisms, beam search, and HMM forward-backward.
When softmax is wrong
When labels can co-occur. Multi-label classification (a movie is both "comedy" AND "romance") needs independent sigmoids with binary cross-entropy, not a softmax — softmax forces probabilities to sum to 1 across classes.