Softmax for Multiclass — Section 2: Linear Models

For $K$ classes, model $P(y=k | x) \propto \exp(\beta_k^T x)$ and normalize:

P(y=k|x) = \frac{e^{\beta_k^T x}}{\sum_{j=1}^K e^{\beta_j^T x}}

This is the softmax function. Reduces to logistic regression when $K = 2$ .

Loss

Categorical cross-entropy:

L = -\sum_i \sum_k \mathbb{1}[y_i = k] \log \hat{p}_{ik} = -\sum_i \log \hat{p}_{i, y_i}

Convex in each $\beta_k$ — same story as binary logistic.

The redundancy

The model has $K$ vectors $\beta_k$ but only $K-1$ are identifiable: adding the same constant to every $\beta_k$ doesn't change any probability (the constant cancels in normalization). Conventions:

Fix one class as the reference (say $\beta_K = 0$ ). Standard in statistics.
Train all $K$ with regularization that breaks the symmetry. Standard in deep learning — every output unit gets a separate weight vector, and weight decay handles the redundancy.

Numerical stability

Compute softmax with the logsumexp trick: subtract $\max_j \beta_j^T x$ from every $\beta_k^T x$ before exponentiating. Otherwise large inputs cause overflow. Every ML framework does this internally — but understand it because the same trick shows up in attention mechanisms, beam search, and HMM forward-backward.

When softmax is wrong

When labels can co-occur. Multi-label classification (a movie is both "comedy" AND "romance") needs $K$ independent sigmoids with binary cross-entropy, not a softmax — softmax forces probabilities to sum to 1 across classes.