Mutual information measures how much knowing one variable reduces uncertainty about another:
Equivalently, — KL between the joint and the product of marginals. iff and are independent.
Unlike correlation, mutual information captures non-linear dependencies. Two variables can have but high — e.g. on a symmetric .
Applications: feature selection (drop features with low relative to the target), causal-graph learning, signal-processing decoding, and any setting where you suspect non-linear dependence.