Anomaly Detection — Section 7: Unsupervised Learning

Anomaly detection finds points that don't conform to the bulk of the data. Useful for fraud, fault detection, intrusion detection, and data quality checks.

Statistical methods

Z-score: $|z| > 3$ flags outliers. Assumes Gaussian; sensitive to the outliers themselves.
IQR rule: outside $[Q_1 - 1.5 \cdot \text{IQR}, Q_3 + 1.5 \cdot \text{IQR}]$ . Robust; basis of box plot whiskers.
Modified z-score: uses median and MAD instead of mean and SD — robust to outliers contaminating the threshold.

Isolation Forest

Builds random trees where each split is on a random feature with a random threshold. Anomalies tend to be isolated quickly (few splits to a leaf); normal points sit in dense regions and take more splits. Scores are based on average path length. Fast, handles high dimensions, doesn't need labels.

One-Class SVM

Fits a boundary around the "normal" data and flags points outside. Works for moderate dimensions; expensive for big data. Less popular than tree-based methods in recent years.

Autoencoders (NN-based)

Train a neural network to reconstruct inputs. Normal points reconstruct well; anomalies reconstruct poorly. The reconstruction error is the anomaly score. Useful for high-dimensional structured data (images, sequences) where simpler methods struggle.

Local Outlier Factor (LOF)

Density-based. For each point, measure how much sparser its neighborhood is compared to its neighbors' neighborhoods. Points in sparse regions surrounded by dense regions get high LOF scores.

Cold-start vs adaptive

Some methods need a clean "normal" training set; others tolerate or even thrive on contamination. Isolation Forest works on contaminated data (its parameter "contamination" tells it the expected fraction). Autoencoders are sensitive to outliers during training.

Pick a threshold

All methods produce a score; you need a cutoff. Choose based on:

A held-out labeled validation set if available (find threshold maximizing F1 or recall@K)
A precision target (top 1% scores are flagged)
Domain knowledge of the expected anomaly rate