Anomaly detection finds points that don't conform to the bulk of the data. Useful for fraud, fault detection, intrusion detection, and data quality checks.
Statistical methods
- Z-score: flags outliers. Assumes Gaussian; sensitive to the outliers themselves.
- IQR rule: outside . Robust; basis of box plot whiskers.
- Modified z-score: uses median and MAD instead of mean and SD — robust to outliers contaminating the threshold.
Isolation Forest
Builds random trees where each split is on a random feature with a random threshold. Anomalies tend to be isolated quickly (few splits to a leaf); normal points sit in dense regions and take more splits. Scores are based on average path length. Fast, handles high dimensions, doesn't need labels.
One-Class SVM
Fits a boundary around the "normal" data and flags points outside. Works for moderate dimensions; expensive for big data. Less popular than tree-based methods in recent years.
Autoencoders (NN-based)
Train a neural network to reconstruct inputs. Normal points reconstruct well; anomalies reconstruct poorly. The reconstruction error is the anomaly score. Useful for high-dimensional structured data (images, sequences) where simpler methods struggle.
Local Outlier Factor (LOF)
Density-based. For each point, measure how much sparser its neighborhood is compared to its neighbors' neighborhoods. Points in sparse regions surrounded by dense regions get high LOF scores.
Cold-start vs adaptive
Some methods need a clean "normal" training set; others tolerate or even thrive on contamination. Isolation Forest works on contaminated data (its parameter "contamination" tells it the expected fraction). Autoencoders are sensitive to outliers during training.
Pick a threshold
All methods produce a score; you need a cutoff. Choose based on:
- A held-out labeled validation set if available (find threshold maximizing F1 or recall@K)
- A precision target (top 1% scores are flagged)
- Domain knowledge of the expected anomaly rate