PCA finds orthogonal directions (principal components) in feature space along which the data varies most. The first PC captures the most variance, the second the most variance orthogonal to the first, and so on. Projecting onto the top PCs gives a -dimensional summary that retains as much information as possible.
The math
PCs are eigenvectors of the covariance matrix . Eigenvalues give the variance along each PC. In practice, PCA is computed via SVD on the centered data matrix — more numerically stable.
Two main uses
1. Compression / visualization: project to 2-3 PCs to plot high-dimensional data 2. Decorrelation / noise removal: keep only the top PCs, dropping low-variance directions that may be noise
Choosing $k$
Plot cumulative explained variance vs number of components. Keep enough PCs to capture, say, 90-95% of variance. Or use cross-validation if PCA is a preprocessing step for a downstream model.
Center and standardize
PCA is sensitive to feature scale — a feature measured in millions will dominate one measured in tenths. Always standardize before PCA. The "centering" step (subtracting the mean) is mathematically required; sklearn does it automatically.
Limitations
- Linear only — finds directions of maximum LINEAR variance. Nonlinear structure (curves, manifolds) needs t-SNE, UMAP, or kernel PCA.
- Each PC is a linear combination of all input features — loses interpretability. Sparse PCA forces zero coefficients for clarity.
- Doesn't preserve cluster structure necessarily — the top PCs might separate two groups OR mix them, depending on the data.
t-SNE and UMAP
For visualization specifically, t-SNE and UMAP often beat PCA. They optimize local structure (similar points stay close) at the cost of global geometry (don't read distances between clusters as meaningful). UMAP is faster and preserves more global structure than t-SNE.