Clustering and Dimensionality Reduction — Section 19: Machine Learning Fundamentals

Clustering partitions data into groups. K-means is the workhorse: pick $k$ , alternate between assigning points to nearest centroid and updating centroids, until convergence. It's fast, simple, and assumes spherical equally-sized clusters. For non-spherical or unknown cluster counts, try DBSCAN, hierarchical clustering, or Gaussian mixture models.

Dimensionality reduction projects high-dim data into low-dim while preserving structure. PCA captures linear variance, $t$ -SNE and UMAP preserve local neighborhood structure non-linearly (great for visualization). Autoencoders learn non-linear projections via neural networks.

In quant work, clustering finds market regimes (bull, bear, high-vol), groups of similarly-behaving assets, or anomalous trading patterns. Dimension reduction makes covariance matrices manageable and creates compact factor representations.