K-means clusters points into K groups by alternating between assigning points to the nearest cluster center and recomputing centers as the mean of their assigned points. The most-used clustering algorithm — fast, simple, and good when clusters are roughly spherical.
The algorithm
1. Initialize K cluster centers (randomly or with K-means++) 2. Assign each point to its nearest center 3. Update each center to be the mean of its assigned points 4. Repeat 2-3 until assignments stop changing
Converges to a local optimum in for iterations.
K-means++
Random initialization can give bad clusterings. K-means++ picks initial centers more carefully: start with one random point, then iteratively pick the next center proportional to its squared distance from the nearest existing center. Provably -competitive with the global optimum.
Choosing K
The hard problem. Methods:
- Elbow method: plot within-cluster sum of squares vs K; look for an "elbow" where the marginal gain flattens
- Silhouette score: measures how similar each point is to its cluster vs neighboring clusters; higher is better
- Gap statistic: compares within-cluster dispersion to that expected under a null distribution
In practice, K-means is often used as part of a larger pipeline where K is dictated by downstream needs.
Assumptions and failure modes
K-means assumes clusters are roughly spherical, similar size, and similar density. Fails on:
- Non-convex shapes (concentric rings, half-moons)
- Very different cluster sizes
- Strong noise / outliers (mean is sensitive)
- High-dimensional data (Euclidean distance becomes uninformative)
For these, use DBSCAN, spectral clustering, or Gaussian mixtures.
Standardize features
K-means uses Euclidean distance, so features with larger scales dominate. Always standardize (or domain-normalize) before clustering.