Measures of Central Tendency — Section 2: Descriptive Statistics

Three summary numbers compete for "the middle of the data": mean, median, and mode. They agree for symmetric unimodal distributions, diverge for skewed or multimodal ones.

Mean

The arithmetic average $\frac{1}{n} \sum x_i$ . Minimizes squared error from the data. Sensitive to outliers — one wildly large value can drag the mean far from the bulk of the distribution.

Median

The middle value of the sorted data (or the average of the two middle values for even $n$ ). Minimizes absolute error. Robust to outliers — doubling one observation doesn't move the median at all.

Mode

The most frequently occurring value. Useful for categorical data and multimodal distributions; less commonly reported for continuous data.

When mean vs median

For roughly symmetric data with no outliers, use the mean — it's more efficient (lower variance) and works with linear algebra. For skewed data (income, response times, file sizes), the median is a better summary of "what's typical." Mean income > median income because of the right tail.

Trimmed mean

A compromise: drop the smallest and largest $k$ % of observations, average the rest. More robust than the mean, more efficient than the median when the data is "mostly clean but with some outliers."

Geometric and harmonic mean

For multiplicative data (growth rates, ratios), the geometric mean ( $n$ -th root of the product) is the right average. For rates (speed, throughput), the harmonic mean ( $n / \sum 1/x_i$ ) is correct. Using arithmetic mean here is a classic error — averaging 60 and 30 mph gives 45, but the harmonic mean is 40, which is the right "average speed" for equal distances.