Feature Engineering — Section 6: Model Evaluation

Garbage in, garbage out. Modern ML methods are powerful but they can't infer information that isn't in the features. Feature engineering — turning raw data into model inputs that expose structure — is usually the highest-leverage step.

Categorical encoding

One-hot: one column per category. Standard for low-cardinality categoricals. Sparse for high-cardinality.
Label encoding: integer per category. Don't use for non-ordinal variables in linear models or NNs — implies false ordering. OK for tree-based.
Target encoding: replace category with its mean target value. Powerful but prone to leakage; use K-fold within training data to compute.
Embeddings: learned vector per category. Used in neural nets; useful for high-cardinality (user IDs, product IDs).

Numerical transformations

Standardize (z-score): mean 0, std 1. Needed for linear models, SVMs, NNs. Not needed for tree-based.
Log transform for right-skewed positive variables (income, prices). Helps linear models and stabilizes variance.
Quantile / rank transform: convert to uniform [0, 1] via empirical CDF. Robust to outliers, removes skew.
Binning: discretize a continuous variable. Adds non-linearity to linear models cheaply.

Date / time features

A timestamp is almost never useful raw. Extract: hour of day, day of week, day of month, month, year, holiday flag. For seasonality, use sin/cos transforms ( $\sin(2\pi \cdot \text{hour} / 24)$ ).

Interaction features

Sometimes the relationship is in the cross — "income × debt" matters, neither alone does. Polynomial features, explicit cross-features (column1 * column2), or rely on trees to discover them.

Text and image features

Pre-NN era: TF-IDF for text, HOG / SIFT for images. NN era: use pre-trained embeddings or fine-tune. For tabular pipelines that include text, both still have a place — sometimes simpler is better.

The 80/20 rule

In real tabular ML projects, ~80% of the gain comes from feature engineering, ~20% from algorithm/hyperparameter tuning. Time invested understanding the data and crafting features beats time invested in the latest gradient boosting variant.