Garbage in, garbage out. Modern ML methods are powerful but they can't infer information that isn't in the features. Feature engineering — turning raw data into model inputs that expose structure — is usually the highest-leverage step.
Categorical encoding
- One-hot: one column per category. Standard for low-cardinality categoricals. Sparse for high-cardinality.
- Label encoding: integer per category. Don't use for non-ordinal variables in linear models or NNs — implies false ordering. OK for tree-based.
- Target encoding: replace category with its mean target value. Powerful but prone to leakage; use K-fold within training data to compute.
- Embeddings: learned vector per category. Used in neural nets; useful for high-cardinality (user IDs, product IDs).
Numerical transformations
- Standardize (z-score): mean 0, std 1. Needed for linear models, SVMs, NNs. Not needed for tree-based.
- Log transform for right-skewed positive variables (income, prices). Helps linear models and stabilizes variance.
- Quantile / rank transform: convert to uniform [0, 1] via empirical CDF. Robust to outliers, removes skew.
- Binning: discretize a continuous variable. Adds non-linearity to linear models cheaply.
Date / time features
A timestamp is almost never useful raw. Extract: hour of day, day of week, day of month, month, year, holiday flag. For seasonality, use sin/cos transforms ().
Interaction features
Sometimes the relationship is in the cross — "income × debt" matters, neither alone does. Polynomial features, explicit cross-features (column1 * column2), or rely on trees to discover them.
Text and image features
Pre-NN era: TF-IDF for text, HOG / SIFT for images. NN era: use pre-trained embeddings or fine-tune. For tabular pipelines that include text, both still have a place — sometimes simpler is better.
The 80/20 rule
In real tabular ML projects, ~80% of the gain comes from feature engineering, ~20% from algorithm/hyperparameter tuning. Time invested understanding the data and crafting features beats time invested in the latest gradient boosting variant.