Supervised and unsupervised learning, neural networks, optimization, deep learning architectures, and modern ML systems — the foundations for ML interviews.
Other tracks: Quantitative Finance · Software Engineering
Inputs, outputs, hypothesis classes, and what 'learning' actually means.
How squared error, cross-entropy, hinge, and Huber implicitly define what the model optimizes for.
Why infinite capacity doesn't fix everything — the geometry of generalization error.
Why minimizing training loss is not the same as minimizing test loss.
The model everything else is compared against. Closed-form solution, assumptions, when it works.
Ridge, lasso, elastic net — and why the geometry of the penalty determines what kind of solution you get.
Linear models for binary classification — sigmoid, log-odds, and why it's still the workhorse.
Generalizing logistic regression to K classes — softmax, cross-entropy, and the redundancy you need to remove.
Recursive splits, impurity measures, and why a single tree is rarely the right answer.
Bagging plus feature subsampling — variance reduction through deliberate decorrelation.
Sequentially fitting trees to the residuals — the canonical recipe for tabular data.
Second-order optimization, regularized objective, and the engineering tricks that made GBMs production-grade.
The basic feedforward architecture — fully connected layers, activations, and the universal approximation theorem.
Reverse-mode automatic differentiation — the algorithm that makes deep learning possible.
Why ReLU won, what initialization schemes do, and how the two interact.
Dropout, batch norm, weight decay, data augmentation — what they actually do and when each helps.
Translation equivariance, parameter sharing, and the receptive field — why CNNs were the right answer for images.
Modeling sequences with shared parameters across time — and what made plain RNNs unworkable for long sequences.
Query, key, value — and why content-based addressing changed sequence modeling.
Architecture, positional encodings, layer norm placement, and why this design dominated.
SGD, Adam, AdamW — what they actually do and what differs.
Warmup, cosine, step decay — the single most impactful hyperparameter after batch size.
FP16/BF16 forward and backward — 2x speedup and half the memory, if you do it right.
Data, model, and pipeline parallelism — how to train models that don't fit on one GPU.
K-fold, stratified, group, and time-series CV — and the leakage pitfalls each is designed to avoid.
Grid, random, Bayesian, and Hyperband — when each makes sense.
Holding out users vs. holding out predictions — what's actually being measured and what isn't.
When predicted probabilities aren't probabilities, and why your 0.5 cutoff is rarely the right one.
Why training-serving skew happens and the systems built to prevent it.
Batch vs online, latency budgets, batching, and the operational realities of running models in production.
Input drift, prediction drift, performance decay — what to alert on and what to ignore.
From notebook to production — the engineering glue that ties training, deployment, and monitoring together.