The hard parts of ML in production aren't the model. They're the pipelines: how data gets fetched, how the model gets trained, how it gets deployed, how it gets monitored, how it gets rolled back when something goes wrong.
The minimum viable pipeline
1. Data: a reliable source of training data (warehouse query, Parquet snapshot, feature store). 2. Training: a script that reads data, trains a model, evaluates it, writes the model and metrics to a model registry. 3. Deployment: a process that takes a registered model and rolls it out to serving infrastructure (Kubernetes pods, SageMaker endpoint, custom). 4. Monitoring: tracks predictions, drift, performance. 5. Rollback: a one-button path to the previous model.
If any of these are missing, you don't have a production ML system — you have a science project.
Tooling landscape
- Orchestration: Airflow, Prefect, Dagster, Kubeflow Pipelines. Define the steps as a DAG with clear retries and dependencies.
- Experiment tracking: MLflow, Weights & Biases, Neptune. Log hyperparameters, metrics, artifacts. Critical for reproducibility.
- Model registry: MLflow, Vertex AI Model Registry, SageMaker. Versioned storage of trained models with metadata.
- Feature stores: Feast, Tecton (covered earlier).
- CI/CD: GitHub Actions, GitLab, BuildKite. Lint, test, deploy. Same tools as backend engineering; same value.
Reproducibility
Treat ML experiments like software builds. Pin everything:
- Data version (snapshot timestamp or hash)
- Code commit
- Hyperparameter config
- Dependency versions (requirements.txt, conda env)
- Random seed (yes, set it)
If you can't reproduce yesterday's run today, you can't reason about whether changes are improvements.
Deployment strategies
- Blue-green: spin up the new version alongside the old; flip traffic instantly. Easy rollback.
- Canary: route 1% of traffic to the new version, then 5%, then 25%. Catches issues with limited blast radius.
- Shadow: send the new model the same inputs as the old, log its predictions, never serve them. Lets you compare without user impact.
Pick based on risk: shadow before canary before blue-green for high-stakes changes; blue-green is fine for low-risk swaps of an already-validated model.