How Do You Monitor and Maintain ML Models in Production (MLOps)?

Concept

Shipping a model is the starting line, not the finish. MLOps ensures models remain accurate, reliable, and cost-effective after deployment by monitoring data, predictions, and business outcomes — and by closing the loop with retraining and governance.

Production is messy: upstream schemas change, user behavior drifts, and infra fails. A robust monitoring plan detects issues early and provides clear runbooks to fix them.

1) What to Monitor (Beyond Accuracy)

A. Data & Features

Schema & type checks: column presence, dtypes, allowed domains.
Data quality: missing rates, invalid values, out-of-range, deduplication.
Distribution shift (covariate drift): compare live feature distributions vs. training baseline.

B. Predictions

Score distribution: mean/quantiles, saturation at 0 or 1, unexpected spikes.
Calibration: predicted probabilities vs. observed frequencies (when labels arrive).
Fairness slices: performance across cohorts (e.g., region, device).

C. Business KPIs

Conversion, revenue lift, fraud catch rate, SLA latency, cost per prediction.

D. System Health

Latency, throughput, error rates (4xx/5xx), GPU/CPU utilization, queue lag.

2) Detecting Drift (MDX-safe formulas)

Use simple, explainable statistics that are easy to compute online.

Population Stability Index (PSI):


PSI = sum_over_bins( (p_live - p_base) * ln( p_live / p_base ) )

# Rule of thumb:

# < 0.1  : stable

# 0.1–0.25: moderate shift

# > 0.25 : significant shift (investigate)

Jensen–Shannon Distance (JSD):


JSD(P || Q) = 0.5 * KL(P || M) + 0.5 * KL(Q || M)
where M = 0.5 * (P + Q)

# Bounded and symmetric; good for comparing histograms.

Calibration (Brier score):


Brier = mean( (y_hat - y_true)^2 )

# Lower is better; track by time and by cohort.

When labels are delayed (common in fraud or credit), track proxy metrics (agreement with a strong baseline, rule-based triggers) until ground truth arrives.

3) Retraining & Release Strategies

A. Triggers

Time-based: retrain weekly or monthly.
Data-based: PSI or JSD exceeds threshold; schema change detected.
Performance-based: drop in AUC/F1, rise in calibration error or KPI degradation.

B. Pipelines

Continuous Training (CT): scheduled feature/label extraction → train → validate → push to a model registry.
Champion/Challenger: shadow a new model; promote only if it beats current champion in online A/B or canary.

C. Validation Gates

Holdout metrics (AUC, logloss, RMSE).
Canary metrics: latency, error rate, tail p95/p99.
Safety checks: calibration within tolerance, fairness deltas within policy.

4) Reference Architecture (Production-grade)

Feature layer: batch features in a lake/warehouse (Parquet) plus online features in a low-latency store (e.g., Redis). Keep offline/online parity via a feature store.
Serving layer: real-time inference service (FastAPI, gRPC) with autoscaling; batch scoring for backfills.
Observability: metrics (Prometheus), logs (ELK), traces (OpenTelemetry), model metrics (WhyLabs, Arize, Fiddler, custom).
Registry & lineage: MLflow or SageMaker Model Registry; artifact versioning; dataset hashes; reproducible training runs.
Orchestration: Airflow/Prefect/Dagster for CT; event triggers on data arrival.
Governance: access control, PII handling, approval workflows, rollback buttons.

5) Real-World Scenarios

Fraud Detection (Payments)

Problem: adversaries adapt; feature distributions shift overnight.
Approach: hourly PSI on key signals (transaction amount, device risk score). If PSI > 0.25, page on-call and throttle risky cohorts; nightly retrain with recent labels.
Result: reduced false negatives during attack spikes while containing latency SLO.

Search/Ranking (Marketplace)

Problem: seasonal shifts and supply shocks degrade CTR.
Approach: calibration monitoring + weekly champion/challenger online test; counterfactual evaluation with logged propensities to pre-screen candidates.
Result: sustained CTR lift without regressions during promotions.

6) Fairness, Safety, and Compliance

Track metrics by cohort: TPR/FPR parity, equalized odds, calibration gaps.
Maintain model cards: intended use, training data, known limitations, caveats.
Keep audit trails: who trained what, with which data, and when.
Enforce kill-switch and safe defaults if monitoring fails.

7) Runbooks (What to Do When Things Break)

Alarm: PSI high on device_type

Check recent product rollouts, bot traffic, geo mix shift.
Hotfix: enable robust rules; gradually reweight features; consider temporary threshold rise.
Schedule fast retrain on last 7–14 days.

Alarm: Latency p99 up

Inspect feature fetch timings; cache hot features; batch external calls.
Enable distillation to a lighter model for peak traffic windows.

Alarm: Calibration off

Refit Platt scaling/isotonic on latest data; patch via registry without full model retrain.

8) Best Practices (Hard-earned)

Test like prod: same feature code for training and serving; avoid training-serving skew.
Alert sanity: actionable alerts only; include links to dashboards and playbooks.
Slice everything: averages lie — always monitor by cohort.
Shadow before ship: run challengers in shadow to de-risk releases.
Budget & carbon: track inference cost per 1k requests and footprint; distill when over budget.

Tips for Application

When to discuss: system design, production ML, or “why did your model degrade in the wild?”
Interview Tip: quantify your ops impact:

“We added PSI-based drift alarms and a weekly CT job; incident count fell 60%, and time-to-restore dropped from 6h to 45m. Canary + shadow prevented two bad rollouts.”

Key takeaway:
Great ML isn’t just about a high offline metric — it’s about operating the model: monitoring the right signals, reacting fast, and retraining safely so performance holds up in the real world.