InterviewBiz LogoInterviewBiz
← Back
How Do You Monitor and Maintain ML Models in Production (MLOps)?
data-sciencehard

How Do You Monitor and Maintain ML Models in Production (MLOps)?

HardCommonMajor: data scienceairbnb, uber

Concept

Shipping a model is the starting line, not the finish. MLOps ensures models remain accurate, reliable, and cost-effective after deployment by monitoring data, predictions, and business outcomes — and by closing the loop with retraining and governance.

Production is messy: upstream schemas change, user behavior drifts, and infra fails. A robust monitoring plan detects issues early and provides clear runbooks to fix them.


1) What to Monitor (Beyond Accuracy)

A. Data & Features

  • Schema & type checks: column presence, dtypes, allowed domains.
  • Data quality: missing rates, invalid values, out-of-range, deduplication.
  • Distribution shift (covariate drift): compare live feature distributions vs. training baseline.

B. Predictions

  • Score distribution: mean/quantiles, saturation at 0 or 1, unexpected spikes.
  • Calibration: predicted probabilities vs. observed frequencies (when labels arrive).
  • Fairness slices: performance across cohorts (e.g., region, device).

C. Business KPIs

  • Conversion, revenue lift, fraud catch rate, SLA latency, cost per prediction.

D. System Health

  • Latency, throughput, error rates (4xx/5xx), GPU/CPU utilization, queue lag.

2) Detecting Drift (MDX-safe formulas)

Use simple, explainable statistics that are easy to compute online.

Population Stability Index (PSI):


PSI = sum_over_bins( (p_live - p_base) * ln( p_live / p_base ) )

# Rule of thumb:

# < 0.1  : stable

# 0.1–0.25: moderate shift

# > 0.25 : significant shift (investigate)

Jensen–Shannon Distance (JSD):


JSD(P || Q) = 0.5 * KL(P || M) + 0.5 * KL(Q || M)
where M = 0.5 * (P + Q)

# Bounded and symmetric; good for comparing histograms.

Calibration (Brier score):


Brier = mean( (y_hat - y_true)^2 )

# Lower is better; track by time and by cohort.

When labels are delayed (common in fraud or credit), track proxy metrics (agreement with a strong baseline, rule-based triggers) until ground truth arrives.


3) Retraining & Release Strategies

A. Triggers

  • Time-based: retrain weekly or monthly.
  • Data-based: PSI or JSD exceeds threshold; schema change detected.
  • Performance-based: drop in AUC/F1, rise in calibration error or KPI degradation.

B. Pipelines

  • Continuous Training (CT): scheduled feature/label extraction → train → validate → push to a model registry.
  • Champion/Challenger: shadow a new model; promote only if it beats current champion in online A/B or canary.

C. Validation Gates

  • Holdout metrics (AUC, logloss, RMSE).
  • Canary metrics: latency, error rate, tail p95/p99.
  • Safety checks: calibration within tolerance, fairness deltas within policy.

4) Reference Architecture (Production-grade)

  • Feature layer: batch features in a lake/warehouse (Parquet) plus online features in a low-latency store (e.g., Redis). Keep offline/online parity via a feature store.
  • Serving layer: real-time inference service (FastAPI, gRPC) with autoscaling; batch scoring for backfills.
  • Observability: metrics (Prometheus), logs (ELK), traces (OpenTelemetry), model metrics (WhyLabs, Arize, Fiddler, custom).
  • Registry & lineage: MLflow or SageMaker Model Registry; artifact versioning; dataset hashes; reproducible training runs.
  • Orchestration: Airflow/Prefect/Dagster for CT; event triggers on data arrival.
  • Governance: access control, PII handling, approval workflows, rollback buttons.

5) Real-World Scenarios

Fraud Detection (Payments)

  • Problem: adversaries adapt; feature distributions shift overnight.
  • Approach: hourly PSI on key signals (transaction amount, device risk score). If PSI > 0.25, page on-call and throttle risky cohorts; nightly retrain with recent labels.
  • Result: reduced false negatives during attack spikes while containing latency SLO.

Search/Ranking (Marketplace)

  • Problem: seasonal shifts and supply shocks degrade CTR.
  • Approach: calibration monitoring + weekly champion/challenger online test; counterfactual evaluation with logged propensities to pre-screen candidates.
  • Result: sustained CTR lift without regressions during promotions.

6) Fairness, Safety, and Compliance

  • Track metrics by cohort: TPR/FPR parity, equalized odds, calibration gaps.
  • Maintain model cards: intended use, training data, known limitations, caveats.
  • Keep audit trails: who trained what, with which data, and when.
  • Enforce kill-switch and safe defaults if monitoring fails.

7) Runbooks (What to Do When Things Break)

  1. Alarm: PSI high on device_type
  • Check recent product rollouts, bot traffic, geo mix shift.
  • Hotfix: enable robust rules; gradually reweight features; consider temporary threshold rise.
  • Schedule fast retrain on last 7–14 days.
  1. Alarm: Latency p99 up
  • Inspect feature fetch timings; cache hot features; batch external calls.
  • Enable distillation to a lighter model for peak traffic windows.
  1. Alarm: Calibration off
  • Refit Platt scaling/isotonic on latest data; patch via registry without full model retrain.

8) Best Practices (Hard-earned)

  • Test like prod: same feature code for training and serving; avoid training-serving skew.
  • Alert sanity: actionable alerts only; include links to dashboards and playbooks.
  • Slice everything: averages lie — always monitor by cohort.
  • Shadow before ship: run challengers in shadow to de-risk releases.
  • Budget & carbon: track inference cost per 1k requests and footprint; distill when over budget.

Tips for Application

  • When to discuss: system design, production ML, or “why did your model degrade in the wild?”
  • Interview Tip: quantify your ops impact:

    “We added PSI-based drift alarms and a weekly CT job; incident count fell 60%, and time-to-restore dropped from 6h to 45m. Canary + shadow prevented two bad rollouts.”


Key takeaway:
Great ML isn’t just about a high offline metric — it’s about operating the model: monitoring the right signals, reacting fast, and retraining safely so performance holds up in the real world.