Catalog
concept#ML#DevOps#Data#Governance#Platform

Machine Learning Operations (MLOps)

MLOps connects ML development, production and operations using processes, automation and governance to run models reliably.

Machine Learning Operations (MLOps) is a practice that unifies ML model development, deployment and maintenance across teams.
Emerging
High

Classification

  • High
  • Organizational
  • Organizational
  • Intermediate

Technical context

Kubernetes / container orchestrationCI/CD systems (e.g. GitHub Actions, GitLab CI)Monitoring and observability tools (e.g. Prometheus)

Principles & goals

Version data, models and pipelinesAutomate tests and validationSeparate infrastructure from business logic
Iterate
Enterprise, Domain, Team

Use cases & scenarios

Compromises

  • Insufficient monitoring leads to gradual quality degradation
  • Over-automation without governance increases failure risk
  • Data access or privacy breaches
  • Version everything (code, data, models, config)
  • Automate tests at data, model and integration levels
  • Define clear metrics and alert thresholds for production

I/O & resources

  • Training and production data
  • Model definitions and hyperparameters
  • Infrastructure and deployment configs
  • Production model endpoints
  • Monitoring and audit dashboards
  • Versioned artifacts and metadata

Description

Machine Learning Operations (MLOps) is a practice that unifies ML model development, deployment and maintenance across teams. It combines data engineering, CI/CD, monitoring and governance to productionize models reliably. MLOps defines roles, pipelines and automation to ensure reproducibility, scalability and continuous improvement in ML systems.

  • Faster, reproducible model deployments
  • Improved monitoring and drift detection
  • Better governance and traceability

  • High initial integration effort
  • Requires specialized skills
  • Complexity with heterogeneous data sources

  • Deployment frequency

    Number of model deployments per time unit.

  • Model performance

    Business-relevant metrics such as precision, recall or AUC in production.

  • MTTR for models

    Average time to recover from model or pipeline failures.

E‑commerce platform — live recommendations

Rollout of recommendation models using canary deployments and real-time monitoring.

Financial services — fraud detection

Continuous validation and retraining to minimize false positives.

SaaS provider — automated feature pipelines

Feature versioning, tests and reproducible training runs as standard practice.

1

Define roles, responsibilities and SLAs

2

Establish versioning for data, models and pipelines

3

Set up CI/CD, monitoring and retraining loops

⚠️ Technical debt & bottlenecks

  • Unversioned models and feature sets
  • Monolithic pipelines without modularity
  • Missing rollback and canary strategies
Data quality and availabilityModel drift and monitoring gapsDeployment and latency bottlenecks
  • Deploying models directly to production without monitoring
  • Retraining solely on recent labels without validation
  • Ignoring governance and leaving critical data exposed
  • Using accuracy as the sole quality criterion
  • Detecting model drift only after business metrics suffer
  • Underestimating data dependencies
Data engineering and feature engineeringMachine learning and model validationDevOps skills: infrastructure, CI/CD, SRE
Reproducibility of training runsScalability of training and inference workloadsSecurity and compliance for data and models
  • Regulatory requirements and data protection
  • Limited availability of ML specialists
  • Heterogeneous infrastructure landscape