Catalog
concept#Machine Learning#Platform#Observability#Reliability

Model Serving

Concepts and practices for exposing trained machine learning models to production traffic, focusing on scalability, versioning and observability.

Model serving describes the systems and infrastructure that expose trained machine learning models to production traffic, handling scaling, versioning, routing and observability.
Established
Medium

Classification

  • Medium
  • Technical
  • Architectural
  • Intermediate

Technical context

CI/CD pipeline (e.g., Jenkins, GitHub Actions)Feature store and model registryObservability stack (Prometheus, Grafana, Jaeger)

Principles & goals

Separation of training and serving runtimeVersioning and reproducibility of modelsMeasurable SLIs and automated rollbacks
Run
Enterprise, Domain, Team

Use cases & scenarios

Compromises

  • Retired models can cause silent failures for API clients
  • Undetected model regressions in live traffic
  • Security risks from untrusted model artifacts or dependencies
  • Clearly defined API contracts and versioning
  • Automated canary releases with monitoring
  • Use‑case tailored resource profiles and caching

I/O & resources

  • Trained model artifact (e.g., SavedModel, ONNX)
  • Payload schema and API contract
  • Resource profiles and SLA targets
  • Production prediction API with versioning
  • Monitoring metrics, logs and traces
  • Deployment audit and reproduction artifacts

Description

Model serving describes the systems and infrastructure that expose trained machine learning models to production traffic, handling scaling, versioning, routing and observability. It includes serving APIs, model lifecycle management and resource orchestration. The goal is reliable, low‑latency inference and reproducible deployment pipelines.

  • Fast, scalable delivery of predictions
  • Clear separation of responsibilities in the ML lifecycle
  • Improved observability and fault detection in production

  • Increased infrastructure effort and operating costs
  • Complexity in model compatibility and serialization
  • Latency limits for very resource‑intensive models

  • P99 latency

    95th/99th percentile of response times to measure worst‑case latency.

  • Throughput (requests per second)

    Number of successful inference requests served per time unit.

  • Error rate

    Proportion of failed or erroneous requests relative to total traffic.

TensorFlow Serving for image classification

Use of TensorFlow Serving to deploy and version a CNN model with gRPC API.

KServe on Kubernetes for A/B testing

KServe uses Knative/Ingress routing for canary rollouts and traffic splits.

Batch inference with Spark and ONNX

Scaled offline predictions using ONNX Runtime as a Spark job for reporting.

1

Validate, serialize and register model in registry

2

Build serving image and deploy as a version

3

Configure routing, autoscaling and observability

⚠️ Technical debt & bottlenecks

  • Legacy models without registered metadata
  • Ad‑hoc serving scripts instead of standardized images
  • Missing automated rollback mechanisms
GPU/CPU resourcesNetwork latencySerialization / I/O
  • Live model changes without staging tests
  • Allocating high resources for rare batch jobs
  • Ignoring drift monitoring after deployment
  • Underestimating serialization incompatibilities
  • Lack of protection against malformed model inputs
  • Insufficient canaries before full rollout
Fundamentals in machine learning and model formatsKnowledge of containerization and orchestrationExperience with monitoring and SLI definitions
Application SLAs for latency and availabilityModel lifecycle and versioningObservability and tracing for inference paths
  • Compatibility between model formats
  • Data privacy and compliance requirements
  • Limited hardware (e.g., no GPUs available)