Model Serving
Concepts and practices for exposing trained machine learning models to production traffic, focusing on scalability, versioning and observability.
Classification
- ComplexityMedium
- Impact areaTechnical
- Decision typeArchitectural
- Organizational maturityIntermediate
Technical context
Principles & goals
Use cases & scenarios
Compromises
- Retired models can cause silent failures for API clients
- Undetected model regressions in live traffic
- Security risks from untrusted model artifacts or dependencies
- Clearly defined API contracts and versioning
- Automated canary releases with monitoring
- Use‑case tailored resource profiles and caching
I/O & resources
- Trained model artifact (e.g., SavedModel, ONNX)
- Payload schema and API contract
- Resource profiles and SLA targets
- Production prediction API with versioning
- Monitoring metrics, logs and traces
- Deployment audit and reproduction artifacts
Description
Model serving describes the systems and infrastructure that expose trained machine learning models to production traffic, handling scaling, versioning, routing and observability. It includes serving APIs, model lifecycle management and resource orchestration. The goal is reliable, low‑latency inference and reproducible deployment pipelines.
✔Benefits
- Fast, scalable delivery of predictions
- Clear separation of responsibilities in the ML lifecycle
- Improved observability and fault detection in production
✖Limitations
- Increased infrastructure effort and operating costs
- Complexity in model compatibility and serialization
- Latency limits for very resource‑intensive models
Trade-offs
Metrics
- P99 latency
95th/99th percentile of response times to measure worst‑case latency.
- Throughput (requests per second)
Number of successful inference requests served per time unit.
- Error rate
Proportion of failed or erroneous requests relative to total traffic.
Examples & implementations
TensorFlow Serving for image classification
Use of TensorFlow Serving to deploy and version a CNN model with gRPC API.
KServe on Kubernetes for A/B testing
KServe uses Knative/Ingress routing for canary rollouts and traffic splits.
Batch inference with Spark and ONNX
Scaled offline predictions using ONNX Runtime as a Spark job for reporting.
Implementation steps
Validate, serialize and register model in registry
Build serving image and deploy as a version
Configure routing, autoscaling and observability
⚠️ Technical debt & bottlenecks
Technical debt
- Legacy models without registered metadata
- Ad‑hoc serving scripts instead of standardized images
- Missing automated rollback mechanisms
Known bottlenecks
Misuse examples
- Live model changes without staging tests
- Allocating high resources for rare batch jobs
- Ignoring drift monitoring after deployment
Typical traps
- Underestimating serialization incompatibilities
- Lack of protection against malformed model inputs
- Insufficient canaries before full rollout
Required skills
Architectural drivers
Constraints
- • Compatibility between model formats
- • Data privacy and compliance requirements
- • Limited hardware (e.g., no GPUs available)