Inference

Model Serving

Model serving is the infrastructure layer that deploys a trained ML model to handle real-time or batch prediction requests, managing scaling, load balancing, versioning, and reliability in production. It bridges the gap between an offline-trained model artifact and a live, queryable API endpoint.

Model serving encompasses the hardware, software, and operational processes needed to expose a trained model's inference capability as a reliable, scalable service. The core components are the serving runtime (the process that loads weights and executes forward passes), the API layer (HTTP/REST or gRPC endpoints), a request queue and scheduler, autoscaling logic, and a model registry for versioning and rollback. For LLMs, the serving runtime also manages the KV cache — the stored attention keys and values for in-progress sequences — which can consume the majority of GPU memory.

A model artifact is loaded onto accelerators by a serving runtime such as NVIDIA Triton Inference Server, TensorRT-LLM, vLLM, or a provider's proprietary stack. The runtime handles batching, memory allocation, and kernel execution. An API gateway in front routes traffic, applies rate limits, and authenticates requests. Kubernetes or equivalent orchestration manages horizontal scaling — spinning up additional replicas under load and tearing them down when traffic subsides — and handles rolling updates when a new model version is deployed without downtime. Observability (latency percentiles, error rates, GPU utilization, queue depth) feeds autoscaling decisions and alerting.

Serving is often the dominant cost center for a production AI product relative to training costs amortized over the product's lifetime. Decisions about hardware selection, quantization depth, batching strategy, replication factor, and whether to use spot or on-demand instances directly determine operating costs and latency SLAs. A model that cannot be reliably and economically served at scale has no practical product value regardless of its benchmark performance.

As of 2026, the landscape includes general-purpose cloud ML serving platforms (AWS SageMaker, Google Vertex AI, Azure ML), LLM-specialized open-source runtimes (vLLM, Ollama, LMDeploy), and fully managed proprietary APIs (OpenAI, Anthropic Claude API, Google Gemini API). Multi-LoRA serving — hosting a single base model with hundreds of fine-tuned adapter layers swapped per request — has matured, allowing companies to serve many specialized variants at the hardware cost of one base deployment.

Example

A fintech startup deploys its document-extraction model using vLLM on a Kubernetes cluster with two NVIDIA H100 GPUs, configuring horizontal pod autoscaling to add replicas when request queue depth exceeds 50, and rolling back to the previous model version automatically if error rate rises above 1%.

Related terms

← Glossary