Inference

Inference

Inference is the process of applying a trained machine learning model to new input data to produce predictions or outputs. It is the deployment-time operation, distinct from training, in which no model parameters are updated.

In machine learning, inference refers to the forward-pass computation in which a trained model receives an input and produces an output — a classification label, a generated text sequence, an embedding vector, or any other model output. Unlike training, inference does not modify model weights; it is a read-only operation against a fixed set of parameters. In production systems, inference runs continuously and at scale, serving end users or downstream applications.

Inference for a transformer-based language model involves tokenizing the input text, looking up token embeddings, computing multi-head self-attention across the context window, passing activations through feed-forward layers, and — for generative models — iteratively sampling the next token until a stopping criterion is met (autoregressive decoding). Key optimizations include quantization (reducing weight precision from 32-bit float to 8-bit or 4-bit integers), KV-cache reuse (storing previously computed key and value matrices to avoid redundant computation across decoding steps), and request batching (grouping concurrent requests to maximize GPU utilization).

Inference economics dominate the cost structure of deployed AI products. In most large-scale deployments, cumulative inference costs exceed one-time training costs because inference runs continuously while training is periodic. Latency (time to first token, total generation time) and throughput (tokens per second per accelerator) are the primary performance metrics. These pressures have driven investment in inference-optimized hardware — including Groq's LPU, Cerebras wafer-scale processors, and NVIDIA H200 and Blackwell GPUs — as well as algorithmic techniques such as speculative decoding, which uses a smaller draft model to propose candidate tokens validated in parallel by the main model.

As of 2026, inference serving is a mature discipline with dedicated open-source frameworks including vLLM, TensorRT-LLM, and SGLang. Major providers offer inference APIs priced per million tokens. On-device inference — running models locally on smartphones, laptops, or embedded hardware without cloud connectivity — has become practical with quantized sub-10B-parameter models that fit within consumer DRAM, enabling privacy-preserving and low-latency applications.

Example

When a user submits a prompt to a cloud-hosted language model API, the request is routed to a GPU server that runs inference: the tokenized prompt passes through the model's transformer layers, and output tokens are streamed back to the client as they are generated.

Related terms

Latest news on this topic

← Glossary