Инференс

Quantization

Quantization is the technique of representing a neural network's weights — and optionally its activations — in lower-precision numeric formats such as INT8 or INT4 instead of the default FP16 or BF16, reducing memory footprint and accelerating inference at the cost of a small, usually acceptable accuracy degradation.

Neural network quantization reduces the number of bits used to store model parameters. A standard large language model stores weights in 16-bit brain-floating-point (BF16), consuming 2 bytes per parameter: a 70-billion-parameter model requires roughly 140 GB at this precision. Quantizing to 8-bit integers (INT8) halves memory usage to approximately 70 GB; 4-bit quantization (INT4 or NF4) reduces it to roughly 35 GB, placing a 70B model within reach of a pair of consumer-grade NVIDIA RTX 4090 GPUs (each with 24 GB VRAM) or a single A100 80 GB. Memory reduction translates directly to lower hosting costs and enables deployment on hardware that would otherwise be insufficient.

Two primary methodologies exist. Post-training quantization (PTQ) applies quantization to an already-trained model without further gradient updates: GPTQ (2022) uses approximate second-order information to minimize quantization error layer by layer; AWQ (Activation-aware Weight Quantization, 2023) identifies and protects the small fraction of weights that contribute most to activation magnitude, preserving accuracy at very low bit-widths without retraining. Quantization-aware training (QAT) inserts simulated quantization into the training loop so the model learns to compensate for precision loss during gradient descent, producing higher accuracy than PTQ at the same bit-width at the cost of additional training compute. Hybrid formats such as NF4 (4-bit NormalFloat, optimized for weights that follow a normal distribution, used in bitsandbytes) and GGUF (the container format used by llama.cpp for CPU and mixed CPU/GPU inference) have become the dominant distribution formats for open-weight models.

Quantization is the primary enabler of local and on-device LLM inference. Without it, even a 7-billion-parameter model at FP16 requires roughly 14 GB of memory, exceeding the capacity of most laptop GPUs and mobile accelerators. At cloud scale, INT8 weight quantization halves the memory bandwidth demand during the decode phase — the dominant bottleneck for token generation — roughly doubling throughput per GPU. Accuracy loss is typically negligible at INT8 and small but measurable at INT4 on most benchmarks; going to 2-bit or 1-bit incurs larger degradation and remains an active research frontier.

As of 2026, INT8 weight quantization is essentially universal in cloud inference deployments. The open-source community distributes nearly all major open-weight models — LLaMA 3, Mistral, Qwen 2.5, Gemma 2 — as GGUF-quantized files by default on Hugging Face. Apple's MLX framework leverages 4-bit quantization for on-device inference on Apple Silicon. Microsoft Research's BitNet b1.58 (2024) demonstrated competitive accuracy with ternary weights (−1, 0, +1), and Qualcomm has shipped dedicated INT4 inference accelerators in mobile SoCs. KV-cache quantization — independently quantizing the stored attention tensors from FP16 to INT8 or INT4 — has also become standard practice in production serving stacks including vLLM and TensorRT-LLM.

Пример

A developer runs LLaMA 3 70B locally on a single NVIDIA RTX 4090 using 4-bit GGUF quantization via llama.cpp; the quantized model occupies approximately 38 GB in system memory with partial GPU offload rather than the ~140 GB required at full BF16 precision, enabling practical local inference with only a minor reduction in benchmark accuracy.

Связанные термины

← Глоссарий