How to compress a language model 3x: a guide to FP8, GPTQ, and SmoothQuant
A new guide shows how to compress a language model using llmcompressor. Three quantization methods were tested: FP8 (fast, low precision), GPTQ (high precision)

An open-source tool llmcompressor allows you to compress pre-trained language models to a size suitable for production use. A new practical guide shows how to apply quantization to ready-made instruction-tuned models and choose the optimal method for your scenario.
What is model quantization
Quantization is reducing the precision of the numbers a model works with. Instead of standard 16-bit (FP16) or 32-bit (FP32) numbers, a model can work with 8-bit (int8) or even 4-bit values. This makes the model smaller and faster, but can degrade answer quality. There are two approaches: quantization-aware training (QAT) and post-training quantization (PTQ). The first is more accurate but requires retraining the model on data. The second is faster, simply applied to an already-ready model right before use, without needing retraining.
llmcompressor specializes in PTQ — post-training quantization. This allows compression to be applied in a matter of hours instead of weeks of retraining. An engineer loads a ready-made model into llmcompressor, chooses a quantization method, and in a few hours has a compressed version ready for use on weaker hardware.
Three quantization methods compared
The guide thoroughly tests three different approaches on the same base model:
- FP8 dynamic quantization — the simplest, works in minutes, requires no calibration on additional data. All numbers (weights and activations) are converted to 8-bit format. Downside: worst accuracy, can lose up to 5% in answer quality
- GPTQ (W4A16) — compresses only model weights to 4 bits, activations remain in original 16 bits. Requires a small calibration on a data sample. Good balance between speed and quality
- SmoothQuant with GPTQ (W8A8) — the most accurate of the three, weights and activations in 8 bits, but with intelligent value distribution. Slower than others, requires more data for calibration, but results remain close to the original with less than 1% loss
Each method was tested on a real task — text generation based on user queries. They measured file size on disk, generation speed (latency and throughput), and model "perplexity" — a metric of how the model errs on test data.
Benchmark results
Disk size can shrink 3-4 times. A 16-bit model takes substantially more space than the 8-bit version of the same model. Inference speed increases proportionally to compression, especially noticeable on mobile devices where battery is critical. Accuracy depends on the chosen method. FP8 loses up to 5% in answer quality, SmoothQuant — less than 1%. For production scenarios where every percent of accuracy is critical, SmoothQuant is chosen, even if it's slower. For idea generation, drafts, and auxiliary tasks, FP8 is suitable, and computation savings justify the quality loss.
The practical conclusion from the guide: if you need speed and low costs — choose FP8. If accuracy is critical and you're willing to spend more time on inference — SmoothQuant.
Who needs this
This tool and approach are useful for companies that want to run their language model in production:
— on edge (on the user's device) without sending data to the cloud — in a private cloud with limited hardware and budget — at scale: the smaller the model, the cheaper batch processing and cloud bills
Startups and corporate industry are already actively using quantization. Meta launched Llama 2 with official int8 quantization support. Hugging Face released bitsandbytes — a library that simplifies quantization for engineers. Now llmcompressor allows doing this with fine-grained control over the method.
What it means
Quantization is transitioning from the category of experiments to a standard ML-pipeline tool. This is closing the last mile — tools like llmcompressor allow an engineer to choose a compromise between size, speed, and quality in hours instead of weeks of experimentation. For the entire industry this means: large language models become more accessible, cheaper to operate, and safer in terms of privacy, because you can run them locally without the cloud.