Training

QLoRA

QLoRA is a fine-tuning method that quantizes a frozen base model to 4-bit precision while training LoRA adapters at higher precision, enabling large language models to be fine-tuned on a single consumer or professional GPU.

QLoRA (Quantized Low-Rank Adaptation) is an extension of LoRA introduced by Tim Dettmers and colleagues at the University of Washington in a 2023 paper (NeurIPS 2023). Standard LoRA reduces trainable parameters but still requires loading the full base model into GPU memory at 16-bit precision, making models above roughly 13 billion parameters impractical to fine-tune on a single GPU. QLoRA solves this by storing the frozen base model weights in 4-bit quantized format, while the LoRA adapter weights are trained and updated at 16-bit bfloat precision.

QLoRA relies on three technical innovations. NormalFloat 4-bit (NF4) is a quantization data type designed for weight tensors that follow an approximately normal distribution, achieving better fidelity than standard 4-bit integer quantization at the same bit width. Double quantization further compresses the per-block quantization constants themselves, recovering additional memory at negligible accuracy cost. Paged optimizers use NVIDIA's unified memory system to handle optimizer state spikes by transparently paging data between GPU and CPU RAM, preventing out-of-memory crashes during gradient steps. Together these techniques reduce the memory footprint of a 65-billion-parameter model from over 130 GB in float16 to under 48 GB, enabling fine-tuning on a single 80 GB A100 GPU.

The democratizing effect was significant. Before QLoRA, fine-tuning models larger than 13 billion parameters required multi-GPU server clusters. After its release, 33B and 65B models became fine-tunable on a single professional GPU, and smaller models on consumer RTX 4090 hardware. The original paper demonstrated this by producing Guanaco, a series of instruction-tuned models that closely matched GPT-3.5 on a human preference benchmark despite being fine-tuned on a single GPU in under 24 hours.

By 2026, QLoRA is integrated into major fine-tuning libraries including bitsandbytes, Axolotl, and Unsloth, and is routinely used in both research and production workflows. The technique has been extended to vision-language and multimodal architectures. Residual limitations include a small but measurable accuracy penalty relative to full-precision LoRA, particularly at very low ranks or with highly compressed smaller models; practitioners mitigate this by using slightly higher ranks, 8-bit quantization where memory allows, or mixed-precision intermediate layers.

Example

A university research group uses QLoRA to fine-tune a 13-billion-parameter Mistral model on a curated dataset of scientific papers, loading the 4-bit quantized base model in approximately 7 GB of VRAM and training LoRA adapters in bfloat16 on a single RTX 4090—hardware that would be completely insufficient for full-precision fine-tuning of a model this size.

Related terms

Latest news on this topic

← Glossary