Training

Knowledge Distillation

Knowledge distillation is a compression technique in which a small student model is trained to match the output distribution of a larger teacher model, producing a compact model that retains much of the teacher's accuracy.

Knowledge distillation is a model compression and training method in which a smaller, more efficient student network is taught to replicate the behavior of a larger, more capable teacher network. Rather than training the student solely on hard one-hot ground-truth labels, it learns to match the teacher's full softmax output distribution, which encodes richer information about inter-class relationships and the teacher's learned uncertainty.

The technique was formalized by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean in a 2015 paper (NeurIPS 2015). The key mechanism is temperature scaling: dividing the teacher's output logits by a temperature parameter T > 1 before applying softmax produces softer probability distributions that assign meaningful probability mass to near-miss classes. These soft targets carry more information than one-hot labels—a dog image receiving 2% probability under the cat class conveys structural similarity between the two categories that a one-hot label cannot. The student's loss is typically a weighted combination of cross-entropy against the soft teacher targets and standard cross-entropy against hard ground-truth labels. Extensions such as feature-level distillation and attention-transfer distillation additionally align intermediate-layer activations and attention maps between teacher and student, further improving transfer quality.

Knowledge distillation matters because the largest models are impractical for latency-sensitive or resource-constrained environments—mobile devices, embedded hardware, and cost-efficient cloud inference endpoints. Distillation bridges this gap: a student model of a given size consistently outperforms an independently trained same-size model, because soft targets provide a richer training signal than labels alone. DistilBERT (Hugging Face, 2019) demonstrated that a 66M-parameter student retains approximately 97% of BERT-base's GLUE performance at 60% of inference speed with 40% fewer parameters.

By 2026, distillation is applied at scale across NLP, vision, and speech. In the LLM era it takes new forms: DeepSeek released distilled variants of its DeepSeek-R1 reasoning model in early 2025—ranging from 1.5B to 70B parameters and trained on long reasoning traces generated by the full model—achieving competitive scores on mathematical and coding benchmarks at a fraction of the inference cost. Google's Gemini Nano models, designed for on-device inference on Pixel phones, were distilled from larger Gemini checkpoints. Apple's on-device models shipped in iOS 18 similarly rely on distillation to compress foundation model capabilities into the tight memory and power envelopes of mobile hardware.

Example

Hugging Face trained DistilBERT by distilling from the 110M-parameter BERT-base model using soft targets at temperature 4; the resulting 66M-parameter student runs 60% faster at inference while scoring approximately 97% of BERT-base's performance across the GLUE benchmark suite.

Related terms

← Glossary