Distillation

Also known as · knowledge distillation · model distillation

Training a smaller 'student' model to mimic a larger 'teacher' model.

Knowledge distillation trains a smaller, cheaper 'student' model to reproduce the behavior of a larger 'teacher' model. Instead of (or in addition to) learning from raw data, the student learns from the teacher's outputs, capturing much of its capability at a fraction of the size and cost.

It's why many fast, inexpensive models punch above their weight — they've been distilled from larger frontier models. For a lot of everyday tasks, a distilled model is more than good enough and far cheaper to run.

Distillation, quantization, and architectural efficiency together explain a broad industry trend: capability that required a giant model last year often runs on a much smaller one today.

Learn more in Module 17 — Quantization & Efficiency →

Distillation

Related terms

Beyond definitions.