Quantization

Also known as · quantized · low-precision

Compressing a model by storing its numbers at lower precision to cut cost.

Quantization shrinks a model by representing its parameters with fewer bits — for example, converting 16-bit numbers to 8-bit or 4-bit. This cuts the memory the model needs and speeds up inference, often with surprisingly little loss in quality.

It's a key technique for making large models practical: a quantized model can run on cheaper hardware, or even on a laptop or phone, where the full-precision version wouldn't fit. The trade-off is that pushing precision too low eventually degrades accuracy, so there's a sweet spot.

Quantization is one of the main levers — alongside distillation and better serving software — for driving down the cost of running models in production.

Learn more in Module 17 — Quantization & Efficiency →

Quantization

Related terms

Beyond definitions.