Quantization
Also known as · quantized · low-precision
Compressing a model by storing its numbers at lower precision to cut cost.
Quantization shrinks a model by representing its parameters with fewer bits — for example, converting 16-bit numbers to 8-bit or 4-bit. This cuts the memory the model needs and speeds up inference, often with surprisingly little loss in quality.
It's a key technique for making large models practical: a quantized model can run on cheaper hardware, or even on a laptop or phone, where the full-precision version wouldn't fit. The trade-off is that pushing precision too low eventually degrades accuracy, so there's a sweet spot.
Quantization is one of the main levers — alongside distillation and better serving software — for driving down the cost of running models in production.