Inference
Also known as · serving · generation
Running a trained model to generate output — what happens on every prompt.
Inference is the act of using a trained model: you send a prompt, the model runs its math, and it generates a response token by token. Every API call, every chat message, every code completion is an inference.
Unlike training, inference doesn't change the model — the parameters stay fixed. But inference is where the ongoing cost lives. At scale, companies spend far more on inference (serving millions of requests) than on the one-time training run, which is why so much engineering goes into making inference faster and cheaper.
Techniques like quantization, caching, and specialized hardware all exist to drive down the cost and latency of inference without hurting quality too much.