Inference

Also known as · serving · generation

Running a trained model to generate output — what happens on every prompt.

Inference is the act of using a trained model: you send a prompt, the model runs its math, and it generates a response token by token. Every API call, every chat message, every code completion is an inference.

Unlike training, inference doesn't change the model — the parameters stay fixed. But inference is where the ongoing cost lives. At scale, companies spend far more on inference (serving millions of requests) than on the one-time training run, which is why so much engineering goes into making inference faster and cheaper.

Techniques like quantization, caching, and specialized hardware all exist to drive down the cost and latency of inference without hurting quality too much.

Learn more in Module 2 — Training vs. Inference →

Inference

Related terms

Beyond definitions.