Inference
In one line: Running a trained model to get an answer. Distinct from training, which is teaching the model in the first place.
What is Inference?
Inference is the act of running a trained model - sending it your prompt and receiving an answer. It is what happens every time you ask ChatGPT a question, generate an image, run a code completion, or use Perplexity to search the web. Inference is the 'production' phase of AI - the part that end users experience.
Inference vs training
Training is the one-time (and enormously expensive) process of teaching the model. Training GPT-4 reportedly cost $50-100M in compute and took months. Fine-tuning a smaller model might cost $50-200. Inference, by contrast, costs fractions of a cent per query. The distinction matters for pricing: AI APIs charge per inference call (per token), not for the training that happened once.
Related: training data, fine-tuning, parameters.
Inference speed comparison
| Model | Approx. tokens/sec | Time to first token |
|---|---|---|
| ChatGPT 4o | 100-150 TPS | <500ms |
| ChatGPT 4.1 | 80-120 TPS | <600ms |
| Claude Sonnet 4 | 80-120 TPS | <600ms |
| Gemini 2.0 Flash | 150-200 TPS | <400ms |
| DeepSeek R1 | 30-60 TPS | 1-3s (reasoning overhead) |
Voice interfaces need very low latency (under 300ms to first token). Chat interfaces can tolerate slightly slower starts. Reasoning models are inherently slower because they generate extended thinking before answering.
Inference cost guide
| Model | Input cost (per 1M tokens) | Output cost (per 1M tokens) |
|---|---|---|
| ChatGPT 4o | $2.50 | $10.00 |
| ChatGPT 4.1 | $2.00 | $8.00 |
| Claude Sonnet 4 | $3.00 | $15.00 |
| Gemini 2.0 Flash | $0.075 | $0.30 |
| DeepSeek R1 | $0.55 | $2.19 |
AskAI.free's subscription plans abstract token costs into a monthly allowance. Use the token counter to estimate your usage before choosing a plan. See pricing for full details.
Inference optimization
Running inference efficiently at scale requires several techniques that reduce cost and latency without retraining the model:
- Quantisation - Reducing model precision from float32 to int8 or int4, cutting memory use and speeding computation with a small quality trade-off.
- KV-cache reuse - Caching the key-value pairs from the attention layers for repeated prefixes (like a fixed system prompt), avoiding redundant computation.
- Speculative decoding - Using a small fast model to draft tokens and a larger model to verify them in batch, increasing throughput.
- Batching - Processing multiple requests simultaneously on the same GPU to improve hardware utilisation and reduce per-query cost.
Inference example
If you are using AskAI.free, a practical way to understand inference is to ask a model to explain it, then ask for a concrete example in your own workflow. For example: "Explain inference for someone using AI to write, code, research, or create images."
This turns the term from a dictionary definition into a decision-making tool: you can see when it affects prompt quality, model choice, output reliability, privacy, cost, or how much context the AI can use.
Why Inference matters
Inference matters because it changes how you choose, prompt, compare or trust AI systems. If you understand this term, you can ask better questions, spot weak answers faster and choose the right model or tool for the job.
A common mistake is treating inference as isolated jargon. It usually connects to nearby ideas like Jailbreak and Knowledge cutoff, so check those next if you want the full picture.
Common mistake with Inference
The most common mistake is using the term as a label without changing behavior. When inference comes up, ask what action should change: the prompt, the model, the input length, the evidence you request, or the way you verify the answer.
See it in action - Ask any AI about inference on AskAI.free.
Try it free →