Back Professions
Back Dating
Back Writing Tools
Back Programming Tools
Back AI Chat
Back AI Image
Back AI Video
Glossary

Inference

In one line: Running a trained model to get an answer. Distinct from training, which is teaching the model in the first place.

What is Inference?

Inference is the act of running a trained model - sending it your prompt and receiving an answer. It is what happens every time you ask ChatGPT a question, generate an image, run a code completion, or use Perplexity to search the web. Inference is the 'production' phase of AI - the part that end users experience.

Inference vs training

Training is the one-time (and enormously expensive) process of teaching the model. Training GPT-4 reportedly cost $50-100M in compute and took months. Fine-tuning a smaller model might cost $50-200. Inference, by contrast, costs fractions of a cent per query. The distinction matters for pricing: AI APIs charge per inference call (per token), not for the training that happened once.

Related: training data, fine-tuning, parameters.

Inference speed comparison

ModelApprox. tokens/secTime to first token
ChatGPT 4o100-150 TPS<500ms
ChatGPT 4.180-120 TPS<600ms
Claude Sonnet 480-120 TPS<600ms
Gemini 2.0 Flash150-200 TPS<400ms
DeepSeek R130-60 TPS1-3s (reasoning overhead)

Voice interfaces need very low latency (under 300ms to first token). Chat interfaces can tolerate slightly slower starts. Reasoning models are inherently slower because they generate extended thinking before answering.

Inference cost guide

ModelInput cost (per 1M tokens)Output cost (per 1M tokens)
ChatGPT 4o$2.50$10.00
ChatGPT 4.1$2.00$8.00
Claude Sonnet 4$3.00$15.00
Gemini 2.0 Flash$0.075$0.30
DeepSeek R1$0.55$2.19

AskAI.free's subscription plans abstract token costs into a monthly allowance. Use the token counter to estimate your usage before choosing a plan. See pricing for full details.

Inference optimization

Running inference efficiently at scale requires several techniques that reduce cost and latency without retraining the model:

  • Quantisation - Reducing model precision from float32 to int8 or int4, cutting memory use and speeding computation with a small quality trade-off.
  • KV-cache reuse - Caching the key-value pairs from the attention layers for repeated prefixes (like a fixed system prompt), avoiding redundant computation.
  • Speculative decoding - Using a small fast model to draft tokens and a larger model to verify them in batch, increasing throughput.
  • Batching - Processing multiple requests simultaneously on the same GPU to improve hardware utilisation and reduce per-query cost.

Inference example

If you are using AskAI.free, a practical way to understand inference is to ask a model to explain it, then ask for a concrete example in your own workflow. For example: "Explain inference for someone using AI to write, code, research, or create images."

This turns the term from a dictionary definition into a decision-making tool: you can see when it affects prompt quality, model choice, output reliability, privacy, cost, or how much context the AI can use.

Why Inference matters

Inference matters because it changes how you choose, prompt, compare or trust AI systems. If you understand this term, you can ask better questions, spot weak answers faster and choose the right model or tool for the job.

A common mistake is treating inference as isolated jargon. It usually connects to nearby ideas like Jailbreak and Knowledge cutoff, so check those next if you want the full picture.

Common mistake with Inference

The most common mistake is using the term as a label without changing behavior. When inference comes up, ask what action should change: the prompt, the model, the input length, the evidence you request, or the way you verify the answer.

See it in action - Ask any AI about inference on AskAI.free.

Try it free →