Glossary

Attention

In one line: The mathematical mechanism that lets transformers focus on different parts of the input when generating each output token.

What is Attention?

Attention is the mechanism that makes modern AI models work. When a transformer-based model like Claude or ChatGPT generates the next token, attention lets it weigh every other token in the context window - Deciding which ones matter most right now. It's why these models can reference something you wrote 50,000 words earlier in a document without losing the thread.

The landmark 2017 paper Attention Is All You Need introduced the transformer architecture built entirely on stacked attention layers, replacing recurrent neural networks (RNNs) that processed tokens one at a time. The shift was dramatic: transformers could be parallelised across GPUs, trained on far more data, and they handled long-range dependencies far more reliably.

The key innovation

For each token, attention computes three vectors: a Query ('what am I looking for?'), a Key ('what do I offer?'), and a Value ('what information do I carry?'). The model scores every Key against the current Query, applies a softmax to get weights that sum to 1, then takes a weighted sum of all Values. The result is a rich representation that blends information from whichever tokens were most relevant - Not just the nearest ones.

Multi-head attention

Modern LLMs run many attention computations in parallel - Each called a 'head'. Different heads learn to track different relationships:

One head might track grammatical subject-verb agreement.
Another might resolve coreference - Figuring out which 'it' refers to.
Others capture semantic similarity, negation, temporal order, and more.

The outputs of all heads are concatenated and projected, giving the model a much richer representation than any single attention pass could provide. GPT-4-class models use 96 or more attention heads.

Self-attention vs cross-attention

Type	What it attends to	Where used
Self-attention	All previous tokens in the same sequence	Every transformer decoder layer (text generation)
Cross-attention	A separate input sequence	Vision models attending to image patches; translation models attending to the source language

Attention vs older RNN architectures

Characteristic	RNN / LSTM	Attention (Transformer)
Processing order	Sequential - One token at a time	Parallel - All tokens at once
Long-range dependencies	Degrades over distance as training signals weaken	Equal cost regardless of distance
Training speed	Slow - Can't parallelise across time steps	Fast - Fully parallelisable on GPUs
Memory cost	O(n) with sequence length	O(n²) - The quadratic scaling challenge

The quadratic scaling cost of attention is the main reason very long context windows require architectural tricks like sparse attention, sliding window attention, or linear approximations. Gemini's 2M-token context uses such techniques to remain practical at that scale.

For a hands-on feel for how token boundaries affect what attention 'sees', try the token counter to inspect how your text is split before it reaches the model.

Attention example

If you are using AskAI.free, a practical way to understand attention is to ask a model to explain it, then ask for a concrete example in your own workflow. For example: "Explain attention for someone using AI to write, code, research, or create images."

This turns the term from a dictionary definition into a decision-making tool: you can see when it affects prompt quality, model choice, output reliability, privacy, cost, or how much context the AI can use.

Why Attention matters

Attention matters because it changes how you choose, prompt, compare or trust AI systems. If you understand this term, you can ask better questions, spot weak answers faster and choose the right model or tool for the job.

A common mistake is treating attention as isolated jargon. It usually connects to nearby ideas like BERT and Chain of thought, so check those next if you want the full picture.

Common mistake with Attention

The most common mistake is using the term as a label without changing behavior. When attention comes up, ask what action should change: the prompt, the model, the input length, the evidence you request, or the way you verify the answer.

See it in action - Ask any AI about attention on AskAI.free.

Try it free →

Uh-oh!

Sign In

Create Account

Pick your plan