Attention
In one line: The mathematical mechanism that lets transformers focus on different parts of the input when generating each output token.
What is Attention?
Attention is the mechanism that makes modern AI models work. When a transformer-based model like Claude or ChatGPT generates the next token, attention lets it weigh every other token in the context window - Deciding which ones matter most right now. It's why these models can reference something you wrote 50,000 words earlier in a document without losing the thread.
The landmark 2017 paper Attention Is All You Need introduced the transformer architecture built entirely on stacked attention layers, replacing recurrent neural networks (RNNs) that processed tokens one at a time. The shift was dramatic: transformers could be parallelised across GPUs, trained on far more data, and they handled long-range dependencies far more reliably.
The key innovation
For each token, attention computes three vectors: a Query ('what am I looking for?'), a Key ('what do I offer?'), and a Value ('what information do I carry?'). The model scores every Key against the current Query, applies a softmax to get weights that sum to 1, then takes a weighted sum of all Values. The result is a rich representation that blends information from whichever tokens were most relevant - Not just the nearest ones.
Multi-head attention
Modern LLMs run many attention computations in parallel - Each called a 'head'. Different heads learn to track different relationships:
- One head might track grammatical subject-verb agreement.
- Another might resolve coreference - Figuring out which 'it' refers to.
- Others capture semantic similarity, negation, temporal order, and more.
The outputs of all heads are concatenated and projected, giving the model a much richer representation than any single attention pass could provide. GPT-4-class models use 96 or more attention heads.
Self-attention vs cross-attention
| Type | What it attends to | Where used |
|---|---|---|
| Self-attention | All previous tokens in the same sequence | Every transformer decoder layer (text generation) |
| Cross-attention | A separate input sequence | Vision models attending to image patches; translation models attending to the source language |
Attention vs older RNN architectures
| Characteristic | RNN / LSTM | Attention (Transformer) |
|---|---|---|
| Processing order | Sequential - One token at a time | Parallel - All tokens at once |
| Long-range dependencies | Degrades over distance as training signals weaken | Equal cost regardless of distance |
| Training speed | Slow - Can't parallelise across time steps | Fast - Fully parallelisable on GPUs |
| Memory cost | O(n) with sequence length | O(n²) - The quadratic scaling challenge |
The quadratic scaling cost of attention is the main reason very long context windows require architectural tricks like sparse attention, sliding window attention, or linear approximations. Gemini's 2M-token context uses such techniques to remain practical at that scale.
For a hands-on feel for how token boundaries affect what attention 'sees', try the token counter to inspect how your text is split before it reaches the model.
Attention example
If you are using AskAI.free, a practical way to understand attention is to ask a model to explain it, then ask for a concrete example in your own workflow. For example: "Explain attention for someone using AI to write, code, research, or create images."
This turns the term from a dictionary definition into a decision-making tool: you can see when it affects prompt quality, model choice, output reliability, privacy, cost, or how much context the AI can use.
Why Attention matters
Attention matters because it changes how you choose, prompt, compare or trust AI systems. If you understand this term, you can ask better questions, spot weak answers faster and choose the right model or tool for the job.
A common mistake is treating attention as isolated jargon. It usually connects to nearby ideas like BERT and Chain of thought, so check those next if you want the full picture.
Common mistake with Attention
The most common mistake is using the term as a label without changing behavior. When attention comes up, ask what action should change: the prompt, the model, the input length, the evidence you request, or the way you verify the answer.
See it in action - Ask any AI about attention on AskAI.free.
Try it free →