Back Professions
Back Dating
Back Writing Tools
Back Programming Tools
Back AI Chat
Back AI Image
Back AI Video
Glossary

Transformer

In one line: The neural network architecture introduced in 2017 that powers every modern LLM - ChatGPT, Claude, Gemini, all of it.

What is Transformer?

The transformer is the neural network architecture introduced in the 2017 paper Attention Is All You Need by Google researchers (Vaswani et al.). Every major modern LLM - ChatGPT, Claude, Gemini, DeepSeek R1, and Llama - Is built on the transformer architecture. It is, without exaggeration, the engine behind the entire modern AI boom.

What made transformers revolutionary

Before 2017, the dominant architecture for language tasks was the Recurrent Neural Network (RNN). RNNs processed text one token at a time in sequence, which created two fundamental problems: they were slow to train (steps cannot be parallelised), and they struggled to remember information from many tokens ago. Transformers replaced sequential processing with attention, which computes relationships between all tokens simultaneously. The gains were dramatic:

  • Training on GPUs became highly parallelisable - Each layer processes the entire sequence at once.
  • Long-range context became tractable - The model can directly attend to any earlier token.
  • Scaling became predictable - More compute and data reliably produced better models, enabling the era of foundation models.

Transformer architecture simplified

  1. Tokenizer - Converts raw text to token IDs using BPE or similar.
  2. Embedding layer - Maps each token ID to a high-dimensional vector (e.g. 4,096 dimensions).
  3. Positional encoding - Injects position information into each vector, because attention itself is order-agnostic.
  4. Stacked attention + feedforward layers - The core computation block, repeated 32–128 times in modern models. Each layer refines the representations.
  5. Output head - Projects the final representation back to vocabulary probabilities. The model samples from these to produce the next token.

Parameters are the weights learned in steps 2–5 during training on training data. A 70-billion-parameter model has 70 billion such weights.

Encoder vs decoder vs encoder-decoder

TypeUsed forExamples
Decoder-onlyText generation - Predicts the next token autoregressivelyGPT-4, Claude, Llama, DeepSeek
Encoder-onlyUnderstanding - Produces rich embeddings of input textBERT, RoBERTa, sentence transformers
Encoder-decoderSequence-to-sequence tasks - Translation, summarisationT5, BART, mT5

The scaling insight

What made transformers uniquely important was not just the architecture but what happened when researchers made them bigger. Scaling laws showed that model performance on language tasks improves predictably as you increase model size, training compute, and data volume together. This gave labs a reliable roadmap: invest in compute, collect more data, and capability grows. The 'GPT' in GPT stands for Generative Pre-trained Transformer, and every generation of OpenAI's flagship models has exploited this scaling insight. Anthropic's Claude models follow the same underlying architecture.

Transformer example

If you are using AskAI.free, a practical way to understand transformer is to ask a model to explain it, then ask for a concrete example in your own workflow. For example: "Explain transformer for someone using AI to write, code, research, or create images."

This turns the term from a dictionary definition into a decision-making tool: you can see when it affects prompt quality, model choice, output reliability, privacy, cost, or how much context the AI can use.

Why Transformer matters

Transformer matters because it changes how you choose, prompt, compare or trust AI systems. If you understand this term, you can ask better questions, spot weak answers faster and choose the right model or tool for the job.

A common mistake is treating transformer as isolated jargon. It usually connects to nearby ideas like Vision model and Zero-shot, so check those next if you want the full picture.

Common mistake with Transformer

The most common mistake is using the term as a label without changing behavior. When transformer comes up, ask what action should change: the prompt, the model, the input length, the evidence you request, or the way you verify the answer.

See it in action - Ask any AI about transformer on AskAI.free.

Try it free →