Glossary

Training data

In one line: The text an LLM is taught on. Typically trillions of tokens scraped from the web, books, code repos and more.

What is Training data?

Training data is the text, code, images, and other media that a foundation model learns from during pre-training. Modern LLMs are trained on staggering volumes of data - Trillions of tokens drawn from across the public internet and licensed collections - Representing what may be the largest corpora of human-produced text ever assembled. The quality, diversity, and balance of this data shapes almost everything the model knows, believes, and gets wrong.

Scale of training data

Model	Estimated data size	Key sources
GPT-3 (2020)	~300 billion tokens	Common Crawl, WebText, Books, Wikipedia
GPT-4 (2023)	~10–13 trillion tokens (est.)	Common Crawl, books, code, curated web
Llama 3 (2024)	15 trillion tokens	Common Crawl, code repos, multilingual web
Claude (Anthropic)	Undisclosed	Curated web, books, code, Constitutional AI fine-tuning
Gemini (Google)	Undisclosed	Web, books, code, YouTube captions, multimodal data

What goes into training data

Common Crawl - A petabyte-scale scrape of the public web, refreshed monthly. Forms the backbone of most pre-training datasets but requires heavy filtering to remove spam, duplicates, and low-quality content.
Books - Digitised books via BookCorpus, Project Gutenberg, and licensed datasets. Improves long-form coherence and reasoning.
Code - GitHub repositories, Stack Overflow, documentation. Makes models dramatically better at programming tasks.
Wikipedia - High-quality factual content in 300+ languages.
Scientific papers - ArXiv, PubMed, licensed academic journals.
Synthetic data - AI-generated text used to supplement real data, particularly for tasks where human-labelled data is scarce.

Why data mix matters

The composition of training data directly determines a model's strengths and blind spots. A model trained on more code is better at programming. One with more multilingual text is better at translation. Models trained on medical literature perform better on clinical tasks. Fine-tuning after pre-training can correct some gaps, but it cannot fully compensate for missing knowledge in the base training data. Data quality is increasingly seen as more valuable than raw quantity: carefully curated datasets often outperform larger but noisier ones.

Controversies and challenges

Training data sits at the centre of several live debates:

Copyright - Authors, news organisations, and code authors have filed lawsuits arguing that scraping their work without permission or compensation violates copyright law. Outcomes will shape how future datasets are built.
Bias - Training data that over-represents Western, English-speaking perspectives produces models that reflect those biases. This affects everything from which names the model associates with professions to how it renders non-English languages. See alignment for how labs try to correct this.
Synthetic data risks - Using AI-generated text as training data can create feedback loops where models reinforce their own errors. Researchers call this 'model collapse.'
Knowledge cutoff - Training data has a fixed end date, after which the model knows nothing about world events. See knowledge cutoff.

A common follow-up question: do your own chats become training data? On AskAI.free they do not - the FAQ explains why conversations are never used for training.

Training data example

If you are using AskAI.free, a practical way to understand training data is to ask a model to explain it, then ask for a concrete example in your own workflow. For example: "Explain training data for someone using AI to write, code, research, or create images."

This turns the term from a dictionary definition into a decision-making tool: you can see when it affects prompt quality, model choice, output reliability, privacy, cost, or how much context the AI can use.

Why Training data matters

Training data matters because it changes how you choose, prompt, compare or trust AI systems. If you understand this term, you can ask better questions, spot weak answers faster and choose the right model or tool for the job.

A common mistake is treating training data as isolated jargon. It usually connects to nearby ideas like Transformer and Vision model, so check those next if you want the full picture.

Common mistake with Training data

The most common mistake is using the term as a label without changing behavior. When training data comes up, ask what action should change: the prompt, the model, the input length, the evidence you request, or the way you verify the answer.

See it in action - Ask any AI about training data on AskAI.free.

Try it free →

Uh-oh!

Sign In

Create Account

Pick your plan