📚 Glossary

Training data

In one line: The text an LLM is taught on. Typically trillions of tokens scraped from the web, books, code repos and more.

Training data is the text an LLM is trained on. Modern foundation models are trained on staggering amounts:

GPT-3: ~300B tokens
GPT-4: estimated ~13T tokens
Llama 3: 15T tokens

Sources include the public web (Common Crawl), books, code repositories (GitHub), academic papers, Wikipedia, and various licensed datasets. The mix matters: a model trained heavily on code is better at code; one trained on more multilingual text is better at translation.

Training data is also the source of much controversy — copyright lawsuits, scraping ethics, model bias from biased data.

See it in action — ask any AI about training data on AskAI.free.

Try it free →

Uh-oh!

Sign In

Create Account

Pick your plan

Training data

Related terms