AskAI.Free
Beta
Navigation
Back Professions
Back Dating
Back Writing Tools
Back Programming Tools
📚 Glossary

Training data

In one line: The text an LLM is taught on. Typically trillions of tokens scraped from the web, books, code repos and more.

Training data is the text an LLM is trained on. Modern foundation models are trained on staggering amounts:

  • GPT-3: ~300B tokens
  • GPT-4: estimated ~13T tokens
  • Llama 3: 15T tokens

Sources include the public web (Common Crawl), books, code repositories (GitHub), academic papers, Wikipedia, and various licensed datasets. The mix matters: a model trained heavily on code is better at code; one trained on more multilingual text is better at translation.

Training data is also the source of much controversy — copyright lawsuits, scraping ethics, model bias from biased data.

See it in action — ask any AI about training data on AskAI.free.

Try it free →