Tokenizer
In one line: The component that converts text into tokens (and back). Different models use different tokenizers, which is why a sentence has a different token count in GPT vs Claude.
A tokenizer is the component that converts text into tokens before sending to an LLM (and converts model output tokens back into text).
Different models use different tokenizers:
- GPT-4 family uses cl100k_base
- GPT-4o uses o200k_base
- Claude uses Anthropic's tokenizer (similar to BPE)
- Gemini uses SentencePiece
Same text, different token count depending on the model. English is roughly equivalent across tokenizers; non-English languages can be 2-3x more tokens. Code can also use more tokens than prose. See exact token counts here.
See it in action — ask any AI about tokenizer on AskAI.free.
Try it free →