Tokenization

Breaking text into tokens — the atoms of language models

Before an AI can read your text, it chops it into small pieces called tokens. Think of it like cutting a sentence into puzzle pieces — but instead of splitting at every space, the AI uses a smarter system called Byte Pair Encoding.

Common words like "the" or "hello" stay as one piece. Unusual words get broken into smaller chunks. A typical LLM has a vocabulary of about 100,000 tokens — that's its entire alphabet. This matters because AI models charge per token, can only handle a limited number at once, and — as you'll see later — even struggle to spell because they think in tokens, not letters.

Common word = 1 token

hello

Unusual word = many tokens

hellooooo

But why not just use whole words? Older models did exactly that — and whenever they hit a word not in their dictionary, it simply became UNKNOWN. All meaning lost. Subword tokenization fixes this: even if the model has never seen "Giga-awesome", it knows Giga, -, and awesome separately.

It's also far more efficient. LLMs can only process a limited number of tokens at once. Splitting by characters would use ~5x more tokens than subwords — so smarter tokenization means the model can read much more text in one go.

Why this matters for your wallet: API pricing is per-token. Because the tokenizer was trained mostly on English, a prompt in Russian or Arabic uses roughly 3x more tokens than the same meaning in English — and costs 3x more.

🎯

Your turn — try it out!

Tokenizer Sandbox

Type anything and see how the AI breaks it into tokens. Try emojis, code, other languages, or misspelled words!

5tokens|15characters

"Hello"" world""!"" ""🌍"

Try: