Understanding Tokenizers

This post is largely inspired by Understanding GPT tokenizers by Simon Willison.

Large Language Models don’t work with words, they work with tokens. They take text, convert it into tokens (integers), then predict which tokens should come next.

To explain this I will use the transformers_js_py library which allows us to work with LLMs in the browser through WebAssembly.

Let’s consider a text we want to tokenize:

text = "The dog eats the apples."

Each LLM has its own tokenizer, so we need to specify which model we are going to use:

model_id = "Qwen/Qwen2.5-0.5B-Instruct"

Finally, we can encode it with the model’s tokenizer by running the following cell:

You should see as output the following list of integers:

token_ids = [785, 5562, 49677, 279, 40676, 13]

This list of integers correspond to the token ids.

You can encode other texts by running the following code above:

tokenizer.encode("El perro come las manzanas.")

You can also modify the model_id to see how the tokens change (search for other models in HuggingFace). For example:

With this model, you should see as output the following list of tokens:

token_ids_Mistral_7B_v0_3 = [1, 1183, 4682, 1085, 2217, 1040, 1747, 3583, 29491]

You can observe that even if the text is the same, these tokens are very different from the previous ones:

token_ids = [785, 5562, 49677, 279, 40676, 13]

We can do the reverse operation. We take the tokens and convert them to text:

Encoding a text and then decoding it should give the same original text.

Playing with tokenizers reveal all sorts of interesting facts.

Most common English words are assigned a single token. As demonstrated above:

“The”: 785
” dog”: 5562
” eats”: 49677
” the”: 279
” apples”: 40676
“.”: 13

Capitalization is important: “The” with a capital T corresponds to token 785, but “the” with lowercase is 1782 and ” the” with both a leading space and a lowercase t is token 279.

Many words also have a token that incorporates a leading space. This makes for much more efficient encoding of full sentences, since they can be encoded without needing to spend a token on each whitespace character.

Numbers get their own tokens:

“0”: 15
“1”: 16
“2”: 17
…
“9”: 24

Languages other than English suffer from less efficient tokenization.

“El perro come las manzanas” in Spanish is encoded like this:

“El”: 6582
” per”: 817
“ro”: 299
” come”: 2525
” las”: 5141
” man”: 883
“z”: 89
” anas”: 25908
“.”: 13

“Le chien mange les pommes” in French is encoded like this:

“Le”: 2304
” ch”: 521
“ien”: 3591
” mange”: 59434
” les”: 3541
” pom”: 29484
” mes”: 8828
“.” : 13

There are all sorts of interesting things like the glitch tokens.

The majority of tokenizers are trained with the byte-pair encoding algorithm.

Many researchers think we should work with bytes and we shouldn’t have tokenizers to begin with and they are actively trying to remove them (without much success).