Tokens

Language models can only work with data that is in a digestible format. Models do not run on characters but on numbers. Therefore, sets of characters are “translated” into sets of integers, so-called tokens.

Tokens may simply comprise a single word, such as “plant”, but they may also be word chunks. Tokens often contain whitespaces. See the example below:

How much wood would a woodchuck chuck if a woodchuck could chuck wood?

We can see here that almost all tokens (each token colored in either red, orange, yellow, or green) start with a leading whitespace. Furthermore, the word “woodchuck” is divided into two tokens: “ wood”, “chuck”. The question mark is its own token.