Language models can only work with data that is in a digestible format. In essence, such models comprise many large matrices containing floating point numbers. Matrices do not run on characters but on numbers. Therefore, sets of characters are “translated” into sets of integers, so-called tokens.
Tokens may simply comprise a single word, such as “plant”, but they may also be word chunks. Tokens often contain whitespaces. See the example below:
How much wood would a woodchuck chuck if a woodchuck could chuck wood?
We can see here that almost all tokens (each token colored in either red, orange, yellow, or green) start with a leading whitespace. Furthermore, the word “woodchuck” is divided into two tokens: “ wood”, “chuck”. The question mark is its own token.