Skip to main content

(De)-Tokenize

With the tokenize-endpoint you can use our own tokenizer to tokenize your texts for further use. Next to that you can also detokenize these texts with the detokenize-endpoint.

Code Example

import os
from aleph_alpha_client import Client, TokenizationRequest, DetokenizationRequest
# If you are using a Windows machine, you must install the python-dotenv package and run the two below lines as well.
# from dotenv import load_dotenv
# load_dotenv()

client = Client(token=os.getenv("AA_TOKEN"))
prompt_text = "An apple a day keeps the doctor away"
params = {
"prompt": prompt_text,
"tokens": True,
"token_ids": True
}
tokenization_request = TokenizationRequest(**params)
tokenization_response = client.tokenize(request=tokenization_request, model="luminous-base")
tokens = tokenization_response.tokens
token_ids = tokenization_response.token_ids

print("Your prompt consists of the following tokens: {}".format(
[item for item in zip(tokens, token_ids)]
))
# prints:
# Your prompt consists of the following tokens: [('ĠAn', 556), ('Ġapple', 48741), ('Ġa', 247), ('Ġday', 2983), ('Ġkeeps', 28063), ('Ġthe', 301), ('Ġdoctor', 10510), ('Ġaway', 5469)]
# Note: The string token representations shown here consist of characters (e.g. Ġ) that may denote spaces or other special characters.

detokenization_request = DetokenizationRequest(token_ids)
detokenization_response = client.detokenize(request=detokenization_request, model="luminous-base")
detokenized_text = detokenization_response.result

print(f"Detokenized: '{detokenized_text}'")
# prints:
# Detokenized: ' An apple a day keeps the doctor away'
# Note: Due to our tokenizer's properties, detokenization will add a whitespace in front.

Using the Tokenizer Locally

You can also directly access the tokenizer we use in our models. This allows you to count the number of tokens in your prompt more accurately.

import os
from aleph_alpha_client import Client

client = Client(token=os.getenv("AA_TOKEN"))

tokenizer = client.tokenizer("luminous-supreme")
text = "Friends, Romans, countrymen, lend me your ears;"

tokens = tokenizer.encode(text)
and_back_to_text = tokenizer.decode(tokens.ids)

print("Tokens:", tokens.ids)
print("Back to text from ids:", and_back_to_text)

# Tokens: [37634, 15, 51399, 15, 6326, 645, 15, 75938, 489, 867, 47317, 30]
# Back to text from ids: Friends, Romans, countrymen, lend me your ears;

If you need more information on the parameters you can use, please checkout our HTTP API (tokenization, detokenization)