Announcing constrained decoding to ensure JSON format
Overview
JSON is a widely used format for inter-machine communication. While prompting a model to respond in JSON often yields the correct format, there's no guarantee that the generated output will always be valid. Invalid JSON can cause applications to fail if they rely on strict adherence to the format. To address this, constrained decoding has been implemented at inference level, ensuring that the model's output strictly complies with the JSON format.
Limitations
The current implementation comes with the following limitations:
- Slower compared to standard, unconstrained completions (few tokens / second)
- Does not currently support schemas. While the output will be valid JSON, it is not yet possible to enforce specific keys, value types, or structures within the JSON.
- Similar to regular completions, if the maximum token limit is too small, the output - in this case, the JSON - may be truncated and therefore incomplete. To avoid this, ensure to set the maximum_tokens parameter sufficiently large.
Usage
A separate endpoint has been created to this purpose:
<api-url>/complete/json
Request follow the same format as for the /complete
endpoint. However, extended sampling parameters (e.g., repetition or frequency penalties, temperature etc.) are disabled as of now.
It is recommended to nonetheless prompt the model for JSON format. If not doing so the output will still be forced into valid json format, but the result may not look as intended (e.g., "\n\n [1] \n\n"). As such it is recommended to always add something like "Answer only in valid json format." at the end of the prompt.
Example
The following python script assumes that the api token has been set as an environment variable AA_API_TOKEN
(or set in the script directly) and that the import packages are installed (pip install requests
) in the respective python environment.
import requests
import os
url = "https://api.aleph-alpha.com/complete/json"
token = os.getenv('AA_API_TOKEN')
model = "llama-3.1-8b-instruct"
prompt = "Describe your favorite Harry Potter character. Answer only in valid json format."
headers = {
"Content-Type": "application/json",
"Accept": "application/json",
"Authorization": f"Bearer {token}"
}
data = {
"model": model,
"prompt": f"<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful AI assistant<|eot_id|><|start_header_id|>user<|end_header_id|>{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>",
"maximum_tokens": 200,
"minimum_tokens": 20,
}
response = requests.post(url, headers=headers, json=data)
print(response.json()["completions"][0]["completion"])
When using an on-premise stack instead of the public API, replace https://api.aleph-alpha.com
with your local api url.
The above example might, for example, yield:
{
"name": "Luna Lovegood",
"description": "Quirky and dreamy Ravenclaw student with a unique perspective on the wizarding world",
"traits": [
"Optimistic",
"Loyal",
"Unconventional",
"Brave"
],
"abilities": [
"Magical prowess",
"Divination skills"
],
"relationships": [
"Close friendship with Ginny Weasley",
"Respectful admiration for Professor Dumbledore"
]
}