Announcing constrained decoding to ensure JSON format

Overview

JSON is a widely used format for inter-machine communication. While prompting a model to respond in JSON often yields the correct format, there's no guarantee that the generated output will always be valid. Invalid JSON can cause applications to fail if they rely on strict adherence to the format. To address this, constrained decoding has been implemented at inference level, ensuring that the model's output strictly complies with the JSON format.

Limitations

The current implementation comes with the following limitations:

Slower compared to standard, unconstrained completions (few tokens / second)
Does not currently support schemas. While the output will be valid JSON, it is not yet possible to enforce specific keys, value types, or structures within the JSON.
Similar to regular completions, if the maximum token limit is too small, the output - in this case, the JSON - may be truncated and therefore incomplete. To avoid this, ensure to set the maximum_tokens parameter sufficiently large.

Usage

A separate endpoint has been created to this purpose:

<api-url>/complete/json

Request follow the same format as for the /complete endpoint. However, extended sampling parameters (e.g., repetition or frequency penalties, temperature etc.) are disabled as of now.

caution

It is recommended to nonetheless prompt the model for JSON format. If not doing so the output will still be forced into valid json format, but the result may not look as intended (e.g., "\n\n [1] \n\n"). As such it is recommended to always add something like "Answer only in valid json format." at the end of the prompt.

Example

The following python script assumes that the api token has been set as an environment variable AA_API_TOKEN (or set in the script directly) and that the import packages are installed (pip install requests) in the respective python environment.

import requests
import os

url = "https://api.aleph-alpha.com/complete/json"
token = os.getenv('AA_API_TOKEN')

model = "llama-3.1-8b-instruct"
prompt = "Describe your favorite Harry Potter character. Answer only in valid json format."

headers = {
    "Content-Type": "application/json",
    "Accept": "application/json",
    "Authorization": f"Bearer {token}"
}

data = {
    "model": model,
    "prompt": f"<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful AI assistant<|eot_id|><|start_header_id|>user<|end_header_id|>{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>",
    "maximum_tokens": 200,
    "minimum_tokens": 20,
}

response = requests.post(url, headers=headers, json=data)
print(response.json()["completions"][0]["completion"])

When using an on-premise stack instead of the public API, replace https://api.aleph-alpha.com with your local api url.

The above example might, for example, yield:

{
  "name": "Luna Lovegood",
  "description": "Quirky and dreamy Ravenclaw student with a unique perspective on the wizarding world",
  "traits": [
    "Optimistic",
    "Loyal",
    "Unconventional",
    "Brave"
  ],
  "abilities": [
    "Magical prowess",
    "Divination skills"
  ],
  "relationships": [
    "Close friendship with Ginny Weasley",
    "Respectful admiration for Professor Dumbledore"
  ]
}

Overview​

Limitations​

Usage​

Example​

Overview

Limitations

Usage

Example