Multimodality
Multimodal capabilities allow large language models to process and understand multiple types of input data simultaneously. The Pharia Inference API now supports combining text and images in the input prompt. In this article we will see how to send both text and images to language models via the OpenAI-compatible chat endpoint. This enables LLMs to analyze visual content, understand context from images, and provide comprehensive responses that consider both textual and visual information.
- Input images need to be base64 encoded and have a resolution of 4000x4000 pixels or smaller.
- Any number of input images is supported as long as the combined request size does not exceed 128MiB.
- External links to images are not supported.
- Generating images as output is not supported.
- Generating audio as output is not supported.
- Transcription of input audio is supported via the
/transcribeendpoint, see Transcribe audio.
Usage
In order to leverage the multimodality capabilities we need to deploy a vision language model which
supports images as input. Note, that we would need to set the multimodal_enabled flag to true in
the model package configuration in order for the inference api to enable requests containing images
(see Worker Deployment for more details).
Once a vision model is deployed, we will send a chat request containing an image to the
inference API and ask the model to describe it. To do so we will use both the
Aleph-Alpha Client and plain curl.
Aleph-Alpha Client
To send and describe and image consider the following Python code:
import os
from aleph_alpha_client.aleph_alpha_client import Client
from aleph_alpha_client.chat import ChatRequest, Message, Role
from PIL import Image
client = Client(
host="https://inference-api.pharia.example.com",
token=os.environ["PHARIA_TOKEN"],
)
# example vision model
model = "qwen2.5-vl-32b-instruct"
messages = [
Message(
role=Role.System,
content="You are a helpful assistant."),
Message(
role=Role.User,
content=[
"Describe the following image. Answer in 20 words or less.",
Image.open("files/cat.jpg"),
],
),
]
chat_request = ChatRequest(model=model, messages=messages)
response = client.chat(request=chat_request, model=model)
content = response.message.content
assert " cat " in content
print(content)
An example response might be, but will vary depending on the model and input image:
The image shows a black-and-white cat lounging on a couch near a window.
curl
The same can be achieved with curl by sending a base64 encoded image in the request body:
curl -L 'https://inference-api.pharia.example.com/v1/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H "Authorization: Bearer $PHARIA_TOKEN" \
-d '{
"model": "qwen2.5-vl-32b-instruct",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this image? Answer in 20 words or less."
},
{
"type": "image_url",
"image_url": {
"url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4//8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU5ErkJggg=="
}
}
]
}
]
}' | jq
Which will yield a response like:
{
"id": "ccc7e586-03bc-4bc5-b9f5-bb27fe76b1d9",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "The image shows a red cross symbol, commonly associated with medical aid, emergency services, or the International Red Cross."
},
"logprobs": null
}
],
"created": 1755068442,
"model": "qwen2.5-vl-32b-instruct",
"system_fingerprint": "empty",
"object": "chat.completion",
"usage": {
"prompt_tokens": 16421,
"completion_tokens": 24,
"total_tokens": 16445
}
}
The above examples work for any vision model served by the vLLM worker. If you want to use the
luminous worker type with e.g. luminous-base you have to use the /v1/complete endpoint instead.