Multimodality

What is multimodality?

Multimodality is the concept of allowing models to process and understand multiple types of input data simultaneously. For example, an input prompt can contain both text and images.

The PhariaInference API supports multimodal inputs. In this article, we describe how to send both text and images to models using the OpenAI-compatible chat endpoint. This enables models to analyse visual content, understand context from images, and provide comprehensive responses that consider both textual and visual information.

Technical details

  • Input images must be base64 encoded and have a resolution of 4000x4000 pixels or smaller.

  • Any number of input images is supported as long as the combined request size does not exceed 128 MiB.

  • External links to images are not supported.

  • Generating images as output is not supported.

  • Generating audio as output is not supported.

  • Transcription of input audio is supported with the /transcribe endpoint.

How it works

To use multimodality, we need to deploy a vision language model which supports images as input. We also need to set the multimodal_enabled flag to true in the model package configuration for the PhariaInference API to enable requests containing images (see Deploying workers.)

Once a vision model is deployed, we send a chat request containing an image to the PhariaInference API and ask the model to describe it. To do so, we use both the Aleph-Alpha client and plain curl.

Aleph-Alpha client

To send and describe an image, consider the following Python code:

import os
from aleph_alpha_client.aleph_alpha_client import Client
from aleph_alpha_client.chat import ChatRequest, Message, Role
from PIL import Image

client = Client(
    host="https://inference-api.pharia.example.com",
    token=os.environ["PHARIA_TOKEN"],
)
# example vision model
model = "qwen2.5-vl-32b-instruct"

messages = [
    Message(
        role=Role.System,
        content="You are a helpful assistant."),
    Message(
        role=Role.User,
        content=[
            "Describe the following image. Answer in 20 words or less.",
            Image.open("files/cat.jpg"),
        ],
    ),
]
chat_request = ChatRequest(model=model, messages=messages)

response = client.chat(request=chat_request, model=model)
content = response.message.content
assert " cat " in content
print(content)

The response varies depending on the model and input image, but a typical response might be the following:

The image shows a black-and-white cat lounging on a couch near a window.

curl

The same can be achieved with curl by sending a base64 encoded image in the request body:

curl -L 'https://inference-api.pharia.example.com/v1/chat/completions' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-H "Authorization: Bearer $PHARIA_TOKEN" \
-d '{
    "model": "qwen2.5-vl-32b-instruct",
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant."
        },
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What is in this image? Answer in 20 words or less."
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": ""
                    }
                }
            ]
        }
    ]
}' | jq

The response might look like the following:

{
  "id": "ccc7e586-03bc-4bc5-b9f5-bb27fe76b1d9",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The image shows a red cross symbol, commonly associated with medical aid, emergency services, or the International Red Cross."
      },
      "logprobs": null
    }
  ],
  "created": 1755068442,
  "model": "qwen2.5-vl-32b-instruct",
  "system_fingerprint": "empty",
  "object": "chat.completion",
  "usage": {
    "prompt_tokens": 16421,
    "completion_tokens": 24,
    "total_tokens": 16445
  }
}
The examples above work for any vision model served by the vLLM worker. If you want to use the luminous worker type, such as with luminous-base, you need to use the /v1/complete endpoint instead.