Skip to main content

Basic Principles

Some of our models have multimodal capabilities. This means that they understand not only text, but also images. This capability opens up a host of possibilities that conventional machine learning methods can only achieve with costly data labeling and training. There are many applications for this technology, from object detection to image classification. Just click on the "Multimodal" tab in the Playground to get started!


Similar to text-only usage, our multimodal models try to find the best continuation for a given input. If prompted, the model will return a text completion. Currently, we do not offer the possibility to generate images using our multimodal functionality.

Model generation is affected by a variety of factors, such as:

  • Nature and content of your prompt items
  • Order of the prompt items
  • Model parameters (most notably temperature)

Try playing around with these factors to get your desired completion. In this section of the documentation, we will lay out some tips, tricks & examples on how to master multimodal model usage.


You can insert images either by uploading them from your file system (accessible via the "Image" button) or by dragging and dropping. All images fed into the AI must be quadratic. Accordingly, the "Model Visible Area" visualizes the part of the image that is AI-visible. Feel free to move or resize the visible area in order to focus on different parts of your image. Before generation, images will be automatically resized to 384x384 pixels. Each image comprises of 144 tokens.


In contrast to language-only models, multimodal models can be prompted with both text and images in any order. Prompt items will always be considered in the order in which you insert them: from top (first) to bottom (last). For example, you may start with a snippet of text, insert an image below, and then finish up with some more text. Simply add new prompt items and drag them around until you're satisfied.

Luckily, you can even apply many of the tips & tricks you learned for "conventional" prompting, such as...

  • Prompt structure. Try to structure you prompt in a clear and unambiguous manner. For example, use the "Q: [...], A: [...]"-scheme or separators, such as linebreaks or "###".
  • Model Parameters. The model parameters bar on the left side works the same way as for text-only tasks! Find the section for more information.
  • Few-shot learning. By providing a handful of examples of how to accomplish a task, you can improve the performance of the model. You can both provide multiple few-shot texts for one image or multiple image-text pairs. Note, however, that few-shot prompting works more robustly with text-only prompts.

Other Tasks

Multimodality also works for other non-completion tasks that are available via our Python client. Here is a list of all tasks that can be used with multimodality.

  • Complete. Provide any combination of texts and images as input to generate a text completion.
  • Evaluate. Provide any combination of texts and images as input to score an expected completion against.
  • Embed. Embed a prompt that may contain any combination of texts and images.
  • Semantic Embed. Semantically embed a prompt that may contain any combination of texts and images. This type of embedding is optimized for search and comparisons.

For more information on these tasks and coding examples find the Tasks section.