Basic Principles

Some of our models have multimodal capabilities. This means that they understand not only text, but also images. This capability opens up a host of possibilities that conventional machine learning methods can only achieve with costly data labeling and training. There are many applications for this technology, from object detection to image classification. Just click on the "Multimodal" tab of the complete-endpoint in the Research Playground to get started!

Currently, one Luminous-extended and Luminous-base support multimodal inputs. You cannot use our control models multimodally.

Completion

Similar to text-only usage, multimodal models try to find the best continuation for a given input. If prompted, the model will return a text completion. Currently, we do not offer the possibility to generate multimodal output.

The generation is affected by a variety of factors, such as:

Nature and content of your prompt items
Order of the prompt items
Model parameters (most notably temperature)

Try playing around with these factors to improve the completion. In this section of the documentation, we lay out some tips, tricks & examples on how to master multimodal model usage.

Images

All images fed into the LLM must be quadratic. Accordingly, the "Model Visible Area" visualizes the part of the image that is visible for the LLM. Feel free to move or resize the visible area in order to focus on different parts of your image. Before generation, images will be automatically resized to 384x384 pixels. Each image comprises of 144 tokens.

Prompting

In contrast to language-only models, multimodal models can be prompted with both text and images in any order. Prompt items will always be considered in the order in which you insert them: from top (first) to bottom (last). For example, you may start with a snippet of text, insert an image below, and then finish up with some more text. When using the Research Playground, simply add new prompt items and drag them around until you're satisfied. You can insert images either by uploading them from your file system (accessible via the "Image" button) or by dragging and dropping.

When it comes to prompting techniques, you can even apply many of the tips & tricks you learned for "conventional" prompting, such as...

Prompt structure. Try to structure you prompt in a clear and unambiguous manner. For example, use the "Q: [...], A: [...]"-scheme or separators, such as linebreaks or "###".
Few-shot learning. By providing a handful of examples of how to accomplish a task, you can improve the performance of the model. You can both provide multiple few-shot texts for one image or multiple image-text pairs. Note, however, that few-shot prompting works more robustly with text-only prompts.

Other Tasks

Multimodality also works for other non-completion tasks that are available via our Python client. Here is a list of all tasks that can be used with multimodality.

Complete: Provide any combination of texts and images as input to generate a text completion.
Evaluate: Provide any combination of texts and images as input to score an expected completion against.
Embed: Embed a prompt that may contain any combination of texts and images.
Semantic Embed: Semantically embed a prompt that may contain any combination of texts and images. This type of embedding is optimized for search and comparisons.

Basic Principles

Completion​

Images​

Prompting​

Other Tasks​

Completion

Images

Prompting

Other Tasks