Documentation
Multimodal models
Contents
Multimodal models
DeepInfra hosts multimodal models that support vision and language models combined. These models can take both images and text as input and provide text as output.
Currently, we host:
Let's consider this image:
If you ask What’s in this image?
The model will answer something like this
In this image, a large, colorful animal, possibly a llama, is standing alone in a barren, red and orange landscape, close to a large volcano. The setting appears to be an artistic painting, possibly inspired by South American culture or a fantasy world with volcanoes. The llama is situated at the center of the scene, drawing attention to the contrasting colors and the fiery backdrop of the volcano. The overall atmosphere of the image suggests a sense of danger and mystery amidst the volcanic landscape.
Images are passed to the model in two ways:
Here is an example of the request.
curl "https://api.deepinfra.com/v1/openai/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-d '{
"model": "meta-llama/Llama-3.2-90B-Vision-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://shared.deepinfra.com/models/llava-hf/llava-1.5-7b-hf/cover_image.ed4fba7a25b147e7fe6675e9f760585e11274e8ee72596e6412447260493cd4f-s600.webp"
}
},
{
"type": "text",
"text": "What’s in this image?"
}
]
}
]
}'
Uploading images using base64 is convenient when you have images available locally. Here is an example for it:
from openai import OpenAI
import base64
import requests
# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
api_key="<your-DeepInfra-API-token>",
base_url="https://api.deepinfra.com/v1/openai",
)
image_url = "https://shared.deepinfra.com/models/llava-hf/llava-1.5-7b-hf/cover_image.ed4fba7a25b147e7fe6675e9f760585e11274e8ee72596e6412447260493cd4f-s600.webp"
base64_image = base64.b64encode(requests.get(image_url).content).decode("utf-8")
chat_completion = openai.chat.completions.create(
model="meta-llama/Llama-3.2-90B-Vision-Instruct",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}"
}
},
{
"type": "text",
"text": "What’s in this image?"
}
]
}
]
)
print(chat_completion.choices[0].message.content)
API allows to pass multiple images too.
curl "https://api.deepinfra.com/v1/openai/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-d '{
"model": "meta-llama/Llama-3.2-90B-Vision-Instruct",
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://shared.deepinfra.com/models/llava-hf/llava-1.5-7b-hf/cover_image.ed4fba7a25b147e7fe6675e9f760585e11274e8ee72596e6412447260493cd4f-s600.webp"
}
},
{
"type": "image_url",
"image_url": {
"url": "https://shared.deepinfra.com/models/meta-llama/Llama-2-7b-chat-hf/cover_image.10373e7a429dd725e0eb9e57cd20aeb815426c077217b27d9aedce37bd5c2173-s600.webp"
}
},
{
"type": "text",
"text": "What’s in this image?"
}
]
}
]
}'
Images are tokenized and passed to the model as input. The number of tokens consumed by an image is reported in the response under "usage":{"prompt_tokens": <tokens-for-images-and-text>,...}
.
Different models work with different image resolutions. You can still pass images of different resolutions, the model will rescale them automatically. Read the documentation of the model to know the supported image resolutions.
detail
argument.