We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

Documentation

Using Multimodal models on DeepInfra

DeepInfra hosts multimodal models that support vision and language models combined. These models can take both images and text as input and provide text as output.

Currently, we host:

Quick start

Let's consider this image:

Example image

If you ask What’s in this image?

The model will answer something like this

In this image, a large, colorful animal, possibly a llama, is standing alone in a barren, red and orange landscape, close to a large volcano. The setting appears to be an artistic painting, possibly inspired by South American culture or a fantasy world with volcanoes. The llama is situated at the center of the scene, drawing attention to the contrasting colors and the fiery backdrop of the volcano. The overall atmosphere of the image suggests a sense of danger and mystery amidst the volcanic landscape.
copy

Images are passed to the model in two ways:

by passing link to the image (e.g. https://example.com/image1.jpg)
by passing base64 encoded image directly in the request

Here is an example of the request.

curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
    "model": "meta-llama/Llama-3.2-90B-Vision-Instruct",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image_url",
            "image_url": {
              "url": "https://shared.deepinfra.com/models/llava-hf/llava-1.5-7b-hf/cover_image.ed4fba7a25b147e7fe6675e9f760585e11274e8ee72596e6412447260493cd4f-s600.webp"
            }
          },
          {
            "type": "text",
            "text": "What’s in this image?"
          }
        ]
      }
    ]
  }'
copy

Example of uploading base64 encoded image

Uploading images using base64 is convenient when you have images available locally. Here is an example for it:

from openai import OpenAI
import base64
import requests

# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
    api_key="<your-DeepInfra-API-token>",
    base_url="https://api.deepinfra.com/v1/openai",
)

image_url = "https://shared.deepinfra.com/models/llava-hf/llava-1.5-7b-hf/cover_image.ed4fba7a25b147e7fe6675e9f760585e11274e8ee72596e6412447260493cd4f-s600.webp"
base64_image = base64.b64encode(requests.get(image_url).content).decode("utf-8")

chat_completion = openai.chat.completions.create(
    model="meta-llama/Llama-3.2-90B-Vision-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{base64_image}"
                    }
                },
                {
                    "type": "text",
                    "text": "What’s in this image?"
                }
            ]
        }
    ]
)

print(chat_completion.choices[0].message.content)
copy

Passing multiple images

API allows to pass multiple images too.

curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -d '{
    "model": "meta-llama/Llama-3.2-90B-Vision-Instruct",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image_url",
            "image_url": {
              "url": "https://shared.deepinfra.com/models/llava-hf/llava-1.5-7b-hf/cover_image.ed4fba7a25b147e7fe6675e9f760585e11274e8ee72596e6412447260493cd4f-s600.webp"
            }
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://shared.deepinfra.com/models/meta-llama/Llama-2-7b-chat-hf/cover_image.10373e7a429dd725e0eb9e57cd20aeb815426c077217b27d9aedce37bd5c2173-s600.webp"
            }
          },
          {
            "type": "text",
            "text": "What’s in this image?"
          }
        ]
      }
    ]
  }'
copy

Calculating costs

Images are tokenized and passed to the model as input. The number of tokens consumed by an image is reported in the response under "usage":{"prompt_tokens": <tokens-for-images-and-text>,...}.

Different models work with different image resolutions. You can still pass images of different resolutions, the model will rescale them automatically. Read the documentation of the model to know the supported image resolutions.

Limitations and Caveats

Supported image types are: jpg, png, and webp.
Images must be smaller than 20MB
Currently, we don't support passing image fidelity with detail argument.

JSON Mode Log Probabilities

Unlock the most affordable AI hosting

Run models at scale with our fully managed GPU infrastructure, delivering enterprise-grade uptime at the industry's best rates.

Contact Sales Get Started

Latest Models

bigcode/

starcoder2-15b

Phind/

Phind-CodeLlama-34B-v2

openchat/

openchat_3.5

openai/

whisper-tiny

Gryphe/

MythoMax-L2-13b

Featured Models

meta-llama/

Llama-4-Maverick-17B-128E-Instruct-Turbo

mistralai/

Voxtral-Small-24B-2507

google/

gemma-3-4b-it

Qwen/

Qwen3-235B-A22B-Thinking-2507

Qwen/

Qwen3-32B

google/

gemma-3-12b-it

Company

Pricing

Docs

Compare

DeepStart

About

Careers

Trust Center

Have questions or need a custom solution?

Contact Sales