Documentation

Using LLaVa model on DeepInfra

LLaVa is a multimodal model that supports vision and language models combined. It's image-text-to-text model. You can read more about it here: Large Language and Vision Assistant (LLaVA)

Currently, we host:

Quick start

Let's consider this image:

Example image

If you ask What’s in this image?

The model will answer something like this

In this image, a large, colorful animal, possibly a llama, is standing alone in a barren, red and orange landscape, close to a large volcano. The setting appears to be an artistic painting, possibly inspired by South American culture or a fantasy world with volcanoes. The llama is situated at the center of the scene, drawing attention to the contrasting colors and the fiery backdrop of the volcano. The overall atmosphere of the image suggests a sense of danger and mystery amidst the volcanic landscape.

Images are passed to the model in two ways:

  1. by passing link to the image (e.g. https://example.com/image1.jpg)
  2. by passing base64 encoded image directly in the request

Here is an example of the request.

curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(deepctl auth token)" \
  -d '{
    "model": "llava-hf/llava-1.5-7b-hf",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image_url",
            "image_url": {
              "url": "https://shared.deepinfra.com/models/llava-hf/llava-1.5-7b-hf/cover_image.ed4fba7a25b147e7fe6675e9f760585e11274e8ee72596e6412447260493cd4f-s600.webp"
            }
          },
          {
            "type": "text",
            "text": "What’s in this image?"
          }
        ]
      }
    ]
  }'

Example of uploading base64 encoded image

Uploading images using base64 is convenient when you have images available locally. Here is an example for it:

from openai import OpenAI
import base64
import requests

# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
    api_key="<your-DeepInfra-API-token>",
    base_url="https://api.deepinfra.com/v1/openai",
)

image_url = "https://shared.deepinfra.com/models/llava-hf/llava-1.5-7b-hf/cover_image.ed4fba7a25b147e7fe6675e9f760585e11274e8ee72596e6412447260493cd4f-s600.webp"
base64_image = base64.b64encode(requests.get(image_url).content).decode("utf-8")

chat_completion = openai.chat.completions.create(
    model="llava-hf/llava-1.5-7b-hf",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{base64_image}"
                    }
                },
                {
                    "type": "text",
                    "text": "What’s in this image?"
                }
            ]
        }
    ]
)

print(chat_completion.choices[0].message.content)

Passing multiple images

API allows to pass multiple images too. LLaVa can still perform reasonably well with multiple images, though it wasn't trained on multi-image datasets.

curl "https://api.deepinfra.com/v1/openai/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(deepctl auth token)" \
  -d '{
    "model": "llava-hf/llava-1.5-7b-hf",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image_url",
            "image_url": {
              "url": "https://shared.deepinfra.com/models/llava-hf/llava-1.5-7b-hf/cover_image.ed4fba7a25b147e7fe6675e9f760585e11274e8ee72596e6412447260493cd4f-s600.webp"
            }
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://shared.deepinfra.com/models/meta-llama/Llama-2-7b-chat-hf/cover_image.10373e7a429dd725e0eb9e57cd20aeb815426c077217b27d9aedce37bd5c2173-s600.webp"
            }
          },
          {
            "type": "text",
            "text": "What’s in this image?"
          }
        ]
      }
    ]
  }'

Calculating costs

For LLaVa 1.5, each image takes additional 576 input tokens. It's also reported in response under "usage":{"prompt_tokens": <tokens-for-images-and-text>,...}.

LLaVa 1.5 model is designed to work only with 336 x 336 image resolutions, that's why each image always takes the same amount of tokens. You can still pass large or smaller images, the model will rescale them automatically.

Limitations and Caveats

  • Supported image types are: jpg, png, and webp.
  • Images must be smaller than 20MB
  • LLaVa 1.5 only works with 336 x 336 resolution (it will rescale every image to this size).
  • Since LLaVa 1.5 has only one resolution, passing image fidelity with detail argument is not useful. Though it might change in future with the newer versions of the model.
  • Try to put your text instructions after the image content (i.e. follow this order inside content "content": [{"type": "image_url"...}, ..., {"type": "text"...}]). The model doesn't do well when you have text and then image in the messages content.