Documentation
LLaVa model
Contents
LLaVa model
LLaVa is a multimodal model that supports vision and language models combined. It's image-text-to-text model. You can read more about it here: Large Language and Vision Assistant (LLaVA)
Currently, we host:
Let's consider this image:
If you ask What’s in this image?
The model will answer something like this
In this image, a large, colorful animal, possibly a llama, is standing alone in a barren, red and orange landscape, close to a large volcano. The setting appears to be an artistic painting, possibly inspired by South American culture or a fantasy world with volcanoes. The llama is situated at the center of the scene, drawing attention to the contrasting colors and the fiery backdrop of the volcano. The overall atmosphere of the image suggests a sense of danger and mystery amidst the volcanic landscape.
Images are passed to the model in two ways:
Here is an example of the request.
curl "https://api.deepinfra.com/v1/openai/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-d '{
"model": "llava-hf/llava-1.5-7b-hf",
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://shared.deepinfra.com/models/llava-hf/llava-1.5-7b-hf/cover_image.ed4fba7a25b147e7fe6675e9f760585e11274e8ee72596e6412447260493cd4f-s600.webp"
}
},
{
"type": "text",
"text": "What’s in this image?"
}
]
}
]
}'
Uploading images using base64 is convenient when you have images available locally. Here is an example for it:
from openai import OpenAI
import base64
import requests
# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
api_key="<your-DeepInfra-API-token>",
base_url="https://api.deepinfra.com/v1/openai",
)
image_url = "https://shared.deepinfra.com/models/llava-hf/llava-1.5-7b-hf/cover_image.ed4fba7a25b147e7fe6675e9f760585e11274e8ee72596e6412447260493cd4f-s600.webp"
base64_image = base64.b64encode(requests.get(image_url).content).decode("utf-8")
chat_completion = openai.chat.completions.create(
model="llava-hf/llava-1.5-7b-hf",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}"
}
},
{
"type": "text",
"text": "What’s in this image?"
}
]
}
]
)
print(chat_completion.choices[0].message.content)
API allows to pass multiple images too. LLaVa can still perform reasonably well with multiple images, though it wasn't trained on multi-image datasets.
curl "https://api.deepinfra.com/v1/openai/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPINFRA_TOKEN" \
-d '{
"model": "llava-hf/llava-1.5-7b-hf",
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://shared.deepinfra.com/models/llava-hf/llava-1.5-7b-hf/cover_image.ed4fba7a25b147e7fe6675e9f760585e11274e8ee72596e6412447260493cd4f-s600.webp"
}
},
{
"type": "image_url",
"image_url": {
"url": "https://shared.deepinfra.com/models/meta-llama/Llama-2-7b-chat-hf/cover_image.10373e7a429dd725e0eb9e57cd20aeb815426c077217b27d9aedce37bd5c2173-s600.webp"
}
},
{
"type": "text",
"text": "What’s in this image?"
}
]
}
]
}'
For LLaVa 1.5, each image takes additional 576 input tokens. It's also reported in response under "usage":{"prompt_tokens": <tokens-for-images-and-text>,...}
.
LLaVa 1.5 model is designed to work only with 336 x 336 image resolutions, that's why each image always takes the same amount of tokens. You can still pass large or smaller images, the model will rescale them automatically.
detail
argument is not useful. Though it might change in future with the newer versions of the model."content": [{"type": "image_url"...}, ..., {"type": "text"...}]
). The model doesn't do well when you have text and then image in the messages content.