We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

Browse deepinfra models:

All categories and models you can try out and directly use in deepinfra:
Search

Category/multimodal

Multimodal AI models can process and understand multiple types of input simultaneously, such as text and images, making them powerful tools for tasks that require understanding of both visual and textual information.

These models combine computer vision and natural language processing capabilities to analyze images, answer questions about visual content, generate descriptions, and perform complex reasoning tasks that involve both text and visual elements.

Multimodal models are particularly useful for applications like visual question answering, image captioning, document analysis, and interactive AI assistants that need to understand and respond to both text and visual inputs.

meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 cover image
featured
fp8
1024k
$0.15/$0.60 in/out Mtoken
  • text-generation

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding. Llama 4 Maverick, a 17 billion parameter model with 128 experts

meta-llama/Llama-4-Scout-17B-16E-Instruct cover image
featured
bfloat16
320k
$0.08/$0.30 in/out Mtoken
  • text-generation

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding. Llama 4 Scout, a 17 billion parameter model with 16 experts

meta-llama/Llama-Guard-4-12B cover image
featured
bfloat16
160k
$0.05 / Mtoken
  • text-generation

Llama Guard 4 is a natively multimodal safety classifier with 12 billion parameters trained jointly on text and multiple images. Llama Guard 4 is a dense architecture pruned from the Llama 4 Scout pre-trained model and fine-tuned for content safety classification. Similar to previous versions, it can be used to classify content in both LLM inputs (prompt classification) and in LLM responses (response classification). It itself acts as an LLM: it generates text in its output that indicates whether a given prompt or response is safe or unsafe, and if unsafe, it also lists the content categories violated.

google/gemma-3-27b-it cover image
featured
bfloat16
128k
$0.10/$0.20 in/out Mtoken
  • text-generation

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Gemma 3 27B is Google's latest open source model, successor to Gemma 2

google/gemma-3-12b-it cover image
featured
bfloat16
128k
$0.05/$0.10 in/out Mtoken
  • text-generation

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Gemma 3-12B is Google's latest open source model, successor to Gemma 2

google/gemma-3-4b-it cover image
featured
bfloat16
128k
$0.02/$0.04 in/out Mtoken
  • text-generation

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Gemma 3-12B is Google's latest open source model, successor to Gemma 2

microsoft/Phi-4-multimodal-instruct cover image
featured
bfloat16
128k
$0.05/$0.10 in/out Mtoken
  • text-generation

Phi-4-multimodal-instruct is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi-3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning, direct preference optimization and RLHF (Reinforcement Learning from Human Feedback) to support precise instruction adherence and safety measures. The languages that each modal supports are the following: - Text: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian - Vision: English - Audio: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese

Qwen/QVQ-72B-Preview cover image
bfloat16
31k
Replaced
  • text-generation

QVQ-72B-Preview is an experimental research model developed by the Qwen team, focusing on enhancing visual reasoning capabilities. QVQ-72B-Preview has achieved remarkable performance on various benchmarks. It scored a remarkable 70.3% on the Multimodal Massive Multi-task Understanding (MMMU) benchmark

anthropic/claude-4-opus cover image
195k
$16.50/$82.50 in/out Mtoken
  • text-generation

Anthropic’s most powerful model yet and the state-of-the-art coding model. It delivers sustained performance on long-running tasks that require focused effort and thousands of steps, significantly expanding what AI agents can solve. Claude Opus 4 is ideal for powering frontier agent products and features.

anthropic/claude-4-sonnet cover image
195k
$3.30/$16.50 in/out Mtoken
  • text-generation

Anthropic's mid-size model with superior intelligence for high-volume uses in coding, in-depth research, agents, & more.

google/gemini-1.5-flash cover image
976k
Deprecated
  • text-generation

Gemini 1.5 Flash is Google's foundation model that performs well at a variety of multimodal tasks such as visual understanding, classification, summarization, and creating content from image, audio and video. It's adept at processing visual and text inputs such as photographs, documents, infographics, and screenshots. Gemini 1.5 Flash is designed for high-volume, high-frequency tasks where cost and latency matter.

google/gemini-2.5-flash cover image
976k
$0.105/$2.45 in/out Mtoken
  • text-generation

Gemini 2.5 Flash is Google's latest thinking model, designed to tackle increasingly complex problems. It's capable of reasoning through their thoughts before responding, resulting in enhanced performance and improved accuracy. Gemini 2.5 Flash: best for balancing reasoning and speed.

google/gemini-2.5-pro cover image
976k
$0.875/$7.00 in/out Mtoken
  • text-generation

Gemini 2.5 Pro is Google's the most advanced thinking model, designed to tackle increasingly complex problems. Gemini 2.5 Pro leads common benchmarks by meaningful margins and showcases strong reasoning and code capabilities. Gemini 2.5 models are thinking models, capable of reasoning through their thoughts before responding, resulting in enhanced performance and improved accuracy. The Gemini 2.5 Pro model is now available on DeepInfra.

meta-llama/Llama-3.2-11B-Vision-Instruct cover image
bfloat16
128k
$0.049 / Mtoken
  • text-generation

Llama 3.2 11B Vision is a multimodal model with 11 billion parameters, designed to handle tasks combining visual and textual data. It excels in tasks such as image captioning and visual question answering, bridging the gap between language generation and visual reasoning. Pre-trained on a massive dataset of image-text pairs, it performs well in complex, high-accuracy image analysis. Its ability to integrate visual understanding with language processing makes it an ideal solution for industries requiring comprehensive visual-linguistic AI applications, such as content creation, AI-driven customer service, and research.

meta-llama/Llama-3.2-90B-Vision-Instruct cover image
bfloat16
32k
$0.35/$0.40 in/out Mtoken
  • text-generation

The Llama 90B Vision model is a top-tier, 90-billion-parameter multimodal model designed for the most challenging visual reasoning and language tasks. It offers unparalleled accuracy in image captioning, visual question answering, and advanced image-text comprehension. Pre-trained on vast multimodal datasets and fine-tuned with human feedback, the Llama 90B Vision is engineered to handle the most demanding image-based AI tasks. This model is perfect for industries requiring cutting-edge multimodal AI capabilities, particularly those dealing with complex, real-time visual and textual analysis.