We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

GLM-5.1 - state-of-the-art agentic engineering, now available on DeepInfra!

Browse deepinfra models:

All categories and models you can try out and directly use in deepinfra:

text-generation

automatic-speech-recognition

zero-shot-image-classification

text-generation

PaddleOCR-VL-0.9B

PaddlePaddle/PaddleOCR-VL-0.9B cover image

PaddleOCR-VL is a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios.

$0.14 in, $0.80 out / 1M

text-generation

Qwen3-VL-235B-A22B-Instruct

Qwen/Qwen3-VL-235B-A22B-Instruct cover image

Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities.

text-generation

Qwen3-VL-30B-A3B-Instruct

Qwen/Qwen3-VL-30B-A3B-Instruct cover image

Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date. This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities.

$0.15 in, $0.60 out / 1M

text-generation

olmOCR-2-7B-1025

allenai/olmOCR-2-7B-1025 cover image

olmOCR is a specialized AI tool that converts PDF documents into clean, structured text while preserving important formatting and layout information. What makes olmOCR particularly valuable for developers is its ability to handle challenging PDFs that traditional OCR tools struggle with—including complex layouts, poor-quality scans, handwritten text, and documents with mixed content types. Built on a fine-tuned 7B vision-language model, olmOCR provides enterprise-grade PDF processing at a fraction of the cost of proprietary solutions.

$0.09 in, $0.19 out / 1M

text-generation

claude-3-7-sonnet-latest

anthropic/claude-3-7-sonnet-latest cover image

$0.33 cached, $3.30 in, $16.50 out / 1M

text-generation

anthropic/claude-4-opus cover image

Anthropic’s most powerful model yet and the state-of-the-art coding model. It delivers sustained performance on long-running tasks that require focused effort and thousands of steps, significantly expanding what AI agents can solve. Claude Opus 4 is ideal for powering frontier agent products and features.

$16.50 in, $82.50 out / 1M

text-generation

claude-4-sonnet

anthropic/claude-4-sonnet cover image

Anthropic's mid-size model with superior intelligence for high-volume uses in coding, in-depth research, agents, & more.

$3.30 in, $16.50 out / 1M

text-generation

deepseek-ai/DeepSeek-OCR cover image

DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping. DeepSeek-OCR consists of two components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. Specifically, DeepEncoder serves as the core engine, designed to maintain low activations under high-resolution input while achieving high compression ratios to ensure an optimal and manageable number of vision tokens. Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10x), the model can achieve decoding (OCR) precision of 97%. Even at a compression ratio of 20x, the OCR accuracy still remains at about 60%. This shows considerable promise for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs.

$0.03 in, $0.10 out / 1M

text-generation

gemini-1.5-flash

google/gemini-1.5-flash cover image

Gemini 1.5 Flash is Google's foundation model that performs well at a variety of multimodal tasks such as visual understanding, classification, summarization, and creating content from image, audio and video. It's adept at processing visual and text inputs such as photographs, documents, infographics, and screenshots. Gemini 1.5 Flash is designed for high-volume, high-frequency tasks where cost and latency matter.

text-generation

gemini-1.5-flash-8b

google/gemini-1.5-flash-8b cover image

text-generation

gemini-2.5-flash

google/gemini-2.5-flash cover image

Gemini 2.5 Flash is Google's latest thinking model, designed to tackle increasingly complex problems. It's capable of reasoning through their thoughts before responding, resulting in enhanced performance and improved accuracy. Gemini 2.5 Flash: best for balancing reasoning and speed.

$0.30 in, $2.50 out / 1M

text-generation

google/gemini-2.5-pro cover image

Gemini 2.5 Pro is Google's the most advanced thinking model, designed to tackle increasingly complex problems. Gemini 2.5 Pro leads common benchmarks by meaningful margins and showcases strong reasoning and code capabilities. Gemini 2.5 models are thinking models, capable of reasoning through their thoughts before responding, resulting in enhanced performance and improved accuracy. The Gemini 2.5 Pro model is now available on DeepInfra.

$1.25 in, $10.00 out / 1M

text-generation

google/gemma-3-12b-it cover image

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Gemma 3-12B is Google's latest open source model, successor to Gemma 2

$0.04 in, $0.13 out / 1M

text-generation

google/gemma-3-27b-it cover image

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Gemma 3 27B is Google's latest open source model, successor to Gemma 2

$0.08 in, $0.16 out / 1M

text-generation

google/gemma-3-4b-it cover image

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Gemma 3-12B is Google's latest open source model, successor to Gemma 2

$0.04 in, $0.08 out / 1M

text-generation

Llama-3.2-11B-Vision-Instruct

meta-llama/Llama-3.2-11B-Vision-Instruct cover image

Llama 3.2 11B Vision is a multimodal model with 11 billion parameters, designed to handle tasks combining visual and textual data. It excels in tasks such as image captioning and visual question answering, bridging the gap between language generation and visual reasoning. Pre-trained on a massive dataset of image-text pairs, it performs well in complex, high-accuracy image analysis. Its ability to integrate visual understanding with language processing makes it an ideal solution for industries requiring comprehensive visual-linguistic AI applications, such as content creation, AI-driven customer service, and research.

$0.245 / 1M tokens

text-generation

Llama-4-Maverick-17B-128E-Instruct-FP8

meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 cover image

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding. Llama 4 Maverick, a 17 billion parameter model with 128 experts

$0.15 in, $0.60 out / 1M

text-generation

Llama-4-Scout-17B-16E-Instruct

meta-llama/Llama-4-Scout-17B-16E-Instruct cover image

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding. Llama 4 Scout, a 17 billion parameter model with 16 experts

$0.08 in, $0.30 out / 1M

text-generation

Llama-Guard-4-12B

meta-llama/Llama-Guard-4-12B cover image

Llama Guard 4 is a natively multimodal safety classifier with 12 billion parameters trained jointly on text and multiple images. Llama Guard 4 is a dense architecture pruned from the Llama 4 Scout pre-trained model and fine-tuned for content safety classification. Similar to previous versions, it can be used to classify content in both LLM inputs (prompt classification) and in LLM responses (response classification). It itself acts as an LLM: it generates text in its output that indicates whether a given prompt or response is safe or unsafe, and if unsafe, it also lists the content categories violated.

$0.18 / 1M tokens

text-generation

Mistral-Small-3.2-24B-Instruct-2506

mistralai/Mistral-Small-3.2-24B-Instruct-2506 cover image

Mistral-Small-3.2-24B-Instruct is a drop-in upgrade over the 3.1 release, with markedly better instruction following, roughly half the infinite-generation errors, and a more robust function-calling interface—while otherwise matching or slightly improving on all previous text and vision benchmarks.

$0.075 in, $0.20 out / 1M

text-generation

NVIDIA-Nemotron-Nano-12B-v2-VL

nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL cover image

NVIDIA Nemotron 2 Nano VL extends the Nemotron family into multi-modal reasoning and document intelligence. This auto-regressive vision-language model enables multi-image reasoning, video understanding, visual Q&A and document analysis and summarization. Optimized for enterprise AI workflows, it powers multimodal agentic systems such as visual copilots, document assistants, and knowledge automation pipelines.

$0.20 in, $0.60 out / 1M

SOC 2 Certified

ISO 27001 Certified

Have questions or need a custom solution?

Company

Latest Models

deepseek-ai/DeepSeek-V3.2-Exp deepseek-ai/DeepSeek-V3.1 anthropic/claude-3-7-sonnet-latest zai-org/GLM-5.1 zai-org/GLM-4.6

Featured Models

inworld-ai/inworld-tts-1.5-mini google/gemma-4-26B-A4B-it inworld-ai/inworld-tts-1.5-max Qwen/Qwen3-TTS Qwen/Qwen3-Max

Built With Love in Palo Alto

© 2026 Deep Infra. All rights reserved.

Privacy Policy Terms of Service