We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Browse deepinfra models:

All categories and models you can try out and directly use in deepinfra:

text-generation

automatic-speech-recognition

zero-shot-image-classification

featured

text-generation

DeepSeek-V4-Flash-0731

deepseek-ai/DeepSeek-V4-Flash-0731 cover image

DeepSeek-V4-Flash-0731 is the official release of DeepSeek-V4-Flash, superseding the preview version, with substantially enhanced agentic capabilities. DeepSeek-V4-Flash-0731 outperforms DeepSeek-V4-Pro (Preview) on benchmarks listed below despite its far smaller activated parameter count, and is broadly competitive with the strongest proprietary models available.

$0.018 cached, $0.09 in, $0.18 out / 1M

featured

text-generation

zai-org/GLM-5.2 cover image

GLM-5.2 is Z-AI's latest flagship model for long-horizon tasks. It marks a substantial leap in long-horizon task capability over its predecessor GLM-5.1 and, for the first time, delivers that capability on a **solid 1M-token context**.

$0.14 cached, $0.75 in, $2.40 out / 1M

featured

text-generation

moonshotai/Kimi-K2.7-Code cover image

Kimi K2.7 Code is a coding-focused agentic model built upon Kimi K2.6. With substantial improvements on real-world long-horizon coding tasks, it strengthens end-to-end task completion across complex software engineering workflows while improving token efficiency, reducing thinking-token usage by approximately 30% compared with Kimi K2.6.

$0.15 cached, $0.74 in, $3.50 out / 1M

featured

text-generation

NVIDIA-Nemotron-3-Ultra-550B-A55B

nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B cover image

Nemotron 3 Ultra is built for, frontier reasoning, orchestration, coding agents, deep research, and complex enterprise workflows. It delivers up to 5x faster inference and up to 30% lower cost for agentic workloads while supporting up to 1M token context.

$0.10 cached, $0.50 in, $2.20 out / 1M

featured

text-generation

DeepSeek-V4-Flash

deepseek-ai/DeepSeek-V4-Flash cover image

DeepSeek V4 Flash is an efficiency-focused MoE model with 284B total parameters (13B active) and a 1M-token context window. It's tuned for fast inference and high-throughput use cases while still holding up on reasoning and coding tasks.

$0.018 cached, $0.09 in, $0.18 out / 1M

featured

text-generation

DeepSeek-V4-Pro

deepseek-ai/DeepSeek-V4-Pro cover image

DeepSeek V4 Pro is an MoE model with 1.6T total parameters (49B active) and a 1M-token context window. It's built for advanced reasoning, coding, and long-running agent tasks, and performs well on knowledge, math, and software engineering benchmarks.

$0.10 cached, $1.30 in, $2.60 out / 1M

featured

text-generation

moonshotai/Kimi-K2.6 cover image

Kimi K2.6 is an open-source, native multimodal agentic model that advances practical capabilities in long-horizon coding, coding-driven design, proactive autonomous execution, and swarm-based task orchestration.

$0.15 cached, $0.75 in, $3.50 out / 1M

featured

text-generation

XiaomiMiMo/MiMo-V2.5-Pro cover image

MiMo-V2.5-Pro is an open-source Mixture-of-Experts (MoE) language model with 1.02T total parameters and 42B active parameters. It utilizes the hybrid attention architecture and 3-layers Multi-Token Prediction (MTP) introduced in [MiMo-V2-Flash](https://github.com/XiaomiMiMo/MiMo-V2-Flash).

$0.20 cached, $1.00 in, $3.00 out / 1M

featured

text-generation

Qwen3.6-35B-A3B

Qwen/Qwen3.6-35B-A3B cover image

Qwen3.6-35B-A3B is Alibaba's latest flagship Mixture-of-Experts model, with 35B total parameters and only 3B activated per token (256 experts, 8 routed + 1 shared). Built on direct feedback from the community, Qwen3.6 prioritizes stability and real-world utility, offering developers a more intuitive, responsive, and genuinely productive coding experience.

$0.10 in, $0.95 out / 1M

featured

text-generation

zai-org/GLM-5.1 cover image

GLM-5.1 is Z-AI's next-generation flagship model for agentic engineering, with significantly stronger coding capabilities than its predecessor. It achieves state-of-the-art performance on SWE-Bench Pro and leads GLM-5 by a wide margin on NL2Repo (repo generation) and Terminal-Bench 2.0 (real-world terminal tasks).

$0.205 cached, $1.05 in, $3.50 out / 1M

featured

text-generation

Qwen3.5-397B-A17B

Qwen/Qwen3.5-397B-A17B cover image

Qwen3.5-397B-A17B is Alibaba's most capable Qwen3.5 model, a Mixture-of-Experts architecture with 397B total parameters and 17B activated per token. It features a 262K token context window (extensible to 1M with YaRN), thinking/reasoning mode, tool calling with MCP integration, and support for 201 languages. Sets state-of-the-art results on reasoning, coding, math, and multimodal benchmarks.

$0.22 cached, $0.45 in, $3.00 out / 1M

featured

text-generation

gemma-4-26B-A4B-it

google/gemma-4-26B-A4B-it cover image

Efficient, MoE variant of Gemma 4. Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input and generating text output.

$0.07 in, $0.34 out / 1M

featured

text-generation

google/gemma-4-31B-it cover image

Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input and generating text output.

$0.13 in, $0.38 out / 1M

featured

text-generation

NVIDIA-Nemotron-3-Super-120B-A12B

nvidia/NVIDIA-Nemotron-3-Super-120B-A12B cover image

NVIDIA Nemotron 3 Super is a hybrid Mixture-of-Experts (MoE) model engineered for highest compute efficiency and accuracy in multi-agent applications and specialized agentic systems. It is optimized to run many collaborating agents per application on a single GPU, delivering high accuracy for reasoning, tool use, and instruction following.

$0.085 in, $0.40 out / 1M

featured

text-generation

zai-org/GLM-5 cover image

GLM-5 is an advanced, open-source large language model designed for developers tackling the toughest challenges. It excels at long-context reasoning, multi-step tool orchestration, and complex systems engineering, making it the ideal choice for powering sophisticated agents and applications that require high-level cognitive tasks.

$0.12 cached, $0.60 in, $2.08 out / 1M

featured

Qwen/Qwen3-TTS cover image

Qwen3-TTS is an advanced text-to-speech model by Alibaba's Qwen team, delivering stable, expressive, and low-latency speech generation across 10 languages. Key capabilities: - 9 preset voices — Vivian, Serena, Uncle_Fu, Dylan, Eric, Ryan, Aiden, Ono_Anna, Sohee — covering diverse genders, ages, and accents - Voice cloning — clone any voice from a short (~3s) audio sample via the voice_id parameter - Instruction control — adjust tone, emotion, and speaking style with natural language (e.g. "speak slowly and calmly", "excited tone") - 10 languages — English, Chinese, Japanese, Korean, German, French, Russian, Spanish, Italian, Portuguese - Streaming support — real-time PCM streaming with ~97ms first-byte latency - Multiple output formats — WAV, MP3, FLAC, PCM Built on a 1.7B parameter architecture using discrete multi-codebook language modeling for end-to-end speech synthesis without cascading errors. Uses a custom 12Hz acoustic tokenizer that preserves paralinguistic information and environmental audio details.

$20.00 per 1M characters

featured

Qwen3-TTS-VoiceDesign

Qwen/Qwen3-TTS-VoiceDesign cover image

● Qwen3-TTS-VoiceDesign is a voice design variant of Qwen3-TTS by Alibaba's Qwen team. Instead of selecting from preset voices, you describe the voice you want in natural language — and the model generates speech in that voice. Key capabilities: - Natural language voice control — describe any voice with free text (e.g. "a deep male voice with a calm, authoritative presence", "a young cheerful female with a warm and friendly tone") - 10 languages — English, Chinese, Japanese, Korean, German, French, Russian, Spanish, Italian, Portuguese - Streaming support — real-time PCM streaming - Multiple output formats — WAV, MP3, FLAC, PCM Built on the same 1.7B parameter architecture as Qwen3-TTS, using discrete multi-codebook language modeling and a custom 12Hz acoustic tokenizer for high-quality end-to-end speech synthesis.

$20.00 per 1M characters

featured

text-generation

Qwen/Qwen3-Max cover image

The latest flagship model in the Qwen family. State-of-the-art results across a comprehensive suite of benchmarks — including knowledge, reasoning, coding, instruction following, human preference alignment, agent tasks, and multilingual understanding.

$1.20 in $6.00 out $0.24 cached / 1M tokens

featured

text-generation

Qwen3-Max-Thinking

Qwen/Qwen3-Max-Thinking cover image

The latest flagship reasoning model in the Qwen3 family. Further enhanced by multiple innovations like adaptive tool-use and advanced test-time scaling techniques

$1.20 in $6.00 out $0.24 cached / 1M tokens

featured

text-generation

moonshotai/Kimi-K2.5 cover image

Kimi K2.5 is an open-source, native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens atop Kimi-K2-Base. It seamlessly integrates vision and language understanding with advanced agentic capabilities, instant and thinking modes, as well as conversational and agentic paradigms.

$0.07 cached, $0.45 in, $2.25 out / 1M

featured

text-generation

zai-org/GLM-4.7-Flash cover image

GLM-4.7-Flash is a 30B-A3B MoE model. As the strongest model in the 30B class, GLM-4.7-Flash offers a new option for lightweight deployment that balances performance and efficiency.

$0.01 cached, $0.06 in, $0.40 out / 1M

featured

text-generation

deepseek-ai/DeepSeek-V3.2 cover image

DeepSeek-V3.2 is a large language model designed to harmonize high computational efficiency with strong reasoning and agentic tool-use performance. It introduces DeepSeek Sparse Attention (DSA), a fine-grained sparse attention mechanism that reduces training and inference cost while preserving quality in long-context scenarios. A scalable reinforcement learning post-training framework further improves reasoning, with reported performance in the GPT-5 class, and the model has demonstrated gold-medal results on the 2025 IMO and IOI. V3.2 also uses a large-scale agentic task synthesis pipeline to better integrate reasoning into tool-use settings, boosting compliance and generalization in interactive environments.

$0.13 cached, $0.26 in, $0.38 out / 1M

featured

black-forest-labs/

FLUX-2-klein-4b

black-forest-labs/FLUX-2-klein-4b cover image

The fastest model of the Flux 2 family. Frontier visual intelligence — state-of-the-art image generation and editing from Black Forest Labs

$0.014 x (width / 1024) x (height / 1024)

featured

black-forest-labs/

FLUX-2-klein-9b

black-forest-labs/FLUX-2-klein-9b cover image

The best quality-to-latency ratio, production apps model of the Flux 2 family. Frontier visual intelligence — state-of-the-art image generation and editing from Black Forest Labs

$0.015 x (width / 1024) x (height / 1024)

SOC 2 Certified

ISO 27001 Certified

Have questions or need a custom solution?

Company

Latest Models

deepseek-ai/DeepSeek-V4-Flash-0731 thinkingmachines/Inkling-Small google/nano-banana-2-lite google/nano-banana-2 google/nano-banana-pro

Featured Models

nvidia/NVIDIA-Nemotron-3-Super-120B-A12B Qwen/Qwen3-TTS deepseek-ai/DeepSeek-V4-Flash-0731 zai-org/GLM-4.7-Flash deepseek-ai/DeepSeek-V3.2

Built With Love in Palo Alto

© 2026 DeepInfra. All rights reserved.

Privacy Policy Terms of Service