Machine Learning Models and Infrastructure | Deep Infra

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

🚀 New models by Bria.ai, generate and edit images at scale 🚀

FAST

SIMPLE

RELIABLE

LOW-COST

AI Inference

Accelerate your AI with developer-friendly APIs designed for performance and cost-efficiency.

Let's Go Book a consultation

Qwentext-generation

$0.10/M in • $0.28/M out

googletext-generation

gemini-2.5-flash

$0.30/M in • $2.50/M out

Qwentext-generation

$0.08/M in • $0.29/M out

NVIDIAgpu-rental

On-Demand DGX B200 GPUs

$2.49 / instance-hour

Qwentext-generation

$0.06/M in • $0.24/M out

moonshotaitext-generation

Kimi-K2-Instruct-0905

$0.50/M in • $2.00/M out

deepseek-aitext-generation

DeepSeek-R1-0528

$0.50/M in • $2.15/M out

mistralaiautomatic-speech-recognition

Voxtral-Small-24B-2507

$0.00300 / minute

deepseek-aitext-generation

DeepSeek-V3-0324

$0.25/M in • $0.88/M out

Qwentext-generation

Qwen3-235B-A22B-Thinking-2507

$0.30/M in • $2.90/M out

mistralaitext-generation

Mistral-Small-3.2-24B-Instruct-2506

$0.075/M in • $0.20/M out

deepseek-aitext-generation

DeepSeek-V3.1-Terminus

$0.27/M in • $1.00/M out

allenaitext-generation

$0.27/M in • $1.50/M out

zai-orgtext-generation

$0.60/M in • $1.90/M out

meta-llamatext-generation

Llama-4-Maverick-17B-128E-Instruct-FP8

$0.15/M in • $0.60/M out

openaitext-generation

$0.03/M in • $0.14/M out

Let's Go Book a consultation

Abacus.AI

Hugging Face

interface.ai

Abacus.AI

Hugging Face

interface.ai

Scale to trillions of tokens without breaking the bank

Low pay-as-you-go pricing - no long-term contracts, no hidden fees, no surprises. Startup? Enterprise? We can scale. We are there for you with our simple APIs and hands-on technical support.

Inference Tailored to You

An inference partner that meets your needs. Whether you're optimizing for cost, latency, throughput or scale - we design the solution around your priorities. DeepInfra provides 100+ models to cover all your needs.

Zero Retention. Compliant. Secure.

With our zero retention policy your inputs, your outputs, and your user data stay private. DeepInfra is SOC 2 and ISO 27001 certified. We follow the best practices in information security and privacy.

Our Hardware. Our Data Centers. Your Performance Edge.

DeepInfra runs on our own cutting-edge inference optimised infrastructure, in secure US-based data centers. Better performance and reliability for you.

Models

Explore our Featured Models

text-generation

Qwen3-Coder-30B-A3B-Instruct

Qwen/Qwen3-Coder-30B-A3B-Instruct cover image

Qwen3-Coder-30B-A3B-Instruct is a high-performance code generation model optimized for agentic coding and complex programming tasks. With 30.5B total parameters and 3.3B activated through Mixture-of-Experts architecture, it delivers exceptional efficiency. The model features native support for 256K token context (extendable to 1M), making it ideal for repository-scale code understanding. It excels at tool calling, browser automation, and multi-step coding workflows.

fp8

256k

$0.07 in, $0.27 out / 1M

text-generation

zai-org/GLM-4.6 cover image

Compared with GLM-4.5, GLM-4.6 brings several key improvements: Longer context window: The context window has been expanded from 128K to 200K tokens, enabling the model to handle more complex agentic tasks. Superior coding performance: The model achieves higher scores on code benchmarks and demonstrates better real-world performance in applications such as Claude Code、Cline、Roo Code and Kilo Code, including improvements in generating visually polished front-end pages. Advanced reasoning: GLM-4.6 shows a clear improvement in reasoning performance and supports tool use during inference, leading to stronger overall capability. More capable agents: GLM-4.6 exhibits stronger performance in tool using and search-based agents, and integrates more effectively within agent frameworks. Refined writing: Better aligns with human preferences in style and readability, and performs more naturally in role-playing scenarios.

fp8

198k

$0.60 in, $1.90 out / 1M

text-generation

DeepSeek-V3.2-Exp

deepseek-ai/DeepSeek-V3.2-Exp cover image

DeepSeek-V3.2-Exp is an intermediate step toward the next-generation architecture of the DeepSeek models by introducing DeepSeek Sparse Attention—a sparse attention mechanism designed to explore and validate optimizations for training and inference efficiency in long-context scenarios.

160k

$0.27 in, $0.40 out / 1M

text-generation

DeepSeek-V3.1-Terminus

deepseek-ai/DeepSeek-V3.1-Terminus cover image

DeepSeek-V3.1 Terminus is an update to DeepSeek V3.1 that maintains the model's original capabilities while addressing issues reported by users, including language consistency and agent capabilities, further optimizing the model's performance in coding and search agents. It is a large hybrid reasoning model (671B parameters, 37B active) that supports both thinking and non-thinking modes. It extends the DeepSeek-V3 base with a two-phase long-context training process. Users can control the reasoning behaviour with the reasoning enabled boolean. Learn more in our docs The model improves tool use, code generation, and reasoning efficiency, achieving performance comparable to DeepSeek-R1 on difficult benchmarks while responding more quickly. It supports structured tool calling, code agents, and search agents, making it suitable for research, coding, and agentic workflows.

fp4

160k

$0.216 cached, $0.27 in, $1.00 out / 1M

text-generation

Qwen3-Next-80B-A3B-Instruct

Qwen/Qwen3-Next-80B-A3B-Instruct cover image

Over the past few months, we have observed increasingly clear trends toward scaling both total parameters and context lengths in the pursuit of more powerful and agentic artificial intelligence (AI). We are excited to share our latest advancements in addressing these demands, centered on improving scaling efficiency through innovative model architecture. We call this next-generation foundation models Qwen3-Next.

bfloat16

256k

$0.14 in, $1.10 out / 1M

text-generation

Qwen3-Next-80B-A3B-Thinking

Qwen/Qwen3-Next-80B-A3B-Thinking cover image

Over the past few months, we have observed increasingly clear trends toward scaling both total parameters and context lengths in the pursuit of more powerful and agentic artificial intelligence (AI). We are excited to share our latest advancements in addressing these demands, centered on improving scaling efficiency through innovative model architecture. We call this next-generation foundation models Qwen3-Next.

bfloat16

256k

$0.14 in, $1.20 out / 1M

text-generation

Kimi-K2-Instruct-0905

moonshotai/Kimi-K2-Instruct-0905 cover image

Kimi K2 0905 is the September update of Kimi K2 0711. It is a large-scale Mixture-of-Experts (MoE) language model developed by Moonshot AI, featuring 1 trillion total parameters with 32 billion active per forward pass. It supports long-context inference up to 256k tokens, extended from the previous 128k. This update improves agentic coding with higher accuracy and better generalization across scaffolds, and enhances frontend coding with more aesthetic and functional outputs for web, 3D, and related tasks. Kimi K2 is optimized for agentic capabilities, including advanced tool use, reasoning, and code synthesis. It excels across coding (LiveCodeBench, SWE-bench), reasoning (ZebraLogic, GPQA), and tool-use (Tau2, AceBench) benchmarks. The model is trained with a novel stack incorporating the MuonClip optimizer for stable large-scale MoE training.

fp4

256k

$0.40 cached, $0.50 in, $2.00 out / 1M

text-generation

deepseek-ai/DeepSeek-V3.1 cover image

DeepSeek-V3.1 is post-trained on the top of DeepSeek-V3.1-Base, which is built upon the original V3 base checkpoint through a two-phase long context extension approach, following the methodology outlined in the original DeepSeek-V3 report. We have expanded our dataset by collecting additional long documents and substantially extending both training phases. The 32K extension phase has been increased 10-fold to 630B tokens, while the 128K extension phase has been extended by 3.3x to 209B tokens. Additionally, DeepSeek-V3.1 is trained using the UE8M0 FP8 scale data format to ensure compatibility with microscaling data formats.

fp4

160k

$0.216 cached, $0.27 in, $1.00 out / 1M

text-generation

openai/gpt-oss-120b cover image

gpt-oss-120b is an open-weight, 117B-parameter Mixture-of-Experts (MoE) language model from OpenAI designed for high-reasoning, agentic, and general-purpose production use cases. The model supports configurable reasoning depth, full chain-of-thought access, and native tool use, including function calling, browsing, and structured output generation.

fp4

128k

$0.05 in, $0.27 out / 1M

text-generation

openai/gpt-oss-20b cover image

gpt-oss-20b is an open-weight 21B parameter model released by OpenAI under the Apache 2.0 license. It uses a Mixture-of-Experts (MoE) architecture with 3.6B active parameters per forward pass, optimized for lower-latency inference. The model is trained in OpenAI’s Harmony response format and supports reasoning level configuration, fine-tuning, and agentic capabilities including function calling, tool use, and structured outputs.

fp4

128k

$0.03 in, $0.14 out / 1M

text-generation

allenai/olmOCR-7B-0825 cover image

olmOCR is a specialized AI tool that converts PDF documents into clean, structured text while preserving important formatting and layout information. What makes olmOCR particularly valuable for developers is its ability to handle challenging PDFs that traditional OCR tools struggle with—including complex layouts, poor-quality scans, handwritten text, and documents with mixed content types. Built on a fine-tuned 7B vision-language model, olmOCR provides enterprise-grade PDF processing at a fraction of the cost of proprietary solutions.

fp8

16k

$0.27 in, $1.50 out / 1M

text-generation

Qwen3-Coder-480B-A35B-Instruct-Turbo

Qwen/Qwen3-Coder-480B-A35B-Instruct-Turbo cover image

Qwen3-Coder-480B-A35B-Instruct is the Qwen3's most agentic code model, featuring Significant Performance on Agentic Coding, Agentic Browser-Use and other foundational coding tasks, achieving results comparable to Claude Sonnet.

fp4

256k

$0.29 in, $1.20 out / 1M

View full collection (100+)

Live AI Inference Metrics

End-to-end insights into speed, scale, stability and spend

M

Tokens per second

ms

Time to first token

Requests per second

exaFLOPS

Host your models on our servers

Low cost, high privacy to ensure you run your operations smoothly

SOC 2 Certified

ISO 27001 Certified

Have questions or need a custom solution?

Company

Latest Models

zai-org/GLM-4.6 deepseek-ai/DeepSeek-V3.2-Exp anthropic/claude-3-7-sonnet-latest deepseek-ai/DeepSeek-V3.1 moonshotai/Kimi-K2-Instruct-0905

Featured Models

moonshotai/Kimi-K2-Instruct-0905 allenai/olmOCR-7B-0825 deepseek-ai/DeepSeek-V3 deepseek-ai/DeepSeek-V3.2-Exp Qwen/Qwen3-Next-80B-A3B-Thinking

Built With Love in Palo Alto

© 2025 Deep Infra. All rights reserved.

Privacy Policy Terms of Service