DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

We've been following NVIDIA Nemotron work closely, and we're excited to make Nemotron 3 Ultra and Nemotron 3.5 Content Safety available on DeepInfra from day 0. These aren't just more models to add to the catalog. Nemotron is built around a specific idea about how agentic AI should work, and we think that idea is right.
Most benchmarks still measure model quality in isolation. But if you're building agentic systems that plan, call tools, delegate work, loop, and eventually complete a task, then you need ot measure of task completion.
"The right measure isn't simply model quality. It's the speed of task completion."
That philosophy shows up most clearly in Nemotron 3 Ultra, which is designed to deliver up to 5x faster inference and up to 30% lower cost for long-running agent workflows.
The broader Nemotron family extends that same idea across the agent stack. Instead of one model that tries to do everything, each model is purpose-built for a specific role—reasoning, speech, safety, and more—so developers can use the right one for each job.
550B · 55B active · 1M context · BF16 + NVFP4
Nemotron 3 Ultra is built for, frontier reasoning, orchestration, coding agents, deep research, and complex enterprise workflows. It delivers up to 5x faster inference and up to 30% lower cost for agentic workloads while supporting up to 1M token context.
4B · multimodal · 23 categories · 12 languages
A compact safety model that handles text, images, and custom policies. It outputs a safe/unsafe classification plus a reasoning trace, and can be used as an inference-time guardrail, as a judge for LLM safety testing and evaluation, or with the accompanying training dataset to post-train models for safer behavior. Designed to run as a guardrail layer in your pipeline without adding a lot of latency.
These two complement each other naturally. Nemotron 3 Ultra does the heavy lifting, while the safety models keeps the agents things in check. Both are available via our standard API, same as everything else on DeepInfra.
0.6B · Streaming · ~40 language-locales
Real-time streaming ASR built for voice agents. Cache-aware architecture means true chunk-by-chunk processing — no recomputation, no buffering lag — designed for high-concurrency live workloads. Supports 40 language locales with native punctuation and capitalization, runtime-configurable latency modes, and word boosting for domain-specific vocabulary. The voice layer for your agent stack, available on DeepInfra now.
All three models are live right now on DeepInfra and available through our standard API. If you've used DeepInfra before, nothing changes, same API, same setup. If you're new, it takes about two minutes to get a key and run your first call.
→ Explore models: models page
→ View docs: DeepInfra docs
How to OpenAI Whisper with per-sentence and per-word timestamp segmentation using DeepInfraWhisper is a Speech-To-Text model from OpenAI.
Use OpenAI API clients with LLaMasGetting started
# create a virtual environment
python3 -m venv .venv
# activate environment in current shell
. .venv/bin/activate
# install openai python client
pip install openai
Choose a model
meta-llama/Llama-2-70b-chat-hf
[meta-llama/L...
DeepInfra Launches Access to NVIDIA Nemotron Models for Vision, Retrieval, and AI SafetyDeepInfra is serving the new, open NVIDIA Nemotron vision language and OCR AI models from day zero of their release. As a leading inference provider committed to performance and cost-efficiency, we're making these cutting-edge models available at the industry's best prices, empowering developers to build specialized AI agents without compromising on budget or performance.© 2026 DeepInfra. All rights reserved.