Nemotron 3 Nano Omni — the first multimodal model in the Nemotron 3 family, now on DeepInfra!

When you first learn how to use OpenClaw, the onboarding flow asks for an API key and points you toward Anthropic or OpenAI. Reasonable starting point. For production agents running dozens of tasks a day, it’s an expensive one.
OpenClaw works with any OpenAI-compatible API, so you can swap the default model for an open-weight alternative without touching the rest of your setup. Open-weight models on DeepInfra cost a fraction of closed APIs, with context windows and tool-calling accuracy that match or exceed GPT-4o and Claude for most agentic workflow scenarios. Same SOUL.md, same messaging integration, same skill library. Different endpoint, lower bill.
This guide walks you through the full path: installing OpenClaw, connecting DeepInfra as a custom provider, setting up your first agent workflow, and picking the right model for three common use cases. If you already have OpenClaw running, skip ahead to the provider config section.
Every OpenClaw task burns more tokens than a standard chat session. A single agent turn includes a system prompt with your SOUL.md instructions, accumulated memory from MEMORY.md, one or more tool call payloads, and the model’s response. A research-and-summarize job with three tool call round trips runs 5,000 to 8,000 tokens per completed task.
At GPT-4o pricing ($2.50 per million input tokens, $10 per million output), that task costs roughly $0.03 to $0.06. Run a few dozen tasks a day and you’re looking at $20 to $50 per month before memory accumulation pushes numbers higher.
Kimi K2.5 on DeepInfra prices at $0.45 per million input tokens and $2.25 per million output. The same task runs $0.005 to $0.009, roughly a 5x reduction for equivalent task completion quality. DeepSeek-V3-0324 cuts this further to $0.20 per million input and $0.77 per million output. DeepInfra supports context caching on both models, dropping repeated system prompt costs to near zero.
Context caching matters more for OpenClaw than for most workloads. Your SOUL.md and the static portions of your agent’s system prompt don’t change between requests. Once those tokens are cached, you pay $0.07 per million on cache reads instead of $0.45 per million on fresh input. An agent running 50 tasks a day with a 2,000-token SOUL.md saves roughly $0.006 per day from caching alone, which compounds quickly across a multi-agent setup.
The open-weight advantage isn’t theoretical. It changes the economics of running OpenClaw at any real scale.
OpenClaw separates concerns cleanly. The gateway manages messaging channels and routes incoming messages to the right agent. Each agent has a workspace, a set of skills, and a model assignment. Plugging in DeepInfra means changing that model assignment. Nothing else in the stack needs to know the difference.
All model configuration lives in ~/.openclaw/openclaw.json. The file uses a models.providers block to register inference endpoints. By default, that block has entries for Anthropic and OpenAI. You add DeepInfra as a named provider alongside them, not in place of them.
The field to understand before you edit anything is “mode”: “merge”. When you write changes to openclaw.json, this field tells OpenClaw how to apply them. Without it, a partial config write overwrites the entire providers object and deletes your existing entries. With it, your new deepinfra block merges into the existing structure. Always include it.
Each provider entry specifies a baseUrl, an apiKey, an api protocol string, and a models array. For DeepInfra’s OpenAI-compatible API, the api value is “openai-completions”. That same value works for Ollama and LiteLLM proxies, which is useful if you want a local fallback alongside your DeepInfra models.
Model IDs must match exactly what the provider’s API expects. DeepInfra IDs include the organization prefix: “moonshotai/Kimi-K2.5”, “deepseek-ai/DeepSeek-V3-0324”. When referencing a model in agents.defaults, you prefix the provider name: “deepinfra/moonshotai/Kimi-K2.5”. That prefix is how OpenClaw knows which provider’s endpoint to route the request to.
The agents.defaults setting applies globally across every agent. Individual agents override it with their own model key in agents.list. A multi-agent setup where one agent handles research via Kimi K2.5 and another handles code via Qwen3 Coder 480B A35B is just two entries with different model keys, both pointing at the same DeepInfra provider.
OpenClaw requires Node 24 (recommended) or Node 22.14 LTS. The installer handles Node if it isn’t present. macOS, Linux, and Windows via WSL2 or native PowerShell are all supported.
Run the installer:
curl -fsSL https://openclaw.ai/install.sh | bashOn Windows, use PowerShell:
iwr -useb https://openclaw.ai/install.ps1 | iexThe script installs the package and launches onboarding automatically. You’ll paste an API key for an initial provider (Anthropic or OpenAI), connect a messaging channel (Telegram is the fastest: one QR scan), and configure the daemon that keeps OpenClaw running in the background. Without the daemon, OpenClaw stops when you close the terminal. The –install-daemon flag registers it with your OS process manager.
If you installed via npm and skipped onboarding, run it manually:
openclaw onboard --install-daemonAfter setup, verify:
openclaw --version
openclaw doctor
openclaw gateway statusopenclaw doctor checks configuration and prints specific remediation steps for anything it finds. openclaw gateway status confirms the gateway is up and accepting messages. Fix any flagged issues before adding a custom provider. A clean doctor output is the right baseline to start from.
The custom provider configuration for DeepInfra lives entirely inside ~/.openclaw/openclaw.json. Open it in any editor and add this block, merging it into the existing file:
{
"models": {
"mode": "merge",
"providers": {
"deepinfra": {
"baseUrl": "https://api.deepinfra.com/v1/openai",
"apiKey": "${DEEPINFRA_API_TOKEN}",
"api": "openai-completions",
"models": [
{
"id": "moonshotai/Kimi-K2.5",
"name": "Kimi K2.5",
"reasoning": false,
"input": ["text"],
"cost": { "input": 0.45, "output": 2.25, "cacheRead": 0.07, "cacheWrite": 0 },
"contextWindow": 262144,
"maxTokens": 8192
}
]
}
}
},
"agents": {
"defaults": {
"model": { "primary": "deepinfra/moonshotai/Kimi-K2.5" }
}
}
}The “mode”: “merge” field is the important part. A partial config write without it replaces the entire providers object, wiping every provider you’ve configured. With it, OpenClaw merges the new deepinfra entry in and leaves Anthropic and OpenAI untouched.
The apiKey value uses ${DEEPINFRA_API_TOKEN}, resolved from your shell environment at startup. Add export DEEPINFRA_API_TOKEN=your_key_here to your shell profile (~/.zshrc or ~/.bashrc), reload it, then restart the daemon so the variable is in scope. If the gateway starts before the environment loads, it fails silently on every DeepInfra request. Generate your key at DeepInfra under API Keys.
The cost block drives OpenClaw’s in-app token accounting. Fill in real per-model numbers rather than zeros so the usage estimates stay accurate as tasks accumulate. The cacheRead value matters if you’re running a model that supports context caching. Leaving that field as zero causes the in-app cost display to undercount.
Model IDs must match DeepInfra’s API exactly, including the organization prefix: “moonshotai/Kimi-K2.5”, “deepseek-ai/DeepSeek-V3-0324”, “Qwen/Qwen3-Coder-480B-A35B-Instruct”. Get the exact ID from the model page before adding it. To include more models, append additional objects to the models array.
After saving, restart and verify:
openclaw gateway restart
openclaw doctorA clean doctor run lists deepinfra under available providers with no errors. If it shows an auth error instead, check that DEEPINFRA_API_TOKEN is actually exported in the daemon’s environment: echo $DEEPINFRA_API_TOKEN should return your key, not an empty string.
With DeepInfra registered, agents.defaults.model.primary tells every agent which model to use unless overridden. The config block above sets it to “deepinfra/moonshotai/Kimi-K2.5”. Override it per-agent by adding a model key to any individual entry in agents.list.
Before sending your first task, write a SOUL.md in your agent’s workspace directory. This is OpenClaw’s behavioral contract: a plain text document defining identity, permissions, communication style, and hard limits. Practical examples: no purchasing, no deleting files without explicit confirmation, respond only in English, never share calendar details with unrecognized senders. SOUL.md stays in the system prompt across every turn, so it’s the most reliable place to enforce rules.
Keep SOUL.md under 800 tokens initially. Three sections work well: a short identity paragraph, a bulleted permissions list, and a bulleted hard-limits list. That format is easy to scan and easy to update as your agent’s responsibilities grow. Resist the urge to write exhaustive rules at the start. A SOUL.md that covers the five things that would actually cause problems is more effective than one that tries to anticipate every edge case.
Once your personal AI agent accumulates weeks of MEMORY.md entries and task history, that context budget fills fast. Start lean and add constraints only when behavior problems surface.
Test the endpoint directly before connecting the full OpenClaw stack:
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["DEEPINFRA_API_TOKEN"],
base_url="https://api.deepinfra.com/v1/openai",
)
response = client.chat.completions.create(
model="moonshotai/Kimi-K2.5",
messages=[
{
"role": "system",
"content": "You are an OpenClaw agent. Follow all SOUL.md instructions precisely.",
},
{"role": "user", "content": "List the three most important tasks in my queue."},
],
tools=[
{
"type": "function",
"function": {
"name": "read_memory",
"description": "Read the agent's memory file",
"parameters": {
"type": "object",
"properties": {"filename": {"type": "string"}},
"required": ["filename"],
},
},
}
],
tool_choice="auto",
)
print(response.choices[0].message)If response.choices[0].message.tool_calls comes back with a valid read_memory call, the endpoint is working and tool calling is live. That’s your signal before connecting the messaging stack.
Once the direct test passes, send your first message through the channel you configured during onboarding. OpenClaw assembles the SOUL.md system prompt, accumulated MEMORY.md context, and the incoming message, then routes the combined payload to DeepInfra. Watch the gateway logs on the first few tasks: openclaw gateway logs –follow. Clean tool call completions with no timeout errors confirm the setup is working end to end.
Three workloads cover most OpenClaw deployments. Here’s which model fits each one on DeepInfra:
| Use case | Recommended model | Context window | Input $/1M | Output $/1M |
|---|---|---|---|---|
| General assistant (research, summarization, scheduling) | Kimi K2.5 | 256K | $0.45 | $2.25 |
| Coding agent (PR review, file editing, debugging) | Qwen3 Coder 480B A35B | 256K | $0.40 | $1.60 |
| High-volume automation (short daily tasks) | DeepSeek-V3-0324 | 160K | $0.20 | $0.77 |
| Intent routing / development environments | Llama 3.1 70B Instruct | 128K | $0.40 | $0.40 |
General assistant: Kimi K2.5 is the safe default. Its 256K context window handles deep MEMORY.md accumulation without losing earlier instructions, and tool calling stays reliable across extended multi-step workflows.
Coding agent: Qwen3 Coder 480B A35B scores 69.6% on SWE-bench Verified, leading the open-weight tier for repository-level work. If your agent writes, reviews, or edits source files, this is the right pick.
High-volume automation: DeepSeek-V3-0324 at $0.20 per million input tokens is roughly 2.5x cheaper than Kimi K2.5. DeepInfra supports context caching on this model, which cuts repeated system prompt costs significantly at scale.
Intent routing and development environments: Llama 3.1 70B Instruct prices at $0.40 per million tokens for both input and output, the only model here with flat symmetric pricing. It handles intent classification, skill dispatching, and short structured outputs cleanly, and works as a low-cost stand-in during development when you’re iterating on agent logic rather than running production workloads.
OpenClaw lets you assign different models to different agents in the same install. Route research to Kimi K2.5, code generation to Qwen3 Coder 480B A35B, high-frequency summaries to DeepSeek-V3-0324, and intent classification to Llama 3.1 70B Instruct, all from one DeepInfra API key with no additional configuration beyond the provider block you already wrote.
The DeepInfra provider setup takes about five minutes. Swap the default closed-API model for an open-weight alternative and the cost structure changes across everything that runs on top. Adding more models follows the same pattern: append an entry to the models array, set the right ID, fill in the cost block, and restart the gateway.
Browse available models and current pricing at DeepInfra. Questions or feedback: reach us at feedback@deepinfra.com, join the community at Discord or find us on X at @DeepInfra.
NVIDIA Nemotron 3 Super 120B API Benchmarks: Latency & Cost<p>About NVIDIA Nemotron 3 Super 120B A12B NVIDIA’s Nemotron 3 Super 120B A12B is an open-weight large language model released on March 11, 2026. It features 120B total parameters with only 12B active per forward pass, delivering exceptional compute efficiency for complex multi-agent applications such as software development and cybersecurity triaging. The model uses a […]</p>
Qwen3.5 35B A3B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 35B A3B Qwen3.5 35B A3B is a native vision-language model released by Alibaba Cloud in February 2026. It uses a hybrid architecture that integrates Gated Delta Networks with a sparse Mixture-of-Experts model, achieving higher inference efficiency. With 35 billion total parameters and only 3 billion activated per token through 256 experts (8 routed […]</p>
Qwen3.5 397B A17B API Benchmarks: Latency, Throughput & Cost<p>About Qwen3.5 397B A17B Qwen3.5 397B A17B is Alibaba Cloud’s largest and most capable multimodal foundation model, released in February 2026. It features a hybrid Mixture-of-Experts (MoE) architecture with 397 billion total parameters and 17 billion active parameters per inference pass, utilizing 512 experts with a routing mechanism selecting a subset per token. This sparse […]</p>
© 2026 Deep Infra. All rights reserved.