DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

A single ask in an OpenClaw session can cost more than a full evening of casual ChatGPT use. Ask your agent something simple, like which calendar event clashes with your flight, and the request that hits the API carries far more than your 12-token question. It also carries your SOUL.md, the tool schemas registered on the gateway, the last forty turns of conversation, the MEMORY.md the heartbeat just appended to, and then the question itself. On Claude Sonnet at $3 per million input tokens, that single round trip can run $0.18. Multiply by a few hundred messages a week and the bill stops looking like a hobby.
OpenClaw cost optimization is mostly a question of which tokens you are buying, on which model, at which price, and whether you are routing each request to the cheapest model that can still finish the task. This guide walks through the math and the routing, with real pricing from DeepInfra-hosted open-weight models as the cost floor.
LLM pricing is per token, in and out, and the meter starts the second your gateway forwards a request. Three things drive the input side of every OpenClaw call: the system prompt block (SOUL.md plus tool definitions), the conversation history the gateway has been accumulating for that thread, and the latest user message. The output side is whatever the model writes back, including any tool-call arguments and any thinking traces if you’re on a reasoning model.
The user message is rarely the expensive part. A four-word question to a fresh agent can fire 6,000 to 12,000 input tokens at the model because the gateway ships the full state every turn. OpenClaw is stateless on the provider side. Every call is a new request with the same baggage attached.
Once you see that the bill is mostly resent context, you know what to change:
The cheapest token you can send is the one priced by an open-weight model on DeepInfra. The default Anthropic and OpenAI configurations OpenClaw ships with are convenient. They are not the price floor. Below is the working set of DeepInfra-hosted models worth considering for OpenClaw work, with current pay-as-you-go pricing.
| Model | Input / 1M | Output / 1M | Cached input / 1M | Context |
|---|---|---|---|---|
| DeepSeek-V3 | $0.32 | $0.89 | n/a | 128K |
| Qwen3-Coder-480B-A35B-Instruct-Turbo | $0.30 | $1.00 | $0.10 | 262K |
| Qwen3-235B-A22B-Instruct-2507 | $0.071 | $0.10 | n/a | 256K |
| Llama 3.1 70B Instruct | $0.40 | $0.40 | n/a | 131K |
| Qwen3-30B-A3B | $0.09 | $0.45 | n/a | 40K |
Qwen3-235B-A22B-Instruct on DeepInfra runs at $0.071 input and $0.10 output. Roughly forty times cheaper on input than Claude Sonnet at $3, and about thirty times cheaper on output than Sonnet at $15. The Qwen3 Coder Turbo variant supports prompt caching at $0.10 per million cached input tokens, which pays off once your agent is replaying the same SOUL.md and tool definitions on every turn.
For OpenClaw planning, the number to watch is cost per completed task, not price per million tokens. A task that takes 30 turns and 80K tokens of cumulative context costs roughly $0.06 on Qwen3-235B-A22B-Instruct, $0.10 on Llama 3.3 70B, $0.26 on DeepSeek-V3, and over $2 on Claude Sonnet. Same outcome. Different price.
Running every agent on your best model wastes money on tasks that did not need it. The cheapest reliable architecture for OpenClaw is a two-tier setup: a smart primary model for the main agent that holds the plan, and a budget model for sub-agents that handle bounded sub-tasks like file reads, summarization, or web fetches. OpenClaw supports this through the per-agent model field in ~/.openclaw/openclaw.json.
Here is the relevant block, with DeepInfra registered as a custom provider and a sub-agent pinned to a cheaper model:
{
"models": {
"mode": "merge",
"providers": {
"deepinfra": {
"baseUrl": "https://api.deepinfra.com/v1/openai",
"apiKey": "${DEEPINFRA_API_TOKEN}",
"api": "openai-completions",
"models": [
{ "id": "Qwen/Qwen3-Coder-480B-A35B-Instruct-Turbo", "name": "Qwen3 Coder" },
{ "id": "Qwen/Qwen3-235B-A22B-Instruct-2507", "name": "Qwen3 235B" },
{ "id": "Qwen/Qwen3-30B-A3B", "name": "Qwen3 30B" }
]
}
}
},
"agents": {
"defaults": {
"model": { "primary": "deepinfra/Qwen/Qwen3-Coder-480B-A35B-Instruct-Turbo" }
},
"list": {
"summarizer": { "model": "deepinfra/Qwen/Qwen3-30B-A3B" },
"fetcher": { "model": "deepinfra/Qwen/Qwen3-30B-A3B" }
}
}
}The primary agent runs on Qwen3 Coder, which keeps tool-calling accurate as the session grows. The summarizer and fetcher sub-agents drop to Qwen3-30B-A3B at $0.09 input and $0.45 output. For an agent that fires the summarizer ten times per primary turn, the math changes fast. A workload that ran $40 a month on a single-model Sonnet config drops below $4 on the tiered open-weight setup.
Two rules keep tiering from quietly breaking. First, never tier down the agent that owns tool-calling correctness. Sub-agents that compress a page of text or pick a date from a calendar tolerate a smaller model fine. A primary emitting a structured tool call against a complex schema does not, and one bad tool call wastes the savings on retries. Second, watch the sub-agent context window. Qwen3-30B-A3B caps at 40K tokens, plenty for a single fetch-and-summarize step but it will reject a 60K-token document outright. If you regularly pipe large inputs through a sub-agent, bump that one up to Qwen3-235B-A22B-Instruct, where input is still $0.071 per million and the window jumps to 256K.
The most common cause of a runaway OpenClaw bill is a SOUL.md that grew to 4,000 tokens, plus a tool registry the agent does not actually use. Every turn ships every byte. That 9,600-token “why is my simple question this expensive” effect is usually 60 percent SOUL.md and tool schemas, 35 percent conversation history, and 5 percent the message the user actually typed.
A discipline that works: keep SOUL.md under 800 tokens at the start, split persistent behavioral rules from short-lived task instructions, and put the task instructions in a per-thread file the agent reads on demand instead of in the system prompt. Tool registration follows the same rule. Each registered tool emits its JSON schema into the system prompt on every call. If an agent has 25 tools registered and uses three, the other twenty-two are paying rent. Per-agent tool whitelists in agents.list.<name>.tools cut that overhead immediately, and the savings stack on every model in the tier above.
The numbers are easy to check. A 4,000-token SOUL.md on Claude Sonnet at $3 input costs $0.012 every turn just for the contract. On Qwen3-235B-A22B-Instruct, the same SOUL.md costs $0.00028. Audit your tools quarterly and prune the dead ones. Pruning is free. Daily savings are not.
OpenClaw’s heartbeat is the scheduled check-in that lets an agent run periodic chores: review the inbox, summarize calendar conflicts, watch a memo for new entries. The default cadence on most installs is one minute. Fine for a quick prototype, a tax on any production deployment. A one-minute heartbeat fires roughly 43,000 times a month. Even at a few thousand input tokens per beat on a budget model, that adds up to real spend.
Match heartbeat cadence to what the agent actually needs to react to. Calendar review every hour, not every minute. Inbox scan every five minutes, not every thirty seconds. Then pin heartbeat-driven agents to a smaller model. A 1-minute heartbeat on Qwen3-30B-A3B runs about $4 a month on 4K-token input loops. The same heartbeat on Sonnet runs over $150. Finally, disable streaming on background tasks. Streaming is for interactive chat. A cron-style heartbeat does not need partial tokens, and the non-streaming path retries cleaner when the network blips.
A separate agents.list.heartbeat entry with its own model, prompt budget, and a short tool whitelist isolates the background cost from your interactive sessions. That single change moves the most idle-load money on a typical OpenClaw bill.
OpenClaw conversation history grows without bound by default. Every reply, every tool result, every observation. By turn 50, a once-thin thread has accumulated 80K to 150K tokens of mostly-irrelevant chatter, and the gateway is paying full input pricing on all of it on every new call. Two mechanisms keep this from quietly eating your budget.
/compact summarizes the older portion of the thread into a few hundred tokens of structured notes, replacing the verbose history with the summary going forward. Run it on long-running threads before they cross 30K tokens of accumulated input. /reset is the harder reset: drop the thread entirely and start fresh, useful when you’ve shifted topic and the old context is dead weight. Both commands move budget back to the user side of the meter, where it belongs.
Prompt caching is the other half. On DeepInfra, Qwen3 Coder Turbo charges $0.30 per million for fresh input tokens and $0.10 per million for cached ones. Qwen3-Max drops from $1.20 to $0.24. The cache key is the prefix of your request, so the parts that do not change across turns (SOUL.md, tool definitions, the first user message in a thread) hit the cache automatically once the model has seen them. For a typical OpenClaw agent firing 30 turns against the same SOUL.md plus tool block, prompt caching alone cuts input spend by roughly 60 percent. The model page handles the cache lifecycle for you. You do not need to configure anything beyond using the model.
When even DeepInfra pricing is more than a workload justifies, there are two escape hatches. The first is Ollama running a quantized open-weight model on your local box. Point a custom OpenClaw provider at http://localhost:11434/v1, set the API key to anything, and the same agents.list overrides route experimental agents to your laptop for zero marginal cost. Throughput is slower and the context window is smaller, but for low-stakes background agents the tradeoff often makes sense.
The second is a free-tier cloud provider for true zero-volume use. Google AI Studio’s Gemini tier and similar offerings give you a daily request budget that, paired with a small heartbeat agent, costs nothing in steady state. Treat these as fallbacks for non-critical agents, not as the default. Reliability and rate-limit behavior differ from a paid provider, and OpenClaw’s failover chain in models.fallback is where you wire that in.
Real numbers help calibrate expectations. The table below sketches three OpenClaw workloads with conservative token estimates and current DeepInfra pricing. The “Sonnet baseline” column is what the same workload costs on a default Anthropic configuration. The “Tiered on DeepInfra” column uses Qwen3 Coder Turbo for the primary agent, Qwen3-30B-A3B for sub-agents and heartbeat, with prompt caching active.
| Workload | Interactive turns / day | Heartbeat cadence | Sonnet baseline | Tiered on DeepInfra |
|---|---|---|---|---|
| Personal assistant, light use | 30 | 30 min | $42/mo | $3.20/mo |
| Power user, daily ops | 120 | 5 min | $180/mo | $14/mo |
| Team agent, always-on | 400 | 1 min | $620/mo | $58/mo |
The shape is consistent across all three. A tiered open-weight setup on DeepInfra lands roughly 12 to 15 times cheaper than the closed-API baseline, before factoring in prompt caching savings on the input side. Caching takes another 30 to 60 percent off input spend once the SOUL.md and tool block are warm. That is how the 90-percent reduction shows up on a real bill. The savings come from cheaper input pricing, tiered routing, pruned overhead, throttled heartbeats, and cache hits stacking on every turn.
If you want the cost floor without rewriting your OpenClaw setup, register DeepInfra as a custom provider, pin your primary agent to Qwen3 Coder or Qwen3-235B-A22B-Instruct, and drop sub-agents and heartbeats to Qwen3-30B-A3B. Pay-as-you-go, no minimum spend, OpenAI-compatible API so your existing OpenClaw config works after a single block change.
Use the tiering config above, run openclaw doctor to confirm the provider, and watch per-task spend for a week. Most users see a 90 percent input-cost cut in the first billing cycle.
Questions or a workload that does not fit the patterns above? Reach us at feedback@deepinfra.com, join our Discord at discord.gg/deepinfra, or find us on X at @DeepInfra.
Best API Providers for NVIDIA Nemotron 3 Super 120B<p>Nemotron 3 Super 120B is available across a growing number of hosted APIs and deployment platforms. At 120B total parameters with 12B active per inference pass, the right provider matters: latency, throughput, and cost vary significantly depending on where you run it. This guide covers the top options by use case — from fully managed […]</p>
LLM API Provider Performance KPIs 101: TTFT, Throughput & End-to-End Goals<p>Fast, predictable responses turn a clever demo into a dependable product. If you’re building on an LLM API provider like DeepInfra, three performance ideas will carry you surprisingly far: time-to-first-token (TTFT), throughput, and an explicit end-to-end (E2E) goal that blends speed, reliability, and cost into something users actually feel. This beginner-friendly guide explains each KPI […]</p>
DeepSeek V4 Pro (Max) API Benchmarks: Latency, Throughput & Cost Analysis<p>About DeepSeek V4 Pro DeepSeek V4 Pro is a Mixture-of-Experts (MoE) language model with 1.6 trillion total parameters and 49 billion activated parameters, supporting a 1 million token context window. Designed for advanced reasoning, coding, and long-horizon agent workflows, it represents the fourth generation of DeepSeek’s flagship open-weight models. The model introduces a hybrid attention […]</p>
© 2026 DeepInfra. All rights reserved.