Nemotron 3 Nano Omni — the first multimodal model in the Nemotron 3 family, now on DeepInfra!

OpenClaw has 362,000 GitHub stars and a skill marketplace with over 44,000 community contributions. That kind of adoption doesn’t happen by accident. Still, the same teams running it in production keep running into the same complaint: the model list is fixed.
OpenClaw’s guided setup wizard covers OpenAI, Anthropic, Google, DeepSeek, and local Ollama. You can point it at any OpenAI-compatible endpoint through the config file, but that requires manual editing rather than the wizard flow. The cost case for open-weight models is real regardless: proprietary APIs charge output rates that stack up fast when your agent is hammering tools across a long context window, and you have no control over which model version the provider decides to serve on any given day. Teams that want finer control over cost and model pinning end up in the config file either way.
Open-weight models have closed the gap. GLM-5.1, Qwen3.5-397B, and Step-3.5-Flash are competitive with proprietary alternatives on coding and reasoning tasks, and DeepInfra’s OpenAI-compatible API means you can swap between them without touching your application code. The frameworks in this roundup make that configuration the default rather than an afterthought.
Security is the other reason teams start looking elsewhere. OpenClaw currently has over 400 open issues tagged as security-related, and the codebase is large enough that auditing community skills is genuinely difficult for most teams. The three frameworks below are either architecturally leaner, sandboxed at the OS level, or both.
For teams treating their agent as long-term infrastructure, the real question is which open source agent framework gives you control over the model layer, not just the messaging layer. That’s what separates the options below from OpenClaw’s defaults.
Not every replacement solves the same problem. Four things matter most for production use.
OpenAI-compatible endpoint support. If a framework lets you set a custom base_url, you can point it at DeepInfra’s API and switch between open-weight models without touching agent code. Frameworks that hardcode provider routing remove that option entirely. This single feature determines whether your model choice is yours or the framework’s.
Memory architecture. OpenClaw uses flat files. That works for one user, but falls apart under concurrent sessions or when you want semantic search across past interactions. Frameworks with multi-level memory or vector search become more capable the longer you run them. A well-designed memory layer is what separates a one-off tool from something worth building workflows around.
Platform coverage. If your agent needs to respond across Telegram, Slack, Discord, and a CLI at the same time, the framework should handle that natively. Bolting on adapters is a maintenance problem you don’t need.
Total cost of ownership. The API bill usually dominates. Architectures that use smaller, faster models for routing and memory retrieval and then send complex reasoning to a frontier model keep costs predictable. Whether that routing is configurable or baked in matters more than the platform fee.
Hermes Agent is an open-source autonomous agent from Nous Research. It crossed 33,000 GitHub stars within weeks of launching in February 2026 and is now at 111,000. The draw was a single capability: it gets better at your specific tasks over time.
After completing a complex task, Hermes’s reflection module extracts a reusable skill and writes it to ~/.hermes/skills/. The next time a similar task comes in, the agent retrieves that skill and runs it without an API call. Repeat a class of task daily for a few weeks, and those calls stop happening entirely. That’s a real reduction in costs for any agent running long-context workflows.
Memory works in three layers. Short-term context is in-session. Medium-term notes get compressed by a background job on a schedule. Long-term memories go into a ChromaDB vector store that Hermes queries with weighted hybrid search, semantic similarity plus keyword match combined. You can ask it to summarize something you shared three weeks ago and it will find it.
That layered design is what makes Hermes work as a persistent memory agent rather than a stateless assistant. Each session builds on the last. For self-hosted AI agent deployments where continuity matters, the ChromaDB dependency is worth the setup overhead.
Model configuration is straightforward. The config file at ~/.hermes/cli-config.yaml takes a base_url field that replaces the provider entirely. Set it once, and Hermes calls whatever OpenAI-compatible endpoint you point it at, with your API key. Wiring it to DeepInfra:
model:
provider: custom
base_url: "https://api.deepinfra.com/v1/openai"
api_key: "your-deepinfra-api-key"
model: "zai-org/GLM-5.1"Verify the connection before you hand it to the agent:
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["DEEPINFRA_API_TOKEN"],
base_url="https://api.deepinfra.com/v1/openai",
)
resp = client.chat.completions.create(
model="zai-org/GLM-5.1",
messages=[{"role": "user", "content": "Reply with one word: ready"}],
)
print(resp.choices[0].message.content)GLM-5.1 is a good default here. Its 198K context window fits the agent’s full memory-plus-task payload without truncation. For lighter workloads, Step-3.5-Flash at $0.10 per million input tokens handles quick-response tasks at a fraction of the cost.
Strengths: self-improving skill system, multi-level memory with semantic search, six messaging platforms, zero telemetry, MIT license.
Limitations: ChromaDB is a required dependency for full memory support, which adds complexity to your deployment. Setup takes more work than OpenClaw’s Docker path.
Cost: free platform. A $5 to $10 per month VPS covers most personal or small-team deployments. GLM-5.1 is $1.40 per million input tokens on DeepInfra. Step-3.5-Flash is $0.10 per million for lighter workloads.
ZeroClaw is a 3.4MB Rust binary. Startup is under 10 milliseconds. Idle RAM is under 5MB. If you’re running an agent on a $4 per month VPS, a Raspberry Pi, or inside a container where memory is tight, the overhead from ChromaDB and a Python runtime adds up. ZeroClaw skips all of that.
The project hit 30,000 GitHub stars since its February 2026 release, a fast climb for a framework this young. It positions itself as a model-agnostic AI assistant. Provider changes are a config edit, not a code change, and the binary doesn’t care whether it’s talking to OpenAI, a local Ollama instance, or a custom inference endpoint.
Memory without external dependencies. ZeroClaw uses a custom SQLite-based vector store with FTS5 keyword search and weighted hybrid retrieval. No ChromaDB. No separate process to babysit. Memories sit in a single file on disk, which makes backup a copy command and migration a paste.
Model configuration uses the custom: prefix in ZeroClaw’s TOML config at ~/.zeroclaw/config.toml:
default_provider = "custom:https://api.deepinfra.com/v1/openai"
api_key = "your-deepinfra-api-key"
default_model = "Qwen/Qwen3.5-35B-A3B"Qwen3.5-35B-A3B fits this deployment profile well. At $0.20 per million input tokens, it’s a 35B parameter MoE model with only 3B active per token. You get solid reasoning at a throughput profile that makes sense for edge hardware.
Strengths: sub-10ms startup, <5MB RAM, no external vector DB, 20,000+ GitHub stars, MIT and Apache 2.0 dual license, 28+ built-in providers.
Limitations: newer than OpenClaw and Hermes Agent, so the community skill ecosystem is thinner. The CLI-first interface requires adjustment if you’re used to a GUI-driven workflow.
Cost: free platform. Runs on a $4 to $6 per month VPS. API costs start at $0.20 per million input tokens with Qwen3.5-35B-A3B on DeepInfra.
NemoClaw is NVIDIA’s reference stack for running OpenClaw in production environments where security review is part of the process. Jensen Huang announced it on March 16, 2026. The main addition is OS-level sandboxing that the other frameworks on this list don’t have.
The mechanism is OpenShell. When you run nemoclaw onboard, NemoClaw builds an isolated container for the OpenClaw runtime and routes all inference calls through OpenShell’s managed proxy. Agent code doesn’t touch your network interface directly. For teams with compliance requirements, that’s auditable architecture, not just a policy document.
Enterprise-scale agent deployment has a different set of problems than running a personal assistant on a VPS. Inference calls can leave the container boundary, community skills execute with network access, and there’s no audit trail for model requests by default. NemoClaw’s managed proxy intercepts all three before they reach the network.
Inference routing defaults to NVIDIA’s NIM APIs at https://integrate.api.nvidia.com/v1, which includes Nemotron-3-Super-120B-A12B. The config accepts a base URL override if you want a different provider. DeepInfra carries Nemotron-3-Super-120B-A12B at $0.10 per million input tokens and $0.50 per million output, which is an option if you want the model without locking into NVIDIA’s own inference stack.
Strengths: OS-level sandboxing, managed inference proxy, built-in safety guardrails, NVIDIA enterprise backing, inherits OpenClaw’s 44,000+ skill ecosystem.
Limitations: still in early preview as of April 2026, so the config API may shift. Requires an environment that supports NVIDIA OpenShell, which rules out most low-cost VPS options. Not a lightweight choice.
Cost: open source. Infrastructure cost depends on your existing NVIDIA setup. For teams already running NVIDIA hardware or cloud instances, the platform itself adds no cost.
| OpenClaw | Hermes Agent | ZeroClaw | NemoClaw | |
|---|---|---|---|---|
| GitHub Stars | 353k | 33k | 21k | Preview |
| Custom Endpoint | Limited | Yes (base_url) | Yes (custom:) | Yes |
| Memory | File-based | Multi-level + ChromaDB | SQLite + FTS5 | Inherited |
| Sandboxing | No | No | No | OpenShell |
| Min. VPS Cost | $5/mo | $5/mo | $4/mo | NVIDIA stack |
| Best For | General | Power users | Edge / constrained | Enterprise |
Three different problems, three different answers.
Go with Hermes Agent if your workflows are repetitive and complex. The skill system earns its value when you run the same category of task every day: research pipelines, code review loops, document processing chains. Back it with GLM-5.1 on DeepInfra ($1.40 per million input tokens, 198K context) for reasoning-heavy work, or drop to Step-3.5-Flash ($0.10 per million) for the fast-response parts of your pipeline. Changing models is one line in the config, so you can dial in cost-to-quality as your usage patterns become clear.
Go with ZeroClaw if deployment constraints are the bottleneck. The 3.4MB Rust binary starts before most containers are even initialized. SQLite memory means no external services to provision. Qwen3.5-35B-A3B on DeepInfra at $0.20 per million input tokens gives you a capable MoE model at a price that fits the deployment profile. For teams building internal tools or edge infrastructure, ZeroClaw is the option where the agent runtime itself stops being a concern.
Go with NemoClaw if your team needs security sign-off before any agent ships. OpenShell sandboxing and managed inference routing give compliance teams something real to review. Nemotron-3-Super-120B-A12B is on DeepInfra at $0.10 per million input tokens if you want the model without committing to NVIDIA’s inference infrastructure.
All three support DeepInfra’s OpenAI-compatible API. You can move between frameworks without rewriting your model integration layer. The model ID and the base URL are the only things that change. Pick the one that fits your current constraint, then change the model backend when your needs shift. DeepInfra’s pay-as-you-go pricing means you’re not locked into a spend tier when you do.
Visit the DeepInfra models page to compare GLM-5.1, Step-3.5-Flash, Qwen3.5-35B-A3B, and Nemotron-3-Super across context window, pricing, and benchmark scores. Framework questions or model recommendations: email feedback@deepinfra.com, connect with the community on Discord, or follow @DeepInfra on X.
GLM-4.6 vs DeepSeek-V3.2: Performance, Benchmarks & DeepInfra Results<p>The open-source LLM ecosystem has evolved rapidly, and two models stand out as leaders in capability, efficiency, and practical usability: GLM-4.6, Zhipu AI’s high-capacity reasoning model with a 200k-token context window, and DeepSeek-V3.2, a sparsely activated Mixture-of-Experts architecture engineered for exceptional performance per dollar. Both models are powerful. Both are versatile. Both are widely adopted […]</p>
Qwen API Pricing Guide 2026: Max Performance on a Budget<p>If you have been following the AI leaderboards lately, you have likely noticed a new name constantly trading blows with GPT-4o and Claude 3.5 Sonnet: Qwen. Developed by Alibaba Cloud, the Qwen model family (specifically Qwen 2.5 and Qwen 3) has exploded in popularity for one simple reason: unbeatable price-to-performance. In 2025, Qwen is widely […]</p>
Pricing 101: Token Math & Cost-Per-Completion Explained<p>LLM pricing can feel opaque until you translate it into a few simple numbers: input tokens, output tokens, and price per million. Every request you send—system prompt, chat history, RAG context, tool-call JSON—counts as input; everything the model writes back counts as output. Once you know those two counts, the cost of a completion is […]</p>
© 2026 Deep Infra. All rights reserved.