Best Models for OpenClaw: Top Picks for Agentic Workloads

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Published on 2026.04.28 by DeepInfra

When you configure OpenClaw for the first time, the model picker looks like a minor config detail. It isn’t. The model you connect decides whether your agents complete tasks reliably or fall apart halfway through a multi-step workflow. It sets what you pay per completed job, not just per token. And it determines whether your SOUL.md instructions hold up over a long session or quietly stop being respected as context fills up.

OpenClaw is a local-first autonomous AI agent that connects messaging platforms (WhatsApp, Telegram, Discord, and others) to any LLM provider via an OpenAI-compatible API. Peter Steinberger published it in late 2025 under the name Clawdbot and it crossed 100,000 GitHub stars in February 2026. Because it supports custom providers, you’re not locked into the standard cloud options. Any model served at an OpenAI-compatible endpoint is a candidate.

That opens up a much wider field than most guides cover. Open-weight models on inference providers like DeepInfra regularly beat proprietary options at a fraction of the cost for agentic workloads. Yet most articles still default to Claude and GPT-4o as though nothing else exists.

This guide covers the best models for OpenClaw as of mid-2026: real pricing, concrete tradeoffs, a comparison table, a Python example, and an example OpenClaw config. The picks are weighted on three things that actually matter for agents: reliable tool-calling, instruction adherence over long sessions, and context window behavior when SOUL.md, memory files, and task history are all adding up.

How We Evaluated: Three Constraints That Actually Matter

Price-per-million-tokens is the number every comparison article leads with. It’s also the least useful number for OpenClaw workloads. Agents don’t send one short prompt and stop. A single task usually involves a system prompt, accumulated memory from MEMORY.md, tool call payloads, and several back-and-forth turns before a result lands. Token count per completed task runs an order of magnitude higher than chatbot benchmarks suggest.

Here’s what to evaluate when choosing a model for OpenClaw.

Tool-calling accuracy. OpenClaw relies on structured function calls to trigger skills and talk to external services. A model that occasionally returns malformed JSON or drops required fields forces a retry loop. That retry doubles the token spend and degrades the user experience. Look for models with a strong track record of tool-calling accuracy on the first attempt, not models that merely claim to support the feature.

Instruction adherence over long sessions. SOUL.md is OpenClaw’s behavioral contract. As context fills with message history, weaker models start ignoring instructions they followed cleanly at turn one. Weight models that hold defined personas and behavioral rules at 50K tokens the same way they do at 5K.

Context retention. This is different from having a large context window. A model can accept 128K tokens and still lose track of constraints set earlier in the session. Poor context retention quickly turns into forgotten tool schemas, dropped behavioral rules from SOUL.md, or agents that repeat steps they already completed. The context window spec tells you the ceiling; community reports on long-session behavior tell you whether the model actually uses it.

Cost per completed task. A rough calculation for a representative OpenClaw task (2,000-token system prompt, three tool call round trips at 500 tokens each, 300-token final response) gives a more honest comparison than raw per-token pricing, especially with MoE models that activate fewer parameters per token.

Best Models for OpenClaw: Kimi K2.5 Takes the Top General Spot

Kimi K2.5 is one of the strongest all-around pick for OpenClaw right now. It’s a 1-trillion-parameter mixture-of-experts model from Moonshot AI, with 32B parameters active per forward pass. That architecture buys you frontier-level reasoning at inference costs well below comparably capable dense models. (The older moonshotai/Kimi-K2-Instruct ID still resolves on DeepInfra but is redirected to K2.5, so pin the new ID in config to avoid surprises.)

Three things make it stand out for OpenClaw:

Tool calls come back clean and well-structured across extended sessions, even when the schema is complex or the call sits inside a multi-step plan.
The 256K context window holds up toward the tail end, which matters once your OpenClaw instance has weeks of conversation history, a thick MEMORY.md, and multiple concurrent task
threads stacking up.
Instruction adherence stays consistent. A behavioral contract in SOUL.md at turn one is still being followed at turn forty, without the gradual scope drift you see in less instruction-tuned models.

Community data backs the pick. Kimi K2.5 topped the pricepertoken.com OpenClaw leaderboard in April 2026 by developer vote. That pattern also shows up when you look at which models appear in production OpenClaw deployments rather than just benchmark tables.

Kimi K2.5’s Pricing on DeepInfra is $0.45 per million input tokens and $2.25 per million output tokens, with cached input at $0.07 per million. For the representative task scenario (2,000-token system prompt, three tool call round trips at 500 tokens each, 300-token response), that works out to roughly $0.003 to $0.005 per completed task depending on how much context accumulates across turns and whether the static system prompt is hitting the cache. Not the cheapest in this guide, but a reasonable price when completion rate matters more than shaving fractions of a cent per call.

On DeepInfra, Kimi K2.5 is available via the OpenAI-compatible API. Setup is just an API key:

import os

from openai import OpenAI

client = OpenAI(
    api_key=os.environ["DEEPINFRA_API_TOKEN"],
    base_url="https://api.deepinfra.com/v1/openai",
)

response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.5",
    messages=[
        {
            "role": "system",
            "content": "You are an OpenClaw agent. Follow all SOUL.md instructions precisely.",
        },
        {"role": "user", "content": "Research the top 3 competitors for Acme Corp and summarize findings."},
    ],
    tools=[
        {
            "type": "function",
            "function": {
                "name": "web_search",
                "description": "Search the web for current information",
                "parameters": {
                    "type": "object",
                    "properties": {"query": {"type": "string"}},
                    "required": ["query"],
                },
            },
        }
    ],
    tool_choice="auto",
)

print(response.choices[0].message)copy

Point base_url at DeepInfra and set model to moonshotai/Kimi-K2.5. If you’re migrating from an existing OpenAI provider config, no other changes are needed. The API surface is identical.

Best for Cost-Conscious Deployments: DeepSeek-V3-0324

If you’re running OpenClaw at volume (a small-business automation handling dozens of daily users, or a dev environment where you reset agents constantly), DeepSeek-V3-0324 makes the strongest cost case in this tier.

It’s a 671B MoE model with 37B active parameters per token. At $0.20 per million input tokens and $0.77 per million output tokens on DeepInfra, it’s roughly 2.5x cheaper than Kimi K2 on inputs. DeepInfra supports context caching on this model too, dropping cached input tokens to $0.135 per million. For OpenClaw deployments where the system prompt and memory context stay mostly static across calls, that caching discount adds up quickly.

Reasoning and instruction following are solid. DeepSeek-V3-0324 handles SOUL.md compliance well, runs multi-step plans reliably, and returns clean tool calls in most cases. Where it trails Kimi K2 is at the far end of a long context window and in complex nested tool-calling chains. For straightforward agentic workflows, those edge cases rarely come up.

The practical call: start with Kimi K2 as your primary model, then test DeepSeek-V3-0324 against your actual task distribution. If your agents mostly do research, summarization, and light automation rather than deeply nested multi-step orchestration, the cost savings are likely worth the capability gap.

Best for Coding Agents: Qwen3 Coder 480B A35B

If your OpenClaw instance handles code generation, pull request review, debugging, or anything that ends with writing or editing source files, Qwen3 Coder 480B A35B is the right pick. It’s a 480B MoE model from Alibaba’s Qwen team, with 35B parameters active per token and a native 256K context window. It scores 69.6% on SWE-bench Verified without test-time scaling, leading the open-weight tier.

SWE-bench measures the ability to resolve real GitHub issues in real codebases. That benchmark maps closely to what code-focused OpenClaw agents actually do. The gap between Qwen3 Coder 480B A35B and the next open-weight contender is real, not rounding error.

Pricing on DeepInfra is $0.40 per million input tokens and $1.60 per million output tokens. For a coding agent task (larger system prompt, more tool call round trips for file reads and edits), expect roughly $0.009 to $0.014 per completed task depending on output length.

One tradeoff to name upfront: Qwen3 Coder 480B A35B is optimized hard for code tasks. Instruction adherence on non-coding work and conversational coherence over long sessions are slightly weaker than Kimi K2. If your OpenClaw agent is a generalist that handles code occasionally, Kimi K2 is the better default. If coding is the primary workload, Qwen3 Coder 480B A35B is the stronger choice.

Best Cost Floor for Simple Routing: Llama 3.1 70B Instruct

Llama 3.1 70B Instruct is the cost floor in this guide, and that’s the right frame for evaluating it. At $0.40 per million tokens for both input and output, it’s the only model here where input and output pricing are equal. For OpenClaw workflows that live at the low end of complexity, that flat pricing matters.

Where it fits best is intent classification, simple dispatching, and single-turn summarization tasks where the agent doesn’t need to hold a multi-step plan or maintain state across a long session. An OpenClaw instance that routes incoming messages to the right skill, or generates short notification summaries from structured data, rarely needs anything more capable. The model handles those cases cleanly.

The 128K context window is smaller than Kimi K2.5 or Qwen3 Coder, and instruction adherence over long sessions is weaker. SOUL.md compliance tends to drift earlier than it does on the larger models. For complex tool-calling chains or extended agentic workflows, that limitation shows up quickly and the cost savings stop making sense.

The practical use case is two-fold: development and staging environments where you’re testing agent logic and don’t want to burn budget on Kimi K2.5 inference, or production pipelines where the routing layer is cleanly separated from the reasoning layer. Llama 3.1 70B Instruct handles the routing; Kimi K2.5 or Qwen3 Coder handles what the routing points to.

Best Models for OpenClaw: Side-by-Side Comparison

All four are available on DeepInfra via the OpenAI-compatible API.

Model	Context	Input $/1M	Output $/1M	Best for
Kimi K2.5	256K	$0.45	$2.25	General agents, long sessions, complex tool use
DeepSeek-V3-0324	160K	$0.20	$0.77	High-volume automation, cost-sensitive workloads
Qwen3 Coder 480B A35B	256K	$0.40	$1.60	Coding agents, PR review, file-editing workflows
Llama 3.1 70B Instruct	128K	$0.40	$0.40	Simple task routing, high-volume low-complexity

Picking the Right Model for Your OpenClaw Use Case

The table is a useful reference, but the decision is simpler in practice.

If your agent handles research, summarization, or multi-step coordination across tools: Use Kimi K2.5. Long context and consistent instruction adherence give you the most headroom as your SOUL.md grows and your agent builds memory across sessions. It’s the safe default for many OpenClaw setups.

If you’re running high-frequency automation at volume: Try DeepSeek-V3-0324 first. The cost gap relative to Kimi K2.5 is significant when you’re processing hundreds of tasks per day. Context caching on DeepInfra cuts system prompt costs further as usage scales. Run both models against the same task set and compare completion rates before you commit.

If your OpenClaw agents write, review, or modify code: Qwen3 Coder 480B A35B is the clear pick. Nothing in the open-weight tier comes close on SWE-bench, and that score translates directly to real performance on file-editing and PR review tasks.

If you’re in early development or want a cost floor for simple routing: Llama 3.1 70B Instruct at $0.40 per million tokens (input and output) handles intent classification, simple dispatching, and low-stakes summarization without spending on capability you don’t need.

One thing to remember is that OpenClaw supports swapping models per agent or per task type within the same install. You don’t have to pick one model for everything. This is especially useful for multi-agent pipelines where different agents have different workloads. A practical setup routes research and planning to Kimi K2.5, code generation to Qwen3 Coder 480B A35B, and high-volume notification summaries to Llama 3.1 70B, all off a single DeepInfra API key.

Getting Started on DeepInfra

Adding any of these models to OpenClaw takes about five minutes with a DeepInfra account. Create an API key in the DeepInfra dashboard, then register DeepInfra as an OpenClaw custom provider by adding the following block to ~/.openclaw/openclaw.json:

{
  "models": {
    "mode": "merge",
    "providers": {
      "deepinfra": {
        "baseUrl": "https://api.deepinfra.com/v1/openai",
        "apiKey": "${DEEPINFRA_API_TOKEN}",
        "api": "openai-completions",
        "models": [
          {
            "id": "moonshotai/Kimi-K2.5",
            "name": "Kimi K2.5",
            "reasoning": false,
            "input": ["text"],
            "cost": { "input": 0.45, "output": 2.25, "cacheRead": 0.07, "cacheWrite": 0 },
            "contextWindow": 262144,
            "maxTokens": 8192
          }
        ]
      }
    }
  },
  "agents": {
    "defaults": {
      "model": { "primary": "deepinfra/moonshotai/Kimi-K2.5" }
    }
  }
}copy

“mode”: “merge” keeps existing providers intact instead of replacing them. The cost block is what OpenClaw reads for its in-app token accounting, so fill it in per model rather than leaving zeros. Swap the id, name, and cost fields for whichever model fits your workload: deepseek-ai/DeepSeek-V3-0324, Qwen/Qwen3-Coder-480B-A35B-Instruct, or meta-llama/Meta-Llama-3.1-70B-Instruct. All four run on DeepInfra’s pay-as-you-go plan with no minimum spend, no contracts, and zero data retention on inference requests.

DeepInfra runs on bare-metal infrastructure in US data centers with SOC 2 and ISO 27001 compliance. That matters if you’re connecting OpenClaw to email accounts, calendars, or CRM data.

Model pages with current pricing and specs:

If any of these models fit one of your workflows, the fastest path forward is a signing up at DeepInfra. No contracts, no minimum spend. You can send requests within five minutes of signing up. Questions or model tradeoff discussions: reach out at feedback@deepinfra.com, join the community on Discord, or find us on X at @DeepInfra.

Kimi K2 0905 API from Deepinfra: Practical Speed, Predictable Costs, Built for Devs - Deep Infra<p>Kimi K2 0905 is Moonshot’s long-context Mixture-of-Experts update designed for agentic and coding workflows. With a context window up to ~256K tokens, it can ingest large codebases, multi-file documents, or long conversations and still deliver structured, high-quality outputs. But real-world performance isn’t defined by the model alone—it’s determined by the inference provider that serves it: […]</p>

Nemotron 3 Nano vs GPT-OSS-20B: Performance, Benchmarks & DeepInfra Results<p>The open-source LLM landscape is becoming increasingly diverse, with models optimized for reasoning, throughput, cost-efficiency, and real-world agentic applications. Two models that stand out in this new generation are NVIDIA’s Nemotron 3 Nano and OpenAI’s GPT-OSS-20B, both of which offer strong performance while remaining openly available and deployable across cloud and edge systems. Although both […]</p>

Seed Anchoring and Parameter Tweaking with SDXL Turbo: Create Stunning Cubist ArtIn this blog post, we're going to explore how to create stunning cubist art using SDXL Turbo using some advanced image generation techniques.

View all