NVIDIA Nemotron 3 Super - blazing-fast agentic AI, ready to deploy today!

Did you just finetune your favorite model and are wondering where to run it? Well, we have you covered. Simple API and predictable pricing.
Use a private repo, if you wish, we don't mind. Create a hf access token just for the repo for better security.
You can use the Web UI to create a new deployment.
We also offer HTTP API:
curl -X POST https://api.deepinfra.com/deploy/llm -d '{
"model_name": "test-model",
"gpu": "A100-80GB",
"num_gpus": 2,
"max_batch_size": 64,
"hf": {
"repo": "meta-llama/Llama-2-7b-chat-hf"
},
"settings": {
"min_instances": 1,
"max_instances": 1,
}
}' -H 'Content-Type: application/json' \
-H "Authorization: Bearer YOUR_API_KEY"
curl -X POST \
-d '{"input": "Hello"}' \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer YOUR_API_KEY" \
'https://api.deepinfra.com/v1/inference/github-username/di-model-name'
For in depth tutorial check Custom LLM Docs.
Nemotron 3 Nano Explained: NVIDIA’s Efficient Small LLM and Why It Matters<p>The open-source LLM space has exploded with models competing across size, efficiency, and reasoning capability. But while frontier models dominate headlines with enormous parameter counts, a different category has quietly become essential for real-world deployment: small yet high-performance models optimized for edge devices, private on-prem systems, and cost-sensitive applications. NVIDIA’s Nemotron family brings together open […]</p>
DeepSeek V3.2 API Benchmarks: Latency, Throughput & Cost<p>About DeepSeek V3.2 DeepSeek V3.2 is a state-of-the-art large language model that unifies conversational speed and deep reasoning in a single 685B parameter Mixture of Experts (MoE) architecture with 37B parameters activated per token. It is built around three key technical breakthroughs: DeepSeek V3.2 achieved gold-medal performance in the 2025 International Mathematical Olympiad (IMO) and […]</p>
© 2026 Deep Infra. All rights reserved.