We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

NVIDIA Nemotron 3 Super - blazing-fast agentic AI, ready to deploy today!

Deploy Custom LLMs on DeepInfra
Published on 2024.03.01 by Iskren Chernev
Deploy Custom LLMs on DeepInfra

Did you just finetune your favorite model and are wondering where to run it? Well, we have you covered. Simple API and predictable pricing.

Put your model on huggingface

Use a private repo, if you wish, we don't mind. Create a hf access token just for the repo for better security.

Create custom deployment

Via Web

You can use the Web UI to create a new deployment.

Custom LLM Web UI

Via HTTP

We also offer HTTP API:

curl -X POST https://api.deepinfra.com/deploy/llm -d '{
    "model_name": "test-model",
    "gpu": "A100-80GB",
    "num_gpus": 2,
    "max_batch_size": 64,
    "hf": {
        "repo": "meta-llama/Llama-2-7b-chat-hf"
    },
    "settings": {
        "min_instances": 1,
        "max_instances": 1,
    }
}' -H 'Content-Type: application/json' \
    -H "Authorization: Bearer YOUR_API_KEY"
copy

Use it

curl -X POST \
    -d '{"input": "Hello"}' \
    -H 'Content-Type: application/json' \
    -H "Authorization: Bearer YOUR_API_KEY" \
    'https://api.deepinfra.com/v1/inference/github-username/di-model-name'
copy

For in depth tutorial check Custom LLM Docs.

Related articles
Nemotron 3 Nano Explained: NVIDIA’s Efficient Small LLM and Why It MattersNemotron 3 Nano Explained: NVIDIA’s Efficient Small LLM and Why It Matters<p>The open-source LLM space has exploded with models competing across size, efficiency, and reasoning capability. But while frontier models dominate headlines with enormous parameter counts, a different category has quietly become essential for real-world deployment: small yet high-performance models optimized for edge devices, private on-prem systems, and cost-sensitive applications. NVIDIA’s Nemotron family brings together open [&hellip;]</p>
DeepSeek V3.2 API Benchmarks: Latency, Throughput & CostDeepSeek V3.2 API Benchmarks: Latency, Throughput & Cost<p>About DeepSeek V3.2 DeepSeek V3.2 is a state-of-the-art large language model that unifies conversational speed and deep reasoning in a single 685B parameter Mixture of Experts (MoE) architecture with 37B parameters activated per token. It is built around three key technical breakthroughs: DeepSeek V3.2 achieved gold-medal performance in the 2025 International Mathematical Olympiad (IMO) and [&hellip;]</p>