Reliable JSON-Only Responses with DeepInfra LLMs

We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

NVIDIA Nemotron 3 Super - blazing-fast agentic AI, ready to deploy today!

Published on 2026.02.02 by DeepInfra

When large language models are used inside real applications, their role changes fundamentally. Instead of chatting with users, they become infrastructure components: extracting information, transforming text, driving workflows, or powering APIs.

In these scenarios, natural language is no longer the desired output. What applications need is structured data — and very often, that structure is JSON. Yet anyone who has worked with LLMs knows the problem: even when explicitly instructed, models frequently return explanations, markdown, or subtly invalid JSON. A single stray character is enough to break a production system.

This article explains how to reliably force DeepInfra-hosted LLMs to return valid JSON only, without fragile prompt hacks or complex post-processing. The approach is simple, robust, and well-suited for production environments.

We will walk through the reasoning, the design principles, and a complete Python example using DeepInfra’s OpenAI-compatible API.

Why JSON-Only Output Matters in Practice

In early experiments, developers often tolerate imperfect outputs. A human can easily ignore extra text or mentally parse half-structured responses. Software cannot.

Once an LLM is part of a backend system, its output might be:

Parsed by json.loads()
Stored in a database
Passed to another service
Used to trigger business logic
Validated against a schema

In all of these cases, anything other than valid JSON is a failure. Even a well-meaning model response like:

Sure! Here is the JSON you requested:
{ "value": 42 }copy

is unusable without brittle string manipulation. These problems multiply as systems grow more complex.

The solution is not “better prompting” alone. It requires hard constraints at the API level.Constraining the Model

DeepInfra supports the OpenAI-compatible Chat Completions API, including structured response formats. The most important feature for JSON-only output is:

response_format = { "type": "json_object" }copy

This does something fundamentally different from prompt instructions. Instead of asking the model to behave, you are restricting what it is allowed to produce.

When json_object is enabled, the model is constrained to emit a single valid JSON object. It cannot wrap the response in explanations, markdown, or conversational text. This dramatically increases reliability.

In practice, this single parameter eliminates the majority of JSON parsing issues developers encounter.

The Importance of Keeping Schemas Small

An instinct when working with structured output is to define a detailed schema that mirrors the full complexity of your application. While this feels safe, it often backfires.

Large schemas increase the cognitive load on the model. Each required field is another opportunity for failure. Each enum or strict type constraint raises the probability that something will be violated.

A more effective strategy is to start with a tiny schema. Ask the model only for the minimum information you need at that moment. If additional details are required, they can be requested later or derived in your application code.

For example, instead of asking for a fully populated object with dozens of fields, you might only ask for two or three core values. Missing or unknown information can safely be represented as null.

This approach leads to higher success rates, simpler validation, and more flexible systems.

Designing Prompts for Structured Output

Even when the model is technically constrained to return JSON, the system prompt still plays a crucial role in shaping its behavior. The goal of the prompt is not to explain the task in detail, but to clearly define what kind of entity the model is supposed to be.

For structured output, the model should not think of itself as a conversational partner. Instead, it should behave like a backend service: it receives an input, performs a transformation, and returns a result in a strictly defined format. When the model adopts this mental frame, it becomes far less likely to add explanations, commentary, or unnecessary language.

In practice, this means that brevity is a strength. Short, explicit instructions consistently outperform long, descriptive prompts. Overly verbose prompts often introduce ambiguity, encourage natural language elaboration, or distract from the primary constraint: returning structured data only.

A useful mental model is to imagine the LLM as a stateless microservice in your architecture. It does not chat, it does not clarify unless explicitly asked, and it does not justify its output. It simply returns data.

Below is an example of a system prompt that works well in production when combined with

response_format=json_object:
You are a backend service.
Your task is to extract structured data from input text.
Return ONLY a valid JSON object.
Do not add explanations, comments, or extra text.
If a value cannot be determined, use null.copy

This prompt is intentionally minimal. It clearly defines the model’s role, reinforces the JSON-only requirement, and specifies how uncertainty should be handled. Combined with decoder-level constraints, this approach reliably produces clean, machine-readable output suitable for direct use in applications.
Handling the Last 1%: A Single Safe Retry

No system is perfect. Network issues, partial responses, or rare model edge cases can still occur. Instead of assuming perfection, production systems should include a minimal safety mechanism.

A proven pattern is the one safe retry:

Call the model
Attempt to parse the response as JSON
If parsing fails, retry once with a stronger instruction
If it fails again, surface an error

This approach avoids infinite loops while still covering nearly all transient failures. In practice, a single retry is enough to handle the vast majority of issues.

A Complete Python Example Using DeepInfra

Let’s put all of this together in a concrete example. In this scenario, we want to extract a small set of structured attributes from arbitrary user text. The domain is intentionally generic so the pattern can be reused across many applications.

We will use a modern instruction-tuned model hosted on DeepInfra and enforce JSON-only output.

Python Example

import json
from openai import OpenAI

# Initialize the DeepInfra client
client = OpenAI(
    api_key="YOUR_DEEPINFRA_API_KEY",
    base_url="https://api.deepinfra.com/v1/openai"
)

# Use a DeepInfra model
MODEL = "meta-llama/Meta-Llama-3.1-8B-Instruct"

SYSTEM_PROMPT = """
You are a backend service.
Return ONLY valid JSON.
Do not add explanations or extra text.
If a value cannot be determined, use null.
"""

def extract_structured_data(text: str) -> dict:
    """
    Calls a LLM and guarantees JSON-only output.
    Uses a single safe retry if JSON parsing fails.
    """

    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {
            "role": "user",
            "content": f"""
Extract the following fields from the text:

- category (string or null)
- priority (number or null)
- deadline_days (number or null)

Text:
\"\"\"{text}\"\"\"
"""
        }
    ]

    def request():
        response = client.chat.completions.create(
            model=MODEL,
            messages=messages,
            response_format={"type": "json_object"},
            temperature=0.2
        )
        return response.choices[0].message.content

    # First attempt
    raw = request()
    try:
        return json.loads(raw)
    except json.JSONDecodeError:
        pass

    # One safe retry with stronger constraint
    messages.insert(
        1,
        {
            "role": "system",
            "content": "IMPORTANT: The response must be valid JSON only."
        }
    )

    raw_retry = request()
    try:
        return json.loads(raw_retry)
    except json.JSONDecodeError:
        raise ValueError("Failed to obtain valid JSON from the model.")copy

Example Usage

input_text = "This task is high priority and should be done within the next 5 days."

result = extract_structured_data(input_text)
print(result)copy

Output

{
  "category": None,
  "priority": 1,
  "deadline_days": 5
}copy

The output is clean, machine-readable, and safe to pass directly into downstream systems.
Why This Pattern Is Production-Ready

What makes this approach robust is not any single trick, but the combination of constraints and simplicity.

The model is constrained at the decoder level to produce JSON. The schema is intentionally small. The system prompt frames the model as infrastructure rather than a conversational partner. Finally, a single retry handles rare failures without introducing complexity.

This combination has proven to be reliable across a wide range of real-world use cases, from data extraction to workflow automation.

Conclusion

Structured output is one of the most important capabilities when using LLMs in serious applications. Without it, systems become fragile and difficult to maintain.

DeepInfra’s support for response_format=json_object makes it possible to treat LLMs as predictable components, not just creative text generators. When combined with small schemas and minimal retry logic, the result is a clean and dependable integration pattern.

If you are building APIs, automation pipelines, internal tools, or AI-powered services, this approach should be your default starting point.

From here, you can layer on validation, enrichment, or multi-step workflows — but reliable JSON output is the foundation everything else depends on.

NVIDIA Nemotron API Pricing Guide 2026While everyone knows Llama 3 and Qwen, a quieter revolution has been happening in NVIDIA’s labs. They have been taking standard Llama models and “supercharging” them using advanced alignment techniques and pruning methods. The result is Nemotron—a family of models that frequently tops the “Helpfulness” leaderboards (like Arena Hard), often beating GPT-4o while being significantly […]

Qwen API Pricing Guide 2026: Max Performance on a BudgetIf you have been following the AI leaderboards lately, you have likely noticed a new name constantly trading blows with GPT-4o and Claude 3.5 Sonnet: Qwen. Developed by Alibaba Cloud, the Qwen model family (specifically Qwen 2.5 and Qwen 3) has exploded in popularity for one simple reason: unbeatable price-to-performance. In 2025, Qwen is widely […]

Nemotron 3 Nano Explained: NVIDIA’s Efficient Small LLM and Why It MattersThe open-source LLM space has exploded with models competing across size, efficiency, and reasoning capability. But while frontier models dominate headlines with enormous parameter counts, a different category has quietly become essential for real-world deployment: small yet high-performance models optimized for edge devices, private on-prem systems, and cost-sensitive applications. NVIDIA’s Nemotron family brings together open […]

View all