We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Reliable JSON-Only Responses with DeepInfra LLMs
Published on 2026.02.02 by DeepInfra
Reliable JSON-Only Responses with DeepInfra LLMs

When large language models are used inside real applications, their role changes fundamentally. Instead of chatting with users, they become infrastructure components: extracting information, transforming text, driving workflows, or powering APIs.

In these scenarios, natural language is no longer the desired output. What applications need is structured data — and very often, that structure is JSON. Yet anyone who has worked with LLMs knows the problem: even when explicitly instructed, models frequently return explanations, markdown, or subtly invalid JSON. A single stray character is enough to break a production system.

This article explains how to reliably force DeepInfra-hosted LLMs to return valid JSON only, without fragile prompt hacks or complex post-processing. The approach is simple, robust, and well-suited for production environments.

We will walk through the reasoning, the design principles, and a complete Python example using DeepInfra’s OpenAI-compatible API.

Why JSON-Only Output Matters in Practice

In early experiments, developers often tolerate imperfect outputs. A human can easily ignore extra text or mentally parse half-structured responses. Software cannot.

Once an LLM is part of a backend system, its output might be:

  • Parsed by json.loads()
  • Stored in a database
  • Passed to another service
  • Used to trigger business logic
  • Validated against a schema

In all of these cases, anything other than valid JSON is a failure. Even a well-meaning model response like:

Sure! Here is the JSON you requested:
{ "value": 42 }
copy

is unusable without brittle string manipulation. These problems multiply as systems grow more complex.

The solution is not “better prompting” alone. It requires hard constraints at the API level.Constraining the Model

DeepInfra supports the OpenAI-compatible Chat Completions API, including structured response formats. The most important feature for JSON-only output is:

response_format = { "type": "json_object" }
copy

This does something fundamentally different from prompt instructions. Instead of asking the model to behave, you are restricting what it is allowed to produce.

When json_object is enabled, the model is constrained to emit a single valid JSON object. It cannot wrap the response in explanations, markdown, or conversational text. This dramatically increases reliability.

In practice, this single parameter eliminates the majority of JSON parsing issues developers encounter.

The Importance of Keeping Schemas Small

An instinct when working with structured output is to define a detailed schema that mirrors the full complexity of your application. While this feels safe, it often backfires.

Large schemas increase the cognitive load on the model. Each required field is another opportunity for failure. Each enum or strict type constraint raises the probability that something will be violated.

A more effective strategy is to start with a tiny schema. Ask the model only for the minimum information you need at that moment. If additional details are required, they can be requested later or derived in your application code.

For example, instead of asking for a fully populated object with dozens of fields, you might only ask for two or three core values. Missing or unknown information can safely be represented as null.

This approach leads to higher success rates, simpler validation, and more flexible systems.

Designing Prompts for Structured Output

Even when the model is technically constrained to return JSON, the system prompt still plays a crucial role in shaping its behavior. The goal of the prompt is not to explain the task in detail, but to clearly define what kind of entity the model is supposed to be.

For structured output, the model should not think of itself as a conversational partner. Instead, it should behave like a backend service: it receives an input, performs a transformation, and returns a result in a strictly defined format. When the model adopts this mental frame, it becomes far less likely to add explanations, commentary, or unnecessary language.

In practice, this means that brevity is a strength. Short, explicit instructions consistently outperform long, descriptive prompts. Overly verbose prompts often introduce ambiguity, encourage natural language elaboration, or distract from the primary constraint: returning structured data only.

A useful mental model is to imagine the LLM as a stateless microservice in your architecture. It does not chat, it does not clarify unless explicitly asked, and it does not justify its output. It simply returns data.

Below is an example of a system prompt that works well in production when combined with

response_format=json_object:
You are a backend service.
Your task is to extract structured data from input text.
Return ONLY a valid JSON object.
Do not add explanations, comments, or extra text.
If a value cannot be determined, use null.
copy

This prompt is intentionally minimal. It clearly defines the model’s role, reinforces the JSON-only requirement, and specifies how uncertainty should be handled. Combined with decoder-level constraints, this approach reliably produces clean, machine-readable output suitable for direct use in applications.
Handling the Last 1%: A Single Safe Retry

No system is perfect. Network issues, partial responses, or rare model edge cases can still occur. Instead of assuming perfection, production systems should include a minimal safety mechanism.

A proven pattern is the one safe retry:

  1. Call the model
  2. Attempt to parse the response as JSON
  3. If parsing fails, retry once with a stronger instruction
  4. If it fails again, surface an error

This approach avoids infinite loops while still covering nearly all transient failures. In practice, a single retry is enough to handle the vast majority of issues.

A Complete Python Example Using DeepInfra

Let’s put all of this together in a concrete example. In this scenario, we want to extract a small set of structured attributes from arbitrary user text. The domain is intentionally generic so the pattern can be reused across many applications.

We will use a modern instruction-tuned model hosted on DeepInfra and enforce JSON-only output.

Python Example

import json
from openai import OpenAI

# Initialize the DeepInfra client
client = OpenAI(
    api_key="YOUR_DEEPINFRA_API_KEY",
    base_url="https://api.deepinfra.com/v1/openai"
)

# Use a DeepInfra model
MODEL = "meta-llama/Meta-Llama-3.1-8B-Instruct"

SYSTEM_PROMPT = """
You are a backend service.
Return ONLY valid JSON.
Do not add explanations or extra text.
If a value cannot be determined, use null.
"""

def extract_structured_data(text: str) -> dict:
    """
    Calls a LLM and guarantees JSON-only output.
    Uses a single safe retry if JSON parsing fails.
    """

    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {
            "role": "user",
            "content": f"""
Extract the following fields from the text:

- category (string or null)
- priority (number or null)
- deadline_days (number or null)

Text:
\"\"\"{text}\"\"\"
"""
        }
    ]

    def request():
        response = client.chat.completions.create(
            model=MODEL,
            messages=messages,
            response_format={"type": "json_object"},
            temperature=0.2
        )
        return response.choices[0].message.content

    # First attempt
    raw = request()
    try:
        return json.loads(raw)
    except json.JSONDecodeError:
        pass

    # One safe retry with stronger constraint
    messages.insert(
        1,
        {
            "role": "system",
            "content": "IMPORTANT: The response must be valid JSON only."
        }
    )

    raw_retry = request()
    try:
        return json.loads(raw_retry)
    except json.JSONDecodeError:
        raise ValueError("Failed to obtain valid JSON from the model.")
copy

Example Usage

input_text = "This task is high priority and should be done within the next 5 days."

result = extract_structured_data(input_text)
print(result)
copy

Output

{
  "category": None,
  "priority": 1,
  "deadline_days": 5
}
copy

The output is clean, machine-readable, and safe to pass directly into downstream systems.
Why This Pattern Is Production-Ready

What makes this approach robust is not any single trick, but the combination of constraints and simplicity.

The model is constrained at the decoder level to produce JSON. The schema is intentionally small. The system prompt frames the model as infrastructure rather than a conversational partner. Finally, a single retry handles rare failures without introducing complexity.

This combination has proven to be reliable across a wide range of real-world use cases, from data extraction to workflow automation.

Conclusion

Structured output is one of the most important capabilities when using LLMs in serious applications. Without it, systems become fragile and difficult to maintain.

DeepInfra’s support for response_format=json_object makes it possible to treat LLMs as predictable components, not just creative text generators. When combined with small schemas and minimal retry logic, the result is a clean and dependable integration pattern.

If you are building APIs, automation pipelines, internal tools, or AI-powered services, this approach should be your default starting point.

From here, you can layer on validation, enrichment, or multi-step workflows — but reliable JSON output is the foundation everything else depends on.

Related articles
GLM-4.7-Flash API Benchmarks: Latency, Throughput & CostGLM-4.7-Flash API Benchmarks: Latency, Throughput & Cost<p>About GLM-4.7-Flash GLM-4.7-Flash is Z.AI&#8217;s open-weights reasoning model released in January 2026. Built on a Mixture-of-Experts (MoE) Transformer architecture, it features 30 billion total parameters with only ~3 billion active per inference — making it exceptionally efficient for its capability class. The model is designed as a lightweight, cost-effective alternative to Z.AI&#8217;s flagship GLM-4.7, optimized [&hellip;]</p>
Open-Source vs Closed-Source AI Models: Is the Gap Worth It?Open-Source vs Closed-Source AI Models: Is the Gap Worth It?<p>The Artificial Analysis Intelligence Index sits at a ceiling of 57. Three frontier models — Claude Opus 4.7, Gemini 3.1 Pro Preview, and GPT-5.5 — all land in that band. Meanwhile, four open-weight models released between February and April 2026 now score 50 or above on the same index. A year ago, the best open-weight [&hellip;]</p>
NVIDIA Nemotron 3 Super 120B API BenchmarksNVIDIA Nemotron 3 Super 120B API Benchmarks<p>NVIDIA Nemotron 3 Super 120B A12B is available across multiple API providers, and the spread in performance and cost is wide enough to change deployment decisions. Artificial Analysis benchmarks three providers — Lightning AI, CoreWeave, and Nebius — with output speed ranging from 154 to 509 t/s (a 3.3x gap), TTFT spanning 0.98s to 1.94s, [&hellip;]</p>