NVIDIA Nemotron 3 Super is a hybrid Mixture-of-Experts (MoE) model engineered for highest compute efficiency and accuracy in multi-agent applications and specialized agentic systems. It is optimized to run many collaborating agents per application on a single GPU, delivering high accuracy for reasoning, tool use, and instruction following.

NVIDIA-Nemotron-3-Super-120B-A12B

GLM-5 is an advanced, open-source large language model designed for developers tackling the toughest challenges. It excels at long-context reasoning, multi-step tool orchestration, and complex systems engineering, making it the ideal choice for powering sophisticated agents and applications that require high-level cognitive tasks.

GLM-5

  Qwen3-TTS is an advanced text-to-speech model by Alibaba's Qwen team, delivering stable, expressive, and low-latency speech generation across 10 languages.                                                                                                                                                                                                                                                                                                                                           Key capabilities:                                                                                                                                                                                                                                  - 9 preset voices — Vivian, Serena, Uncle_Fu, Dylan, Eric, Ryan, Aiden, Ono_Anna, Sohee — covering diverse genders, ages, and accents                                                                                                              - Voice cloning — clone any voice from a short (~3s) audio sample via the voice_id parameter   - Instruction control — adjust tone, emotion, and speaking style with natural language (e.g. "speak slowly and calmly", "excited tone")   - 10 languages — English, Chinese, Japanese, Korean, German, French, Russian, Spanish, Italian, Portuguese   - Streaming support — real-time PCM streaming with ~97ms first-byte latency   - Multiple output formats — WAV, MP3, FLAC, PCM    Built on a 1.7B parameter architecture using discrete multi-codebook language modeling for end-to-end speech synthesis without cascading errors. Uses a custom 12Hz acoustic tokenizer that preserves paralinguistic information and   environmental audio details.

Qwen3-TTS

● Qwen3-TTS-VoiceDesign is a voice design variant of Qwen3-TTS by Alibaba's Qwen team. Instead of selecting from preset voices, you describe the voice you want in natural language — and the model generates speech in that voice.                                                                                                                                                                                                                                                                     Key capabilities:                                                                                                                                                                                                                                  - Natural language voice control — describe any voice with free text (e.g. "a deep male voice with a calm, authoritative presence", "a young cheerful female with a warm and friendly tone")   - 10 languages — English, Chinese, Japanese, Korean, German, French, Russian, Spanish, Italian, Portuguese                                                                                                                                         - Streaming support — real-time PCM streaming   - Multiple output formats — WAV, MP3, FLAC, PCM    Built on the same 1.7B parameter architecture as Qwen3-TTS, using discrete multi-codebook language modeling and a custom 12Hz acoustic tokenizer for high-quality end-to-end speech synthesis.

Qwen3-TTS-VoiceDesign

MiniMax M2.5 is SOTA in coding, agentic tool use and search, office work, and a range of other economically valuable tasks, boasting scores of 80.2% in SWE-Bench Verified, 51.3% in Multi-SWE-Bench, and 76.3% in BrowseComp (with context management).

MiniMax-M2.5

The latest flagship model in the Qwen family. State-of-the-art results across a comprehensive suite of benchmarks — including knowledge, reasoning, coding, instruction following, human preference alignment, agent tasks, and multilingual understanding.

Qwen3-Max

The latest flagship reasoning model in the Qwen3 family. Further enhanced by multiple innovations like adaptive tool-use and advanced test-time scaling techniques

Qwen3-Max-Thinking

Kimi K2.5 is an open-source, native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens atop Kimi-K2-Base. It seamlessly integrates vision and language understanding with advanced agentic capabilities, instant and thinking modes, as well as conversational and agentic paradigms.

Kimi-K2.5

GLM-4.7-Flash is a 30B-A3B MoE model. As the strongest model in the 30B class, GLM-4.7-Flash offers a new option for lightweight deployment that balances performance and efficiency.

GLM-4.7-Flash

NVIDIA Nemotron 3 Nano is an open small reasoning model optimized for fast, cost-efficient inference in agentic and production workloads. Built with a hybrid Mixture-of-Experts (MoE) and Mamba-Transformer architecture, it delivers strong multi-step reasoning, high token throughput, stable latency with predictable cost, and efficient deployment for agent-based systems. Designed for real-world AI systems where reasoning can generate significantly more tokens per prompt, Nemotron Nano reduces compute cost while maintaining strong reasoning quality.

Nemotron-3-Nano-30B-A3B

DeepSeek-V3.2 is a large language model designed to harmonize high computational efficiency with strong reasoning and agentic tool-use performance. It introduces DeepSeek Sparse Attention (DSA), a fine-grained sparse attention mechanism that reduces training and inference cost while preserving quality in long-context scenarios. A scalable reinforcement learning post-training framework further improves reasoning, with reported performance in the GPT-5 class, and the model has demonstrated gold-medal results on the 2025 IMO and IOI. V3.2 also uses a large-scale agentic task synthesis pipeline to better integrate reasoning into tool-use settings, boosting compliance and generalization in interactive environments.

DeepSeek-V3.2

Chatterbox is a family of three state-of-the-art, open-source text-to-speech models by Resemble AI.  We are excited to introduce Chatterbox-Turbo, our most efficient model yet. Built on a streamlined 350M parameter architecture, Turbo delivers high-quality speech with less compute and VRAM than our previous models. We have also distilled the speech-token-to-mel decoder, previously a bottleneck, reducing generation from 10 steps to just one, while retaining high-fidelity audio output.  Paralinguistic tags are now native to the Turbo model, allowing you to use [cough], [laugh], [chuckle], and more to add distinct realism. While Turbo was built primarily for low-latency voice agents, it excels at narration and creative workflows.  If you like the model but need to scale or tune it for higher accuracy, check out our competitively priced TTS service (link).

chatterbox-turbo

The fastest model of the Flux 2 family. Frontier visual intelligence — state-of-the-art image generation and editing from Black Forest Labs

FLUX-2-klein-4b

The best quality-to-latency ratio, production apps model of the Flux 2 family. Frontier visual intelligence — state-of-the-art image generation and editing from Black Forest Labs

FLUX-2-klein-9b

Developed by Anthropic, Claude is a family of highly performant, trustworthy AI models built for complex reasoning, advanced coding, and nuanced language understanding. The latest Claude 4 generation delivers breakthrough capabilities in analytical thinking, with Claude 4 Opus setting new standards for intelligence and Claude 4 Sonnet providing exceptional performance with remarkable efficiency.

Claude models excel at understanding context, following complex instructions, and maintaining coherent conversations across extended interactions. With advanced features like extended thinking for deeper reasoning, prompt caching that reduces costs by up to 90%, vision capabilities for image analysis, and robust safety measures, Claude is designed for enterprise applications that demand both sophistication and reliability.

Available with comprehensive API features including streaming responses, batch processing for 50% cost savings, multilingual support across dozens of languages, and flexible context windows up to 200K tokens (1M in beta), Claude is perfect for building intelligent applications like customer support agents, content analysis systems, coding assistants, and complex reasoning workflows that require both accuracy and trustworthiness.

Claude AI family: Claude 4 Opus for complex reasoning, Claude 4 Sonnet for balanced performance, plus advanced capabilities like extended thinking, prompt caching, vision analysis, and enterprise-grade safety APIs.

Claude AI APIs via DeepInfra

claude

Developed by Anthropic, Claude is a family of highly performant, trustworthy AI models built for complex reasoning, advanced coding, and nuanced language understanding

DeepInfra provides access to Anthropic's latest Claude models, featuring the most advanced reasoning capabilities and balanced performance options, all with enterprise-grade safety and reliability.

Claude

DeepSeek develops advanced foundation models optimized for computational efficiency and strong generalization across diverse tasks. The architecture incorporates recent advances in transformer-based systems, delivering robust performance in both zero-shot and fine-tuned scenarios. Models are pretrained on rigorously filtered multilingual corpora with specialized optimizations for mathematical reasoning and algorithmic tasks. The inference stack achieves competitive throughput while maintaining low latency, making it suitable for production deployment. Researchers and engineers can leverage these models for tasks ranging from natural language processing to complex analytical problem-solving.

deepseek

DeepSeek's models are a suite of advanced AI systems that prioritize efficiency, scalability, and real-world applicability.

DeepSeek

Developed by Black Forest Labs (the original creators behind Stable Diffusion), Flux is a family of state-of-the-art image generation and editing models that deliver exceptional visual quality with breakthrough prompt accuracy and photorealism. Built on advanced 12 billion parameter architecture, Flux models excel at understanding exactly what you want to create or modify.
FLUX 2 represents the next generation of the Flux family, introducing significantly improved image quality, faster generation, and more precise prompt following. The lineup includes Max for ultimate quality, Pro for production workflows, and Klein variants (9B and 4B) optimized for speed and efficiency — making high-quality image generation accessible at every scale.
Flux offers specialized variants for every need: from open-source to commercial licensing, Flux is perfect for developers building creative applications, product visualization tools, and next-generation image editing experiences.

Flux AI image generation family: FLUX.1 Kontext for in-context editing, FLUX.1 Pro/Dev for text-to-image synthesis, plus comprehensive editing tools and state-of-the-art visual generation APIs.

Flux Image Generation APIs via DeepInfra

flux

Developed by Black Forest Labs, Flux is a family of state-of-the-art image generation and editing models that deliver exceptional visual quality with breakthrough prompt accuracy and photorealism.

DeepInfra provides access to Black Forest Labs' complete Flux ecosystem, offering everything from lightning-fast generation to sophisticated in-context editing capabilities with industry-leading prompt adherence and visual quality.

Flux

Developed by Google DeepMind, Gemini is a family of state-of-the-art thinking models with native multimodal capabilities, designed for advanced reasoning, complex problem-solving, and comprehensive understanding across text, audio, video, and images. Built with revolutionary thinking architecture, Gemini models reason through problems step-by-step before responding, delivering enhanced accuracy and performance for sophisticated applications.

Gemini 2.5 Pro sets new standards for complex reasoning and coding excellence, while Gemini 2.5 Flash provides optimal price-performance for high-volume tasks. With massive context windows up to 1 million tokens, native multimodal processing that handles hours of video and audio, and transparent reasoning capabilities that show step-by-step thinking processes, Gemini excels at document analysis, code generation, scientific research, and agentic workflows.

Perfect for building intelligent applications that require deep reasoning, multimodal understanding, long-context processing, and transparent AI decision-making with Google's enterprise-grade reliability and performance.


Gemini AI family: Advanced thinking models with native multimodal processing for text, audio, video, and image understanding APIs

Gemini AI Model APIs via DeepInfra

gemini

Developed by Google DeepMind, Gemini is a family of state-of-the-art thinking models with native multimodal capabilities

DeepInfra provides access to Google's latest Gemini models, featuring advanced thinking capabilities, native multimodal processing, and industry-leading performance for complex reasoning and development tasks.

Gemini

Developed by Meta, Llama (Large Language Model Meta AI) is a family of state-of-the-art open-weight models designed for efficiency and performance. The latest versions feature Mixture-of-Experts (MoE) architectures, enabling cost-effective inference by dynamically activating subsets of parameters. With support for multimodal inputs (text + images) and extended context windows (up to 10M tokens), Llama excels in tasks like code generation, multilingual understanding, and long-form reasoning. The models support FP8 quantization and batch inference, ensuring low-latency, high-throughput performance for production workloads. With permissive licensing and robust tooling (e.g., Llama Guard for safety), Llama is ideal for developers seeking powerful, customizable AI with minimal overhead.

llama

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences.

Llama 4

Meta Llama 3 are a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes.

Llama 3

Llama

Developed by Mistral AI, a leading French research lab, Mistral is a family of open-source AI models built for multilingual excellence, advanced reasoning, and cost-effective performance. These models excel at complex reasoning, mathematics, coding, and specialized tasks while offering complete transparency and deployment freedom through open-source licensing.

Mistral Small 3.2 delivers breakthrough efficiency with native fluency in European languages, while specialized variants handle specific needs: Devstral for coding, Voxtral for audio processing, and Mixtral for high-performance tasks. With Apache 2.0 licensing, extensive context windows up to 128K tokens, and comprehensive customization options, Mistral provides enterprise-grade capabilities without vendor lock-in.

Perfect for building multilingual applications, coding assistants, and reasoning systems where you need both powerful performance and complete control over your AI deployment.

Mistral AI model family: Mistral Small 3.2 for efficient performance, Mixtral for specialized tasks, Devstral for coding, plus multilingual reasoning, mathematics, and open-source flexibility APIs.

Mistral AI Model APIs via DeepInfra

mistral

Developed by Mistral AI, a leading French research lab, Mistral is a family of open-source AI models built for multilingual excellence, advanced reasoning, and cost-effective performance

DeepInfra provides access to Mistral AI's comprehensive open-source model ecosystem, from efficient small models to specialized coding and audio processing variants, all with complete Apache 2.0 licensing freedom.

Mistral

Voxtral is a family of audio models with state-of-the-art speech to text capabilities.

Voxtral

[NVIDIA Nemotron<sup style='font-size:0.6em'>TM</sup>](https://www.nvidia.com/en-us/ai-data-science/foundation-models/nemotron/) is a family of open models, datasets, and training recipes engineered for high performance, efficiency and customization. Nemotron models support synthetic data workflows and supervised fine-tuning — and are equally optimized for real-time inference, reasoning agents, and production AI systems.

nemotron

NVIDIA Nemotron is a family of open models customized for efficiency, accuracy, and specialized workloads.

The Nemotron family spans Nano, Super, and specialized instruct variants, enabling you to balance accuracy, reasoning depth, latency, and cost for your specific workload.
- **Nano** for maximum efficiency and stable inference
- **Super** for multi-agent systems and advance reasoning
- **Instruct variants** for instruction-following and conversational workloads

Nemotron

Developed by Alibaba Group's Qwen Team, Qwen is a family of state-of-the-art large language and multimodal models designed for comprehensive AI capabilities and multilingual performance. The latest Qwen3 generation features balanced model architectures including reintroduced Mixture-of-Experts (MoE) variants (Qwen3-30B-A3B and Qwen3-235B-A22B) alongside dense models up to 32B parameters, enabling efficient resource utilization through dynamic parameter activation. 

With support for 119 languages and dialects, hybrid thinking modes that seamlessly alternate between reasoning and instruction-following without model switching, and extended context windows (up to 1M tokens in Qwen3-2507), Qwen excels in tasks like multilingual understanding, code generation, agentic workflows, and complex problem-solving. The models utilize advanced Byte-level Byte Pair Encoding with a 151,646-token vocabulary, structured ChatML formatting for conversational interactions, and robust tool calling capabilities with parallel execution support. 

Available in both proprietary and open-weight versions with flexible licensing, comprehensive model variants (Base, Instruct, Thinking, and hybrid modes), and enhanced Model Context Protocol support, Qwen is ideal for developers seeking powerful, multilingual AI systems with sophisticated reasoning capabilities and minimal deployment complexity.

Qwen AI model family: Qwen3 language models, specialized coding & reasoning models, plus state-of-the-art embedding & reranking APIs for search and RAG applications.

Qwen Model APIs via DeepInfra

qwen

Qwen series offers a comprehensive suite of dense and mixture-of-experts models.

DeepInfra provides access to Qwen's latest generation of large language models, offering both specialized coding models and general-purpose AI systems with advanced reasoning capabilities.

Qwen

Built for low-latency, high-concurrency, cost-sensitive use cases, with flexible deployment, four-tier thinking, and multimodal

Zonos-v0.1 is a leading open-weight text-to-speech model trained on more than 200k hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS providers.  Our model enables highly natural speech generation from text prompts when given a speaker embedding or audio prefix, and can accurately perform speech cloning when given a reference clip spanning just a few seconds. The conditioning setup also allows for fine control over speaking rate, pitch variation, audio quality, and emotions such as happiness, fear, sadness, and anger. The model outputs speech natively at 44kHz.

Text Embeddings by Weakly-Supervised Contrastive Pre-training. Model has 24 layers and 1024 out dim. 

Llama 3.3-70B is a multilingual LLM trained on a massive dataset of 15 trillion tokens, fine-tuned for instruction-following and conversational dialogue. The model is designed to be helpful, safe, and flexible, with a focus on responsible deployment and mitigating potential risks such as bias, toxicity, and misinformation. It achieves state-of-the-art performance on various benchmarks, including conversational tasks, language translation, and text generation.

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Gemma 3-12B is Google's latest open source model, successor to Gemma 2

Light and fast. Remove the background of your videos to bring the foreground elements to focus. No more unwanted distractions.

Reflection Llama-3.1 70B is trained with a new technique called Reflection-Tuning that teaches a LLM to detect mistakes in its reasoning and correct course.  The model was trained on synthetic data.

DeepSeek-Prover-V2, an open-source large language model designed for formal theorem proving in Lean 4, with initialization data collected through a recursive theorem proving pipeline powered by DeepSeek-V3. The cold-start training procedure begins by prompting DeepSeek-V3 to decompose complex problems into a series of subgoals. The proofs of resolved subgoals are synthesized into a chain-of-thought process, combined with DeepSeek-V3's step-by-step reasoning, to create an initial cold start for reinforcement learning. 

We present a sentence similarity model based on the Sentence Transformers architecture, which maps sentences to a 384-dimensional dense vector space. The model uses a pre-trained BERT encoder and applies mean pooling on top of the contextualized word embeddings to obtain sentence embeddings. We evaluate the model on the Sentence Embeddings Benchmark.

olmOCR is a specialized AI tool that converts PDF documents into clean, structured text while preserving important formatting and layout information. What makes olmOCR particularly valuable for developers is its ability to handle challenging PDFs that traditional OCR tools struggle with—including complex layouts, poor-quality scans, handwritten text, and documents with mixed content types. Built on a fine-tuned 7B vision-language model, olmOCR provides enterprise-grade PDF processing at a fraction of the cost of proprietary solutions.

Veo 3 Fast is a speed-optimized version of the Veo 3 model, designed for rapid video creation. While maintaining high quality, it delivers results in a fraction of the time, making it ideal for quick iterations and dynamic content generation.

We present a sentence transformation model that generates semantically similar sentences. Our model is based on the Sentence-Transformers architecture and was trained on a large dataset of sentence pairs. We evaluate the effectiveness of our model by measuring its ability to generate similar sentences that are close to the original sentence in meaning.

A zero-shot-image-classification model released by OpenAI.
The clip-vit-large-patch14-336 model was trained from scratch on an unknown dataset and achieves unspecified results on the evaluation set. The model's intended uses and limitations, as well as its training and evaluation data, are not provided. The training procedure used an unknown optimizer and precision, and the framework versions included Transformers 4.21.3, TensorFlow 2.8.2, and Tokenizers 0.12.1.

Compared with the Plus series, it significantly reduces the “AI-like” feel in generated images, enhancing their realism. It delivers more lifelike material textures for human subjects, finer and more detailed natural textures, and more visually appealing text rendering.

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation, trained on 680k hours of labeled data without fine-tuning. It's a Transformer based encoder-decoder model, trained on English-only or multilingual data, predicting transcriptions in the same or different language as the audio. Whisper checkpoints come in five configurations of varying model sizes.

Bria RMBG 2.0 enables seamless removal of backgrounds from images, ideal for professional editing tasks. Trained exclusively on licensed data for safe and risk-free commercial use.

Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects.

Mistral-Small-3.2-24B-Instruct is a drop-in upgrade over the 3.1 release, with markedly better instruction following, roughly half the infinite-generation errors, and a more robust function-calling interface—while otherwise matching or slightly improving on all previous text and vision benchmarks.

Gemma is a family of lightweight, state-of-the-art open models from Google. Gemma-2-27B delivers the best performance for its size class, and even offers competitive alternatives to models more than twice its size. 

CodeGemma is a collection of lightweight open code models built on top of Gemma. CodeGemma models are text-to-text and text-to-code decoder-only models and are available as a 7 billion pretrained variant that specializes in code completion and code generation tasks, a 7 billion parameter instruction-tuned variant for code chat and instruction following and a 2 billion parameter pretrained variant for fast code completion.

Multi-reference visual intelligence with unprecedented detail, color precision, and spatial reasoning.  The most advanced image generation and editing model. Generate photorealistic images with precise control.

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences. These models leverage a mixture-of-experts architecture to offer industry-leading performance in text and image understanding. Llama 4 Scout, a 17 billion parameter model with 16 experts

Meta developed and released the Meta Llama 3.1 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes

BGE-M3 is a versatile text embedding model that supports multi-functionality, multi-linguality, and multi-granularity, allowing it to perform dense retrieval, multi-vector retrieval, and sparse retrieval in over 100 languages and with input sizes up to 8192 tokens. The model can be used in a retrieval pipeline with hybrid retrieval and re-ranking to achieve higher accuracy and stronger generalization capabilities. BGE-M3 has shown state-of-the-art performance on several benchmarks, including MKQA, MLDR, and NarritiveQA, and can be used as a drop-in replacement for other embedding models like DPR and BGE-v1.5.

Phi-4-multimodal-instruct is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi-3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning, direct preference optimization and RLHF (Reinforcement Learning from Human Feedback) to support precise instruction adherence and safety measures. The languages that each modal supports are the following: - Text: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian - Vision: English - Audio: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese

Stable Diffusion is a latent text-to-image diffusion model. Generate realistic images given text description

Qwen2.5-Coder-7B is a powerful code-specific large language model with 7.61 billion parameters. It's designed for code generation, reasoning, and fixing tasks. The model covers 92 programming languages and has been trained on 5.5 trillion tokens of data, including source code, text-code grounding, and synthetic data.

The llama-nemotron-rerank-vl-1b-v2 is a 1.7B parameter multimodal reranking model designed to evaluate and order the relevance of document images and text against specific user queries. It excels at understanding complex visual content like charts, tables, and infographics.

Nemotron-4-340B-Instruct is a chat model intended for use for the English language, designed for Synthetic Data Generation

CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes.

Identify and segment objects across video frames using a text prompt. The easiest way to create a mask to modify your videos.

P-Image-Edit is a high-precision image editing model that applies complex transformations, insertions, removals, and style adjustments in under a second. It delivers state-of-the-art accuracy, clean boundaries, and reliable prompt alignment, making multi-step edits fast, consistent, and production-ready.

Bria 3.2 is the next-generation commercial-ready text-to-image model. With just 4 billion parameters, it provides exceptional aesthetics and text rendering, evaluated to be on par to leading open-source models, and outperforming other licensed models.

QwQ is the reasoning model of the Qwen series. Compared with conventional instruction-tuned models, QwQ, which is capable of thinking and reasoning, can achieve significantly enhanced performance in downstream tasks, especially hard problems. QwQ-32B is the medium-sized reasoning model, which is capable of achieving competitive performance against state-of-the-art reasoning models, e.g., DeepSeek-R1, o1-mini.

Euryale 70B v2.1 is a model focused on creative roleplay from Sao10k

The Dolphin 2.6 Mixtral 8x7b model is a finetuned version of the Mixtral-8x7b model, trained on a variety of data including coding data, for 3 days on 4 A100 GPUs. It is uncensored and requires trust_remote_code. The model is very obedient and good at coding, but not DPO tuned. The dataset has been filtered for alignment and bias. The model is compliant with user requests and can be used for various purposes such as generating code or engaging in general chat.

WizardLM-2 8x22B is Microsoft AI's most advanced Wizard model. It demonstrates highly competitive performance compared to those leading proprietary models.

Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper "Robust Speech Recognition via Large-Scale Weak Supervision" by Alec Radford  et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting. Whisper large-v3-turbo is a finetuned version of a pruned Whisper large-v3. In other words, it's the exact same model, except that the number of decoding layers have reduced from 32 to 4. As a result, the model is way faster, at the expense of a minor quality degradation.

The Llama 90B Vision model is a top-tier, 90-billion-parameter multimodal model designed for the most challenging visual reasoning and language tasks. It offers unparalleled accuracy in image captioning, visual question answering, and advanced image-text comprehension. Pre-trained on vast multimodal datasets and fine-tuned with human feedback, the Llama 90B Vision is engineered to handle the most demanding image-based AI tasks.  This model is perfect for industries requiring cutting-edge multimodal AI capabilities, particularly those dealing with complex, real-time visual and textual analysis.

The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B).

Over the past few months, we have observed increasingly clear trends toward scaling both total parameters and context lengths in the pursuit of more powerful and agentic artificial intelligence (AI). We are excited to share our latest advancements in addressing these demands, centered on improving scaling efficiency through innovative model architecture. We call this next-generation foundation models Qwen3-Next.

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation, trained on 680k hours of labelled data without the need for fine-tuning. It is a Transformer based encoder-decoder model, trained on either English-only or multilingual data, and is available in five configurations of varying model sizes. The models were trained on the tasks of speech recognition and speech translation, predicting transcriptions in the same or different languages as the audio.

Latest version of the Airoboros model fine-tunned version of llama-2-70b using the Airoboros dataset. This model is currently running jondurbin/airoboros-l2-70b-2.2.1 

09/04 🔥 Introducing Chatterbox Multilingual in 23 Languages!  We're excited to introduce Chatterbox and Chatterbox Multilingual, Resemble AI's production-grade open source TTS models. Chatterbox Multilingual supports Arabic, Danish, German, Greek, English, Spanish, Finnish, French, Hebrew, Hindi, Italian, Japanese, Korean, Malay, Dutch, Norwegian, Polish, Portuguese, Russian, Swedish, Swahili, Turkish, Chinese out of the box. Licensed under MIT, Chatterbox has been benchmarked against leading closed-source systems like ElevenLabs, and is consistently preferred in side-by-side evaluations.

Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date.  This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities.

Whisper is a set of multi-lingual, robust speech recognition models trained by OpenAI that achieve state-of-the-art results in many languages. Whisper models were trained to predict approximate timestamps on speech segments (most of the time with 1-second accuracy), but they cannot originally predict word timestamps. This version has implementation to predict word timestamps and provide a more accurate estimation of speech segments when transcribing with Whisper models.

Optimized specifically for multimodal agent scenarios. It features enhanced agent capabilities, upgraded multimodal comprehension, and more flexible context management.

Compared with GLM-4.5, GLM-4.6 brings several key improvements:  Longer context window: The context window has been expanded from 128K to 200K tokens, enabling the model to handle more complex agentic tasks. Superior coding performance: The model achieves higher scores on code benchmarks and demonstrates better real-world performance in applications such as Claude Code、Cline、Roo Code and Kilo Code, including improvements in generating visually polished front-end pages. Advanced reasoning: GLM-4.6 shows a clear improvement in reasoning performance and supports tool use during inference, leading to stronger overall capability. More capable agents: GLM-4.6 exhibits stronger performance in tool using and search-based agents, and integrates more effectively within agent frameworks. Refined writing: Better aligns with human preferences in style and readability, and performs more naturally in role-playing scenarios.

Llama 3.2 11B Vision is a multimodal model with 11 billion parameters, designed to handle tasks combining visual and textual data. It excels in tasks such as image captioning and visual question answering, bridging the gap between language generation and visual reasoning. Pre-trained on a massive dataset of image-text pairs, it performs well in complex, high-accuracy image analysis.  Its ability to integrate visual understanding with language processing makes it an ideal solution for industries requiring comprehensive visual-linguistic AI applications, such as content creation, AI-driven customer service, and research.

FLUX.1 [schnell] is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. This model offers cutting-edge output quality and competitive prompt following, matching the performance of closed source alternatives. Trained using latent adversarial diffusion distillation, FLUX.1 [schnell] can generate high-quality images in only 1 to 4 steps. 

Enhanced industrial design and geometric reasoning, improved character consistency, reduced offset issues, and integrated LoRA capabilities

Janus-Pro is a novel autoregressive framework that unifies multimodal understanding and generation. It addresses the limitations of previous approaches by decoupling visual encoding into separate pathways, while still utilizing a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder’s roles in understanding and generation, but also enhances the framework’s flexibility. Janus-Pro surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus-Pro make it a strong candidate for next-generation unified multimodal models.

Bria Erase Foreground precisely removes main subjects or foreground objects from images. Built entirely on licensed data, it is safe and optimized for professional and commercial use.

A sentence similarity model that can be used for various NLP tasks such as text classification, sentiment analysis, named entity recognition, question answering, and more. It utilizes the CoSENT architecture, which consists of a transformer encoder and a pooling module, to encode input texts into vectors that capture their semantic meaning. The model was trained on the nli_zh dataset and achieved high performance on various benchmark datasets.

The new top-tier image model from Black Forest Labs, significantly pushing image quality and editing consistency

An all-round image generation model that supports joint text–image reasoning, multi-image creative fusion, commercial-grade consistency, aesthetic style transfer, and precise control of framing and lighting, significantly enhancing consistency, controllability, and expressiveness in image generation.

The llama-nemotron-embed-vl-1b-v2 is a high-performance multimodal embedding model designed to transform text queries and document images into dense vector representations for advanced retrieval systems. It excels at understanding complex visual content like charts, tables, and infographics.

We introduce DeepSeek-R1, which incorporates cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks. 

The Phi-3-Medium-4K-Instruct is a powerful and lightweight language model with 14 billion parameters, trained on high-quality data to excel in instruction following and safety measures. It demonstrates exceptional performance across benchmarks, including common sense, language understanding, and logical reasoning, outperforming models of similar size.

New model named Chatterbox by Resemble AI's first production-grade open source TTS model. Licensed under MIT, Chatterbox has been benchmarked against leading closed-source systems like ElevenLabs, and is consistently preferred in side-by-side evaluations.  Whether you're working on memes, videos, games, or AI agents, Chatterbox brings your content to life. It's also the first open source TTS model to support emotion exaggeration control, a powerful feature that makes your voices stand out.

Wan2.6 text to image, Upgraded visual quality, aesthetics, and instruction-following deliver precise style control, realistic portraits, long-text understanding, and broad historical/cultural IP coverage, enabling high-quality, highly expressive visual generation.

HiggsAudioV2.5 is a high-quality neural text-to-speech (TTS) model designed for natural-sounding voice generation across a wide range of use cases. It focuses on clarity, stable prosody, and consistent pacing, making it suitable for both short prompts and longer narration.

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. It was trained on 680k hours of labelled data and demonstrated a strong ability to generalise to many datasets and domains without fine-tuning. Whisper checks pens are available in five configurations of varying model sizes, including a smallest configuration trained on English-only data and a largest configuration trained on multilingual data. This one is English-only.

Llama Guard 3 is a Llama-3.1-8B pretrained model, fine-tuned for content safety classification. Similar to previous versions, it can be used to classify content in both LLM inputs (prompt classification) and in LLM responses (response classification). It acts as an LLM – it generates text in its output that indicates whether a given prompt or response is safe or unsafe, and if unsafe, it also lists the content categories violated.

The Mistral-7B-Instruct-v0.2 Large Language Model (LLM) is a instruct fine-tuned version of the Mistral-7B-v0.2 generative text model using a variety of publicly available conversation datasets.

Anthropic's mid-size model with superior intelligence for high-volume uses in coding, in-depth research, agents, & more.

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.

The GTE models are trained by Alibaba DAMO Academy. They are mainly based on the BERT framework and currently offer three different sizes of models, including GTE-large, GTE-base, and GTE-small. The GTE models are trained on a large-scale corpus of relevance text pairs, covering a wide range of domains and scenarios. This enables the GTE models to be applied to various downstream tasks of text embeddings, including information retrieval, semantic textual similarity, text reranking, etc.

Turn any image into a video. Intelligent shot scheduling supports multi-shot storytelling, generating multi-shot narrative videos with consistent subjects, scenes, and atmosphere

ByteDance's Seedance 1.5 Pro is a professional video model using V2A native generation for integrated, synced audio-visual output, enhancing efficiency of professional video creation.

Real-time AI video generation from text, images, and audio. Supports up to 1080p at 48 FPS with built-in audio generation, draft mode for 4x faster previews, and prompt upsampling.

A LLM-based embedding model with in-context learning capabilities that achieves SOTA performance on BEIR and AIR-Bench. It leverages few-shot examples to enhance task performance.

Veo 3.1 is the latest text-to-video model from Google that generates high-fidelity, cinematic videos with synchronized audio from a simple text prompt. It excels at creating realistic and imaginative scenes with a deep understanding of natural language and visual dynamics.

The latest image model, delivering better editing consistency, improved multi-image fusion, finer detail control, natural small text and faces, and harmonious, aesthetic visuals.

FIBO is an open-source, JSON-native text-to-image model trained on detailed structured descriptions (over 1,000+ words per image), providing fine-grained control over light, composition, and camera parameters.

Phi-4-reasoning-plus is a state-of-the-art open-weight reasoning model finetuned from Phi-4 using supervised fine-tuning on a dataset of chain-of-thought traces and reinforcement learning. The supervised fine-tuning dataset includes a blend of synthetic prompts and high-quality filtered data from public domain websites, focused on math, science, and coding skills as well as alignment data for safety and Responsible AI. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning. Phi-4-reasoning-plus has been trained additionally with Reinforcement Learning, hence, it has higher accuracy but generates on average 50% more tokens, thus having higher latency.

This is the instruction fine-tuned version of Mixtral-8x22B - the latest and largest mixture of experts large language model (LLM) from Mistral AI. This state of the art machine learning model uses a mixture 8 of experts (MoE) 22b models. During inference 2 experts are selected. This architecture allows large models to be fast and cheap at inference.

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without fine-tuning. The primary intended users of these models are AI researchers studying robustness, generalisation, and capabilities of the current model.

Anthropic’s most powerful model yet and the state-of-the-art coding model. It delivers sustained performance on long-running tasks that require focused effort and thousands of steps, significantly expanding what AI agents can solve. Claude Opus 4 is ideal for powering frontier agent products and features.

Automatically identify and segment foreground objects across video frames and generate a mask. No prompts, just a video.

DeepSeek-V3.1 is post-trained on the top of DeepSeek-V3.1-Base, which is built upon the original V3 base checkpoint through a two-phase long context extension approach, following the methodology outlined in the original DeepSeek-V3 report. We have expanded our dataset by collecting additional long documents and substantially extending both training phases. The 32K extension phase has been increased 10-fold to 630B tokens, while the 128K extension phase has been extended by 3.3x to 209B tokens. Additionally, DeepSeek-V3.1 is trained using the UE8M0 FP8 scale data format to ensure compatibility with microscaling data formats.

MiniMax-M2 is a Mini model built for Max coding & agentic workflows with just 10 billion activated parameters

A Mythomax/MLewd_13B-style merge of selected 70B models  A multi-model merge of several  LLaMA2 70B finetunes for roleplaying and creative work. The goal was to create a model that combines creativity with intelligence for an enhanced experience.

Phi-4 is a model built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning.

WizardLM-2 7B is the smaller variant of Microsoft AI's latest Wizard model. It is the fastest and achieves comparable performance with existing 10x larger open-source leading models

Bria GenFill enables high-quality object addition or visual transformation. Trained exclusively on licensed data for safe and risk-free commercial use.

Gemini 2.5 Pro is Google's the most advanced thinking model, designed to tackle increasingly complex problems. Gemini 2.5 Pro leads common benchmarks by meaningful margins and showcases strong reasoning and code capabilities.  Gemini 2.5 models are thinking models, capable of reasoning through their thoughts before responding, resulting in enhanced performance and improved accuracy.  The Gemini 2.5 Pro model is now available on DeepInfra.

You can use cURL or any other http client to run inferences:

```bash
curl -X POST \
    -d '{"text": "The quick brown fox jumps over the lazy dog"}'  \
    -H "Authorization: bearer $DEEPINFRA_TOKEN"  \
    -H 'Content-Type: application/json'  \
    'https://api.deepinfra.com/v1/inference/Zyphra/Zonos-v0.1-hybrid'
```

which will give you back something similar to:

```json
{
  "audio": null,
  "input_character_length": 0,
  "output_format": "",
  "words": [
    {
      "end": 1.0,
      "start": 0.0,
      "text": "Hello"
    },
    {
      "end": 5.0,
      "start": 4.0,
      "text": "World"
    }
  ],
  "request_id": null,
  "inference_status": {
    "status": "unknown",
    "runtime_ms": 0,
    "cost": 0.0,
    "tokens_generated": 0,
    "tokens_input": 0,
    "output_length": 0
  }
}

```


You can use our command-line tool [deepctl](/docs/advanced/deepctl) to run
inferences:

```bash
deepctl infer \
    -m 'Zyphra/Zonos-v0.1-hybrid'  \
    -i 'text=The quick brown fox jumps over the lazy dog'
```

which will give you back something similar to:

```json
{
  "audio": null,
  "input_character_length": 0,
  "output_format": "",
  "words": [
    {
      "end": 1.0,
      "start": 0.0,
      "text": "Hello"
    },
    {
      "end": 5.0,
      "start": 4.0,
      "text": "World"
    }
  ],
  "request_id": null,
  "inference_status": {
    "status": "unknown",
    "runtime_ms": 0,
    "cost": 0.0,
    "tokens_generated": 0,
    "tokens_input": 0,
    "output_length": 0
  }
}

```


DeepInfra supports custom voices.

## Create voice

The following creates a voice using the `curl` command.

```bash
curl -X POST "https://api.deepinfra.com/v1/voices/add" \
  -H "Content-Type: multipart/form-data" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -F "audio=@hello.wav" \
  -F "name=John Doe" \
  -F "description=John Doe's voice"
```

which will return something similar to

```json
{
  "user_id": "gh:10000000", 
  "voice_id": "abcd1234abcd1234abcd",
  "name": "John Doe",
  "description": "John Doe's voice",
  "created_at": 1723851387,
  "updated_at": 1723851387
}
```


We try to be eleventlabs python library compatible. Please reach out to feedback@deepinfra.com if you encounter any issues.
```python
from elevenlabs import ElevenLabs, play

# Initialize the ElevenLabs client with overridden api_key and base_url
client = ElevenLabs(api_key="$DEEPINFRA_TOKEN", base_url="https://api.deepinfra.com")

# Define the voice data
voice_name = "John Doe"
voice_description = "John Doe's voice"
audio_file_path = "test_audio.wav"

# Create the voice by cloning using the ElevenLabs client
cloned_voice = client.clone(
    name=voice_name,
    description=voice_description,
    files=[audio_file_path],
    labels="",
)

# Use the voice_id to generate speech
audio = client.generate(
    text="Hello, how are you?",
    voice=cloned_voice.voice_id,
    model="deepinfra/tts",
    output_format="wav",
)

play(audio)
```


The following creates a voice using the `axios` library in JavaScript.

```javascript
const axios = require('axios');
const FormData = require('form-data');
const fs = require('fs');

// Define the API endpoint
const url = "https://api.deepinfra.com/v1/voices/add";

// Create a FormData instance
const formData = new FormData();

// Append the audio file, name, and description to the form data
formData.append('files', fs.createReadStream('test_audio.wav'));
formData.append('name', 'John Doe');
formData.append('description', "John Doe's voice");

// Set the headers, including authorization and content type
const headers = {
    "Authorization": "Bearer $DEEPINFRA_TOKEN",
    ...formData.getHeaders()
};

// Send the POST request
axios.post(url, formData, { headers: headers })
    .then(response => {
        console.log("Voice created successfully!");
        console.log("Response:", response.data);
    })
    .catch(error => {
        console.error("Failed to create voice.");
        console.error("Status Code:", error.response.status);
        console.error("Response:", error.response.data);
    });
```

## Read voice

The following reads a voice using the `curl` command.

```bash
curl -X GET "https://api.deepinfra.com/v1/voices/abcd1234abcd1234abcd" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN"
```

which will return something similar to

```json
{
  "user_id": "gh:10000000", 
  "voice_id": "abcd1234abcd1234abcd",
  "name": "John Doe",
  "description": "John Doe's voice",
  "created_at": 1723851387,
  "updated_at": 1723851387
}
```


## Read voice

The following reads a voice using the `elevenlabs` library in Python. If you encounter any issues, please contact us at feedback@deepinfra.com.
```python
from elevenlabs import ElevenLabsClient

client = ElevenLabsClient(api_key="$DEEPINFRA_TOKEN", base_url="https://api.deepinfra.com")

voice = client.voices.get(voice_id="abcd1234abcd1234abcd")
print(voice)
```


## Read voice

The following reads a voice using JavaScript with the `fetch` API.

```javascript
const url = "https://api.deepinfra.com/v1/voices/abcd1234abcd1234abcd";
const headers = {
  "Content-Type": "application/json",
  "Authorization": "Bearer $DEEPINFRA_TOKEN"
};

fetch(url, { method: "GET", headers: headers })
  .then(response => response.json())
  .then(data => console.log(data))
  .catch(error => console.error('Error:', error));
```

## Update voice

The following updates a voice using the `curl` command.

```bash
curl -X POST "https://api.deepinfra.com/v1/voices/abcd1234abcd1234abcd/edit" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -F "name=Jane Doe" \
  -F "description=Jane Doe's voice"
```

which will return something similar to

```json
{
  "user_id": "gh:10000000", 
  "voice_id": "abcd1234abcd1234abcd",
  "name": "Jane Doe",
  "description": "Jane Doe's voice",
  "created_at": 1723851387,
  "updated_at": 1723851387
}
```


## Update voice
We support elevenlabs client for python. If you encounter any issues, please contact us at feedback@deepinfra.com.
```python
from elevenlabs import ElevenLabsClient

client = ElevenLabsClient(api_key="$DEEPINFRA_TOKEN", base_url="https://api.deepinfra.com")

client.voices.edit(voice_id="abcd1234abcd1234abcd", name="Jane Doe", description="Jane Doe's voice")
```


## Update voice

Update a voice using the `axios` library in JavaScript.

```javascript
const axios = require('axios');
const FormData = require('form-data');

const url = "https://api.deepinfra.com/v1/voices/abcd1234abcd1234abcd/edit";
const formData = new FormData();
formData.append('name', 'John Doe');
formData.append('description', "John Doe's voice");

const headers = {
    "Authorization": "Bearer $DEEPINFRA_TOKEN",
    ...formData.getHeaders()
};

axios.post(url, formData, { headers: headers })
    .then(response => {
        console.log("Voice updated successfully!");
        console.log("Response:", response.data);
    })
    .catch(error => {
        console.error("Failed to update voice.");
        console.error("Status Code:", error.response.status);
        console.error("Response:", error.response.data);
    });
```


## Delete voice

The following deletes a voice using the `curl` command.

```bash
curl -X DELETE "https://api.deepinfra.com/v1/voices/abcd1234abcd1234abcd" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN"
```

which will return 200 OK status code.


## Delete voice


```python
from elevenlabs import ElevenLabsClient

client = ElevenLabsClient(api_key="$DEEPINFRA_TOKEN", base_url="https://api.deepinfra.com")

client.voices.delete(voice_id="abcd1234abcd1234abcd")
```


## Delete voice

The following deletes a voice using the `fetch` API in JavaScript.

```javascript
const url = "https://api.deepinfra.com/v1/voices/abcd1234abcd1234abcd";
const headers = {
  "Content-Type": "application/json",
  "Authorization": `Bearer $DEEPINFRA_TOKEN`
};

fetch(url, {
  method: "DELETE",
  headers: headers
})
.then(response => {
  if (response.ok) {
    console.log("Voice deleted successfully.");
  } else {
    throw new Error(`Failed to delete voice: ${response.status}`);
  }
})
.catch(error => console.error(error));
```


## List voices
The following lists voices using the `curl` command.

```bash
curl -X GET "https://api.deepinfra.com/v1/voices" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
```

which will return something similar to

```json
{
  "voices": [
    {
      "user_id": "gh:10000000", 
      "voice_id": "abcd1234abcd1234abcd",
      "name": "John Doe",
      "description": "John Doe's voice",
      "created_at": 1723851387,
      "updated_at": 1723851387
    },
    {
      "user_id": "gh:10000000",
      "voice_id":"abcd1234abcd1234abc1",
      "name": "Jane Doe",
      "description": "Jane Doe's voice",
      "created_at": 1723680057,
      "updated_at": 1723680057
    }
  ]
}
```


## List voices

The following lists voices using the `elevenlabs` client library in Python. If you encounter any issues, please reach out to feedback@deepinfra.com.

```python
from elevenlabs import ElevenLabs

client = ElevenLabs(api_key="$DEEPINFRA_TOKEN", base_url="https://api.deepinfra.com")

response = client.voices.get_all()

for voice in response.voices:
    print(voice.voice_id)
```


## List voices

The following lists voices using the `fetch` API in JavaScript.

```javascript
const url = "https://api.deepinfra.com/v1/voices";
const headers = {
  "Content-Type": "application/json",
  "Authorization": `Bearer $DEEPINFRA_TOKEN`
};

fetch(url, {
  method: "GET",
  headers: headers
})
.then(response => {
  if (response.ok) {
    return response.json();
  } else {
    throw new Error(`Failed to list voices: ${response.status}`);
  }
})
.then(data => console.log(data))
.catch(error => console.error(error));
```


The DeepInfra OpenAI-compatible Speech API endpoint enables users to effortlessly convert text into speech audio. This document outlines how to integrate and utilize this endpoint to quickly create speech from text inputs, leveraging various audio output formats.

## Create speech

Use the following example of pythong code to generate an audio file from your text input:

```python
from pathlib import Path
from openai import OpenAI

client = OpenAI(base_url="https://api.deepinfra.com/v1/openai",
                api_key="$DEEPINFRA_TOKEN")

speech_file_path = Path(__file__).parent / "speech.mp3"
with client.audio.speech.with_streaming_response.create(
  model="Zyphra/Zonos-v0.1-hybrid",
  voice="random",
  input="The quick brown fox jumped over the lazy dog.",
  response_format="mp3",
) as response:
  response.stream_to_file(speech_file_path)
```

The API returns the generated audio file in the requested format (e.g., `mp3`, `pcm`). The example above saves the audio output directly as `speech.mp3`.

### Supported Audio Formats

| Format | Description                           | Recommended Usage                          |
|--------|---------------------------------------|--------------------------------------------|
| pcm    | Raw, uncompressed audio               | *Best for lowest latency streaming*        |
| mp3    | Compressed audio suitable for storage | General use and file distribution          |
| opus   | Compressed audio ideal for streaming  | Efficient streaming and real-time playback |
| flac   | Lossless audio compression            | High-quality archival storage              |
| wav    | Uncompressed audio                    | High-quality audio processing tasks        |

### Performance Recommendations

- **For lowest latency (fastest initial audio chunk):**
  - Use the `pcm` output format.

- **For general purposes:**
  - Use the `mp3` or `opus` formats for optimal balance between quality and file size.



The DeepInfra OpenAI-compatible Speech API endpoint enables users to effortlessly convert text into speech audio. This document outlines how to integrate and utilize this endpoint to quickly create speech from text inputs, leveraging various audio output formats.

## Create speech

Use the following example of js code to generate an audio file from your text input:

```javascript
import fs from "fs";
import path from "path";
import OpenAI from "openai";

const openai = new OpenAI(base_url="https://api.deepinfra.com/v1/openai",
                          api_key="$DEEPINFRA_TOKEN");

const speechFile = path.resolve("./speech.mp3");

async function main() {
  const mp3 = await openai.audio.speech.create({
    model: "Zyphra/Zonos-v0.1-hybrid",
    voice: "random",
    input: "The quick brown fox jumped over the lazy dog.",
    response_format: "mp3",
  });
  console.log(speechFile);
  const buffer = Buffer.from(await mp3.arrayBuffer());
  await fs.promises.writeFile(speechFile, buffer);
}
main();
```

The API returns the generated audio file in the requested format (e.g., `mp3`, `pcm`). The example above saves the audio output directly as `speech.mp3`.

### Supported Audio Formats

| Format | Description                           | Recommended Usage                          |
|--------|---------------------------------------|--------------------------------------------|
| pcm    | Raw, uncompressed audio               | *Best for lowest latency streaming*        |
| mp3    | Compressed audio suitable for storage | General use and file distribution          |
| opus   | Compressed audio ideal for streaming  | Efficient streaming and real-time playback |
| flac   | Lossless audio compression            | High-quality archival storage              |
| wav    | Uncompressed audio                    | High-quality audio processing tasks        |

### Performance Recommendations

- **For lowest latency (fastest initial audio chunk):**
  - Use the `pcm` output format.

- **For general purposes:**
  - Use the `mp3` or `opus` formats for optimal balance between quality and file size.



The DeepInfra OpenAI-compatible Speech API endpoint enables users to effortlessly convert text into speech audio. This document outlines how to integrate and utilize this endpoint to quickly create speech from text inputs, leveraging various audio output formats.

## Create speech

Use the following example `curl` request to generate an audio file from your text input:

```bash
curl https://api.deepinfra.com/v1/openai/audio/speech \
  -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Zyphra/Zonos-v0.1-hybrid",
    "input": "The quick brown fox jumped over the lazy dog.",
    "voice": "random",
    "response_format": "mp3"
  }' \
  --output speech.mp3
```

The API returns the generated audio file in the requested format (e.g., `mp3`, `pcm`). The example above saves the audio output directly as `speech.mp3`.

### Supported Audio Formats

| Format | Description                           | Recommended Usage                          |
|--------|---------------------------------------|--------------------------------------------|
| pcm    | Raw, uncompressed audio               | *Best for lowest latency streaming*        |
| mp3    | Compressed audio suitable for storage | General use and file distribution          |
| opus   | Compressed audio ideal for streaming  | Efficient streaming and real-time playback |
| flac   | Lossless audio compression            | High-quality archival storage              |
| wav    | Uncompressed audio                    | High-quality audio processing tasks        |

### Performance Recommendations

- **For lowest latency (fastest initial audio chunk):**
  - Use the `pcm` output format.

- **For general purposes:**
  - Use the `mp3` or `opus` formats for optimal balance between quality and file size.



The DeepInfra ElevenLabs-compatible Speech API endpoint allows users to seamlessly convert text inputs into high-quality speech audio. This document provides guidance on integrating and using this endpoint to generate realistic audio files efficiently.

Use the following examples of `py` request to generate an audio file from your text input:

## Create Speech (non-streaming)

```bash
from elevenlabs import ElevenLabs

client = ElevenLabs(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/",
)
client.text_to_speech.convert(
    voice_id="random",
    output_format="mp3",
    text="The quick brown fox jumped over the lazy dog.",
    model_id="Zyphra/Zonos-v0.1-hybrid",
)
```

## Create Speech with Streaming

```bash
from elevenlabs import ElevenLabs

client = ElevenLabs(
    api_key="$DEEPINFRA_TOKEN",
)
client.text_to_speech.convert_as_stream(
    voice_id="random",
    output_format="pcm",
    text="The quick brown fox jumped over the lazy dog.",
    model_id="Zyphra/Zonos-v0.1-hybrid",
)
```

The API returns the generated audio in the requested format, such as `mp3` or `pcm`. The example above saves the audio directly as `speech.mp3`.

### Supported Audio Formats

| Format | Description                           | Recommended Usage                          |
|--------|---------------------------------------|--------------------------------------------|
| pcm    | Raw, uncompressed audio               | *Lowest latency streaming scenarios*       |
| mp3    | Compressed audio suitable for storage | General use and easy file sharing          |
| opus   | Compressed audio ideal for streaming  | Real-time audio streaming applications     |
| flac   | Lossless audio compression            | Archival storage and high-fidelity needs   |
| wav    | Uncompressed audio                    | Audio processing and editing tasks         |

### Performance Recommendations

- **Lowest latency:** Using the `pcm` output format for streaming applications is **HIGHLY RECOMMENDED**.
- **General use:** Use `mp3` or `opus` formats for the best trade-off between quality and file size.



The DeepInfra ElevenLabs-compatible Speech API endpoint allows users to seamlessly convert text inputs into high-quality speech audio. This document provides guidance on integrating and using this endpoint to generate realistic audio files efficiently.

Use the following examples of `js` request to generate an audio file from your text input:

## Create Speech (non-streaming)

```bash
import { ElevenLabsClient } from "elevenlabs";

const client = new ElevenLabsClient({ apiKey: "$DEEPINFRA_TOKEN", base_url: "https://api.deepinfra.com/" });
await client.textToSpeech.convert("random", {
    output_format: "mp3",
    text: "The quick brown fox jumped over the lazy dog.",
    model_id: "Zyphra/Zonos-v0.1-hybrid"
});
```

## Create Speech with Streaming

```bash
import { ElevenLabsClient } from "elevenlabs";

const client = new ElevenLabsClient({ apiKey: "$DEEPINFRA_TOKEN" });
await client.textToSpeech.convert("random", {
    output_format: "pcm",
    text: "The quick brown fox jumped over the lazy dog.",
    model_id: "Zyphra/Zonos-v0.1-hybrid"
});
```

The API returns the generated audio in the requested format, such as `mp3` or `pcm`. The example above saves the audio directly as `speech.mp3`.

### Supported Audio Formats

| Format | Description                           | Recommended Usage                          |
|--------|---------------------------------------|--------------------------------------------|
| pcm    | Raw, uncompressed audio               | *Lowest latency streaming scenarios*       |
| mp3    | Compressed audio suitable for storage | General use and easy file sharing          |
| opus   | Compressed audio ideal for streaming  | Real-time audio streaming applications     |
| flac   | Lossless audio compression            | Archival storage and high-fidelity needs   |
| wav    | Uncompressed audio                    | Audio processing and editing tasks         |

### Performance Recommendations

- **Lowest latency:** Using the `pcm` output format for streaming applications is **HIGHLY RECOMMENDED**.
- **General use:** Use `mp3` or `opus` formats for the best trade-off between quality and file size.



The DeepInfra ElevenLabs-compatible Speech API endpoint allows users to seamlessly convert text inputs into high-quality speech audio. This document provides guidance on integrating and using this endpoint to generate realistic audio files efficiently.

Use the following examples of `curl` request to generate an audio file from your text input:

## Create Speech (non-streaming)

```bash
curl -X POST "https://api.deepinfra.com/v1/text-to-speech/random" \
     -H "xi-api-key: $DEEPINFRA_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{
  "text": "The quick brown fox jumped over the lazy dog.",
  "model_id": "Zyphra/Zonos-v0.1-hybrid",
  "output_format": "mp3",
}' --output speech.mp3
```

## Create Speech with Streaming

```bash
curl -X POST "https://api.deepinfra.com/v1/text-to-speech/random/stream" \
     -H "xi-api-key: $DEEPINFRA_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{
  "text": "The quick brown fox jumped over the lazy dog.",
  "model_id": "Zyphra/Zonos-v0.1-hybrid",
  "output_format": "pcm",
}' --output speech.pcm
```

The API returns the generated audio in the requested format, such as `mp3` or `pcm`. The example above saves the audio directly as `speech.mp3`.

### Supported Audio Formats

| Format | Description                           | Recommended Usage                          |
|--------|---------------------------------------|--------------------------------------------|
| pcm    | Raw, uncompressed audio               | *Lowest latency streaming scenarios*       |
| mp3    | Compressed audio suitable for storage | General use and easy file sharing          |
| opus   | Compressed audio ideal for streaming  | Real-time audio streaming applications     |
| flac   | Lossless audio compression            | Archival storage and high-fidelity needs   |
| wav    | Uncompressed audio                    | Audio processing and editing tasks         |

### Performance Recommendations

- **Lowest latency:** Using the `pcm` output format for streaming applications is **HIGHLY RECOMMENDED**.
- **General use:** Use `mp3` or `opus` formats for the best trade-off between quality and file size.



The service tier used for processing the request. When set to 'priority', the request will be processed with higher priority (only applies to models that support it).

service_tier

text

preset_voice

Voice ID to use for the speech. Either preset_voice or voice_id should be provided

Zonos-v0.1-hybrid

Create Voice HTTP/cURL API

Create voice

Input fields

Input Schema

Output Schema