Nemotron 3 Nano Omni is an open multimodal model built on a hybrid Mixture-of-Experts (MoE) architecture, engineered for high efficiency and strong accuracy across image, video, audio, and text inputs. It powers always-on sub-agents for computer use, document intelligence, and audio-video understanding—replacing fragmented vision, speech, and language pipelines with a single unified inference pass.

Nemotron-3-Nano-Omni-30B-A3B-Reasoning

DeepSeek V4 Pro is an MoE model with 1.6T total parameters (49B active) and a 1M-token context window. It's built for advanced reasoning, coding, and long-running agent tasks, and performs well on knowledge, math, and software engineering benchmarks.

DeepSeek-V4-Pro

DeepSeek V4 Flash is an efficiency-focused MoE model with 284B total parameters (13B active) and a 1M-token context window. It's tuned for fast inference and high-throughput use cases while still holding up on reasoning and coding tasks.

DeepSeek-V4-Flash

Kimi K2.6 is an open-source, native multimodal agentic model that advances practical capabilities in long-horizon coding, coding-driven design, proactive autonomous execution, and swarm-based task orchestration.

Kimi-K2.6

Qwen3.6-35B-A3B is Alibaba's latest flagship Mixture-of-Experts model, with 35B total parameters and only 3B activated per token (256 experts, 8 routed + 1 shared). Built on direct feedback from the community, Qwen3.6 prioritizes stability and real-world utility, offering developers a more intuitive, responsive, and genuinely productive coding experience.

Qwen3.6-35B-A3B

GLM-5.1 is Z-AI's next-generation flagship model for agentic engineering, with significantly stronger coding capabilities than its predecessor. It achieves state-of-the-art performance on SWE-Bench Pro and leads GLM-5 by a wide margin on NL2Repo (repo generation) and Terminal-Bench 2.0 (real-world terminal tasks).

GLM-5.1

Step 3.5 Flash is an open-source reasoning model by StepFun with 196B total parameters (11B active) using Mixture of Experts. It features a 256K context window, deep reasoning, tool calling, and agentic capabilities, achieving 97.3 on AIME 2025 and 74.4% on SWE-bench Verified.

Step-3.5-Flash

Qwen3.5-397B-A17B is Alibaba's most capable Qwen3.5 model, a Mixture-of-Experts architecture with 397B total parameters and 17B activated per token. It features a 262K token context window (extensible to 1M with YaRN), thinking/reasoning mode, tool calling with MCP integration, and support for 201 languages. Sets state-of-the-art results on reasoning, coding, math, and multimodal benchmarks.

Qwen3.5-397B-A17B

Efficient, MoE variant of Gemma 4. Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input and generating text output.

gemma-4-26B-A4B-it

Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input and generating text output.

gemma-4-31B-it

Qwen3.5-122B-A10B is a large Mixture-of-Experts model from Alibaba's Qwen3.5 series with 122B total parameters and 10B activated per token. It features a 262K token context window (extensible to 1M with YaRN), thinking/reasoning mode, tool calling, and support for 201 languages. Excels at complex reasoning, coding, multimodal understanding, and agentic tasks with the efficiency of sparse activation.

Qwen3.5-122B-A10B

NVIDIA Nemotron 3 Super is a hybrid Mixture-of-Experts (MoE) model engineered for highest compute efficiency and accuracy in multi-agent applications and specialized agentic systems. It is optimized to run many collaborating agents per application on a single GPU, delivering high accuracy for reasoning, tool use, and instruction following.

NVIDIA-Nemotron-3-Super-120B-A12B

GLM-5 is an advanced, open-source large language model designed for developers tackling the toughest challenges. It excels at long-context reasoning, multi-step tool orchestration, and complex systems engineering, making it the ideal choice for powering sophisticated agents and applications that require high-level cognitive tasks.

GLM-5

  Qwen3-TTS is an advanced text-to-speech model by Alibaba's Qwen team, delivering stable, expressive, and low-latency speech generation across 10 languages.                                                                                                                                                                                                                                                                                                                                           Key capabilities:                                                                                                                                                                                                                                  - 9 preset voices — Vivian, Serena, Uncle_Fu, Dylan, Eric, Ryan, Aiden, Ono_Anna, Sohee — covering diverse genders, ages, and accents                                                                                                              - Voice cloning — clone any voice from a short (~3s) audio sample via the voice_id parameter   - Instruction control — adjust tone, emotion, and speaking style with natural language (e.g. "speak slowly and calmly", "excited tone")   - 10 languages — English, Chinese, Japanese, Korean, German, French, Russian, Spanish, Italian, Portuguese   - Streaming support — real-time PCM streaming with ~97ms first-byte latency   - Multiple output formats — WAV, MP3, FLAC, PCM    Built on a 1.7B parameter architecture using discrete multi-codebook language modeling for end-to-end speech synthesis without cascading errors. Uses a custom 12Hz acoustic tokenizer that preserves paralinguistic information and   environmental audio details.

Qwen3-TTS

● Qwen3-TTS-VoiceDesign is a voice design variant of Qwen3-TTS by Alibaba's Qwen team. Instead of selecting from preset voices, you describe the voice you want in natural language — and the model generates speech in that voice.                                                                                                                                                                                                                                                                     Key capabilities:                                                                                                                                                                                                                                  - Natural language voice control — describe any voice with free text (e.g. "a deep male voice with a calm, authoritative presence", "a young cheerful female with a warm and friendly tone")   - 10 languages — English, Chinese, Japanese, Korean, German, French, Russian, Spanish, Italian, Portuguese                                                                                                                                         - Streaming support — real-time PCM streaming   - Multiple output formats — WAV, MP3, FLAC, PCM    Built on the same 1.7B parameter architecture as Qwen3-TTS, using discrete multi-codebook language modeling and a custom 12Hz acoustic tokenizer for high-quality end-to-end speech synthesis.

Qwen3-TTS-VoiceDesign

MiniMax M2.5 is SOTA in coding, agentic tool use and search, office work, and a range of other economically valuable tasks, boasting scores of 80.2% in SWE-Bench Verified, 51.3% in Multi-SWE-Bench, and 76.3% in BrowseComp (with context management).

MiniMax-M2.5

The latest flagship model in the Qwen family. State-of-the-art results across a comprehensive suite of benchmarks — including knowledge, reasoning, coding, instruction following, human preference alignment, agent tasks, and multilingual understanding.

Qwen3-Max

The latest flagship reasoning model in the Qwen3 family. Further enhanced by multiple innovations like adaptive tool-use and advanced test-time scaling techniques

Qwen3-Max-Thinking

Kimi K2.5 is an open-source, native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens atop Kimi-K2-Base. It seamlessly integrates vision and language understanding with advanced agentic capabilities, instant and thinking modes, as well as conversational and agentic paradigms.

Kimi-K2.5

GLM-4.7-Flash is a 30B-A3B MoE model. As the strongest model in the 30B class, GLM-4.7-Flash offers a new option for lightweight deployment that balances performance and efficiency.

GLM-4.7-Flash

DeepSeek-V3.2 is a large language model designed to harmonize high computational efficiency with strong reasoning and agentic tool-use performance. It introduces DeepSeek Sparse Attention (DSA), a fine-grained sparse attention mechanism that reduces training and inference cost while preserving quality in long-context scenarios. A scalable reinforcement learning post-training framework further improves reasoning, with reported performance in the GPT-5 class, and the model has demonstrated gold-medal results on the 2025 IMO and IOI. V3.2 also uses a large-scale agentic task synthesis pipeline to better integrate reasoning into tool-use settings, boosting compliance and generalization in interactive environments.

DeepSeek-V3.2

The fastest model of the Flux 2 family. Frontier visual intelligence — state-of-the-art image generation and editing from Black Forest Labs

FLUX-2-klein-4b

The best quality-to-latency ratio, production apps model of the Flux 2 family. Frontier visual intelligence — state-of-the-art image generation and editing from Black Forest Labs

FLUX-2-klein-9b

Developed by Anthropic, Claude is a family of highly performant, trustworthy AI models built for complex reasoning, advanced coding, and nuanced language understanding. The latest Claude 4 generation delivers breakthrough capabilities in analytical thinking, with Claude 4 Opus setting new standards for intelligence and Claude 4 Sonnet providing exceptional performance with remarkable efficiency.

Claude models excel at understanding context, following complex instructions, and maintaining coherent conversations across extended interactions. With advanced features like extended thinking for deeper reasoning, prompt caching that reduces costs by up to 90%, vision capabilities for image analysis, and robust safety measures, Claude is designed for enterprise applications that demand both sophistication and reliability.

Available with comprehensive API features including streaming responses, batch processing for 50% cost savings, multilingual support across dozens of languages, and flexible context windows up to 200K tokens (1M in beta), Claude is perfect for building intelligent applications like customer support agents, content analysis systems, coding assistants, and complex reasoning workflows that require both accuracy and trustworthiness.

Claude AI family: Claude 4 Opus for complex reasoning, Claude 4 Sonnet for balanced performance, plus advanced capabilities like extended thinking, prompt caching, vision analysis, and enterprise-grade safety APIs.

Claude AI APIs via DeepInfra

claude

Developed by Anthropic, Claude is a family of highly performant, trustworthy AI models built for complex reasoning, advanced coding, and nuanced language understanding

DeepInfra provides access to Anthropic's latest Claude models, featuring the most advanced reasoning capabilities and balanced performance options, all with enterprise-grade safety and reliability.

Claude

DeepSeek develops advanced foundation models optimized for computational efficiency and strong generalization across diverse tasks. The architecture incorporates recent advances in transformer-based systems, delivering robust performance in both zero-shot and fine-tuned scenarios. Models are pretrained on rigorously filtered multilingual corpora with specialized optimizations for mathematical reasoning and algorithmic tasks. The inference stack achieves competitive throughput while maintaining low latency, making it suitable for production deployment. Researchers and engineers can leverage these models for tasks ranging from natural language processing to complex analytical problem-solving.

deepseek

DeepSeek's models are a suite of advanced AI systems that prioritize efficiency, scalability, and real-world applicability.

DeepSeek

Developed by Black Forest Labs (the original creators behind Stable Diffusion), Flux is a family of state-of-the-art image generation and editing models that deliver exceptional visual quality with breakthrough prompt accuracy and photorealism. Built on advanced 12 billion parameter architecture, Flux models excel at understanding exactly what you want to create or modify.
FLUX 2 represents the next generation of the Flux family, introducing significantly improved image quality, faster generation, and more precise prompt following. The lineup includes Max for ultimate quality, Pro for production workflows, and Klein variants (9B and 4B) optimized for speed and efficiency — making high-quality image generation accessible at every scale.
Flux offers specialized variants for every need: from open-source to commercial licensing, Flux is perfect for developers building creative applications, product visualization tools, and next-generation image editing experiences.

Flux AI image generation family: FLUX.1 Kontext for in-context editing, FLUX.1 Pro/Dev for text-to-image synthesis, plus comprehensive editing tools and state-of-the-art visual generation APIs.

Flux Image Generation APIs via DeepInfra

flux

Developed by Black Forest Labs, Flux is a family of state-of-the-art image generation and editing models that deliver exceptional visual quality with breakthrough prompt accuracy and photorealism.

DeepInfra provides access to Black Forest Labs' complete Flux ecosystem, offering everything from lightning-fast generation to sophisticated in-context editing capabilities with industry-leading prompt adherence and visual quality.

Flux

Developed by Google DeepMind, Gemini is a family of state-of-the-art thinking models with native multimodal capabilities, designed for advanced reasoning, complex problem-solving, and comprehensive understanding across text, audio, video, and images. Built with revolutionary thinking architecture, Gemini models reason through problems step-by-step before responding, delivering enhanced accuracy and performance for sophisticated applications.

Gemini 2.5 Pro sets new standards for complex reasoning and coding excellence, while Gemini 2.5 Flash provides optimal price-performance for high-volume tasks. With massive context windows up to 1 million tokens, native multimodal processing that handles hours of video and audio, and transparent reasoning capabilities that show step-by-step thinking processes, Gemini excels at document analysis, code generation, scientific research, and agentic workflows.

Perfect for building intelligent applications that require deep reasoning, multimodal understanding, long-context processing, and transparent AI decision-making with Google's enterprise-grade reliability and performance.


Gemini AI family: Advanced thinking models with native multimodal processing for text, audio, video, and image understanding APIs

Gemini AI Model APIs via DeepInfra

gemini

Developed by Google DeepMind, Gemini is a family of state-of-the-art thinking models with native multimodal capabilities

DeepInfra provides access to Google's latest Gemini models, featuring advanced thinking capabilities, native multimodal processing, and industry-leading performance for complex reasoning and development tasks.

Gemini

Developed by Meta, Llama (Large Language Model Meta AI) is a family of state-of-the-art open-weight models designed for efficiency and performance. The latest versions feature Mixture-of-Experts (MoE) architectures, enabling cost-effective inference by dynamically activating subsets of parameters. With support for multimodal inputs (text + images) and extended context windows (up to 10M tokens), Llama excels in tasks like code generation, multilingual understanding, and long-form reasoning. The models support FP8 quantization and batch inference, ensuring low-latency, high-throughput performance for production workloads. With permissive licensing and robust tooling (e.g., Llama Guard for safety), Llama is ideal for developers seeking powerful, customizable AI with minimal overhead.

llama

The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences.

Llama 4

Meta Llama 3 are a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes.

Llama 3

Llama

Developed by Mistral AI, a leading French research lab, Mistral is a family of open-source AI models built for multilingual excellence, advanced reasoning, and cost-effective performance. These models excel at complex reasoning, mathematics, coding, and specialized tasks while offering complete transparency and deployment freedom through open-source licensing.

Mistral Small 3.2 delivers breakthrough efficiency with native fluency in European languages, while specialized variants handle specific needs: Devstral for coding, Voxtral for audio processing, and Mixtral for high-performance tasks. With Apache 2.0 licensing, extensive context windows up to 128K tokens, and comprehensive customization options, Mistral provides enterprise-grade capabilities without vendor lock-in.

Perfect for building multilingual applications, coding assistants, and reasoning systems where you need both powerful performance and complete control over your AI deployment.

Mistral AI model family: Mistral Small 3.2 for efficient performance, Mixtral for specialized tasks, Devstral for coding, plus multilingual reasoning, mathematics, and open-source flexibility APIs.

Mistral AI Model APIs via DeepInfra

mistral

Developed by Mistral AI, a leading French research lab, Mistral is a family of open-source AI models built for multilingual excellence, advanced reasoning, and cost-effective performance

DeepInfra provides access to Mistral AI's comprehensive open-source model ecosystem, from efficient small models to specialized coding and audio processing variants, all with complete Apache 2.0 licensing freedom.

Mistral

Voxtral is a family of audio models with state-of-the-art speech to text capabilities.

Voxtral

The Nemotron family spans Omni, Nano, Super, and specialized instruct variants, enabling you to
  balance accuracy, reasoning depth, latency, and cost for your specific workload.
  Omni for multimodal reasoning across text, audio, and video
  Nano for maximum efficiency and stable inference
  Super for multi-agent systems and advanced reasoning
  Instruct variants for instruction-following and conversational workloads

nemotron

NVIDIA Nemotron is a family of open models customized for efficiency, accuracy, and specialized workloads.

The Nemotron family spans Nano, Super, and specialized instruct variants, enabling you to balance accuracy, reasoning depth, latency, and cost for your specific workload.
- **Nano** for maximum efficiency and stable inference
- **Super** for multi-agent systems and advance reasoning
- **Instruct variants** for instruction-following and conversational workloads

Nemotron

Developed by Alibaba Group's Qwen Team, Qwen is a family of state-of-the-art large language and multimodal models designed for comprehensive AI capabilities and multilingual performance. The latest Qwen3 generation features balanced model architectures including reintroduced Mixture-of-Experts (MoE) variants (Qwen3-30B-A3B and Qwen3-235B-A22B) alongside dense models up to 32B parameters, enabling efficient resource utilization through dynamic parameter activation. 

With support for 119 languages and dialects, hybrid thinking modes that seamlessly alternate between reasoning and instruction-following without model switching, and extended context windows (up to 1M tokens in Qwen3-2507), Qwen excels in tasks like multilingual understanding, code generation, agentic workflows, and complex problem-solving. The models utilize advanced Byte-level Byte Pair Encoding with a 151,646-token vocabulary, structured ChatML formatting for conversational interactions, and robust tool calling capabilities with parallel execution support. 

Available in both proprietary and open-weight versions with flexible licensing, comprehensive model variants (Base, Instruct, Thinking, and hybrid modes), and enhanced Model Context Protocol support, Qwen is ideal for developers seeking powerful, multilingual AI systems with sophisticated reasoning capabilities and minimal deployment complexity.

Qwen AI model family: Qwen3 language models, specialized coding & reasoning models, plus state-of-the-art embedding & reranking APIs for search and RAG applications.

Qwen Model APIs via DeepInfra

qwen

Qwen series offers a comprehensive suite of dense and mixture-of-experts models.

DeepInfra provides access to Qwen's latest generation of large language models, offering both specialized coding models and general-purpose AI systems with advanced reasoning capabilities.

Qwen

Bria GenFill enables high-quality object addition or visual transformation. Trained exclusively on licensed data for safe and risk-free commercial use.

Qwen3 is the latest generation of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support

12B model trained jointly by Mistral AI and NVIDIA, it significantly outperforms existing models smaller or similar in size.

Enhanced industrial design and geometric reasoning, improved character consistency, reduced offset issues, and integrated LoRA capabilities

The 72 billion parameter Qwen2 excels in language understanding, multilingual capabilities, coding, mathematics, and reasoning.

Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects.

Mistral Small 3 is a 24B-parameter language model optimized for low-latency performance across common AI tasks. Released under the Apache 2.0 license, it features both pre-trained and instruction-tuned versions designed for efficient local deployment.  The model achieves 81% accuracy on the MMLU benchmark and performs competitively with larger models like Llama 3.3 70B and Qwen 32B, while operating at three times the speed on equivalent hardware.

DeepSeek R1 Distill Qwen 32B is a distilled large language model based on Qwen 2.5 32B, using outputs from DeepSeek R1. It outperforms OpenAI's o1-mini across various benchmarks, achieving new state-of-the-art results for dense models.  Other benchmark results include:  AIME 2024: 72.6 | MATH-500: 94.3 | CodeForces Rating: 1691.

The Llama 90B Vision model is a top-tier, 90-billion-parameter multimodal model designed for the most challenging visual reasoning and language tasks. It offers unparalleled accuracy in image captioning, visual question answering, and advanced image-text comprehension. Pre-trained on vast multimodal datasets and fine-tuned with human feedback, the Llama 90B Vision is engineered to handle the most demanding image-based AI tasks.  This model is perfect for industries requiring cutting-edge multimodal AI capabilities, particularly those dealing with complex, real-time visual and textual analysis.

FLUX.1 Kontext [dev] is a 12-billion-parameter image editing model that transforms visuals based on natural language instructions. It allows highly consistent, multi-step edits and is released with open weights under a non-commercial license to empower artists and researchers.

Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input.

The Multilingual-E5-large model is a 24-layer text embedding model with an embedding size of 1024, trained on a mixture of multilingual datasets and supporting 100 languages.

Meet Qwen3-VL — the most powerful vision-language model in the Qwen series to date.  This generation delivers comprehensive upgrades across the board: superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities.

Meta developed and released the Meta Llama 3.1 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8B, 70B and 405B sizes

The 7 billion parameter Qwen2.5 excels in language understanding, multilingual capabilities, coding, mathematics, and reasoning

Accurately preserve the look and voice of people or objects from a reference video, supporting multi-reference co-creation.

Gemini 2.5 Flash is Google's latest thinking model, designed to tackle increasingly complex problems. It's capable of reasoning through their thoughts before responding, resulting in enhanced performance and improved accuracy.  Gemini 2.5 Flash: best for balancing reasoning and speed.

Model Details Meta developed and released the Meta Llama 3 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8 and 70B sizes.

At 8 billion parameters, with superior quality and prompt adherence, this base model is the most powerful in the Stable Diffusion family. This model is ideal for professional use cases at 1 megapixel resolution

We present a sentence transformation model that maps sentences and paragraphs to a 768-dimensional dense vector space, suitable for semantic search tasks. The model is trained on 215 million question-answer pairs from various sources, including WikiAnswers, PAQ, Stack Exchange, MS MARCO, GOOAQ, Amazon QA, Yahoo Answers, Search QA, ELI5, and Natural Questions. Our model uses a contrastive learning objective.

Janus-Pro is a novel autoregressive framework that unifies multimodal understanding and generation. It addresses the limitations of previous approaches by decoupling visual encoding into separate pathways, while still utilizing a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder’s roles in understanding and generation, but also enhances the framework’s flexibility. Janus-Pro surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus-Pro make it a strong candidate for next-generation unified multimodal models.

Latest version of the Airoboros model fine-tunned version of llama-2-70b using the Airoboros dataset. This model is currently running jondurbin/airoboros-l2-70b-2.2.1 

Qwen3-Coder-30B-A3B-Instruct is a high-performance code generation model optimized for agentic coding and complex programming tasks. With 30.5B total parameters and 3.3B activated through Mixture-of-Experts architecture, it delivers exceptional efficiency. The model features native support for 256K token context (extendable to 1M), making it ideal for repository-scale code understanding. It excels at tool calling, browser automation, and multi-step coding workflows.

Whisper is a set of multi-lingual, robust speech recognition models trained by OpenAI that achieve state-of-the-art results in many languages. Whisper models were trained to predict approximate timestamps on speech segments (most of the time with 1-second accuracy), but they cannot originally predict word timestamps. This variant contains implementation to predict word timestamps and provide a more accurate estimation of speech segments when transcribing with Whisper models.

Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. It was trained on 680k hours of labelled data and demonstrated a strong ability to generalise to many datasets and domains without fine-tuning. Whisper checks pens are available in five configurations of varying model sizes, including a smallest configuration trained on English-only data and a largest configuration trained on multilingual data. This one is English-only.

Zonos-v0.1 is a leading open-weight text-to-speech model trained on more than 200k hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS providers.  Our model enables highly natural speech generation from text prompts when given a speaker embedding or audio prefix, and can accurately perform speech cloning when given a reference clip spanning just a few seconds. The conditioning setup also allows for fine control over speaking rate, pitch variation, audio quality, and emotions such as happiness, fear, sadness, and anger. The model outputs speech natively at 44kHz.

Black Forest Labs' first flagship model based on Flux latent rectified flow transformers

Nemotron-4-340B-Instruct is a chat model intended for use for the English language, designed for Synthetic Data Generation

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.

ByteDance's Seedance 1.5 Pro is a professional video model using V2A native generation for integrated, synced audio-visual output, enhancing efficiency of professional video creation.

Qwen3-Coder-480B-A35B-Instruct is the Qwen3's most agentic code model, featuring Significant Performance on Agentic Coding, Agentic Browser-Use and other foundational coding tasks, achieving results comparable to Claude Sonnet.

OpenChat is a library of open-source language models that have been fine-tuned with C-RLFT, a strategy inspired by offline reinforcement learning. These models can learn from mixed-quality data without preference labels and have achieved exceptional performance comparable to ChatGPT. The developers of OpenChat are dedicated to creating a high-performance, commercially viable, open-source large language model and are continuously making progress towards this goal.

DeepSeek-R1-Distill-Llama-70B is a highly efficient language model that leverages knowledge distillation to achieve state-of-the-art performance. This model distills the reasoning patterns of larger models into a smaller, more agile architecture, resulting in exceptional results on benchmarks like AIME 2024, MATH-500, and LiveCodeBench. With 70 billion parameters, DeepSeek-R1-Distill-Llama-70B offers a unique balance of accuracy and efficiency, making it an ideal choice for a wide range of natural language processing tasks. 

Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper "Robust Speech Recognition via Large-Scale Weak Supervision" by Alec Radford  et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting. Whisper large-v3-turbo is a finetuned version of a pruned Whisper large-v3. In other words, it's the exact same model, except that the number of decoding layers have reduced from 32 to 4. As a result, the model is way faster, at the expense of a minor quality degradation.

We present a sentence transformation model that generates semantically similar sentences. Our model is based on the Sentence-Transformers architecture and was trained on a large dataset of sentence pairs. We evaluate the effectiveness of our model by measuring its ability to generate similar sentences that are close to the original sentence in meaning.

BGE embedding is a general Embedding Model. It is pre-trained using retromae and trained on large-scale pair data using contrastive learning. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned

Anthropic's mid-size model with superior intelligence for high-volume uses in coding, in-depth research, agents, & more.

MiniMax-M2 is a Mini model built for Max coding & agentic workflows with just 10 billion activated parameters

The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B).

The Qwen3 Embedding model series is the latest proprietary model of the Qwen family, specifically designed for text embedding and ranking tasks. Building upon the dense foundational models of the Qwen3 series, it provides a comprehensive range of text embeddings and reranking models in various sizes (0.6B, 4B, and 8B)

ClarityAI/crystal is a specialized upscaler optimized for portraits, faces, and products, delivering high-precision enhancements with adjustable detail levels for sharp, natural results.

  At 2.5 billion parameters, with improved MMDiT-X architecture and training methods, this model is designed to run “out of the box” on consumer hardware, striking a balance between quality and ease of customization. It is capable of generating images ranging between 0.25 and 2 megapixel resolution. 

Phi-4-reasoning-plus is a state-of-the-art open-weight reasoning model finetuned from Phi-4 using supervised fine-tuning on a dataset of chain-of-thought traces and reinforcement learning. The supervised fine-tuning dataset includes a blend of synthetic prompts and high-quality filtered data from public domain websites, focused on math, science, and coding skills as well as alignment data for safety and Responsible AI. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning. Phi-4-reasoning-plus has been trained additionally with Reinforcement Learning, hence, it has higher accuracy but generates on average 50% more tokens, thus having higher latency.

NVIDIA Nemotron 2 Nano VL extends the Nemotron family into multi-modal reasoning and document intelligence. This auto-regressive vision-language model enables multi-image reasoning, video understanding, visual Q&A and document analysis and summarization. Optimized for enterprise AI workflows, it powers multimodal agentic systems such as visual copilots, document assistants, and knowledge automation pipelines.

Openchat 3.6 is a LLama-3-8b fine tune that outperforms it on multiple benchmarks.

Bria Expand expands images beyond their borders in high quality. Resizing the image by generating new pixels to expand to the desired aspect ratio. Trained exclusively on licensed data for safe and risk-free commercial use.

Voxtral Mini is an enhancement of Ministral 3B, incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription, translation and audio understanding.

Bria Background Generation allows for efficient swapping of backgrounds in images via text prompts or reference image, delivering realistic and polished results. Trained exclusively on licensed data for safe and risk-free commercial use.

Black Forest Labs' latest state-of-the art proprietary model sporting top of the line prompt following, visual quality, details and output diversity.

We introduce StarCoder2-15B-Instruct-v0.1, the very first entirely self-aligned code Large Language Model (LLM) trained with a fully permissive and transparent pipeline. Our open-source pipeline uses StarCoder2-15B to generate thousands of instruction-response pairs, which are then used to fine-tune StarCoder-15B itself without any human annotations or distilled data from huge and proprietary LLMs.

This offers the imaginative writing style of chronos while still retaining coherency and being capable. Outputs are long and utilize exceptional prose. Supports a maxium context length of 4096. The model follows the Alpaca prompt format.

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Gemma 3 27B is Google's latest open source model, successor to Gemma 2

The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications.

The llama-nemotron-embed-vl-1b-v2 is a high-performance multimodal embedding model designed to transform text queries and document images into dense vector representations for advanced retrieval systems. It excels at understanding complex visual content like charts, tables, and infographics.

Veo 3.1 is the latest text-to-video model from Google that generates high-fidelity, cinematic videos with synchronized audio from a simple text prompt. It excels at creating realistic and imaginative scenes with a deep understanding of natural language and visual dynamics.

gpt-oss-20b is an open-weight 21B parameter model released by OpenAI under the Apache 2.0 license. It uses a Mixture-of-Experts (MoE) architecture with 3.6B active parameters per forward pass, optimized for lower-latency inference. The model is trained in OpenAI’s Harmony response format and supports reasoning level configuration, fine-tuning, and agentic capabilities including function calling, tool use, and structured outputs.

Optimized specifically for multimodal agent scenarios. It features enhanced agent capabilities, upgraded multimodal comprehension, and more flexible context management.

This model is part of the GLM-V family of models, introduced in the paper GLM-4.1V-Thinking and GLM-4.5V: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning.

The CLIP model maps text and images to a shared vector space, enabling various applications such as image search, zero-shot image classification, and image clustering. The model can be used easily after installation, and its performance is demonstrated through zero-shot ImageNet validation set accuracy scores. Multilingual versions of the model are also available for 50+ languages.

Anthropic’s most powerful model yet and the state-of-the-art coding model. It delivers sustained performance on long-running tasks that require focused effort and thousands of steps, significantly expanding what AI agents can solve. Claude Opus 4 is ideal for powering frontier agent products and features.

Bria Eraser enables precise removal of unwanted objects from images while maintaining high-quality outputs. Trained exclusively on licensed data for safe and risk-free commercial use

Turn any image into a video. Intelligent shot scheduling supports multi-shot storytelling, generating multi-shot narrative videos with consistent subjects, scenes, and atmosphere

Seedream 4.0 is a SOTA multimodal image creation model built on leading architecture. It breaks through the boundaries of traditional text-to-image models by natively supporting text, single-image, and multi-image inputs. Users can freely combine text and images to achieve diverse creative modes within a single model—such as multi-image blending, image editing, and sequentially batch image generation, featuring subject consistency, making image creation more free and controllable.

FLUX.1 [schnell] is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions. This model offers cutting-edge output quality and competitive prompt following, matching the performance of closed source alternatives. Trained using latent adversarial diffusion distillation, FLUX.1 [schnell] can generate high-quality images in only 1 to 4 steps. 

Veo 3 is a state-of-the-art text-to-video model from Google that generates high-fidelity, cinematic videos with synchronized audio from a simple text prompt. It excels at creating realistic and imaginative scenes with a deep understanding of natural language and visual dynamics.

The Meta Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction-tuned generative models in 1B and 3B sizes (text in/text out).

GLM-4.7 is a state-of-the-art, multilingual Mixture-of-Experts (MoE) language model designed for complex reasoning, agentic coding, and tool use. Building on its predecessor GLM-4.6, it delivers significant improvements across key benchmarks, including multilingual SWE-bench, Terminal Bench, and reasoning-heavy evaluations like HLE. The model features advanced "Interleaved Thinking" and new "Preserved Thinking" modes, allowing it to reason before actions and maintain consistency across long, multi-turn tasks. With 358 billion parameters, GLM-4.7 excels in generating clean code, modern UI elements, and sophisticated reasoning outputs.

Phi-4 is a model built upon a blend of synthetic datasets, data from filtered public domain websites, and acquired academic books and Q&A datasets. The goal of this approach was to ensure that small capable models were trained with data focused on high quality and advanced reasoning.

ChatGPT said:  EmbeddingGemma is a 300M parameter multilingual open embedding model from Google DeepMind, designed for efficient deployment even on low-resource devices, producing high-quality text vector representations for tasks such as search, classification, clustering, and semantic similarity.

Mistral-Small-3.2-24B-Instruct is a drop-in upgrade over the 3.1 release, with markedly better instruction following, roughly half the infinite-generation errors, and a more robust function-calling interface—while otherwise matching or slightly improving on all previous text and vision benchmarks.

Phind-CodeLlama-34B-v2 is an open-source language model that has been fine-tuned on 1.5B tokens of high-quality programming-related data and achieved a pass@1 rate of 73.8% on HumanEval. It is multi-lingual and proficient in Python, C/C++, TypeScript, Java, and more. It has been trained on a proprietary dataset of instruction-answer pairs instead of code completion examples.  The model is instruction-tuned on the Alpaca/Vicuna format to be steerable and easy-to-use. It accepts the Alpaca/Vicuna instruction format and can generate one completion for each prompt.

The DeepSeek R1 model has undergone a minor version upgrade, with the current version being DeepSeek-R1-0528.

Llama 3.2 11B Vision is a multimodal model with 11 billion parameters, designed to handle tasks combining visual and textual data. It excels in tasks such as image captioning and visual question answering, bridging the gap between language generation and visual reasoning. Pre-trained on a massive dataset of image-text pairs, it performs well in complex, high-accuracy image analysis.  Its ability to integrate visual understanding with language processing makes it an ideal solution for industries requiring comprehensive visual-linguistic AI applications, such as content creation, AI-driven customer service, and research.

Bria Enhance improves overall image quality by sharpening details, balancing colors, and boosting clarity for crisp, professional visuals. Trained only on licensed data, it’s safe, reliable, and ready for commercial use.

PaddleOCR-VL is a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios.

FIBO is an open-source, JSON-native text-to-image model trained on detailed structured descriptions (over 1,000+ words per image), providing fine-grained control over light, composition, and camera parameters.

A Mythomax/MLewd_13B-style merge of selected 70B models  A multi-model merge of several  LLaMA2 70B finetunes for roleplaying and creative work. The goal was to create a model that combines creativity with intelligence for an enhanced experience.

Over the past few months, we have observed increasingly clear trends toward scaling both total parameters and context lengths in the pursuit of more powerful and agentic artificial intelligence (AI). We are excited to share our latest advancements in addressing these demands, centered on improving scaling efficiency through innovative model architecture. We call this next-generation foundation models Qwen3-Next.

Qwen2.5 is a model pretrained on a large-scale dataset of up to 18 trillion tokens, offering significant improvements in knowledge, coding, mathematics, and instruction following compared to its predecessor Qwen2. The model also features enhanced capabilities in generating long texts, understanding structured data, and generating structured outputs, while supporting multilingual capabilities for over 29 languages.

A coding model optimized for real-world development environments, with reliable tool use in common IDEs such as Claude Code. It delivers strong front-end performance and supports Skills.

You can POST to our OpenAI Chat Completions compatible endpoint.

#### Simple messages and prompts

Given a list of messages from a conversation, the model will return a response.

```bash
curl "https://api.deepinfra.com/v1/openai/chat/completions" \
 -H "Content-Type: application/json" \
 -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
 -d '{
 "model": "ByteDance/Seed-2.0-code",
 "messages": [
 {
 "role": "user",
 "content": "Hello!"
 }
 ]
 }'
```

To which you'd get something like:

```json
{
 "id": "chatcmpl-guMTxWgpFf",
 "object": "chat.completion",
 "created": 1694623155,
 "model": "ByteDance/Seed-2.0-code",
 "choices": [
 {
 "index": 0,
 "message": {
 "role": "assistant",
 "content": " Hello! It's nice to meet you. Is there something I can help you with or would you like to chat for a bit?"
 },
 "finish_reason": "stop"
 }
 ],
 "usage": {
 "prompt_tokens": 15,
 "completion_tokens": 16,
 "total_tokens": 31,
 "estimated_cost": 0.0000268
 }
}
```

#### Conversations

To create a longer chat-like conversation you just have to add each response message and each of the user messages to every request. This way the model will have the context and will be able to provide better answers. You can tweak it even further by providing a system message.

```bash
curl "https://api.deepinfra.com/v1/openai/chat/completions" \
 -H "Content-Type: application/json" \
 -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
 -d '{
 "model": "ByteDance/Seed-2.0-code",
 "messages": [
 {
 "role": "system",
 "content": "Respond like a michelin starred chef."
 },
 {
 "role": "user",
 "content": "Can you name at least two different techniques to cook lamb?"
 },
 {
 "role": "assistant",
 "content": "Bonjour! Let me tell you, my friend, cooking lamb is an art form, and I'"'"'m more than happy to share with you not two, but three of my favorite techniques to coax out the rich, unctuous flavors and tender textures of this majestic protein. First, we have the classic \"Sous Vide\" method. Next, we have the ancient art of \"Sous le Sable\". And finally, we have the more modern technique of \"Hot Smoking.\""
 },
 {
 "role": "user",
 "content": "Tell me more about the second method."
 }
 ]
 }'
```

The conversation above might return something like the following

```json
{
 "id": "chatcmpl-b23a3fb60cde42ce8f24bb980b4dee87",
 "object": "chat.completion",
 "created": 1715688169,
 "model": "ByteDance/Seed-2.0-code",
 "choices": [
 {
 "index": 0,
 "message": {
 "role": "assistant",
 "content": "Sous le Sable, my friend! It's an ancient technique that's been used for centuries in the Middle East and North Africa. The name itself..."
 },
 "finish_reason": "stop"
 }
 ],
 "usage": {
 "prompt_tokens": 149,
 "total_tokens": 487,
 "completion_tokens": 338,
 "estimated_cost": 0.00035493
 }
}
```

The longer the conversation gets, the more time it takes the model to generate the response. The number of messages that you can have in a conversation is limited by the context size of a model. Larger models also usually take more time to respond.

 

### Streaming

You can turn any of the requests above into a streaming request by passing `"stream": true`:

```bash
curl "https://api.deepinfra.com/v1/openai/chat/completions" \
 -H "Content-Type: application/json" \
 -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
 -d '{
 "model": "ByteDance/Seed-2.0-code",
 "stream": true,
 "messages": [
 {
 "role": "user",
 "content": "Hello!"
 }
 ]
 }'
```

to which you'd get a sequence of [SSE](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events) events, finishing with `[DONE]`.

```
data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "ByteDance/Seed-2.0-code", "choices": [{"index": 0, "delta": {"role": "assistant", "content": " "}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "ByteDance/Seed-2.0-code", "choices": [{"index": 0, "delta": {"role": "assistant", "content": " Hi"}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "ByteDance/Seed-2.0-code", "choices": [{"index": 0, "delta": {"role": "assistant", "content": "!"}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "ByteDance/Seed-2.0-code", "choices": [{"index": 0, "delta": {"role": "assistant", "content": ""}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "ByteDance/Seed-2.0-code", "choices": [{"index": 0, "delta": {"role": "assistant", "content": "</s>"}, "finish_reason": null}]}

data: {"id": "Rc5hsIPHOSfMP3rNSFUw9tfR", "object": "chat.completion.chunk", "created": 1694623354, "model": "ByteDance/Seed-2.0-code", "choices": [{"index": 0, "delta": {}, "finish_reason": "stop"}]}

data: [DONE]
```

You can use the official openai python client to run inferences with us

```python
# Assume openai>=1.0.0
from openai import OpenAI

# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
 api_key="$DEEPINFRA_TOKEN",
 base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
 model="ByteDance/Seed-2.0-code",
 messages=[{"role": "user", "content": "Hello"}],
)

print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)

# Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
# 11 25
```

#### Conversations

To create a longer chat-like conversation you just have to add each response message and each of the user messages to every request. This way the model will have the context and will be able to provide better answers. You can tweak it even further by providing a system message.

```python
# Assume openai>=1.0.0
from openai import OpenAI

# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
 api_key="$DEEPINFRA_TOKEN",
 base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
 model="ByteDance/Seed-2.0-code",
 messages=[
 {"role": "system", "content": "Respond like a michelin starred chef."},
 {"role": "user", "content": "Can you name at least two different techniques to cook lamb?"},
 {"role": "assistant", "content": "Bonjour! Let me tell you, my friend, cooking lamb is an art form, and I'm more than happy to share with you not two, but three of my favorite techniques to coax out the rich, unctuous flavors and tender textures of this majestic protein. First, we have the classic \"Sous Vide\" method. Next, we have the ancient art of \"Sous le Sable\". And finally, we have the more modern technique of \"Hot Smoking.\""},
 {"role": "user", "content": "Tell me more about the second method."},
 ],
)

print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)

# Sous le Sable! It's an ancient technique that never goes out of style, n'est-ce pas? Literally ...
# 149 324
```

The longer the conversation gets, the more time it takes the model to generate the response. The number of messages that you can have in a conversation is limited by the context size of a model. Larger models also usually take more time to respond.

 

### Streaming

Streaming any of the chat completions above is supported by adding the `stream=True` option.


```python
from openai import OpenAI

# Create an OpenAI client with your deepinfra token and endpoint
openai = OpenAI(
 api_key="$DEEPINFRA_TOKEN",
 base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
 model="ByteDance/Seed-2.0-code",
 messages=[{"role": "user", "content": "Hello"}],
 stream=True,
)

for event in chat_completion:
 if event.choices[0].finish_reason:
 print(event.choices[0].finish_reason, event.usage["prompt_tokens"], event.usage["completion_tokens"])
 else:
 print(event.choices[0].delta.content)

# Hello
# !
# It
# 's
# nice
# ...
# 11 25
```

You can use JavaScript in the browser or in node.js to make requests

```bash
npm install openai
```

then

```javascript
import OpenAI from "openai";

const openai = new OpenAI({
 baseURL: 'https://api.deepinfra.com/v1/openai',
 apiKey: "$DEEPINFRA_TOKEN",
});

async function main() {
 const completion = await openai.chat.completions.create({
 messages: [{ role: "user", content: "Hello" }],
 model: "ByteDance/Seed-2.0-code",
 });

 console.log(completion.choices[0].message.content);
 console.log(completion.usage.prompt_tokens, completion.usage.completion_tokens);
}

main();

// Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
// 11 25
```

#### Conversations

To create a longer chat-like conversation you just have to add each response message and each of the user messages to every request. This way the model will have the context and will be able to provide better answers. You can tweak it even further by providing a system message.

```javascript
import OpenAI from "openai";

const openai = new OpenAI({
 baseURL: 'https://api.deepinfra.com/v1/openai',
 apiKey: "$DEEPINFRA_TOKEN",
});

async function main() {
 const completion = await openai.chat.completions.create({
 messages: [
 {role: "system", content: "Respond like a michelin starred chef."},
 {role: "user", content: "Can you name at least two different techniques to cook lamb?"},
 {role: "assistant", content: "Bonjour! Let me tell you, my friend, cooking lamb is an art form, and I'm more than happy to share with you not two, but three of my favorite techniques to coax out the rich, unctuous flavors and tender textures of this majestic protein. First, we have the classic \"Sous Vide\" method. Next, we have the ancient art of \"Sous le Sable\". And finally, we have the more modern technique of \"Hot Smoking.\""},
 {role: "user", "content": "Tell me more about the second method."}
 ],
 model: "ByteDance/Seed-2.0-code",
 });

 console.log(completion.choices[0].message.content);
 console.log(completion.usage.prompt_tokens, completion.usage.completion_tokens);
}

main();

// Sous le Sable, my friend! This traditional technique hails from the ancient Mediterranean, wher...
// 149 324
```

The longer the conversation gets, the more time it takes the model to generate the response. The number of messages that you can have in a conversation is limited by the context size of a model. Larger models also usually take more time to respond.

 

### Streaming

Streaming any of the chat completions above is supported by adding the `stream: true` option.

```javascript
import OpenAI from "openai";

const openai = new OpenAI({
 baseURL: 'https://api.deepinfra.com/v1/openai',
 apiKey: "$DEEPINFRA_TOKEN",
});

async function main() {
 const completion = await openai.chat.completions.create({
 messages: [{ role: "user", content: "Hello" }],
 model: "ByteDance/Seed-2.0-code",
 stream: true,
 });

 for await (const chunk of completion) {
 if (chunk.choices[0].finish_reason) {
 console.log(chunk.choices[0].finish_reason, chunk.usage.prompt_tokens, chunk.usage.completion_tokens);
 } else {
 console.log(chunk.choices[0].delta.content);
 }
 }
}

main();

// Hello
// !
// It
// 's
// nice
// ...
// 11 25
```

The [AI SDK](https://sdk.vercel.ai/) by Vercel makes it very easy to do inference.

```bash
npm install ai @ai-sdk/deepinfra
```

then

```javascript
import { createDeepInfra } from "@ai-sdk/deepinfra";
import { generateText } from "ai";

const deepinfra = createDeepInfra({
 apiKey: "$DEEPINFRA_TOKEN",
});

const { text, usage, finishReason } = await generateText({
 model: deepinfra("ByteDance/Seed-2.0-code"),
 prompt: "Write a vegetarian lasagna recipe for 4 people.",
});

console.log(text);
console.log(usage);
console.log(finishReason);
```

#### Conversations

To create a longer chat-like conversation you just have to add each response message and each of the user messages to every request. This way the model will have the context and will be able to provide better answers. You can tweak it even further by providing a system message.

```javascript
import { createDeepInfra } from "@ai-sdk/deepinfra";
import { generateText } from "ai";

const deepinfra = createDeepInfra({
 apiKey: "$DEEPINFRA_TOKEN",
});

const { text, usage, finishReason } = await generateText({
 model: deepinfra("ByteDance/Seed-2.0-code"),
 messages: [
 { role: "system", content: "Respond like a michelin starred chef." },
 {
 role: "user",
 content: "Can you name at least two different techniques to cook lamb?",
 },
 {
 role: "assistant",
 content:
 'Bonjour! Let me tell you, my friend, cooking lamb is an art form, and I\'m more than happy to share with you not two, but three of my favorite techniques to coax out the rich, unctuous flavors and tender textures of this majestic protein. First, we have the classic "Sous Vide" method. Next, we have the ancient art of "Sous le Sable". And finally, we have the more modern technique of "Hot Smoking."',
 },
 { role: "user", content: "Tell me more about the second method." },
 ],
});

console.log(text);
console.log(usage);
console.log(finishReason);
```

The longer the conversation gets, the more time it takes the model to generate the response. The number of messages that you can have in a conversation is limited by the context size of a model. Larger models also usually take more time to respond.

 

### Streaming

Streaming text responses is easy just replace `generateText` with `streamText` and read the response chunk by chunk

```javascript
import { createDeepInfra } from "@ai-sdk/deepinfra";
import { streamText } from "ai";

const deepinfra = createDeepInfra({
 apiKey: "$DEEPINFRA_TOKEN",
});

const result = streamText({
 model: deepinfra("meta-llama/Llama-3.3-70B-Instruct-Turbo"),
 prompt: "Invent a new holiday and describe its traditions.",
 system:
 "You are a professional writer. You write simple, clear, and concise content.",
});

for await (const textPart of result.textStream) {
 console.log(textPart);
}

console.log(await result.usage);
console.log(await result.finishReason);
```

It works for conversations, too.

This is an advanced and more complex API. We strongly recommend that you use OpenAI Chat Completions instead.

#### Simple prompt

To query this model you need to provide a properly formatted input string.

```bash
curl "https://api.deepinfra.com/v1/inference/ByteDance/Seed-2.0-code" \
 -H "Content-Type: application/json" \
 -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
 -d '{
 "input": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nHello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
 "stop": []
 }'
```

That will respond with

```json
{
 "request_id": "RWZDRhS5kdoM1XWwXLEshynO",
 "inference_status": {
 "status": "succeeded",
 "runtime_ms": 243,
 "cost": 0.0000436,
 "tokens_input": 12,
 "tokens_generated": 25
 },
 "results": [
 {
 "generated_text": "Hello! It's nice to meet you. Is there something I can help you with or would you like to chat for a bit?"
 }
 ],
 "num_tokens": 25,
 "num_input_tokens":12
}
```

#### Conversations

The OpenAI Chat Completions API is better suited for chat-like conversations, use it instead.

To query this model you need to provide a properly formatted input string.

However, you can still do it if you really need to. You have to add each response and each of the user prompts to every request.
You need a properly formatted input string to make it understand the current context. See the example below for some of them.
You can tweak it even further by providing a system message.

```bash
curl "https://api.deepinfra.com/v1/inference/ByteDance/Seed-2.0-code" \
 -H "Content-Type: application/json" \
 -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
 -d '{
 "input": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nRespond like a michelin starred chef.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCan you name at least two different techniques to cook lamb?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nBonjour! Let me tell you, my friend, cooking lamb is an art form, and I'"'"'m more than happy to share with you not two, but three of my favorite techniques to coax out the rich, unctuous flavors and tender textures of this majestic protein. First, we have the classic \"Sous Vide\" method. Next, we have the ancient art of \"Sous le Sable\". And finally, we have the more modern technique of \"Hot Smoking.\"<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nTell me more about the second method.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
 "stop": []
 }'
```

The conversation above might return something like the following

```json
{
 "request_id": "RWZDRhS5kdoM1XWwXLEshynO",
 "inference_status": {
 "status": "succeeded",
 "runtime_ms": 243,
 "cost": 0.000436,
 "tokens_input": 149,
 "tokens_generated": 338
 },
 "results": [
 {
 "generated_text": "Sous le Sable, my friend! It's an ancient technique that's been used for centuries in the Middle East and North Africa. The name itself..."
 }
 ],
 "num_tokens": 338,
 "num_input_tokens": 149
}
```

The longer the conversation gets, the more time it takes the model to generate the response. The conversation is limited by the context size of a model. Larger models also usually take more time to respond.

 

### Streaming


To do a streaming request, just pass `"stream": true`:

```bash
curl "https://api.deepinfra.com/v1/inference/ByteDance/Seed-2.0-code" \
 -H "Content-Type: application/json" \
 -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
 -d '{
 "input": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nHello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
 "stop": [],
 "stream": true
 }'
```

which outputs:

```json
data: {"token": {"id": null, "text": "Hello", "logprob": 0.0, "special": false}, "generated_text": "", "details": null, "estimated_cost": null}

data: {"token": {"id": null, "text": "!", "logprob": 0.0, "special": false}, "generated_text": "", "details": null, "estimated_cost": null}

data: {"token": {"id": null, "text": " It", "logprob": 0.0, "special": false}, "generated_text": "", "details": null, "estimated_cost": null}

data: {"token": {"id": null, "text": "'s", "logprob": 0.0, "special": false}, "generated_text": "", "details": null, "estimated_cost": null}

....

data: {"token": {"id": null, "text": "", "logprob": 0.0, "special": false}, "generated_text": null, "details": {"finish_reason": "stop"}, "num_output_tokens": 25, "num_input_tokens": 12, "estimated_cost": 0.0000386}
```

#### Input format

You can see below the basic format of the input. Bear in mind that newlines often matter.

```text
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>
```

Conversation prompts contain the history of the exchanged prompts and responses.

```text
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

First question<|eot_id|><|start_header_id|>assistant<|end_header_id|>

First answer<|eot_id|><|start_header_id|>user<|end_header_id|>

Second question<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Second answer<|eot_id|><|start_header_id|>user<|end_header_id|>

Final question<|eot_id|><|start_header_id|>assistant<|end_header_id|>
```

If you want to add system prompt, it is done like this

```text
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

System prompt<|eot_id|><|start_header_id|>user<|end_header_id|>

First question<|eot_id|><|start_header_id|>assistant<|end_header_id|>

First answer<|eot_id|><|start_header_id|>user<|end_header_id|>

Second question<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Second answer<|eot_id|><|start_header_id|>user<|end_header_id|>

Final question<|eot_id|><|start_header_id|>assistant<|end_header_id|>
```

You can use our command-line tool [deepctl](/docs/getting-started) to run inferences:

This is an advanced and more complex API. We strongly recommend that you use OpenAI Chat Completions instead.

#### Simple prompt

To query this model you need to provide a properly formatted input string.

```bash
deepctl infer \
 -m 'ByteDance/Seed-2.0-code' \
 -i 'input="<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nHello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"' \
 -i 'stop=[]'
```

That will respond with

```json
{
 "inference_status": {
 "status": "succeeded",
 "runtime_ms": 243,
 "cost": 0.0000436,
 "tokens_input": 12,
 "tokens_generated": 25
 },
 "results": [
 {
 "generated_text": "Hello! It's nice to meet you. Is there something I can help you with or would you like to chat for a bit?"
 }
 ],
 "num_tokens": 25,
 "num_input_tokens":12
}
```

#### Conversations

The OpenAI Chat Completions API is better suited for chat-like conversations, use it instead.

To query this model you need to provide a properly formatted input string.

However, you can still do it if you really need to. You have to add each response and each of the user prompts to every request.
You need a properly formatted input string to make it understand the current context. See the example below for some of them.
You can tweak it even further by providing a system message.

```bash
deepctl infer \
 -m 'ByteDance/Seed-2.0-code' \
 -i 'input="<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nRespond like a michelin starred chef.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCan you name at least two different techniques to cook lamb?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nBonjour! Let me tell you, my friend, cooking lamb is an art form, and I'"'"'m more than happy to share with you not two, but three of my favorite techniques to coax out the rich, unctuous flavors and tender textures of this majestic protein. First, we have the classic \"Sous Vide\" method. Next, we have the ancient art of \"Sous le Sable\". And finally, we have the more modern technique of \"Hot Smoking.\"<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nTell me more about the second method.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"' \
 -i 'stop=[]'
```

The conversation above might return something like the following

```json
{
 "inference_status": {
 "status": "succeeded",
 "runtime_ms": 243,
 "cost": 0.000436,
 "tokens_input": 149,
 "tokens_generated": 338
 },
 "results": [
 {
 "generated_text": "Sous le Sable, my friend! It's an ancient technique that's been used for centuries in the Middle East and North Africa. The name itself..."
 }
 ],
 "num_tokens": 338,
 "num_input_tokens": 149
}
```

The longer the conversation gets, the more time it takes the model to generate the response. The conversation is limited by the context size of a model. Larger models also usually take more time to respond.

#### Input format

You can see below the basic format of the input. Bear in mind that newlines often matter.

```text
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>
```

Conversation prompts contain the history of the exchanged prompts and responses.

```text
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

First question<|eot_id|><|start_header_id|>assistant<|end_header_id|>

First answer<|eot_id|><|start_header_id|>user<|end_header_id|>

Second question<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Second answer<|eot_id|><|start_header_id|>user<|end_header_id|>

Final question<|eot_id|><|start_header_id|>assistant<|end_header_id|>
```

If you want to add system prompt, it is done like this

```text
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

System prompt<|eot_id|><|start_header_id|>user<|end_header_id|>

First question<|eot_id|><|start_header_id|>assistant<|end_header_id|>

First answer<|eot_id|><|start_header_id|>user<|end_header_id|>

Second question<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Second answer<|eot_id|><|start_header_id|>user<|end_header_id|>

Final question<|eot_id|><|start_header_id|>assistant<|end_header_id|>
```

We recommend using our NodeJS client https://github.com/deepinfra/deepinfra-node.

You can install it with

```bash
npm install deepinfra
```

#### Simple prompt

To query this model you need to provide a properly formatted input string.

```javascript
import { TextGeneration } from "deepinfra";

const DEEPINFRA_API_KEY = '$DEEPINFRA_TOKEN';
const MODEL_URL = 'https://api.deepinfra.com/v1/inference/ByteDance/Seed-2.0-code';

async function main() {
 const client = new TextGeneration(MODEL_URL, DEEPINFRA_API_KEY);
 const res = await client.generate({
 "input": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nHello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
 "stop": []
 });
 console.log(res.results[0].generated_text);
}

main();

// Hello! It's nice to meet you. Is there something I can help you with, or would you like to chat?
```

#### Conversations

The OpenAI Chat Completions API is better suited for chat-like conversations, use it instead.

However, you can still do it if you really need to. You have to add each response and each of the user prompts to every request.
You need a properly formatted input string to make it understand the current context. See the example below for some of them.
You can tweak it even further by providing a system message.

```javascript
import { TextGeneration } from "deepinfra";

const DEEPINFRA_API_KEY = '$DEEPINFRA_TOKEN';
const MODEL_URL = 'https://api.deepinfra.com/v1/inference/ByteDance/Seed-2.0-code';

async function main() {
 const client = new TextGeneration(MODEL_URL, DEEPINFRA_API_KEY);
 const res = await client.generate({
 "input": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nRespond like a michelin starred chef.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nCan you name at least two different techniques to cook lamb?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nBonjour! Let me tell you, my friend, cooking lamb is an art form, and I'm more than happy to share with you not two, but three of my favorite techniques to coax out the rich, unctuous flavors and tender textures of this majestic protein. First, we have the classic \"Sous Vide\" method. Next, we have the ancient art of \"Sous le Sable\". And finally, we have the more modern technique of \"Hot Smoking.\"<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nTell me more about the second method.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
 "stop": []
 });
 console.log(res.results[0].generated_text);
}

main();

// Sous le Sable! It's an ancient technique that never goes out of style, n'est-ce pas? Literally ...
```

The longer the conversation gets, the more time it takes the model to generate the response.
The number of messages that you can have in a conversation is limited by the context size of a model.
Larger models also usually take more time to respond.

#### Input format

You can see below the basic format of the input. Bear in mind that newlines often matter.

```text
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|><|start_header_id|>assistant<|end_header_id|>
```

Conversation prompts contain the history of the exchanged prompts and responses.

```text
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

First question<|eot_id|><|start_header_id|>assistant<|end_header_id|>

First answer<|eot_id|><|start_header_id|>user<|end_header_id|>

Second question<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Second answer<|eot_id|><|start_header_id|>user<|end_header_id|>

Final question<|eot_id|><|start_header_id|>assistant<|end_header_id|>
```

If you want to add system prompt, it is done like this

```text
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

System prompt<|eot_id|><|start_header_id|>user<|end_header_id|>

First question<|eot_id|><|start_header_id|>assistant<|end_header_id|>

First answer<|eot_id|><|start_header_id|>user<|end_header_id|>

Second question<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Second answer<|eot_id|><|start_header_id|>user<|end_header_id|>

Final question<|eot_id|><|start_header_id|>assistant<|end_header_id|>
```

input

maximum length of the newly generated generated text.If explicitly set to None it will be the model's max context length minus input length or 16384, whichever is smaller

max_new_tokens

temperature to use for sampling. 0 means the output is deterministic. Values greater than 1 encourage more diversity

temperature

Sample from the set of tokens with highest probability such that sum of probabilies is higher than p. Lower values focus on the most probable tokens.Higher values sample more low-probability tokens

top_p

Float that represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable this.

min_p

Sample from the best k (number of) tokens. 0 means off

top_k

repetition penalty. Value of 1 means no penalty, values greater than 1 discourage repetition, smaller than 1 encourage repetition.

repetition_penalty

Up to 16 strings that will terminate generation immediately

Seed-2.0-code

Seed-2.0-code

New Features