Model Details Meta developed and released the Meta Llama 3 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8 and 70B sizes.

Meta-Llama-3-70B-Instruct

Meta-Llama-3-8B-Instruct

This is the instruction fine-tuned version of Mixtral-8x22B - the latest and largest mixture of experts large language model (LLM) from Mistral AI. This state of the art machine learning model uses a mixture 8 of experts (MoE) 22b models. During inference 2 experts are selected. This architecture allows large models to be fast and cheap at inference.

Mixtral-8x22B-Instruct-v0.1

WizardLM-2 8x22B is Microsoft AI's most advanced Wizard model. It demonstrates highly competitive performance compared to those leading proprietary models.

WizardLM-2-8x22B

WizardLM-2 7B is the smaller variant of Microsoft AI's latest Wizard model. It is the fastest and achieves comparable performance with existing 10x larger open-source leading models

WizardLM-2-7B

Gemma is an open-source model designed by Google. This is Gemma 1.1 7B (IT), an update over the original instruction-tuned Gemma release. Gemma 1.1 was trained using a novel RLHF method, leading to substantial gains on quality, coding capabilities, factuality, instruction following and multi-turn conversation quality.

gemma-1.1-7b-it

Mixtral is mixture of expert large language model (LLM) from Mistral AI. This is state of the art machine learning model using a mixture 8 of experts (MoE) 7b models. During inference 2 expers are selected. This architecture allows large models to be fast and cheap at inference. The Mixtral-8x7B outperforms Llama 2 70B on most benchmarks.

Mixtral-8x7B-Instruct-v0.1

The Mistral-7B-Instruct-v0.2 Large Language Model (LLM) is a instruct fine-tuned version of the Mistral-7B-v0.2 generative text model using a variety of publicly available conversation datasets.

Mistral-7B-Instruct-v0.2

LLaMa 2 is a collections of LLMs trained by Meta. This is the 70B chat optimized version. This endpoint has per token pricing.

Llama-2-70b-chat-hf

The Dolphin 2.6 Mixtral 8x7b model is a finetuned version of the Mixtral-8x7b model, trained on a variety of data including coding data, for 3 days on 4 A100 GPUs. It is uncensored and requires trust_remote_code. The model is very obedient and good at coding, but not DPO tuned. The dataset has been filtered for alignment and bias. The model is compliant with user requests and can be used for various purposes such as generating code or engaging in general chat.

dolphin-2.6-mixtral-8x7b

A Mythomax/MLewd_13B-style merge of selected 70B models  A multi-model merge of several  LLaMA2 70B finetunes for roleplaying and creative work. The goal was to create a model that combines creativity with intelligence for an enhanced experience.

lzlv_70b_fp16_hf

OpenChat is a library of open-source language models that have been fine-tuned with C-RLFT, a strategy inspired by offline reinforcement learning. These models can learn from mixed-quality data without preference labels and have achieved exceptional performance comparable to ChatGPT. The developers of OpenChat are dedicated to creating a high-performance, commercially viable, open-source large language model and are continuously making progress towards this goal.

openchat_3.5

LLaVa is a multimodal model that supports vision and language models combined.

llava-1.5-7b-hf

Latest version of the Airoboros model fine-tunned version of llama-2-70b using the Airoboros dataset. This model is currently running jondurbin/airoboros-l2-70b-2.2.1 

airoboros-70b

SDXL consists of an ensemble of experts pipeline for latent diffusion: In a first step, the base model is used to generate (noisy) latents, which are then further processed with a refinement model (available here: https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/) specialized for the final denoising steps. Note that the base model can be used as a standalone module.

sdxl

Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository for the 7B fine-tuned model, optimized for dialogue use cases and converted for the Hugging Face Transformers format. 

Llama-2-7b-chat-hf

Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.

whisper-large

BGE embedding is a general Embedding Model. It is pre-trained using retromae and trained on large-scale pair data using contrastive learning. Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned

bge-large-en-v1.5

mistralai/Mixtral-8x7B-Instruct-v0.1

Mixtral-8x22B is the latest and largest mixture of expert large language model (LLM) from Mistral AI. This is state of the art machine learning model using a mixture 8 of experts (MoE) 22b models. During inference 2 expers are selected. This architecture allows large models to be fast and cheap at inference.  This model is not instruction tuned. 

mistralai/Mixtral-8x22B-v0.1

microsoft/WizardLM-2-8x22B

cognitivecomputations/dolphin-2.6-mixtral-8x7b

openchat/openchat_3.5

meta-llama/Meta-Llama-3-8B-Instruct

meta-llama/Llama-2-7b-chat-hf

The Mistral-7B-Instruct-v0.1 Large Language Model (LLM) is a instruct fine-tuned version of the Mistral-7B-v0.1 generative text model using a variety of publicly available conversation datasets.

mistralai/Mistral-7B-Instruct-v0.1

mistralai/Mistral-7B-Instruct-v0.2

microsoft/WizardLM-2-7B

google/gemma-1.1-7b-it

meta-llama/Llama-2-13b-chat-hf

Gryphe/MythoMax-L2-13b

Faster version of Gryphe/MythoMax-L2-13b running on multiple H100 cards in fp8 precision. Up to 160 tps. 

Gryphe/MythoMax-L2-13b-turbo

01-ai/Yi-34B-Chat

Code Llama is a state-of-the-art LLM capable of generating code, and natural language about code, from both code and natural language prompts. This particular instance is the 34b instruct variant

codellama/CodeLlama-34b-Instruct-hf

Phind-CodeLlama-34B-v2 is an open-source language model that has been fine-tuned on 1.5B tokens of high-quality programming-related data and achieved a pass@1 rate of 73.8% on HumanEval. It is multi-lingual and proficient in Python, C/C++, TypeScript, Java, and more. It has been trained on a proprietary dataset of instruction-answer pairs instead of code completion examples.  The model is instruction-tuned on the Alpaca/Vicuna format to be steerable and easy-to-use. It accepts the Alpaca/Vicuna instruction format and can generate one completion for each prompt.

Phind/Phind-CodeLlama-34B-v2

meta-llama/Meta-Llama-3-70B-Instruct

meta-llama/Llama-2-70b-chat-hf

deepinfra/airoboros-70b

lizpreciatior/lzlv_70b_fp16_hf

BAAI/bge-large-en-v1.5

BAAI/bge-base-en-v1.5

Text Embeddings by Weakly-Supervised Contrastive Pre-training. Model has 24 layers and 1024 out dim. 

intfloat/e5-large-v2

intfloat/e5-base-v2

The GTE models are trained by Alibaba DAMO Academy. They are mainly based on the BERT framework and currently offer three different sizes of models, including GTE-large, GTE-base, and GTE-small. The GTE models are trained on a large-scale corpus of relevance text pairs, covering a wide range of domains and scenarios. This enables the GTE models to be applied to various downstream tasks of text embeddings, including information retrieval, semantic textual similarity, text reranking, etc.

Model	Context	$ per 1M input tokens	$ per 1M output tokens
mixtral-8x7B-chat	32k	$0.24	$0.24
mixtral-8x22B	64k	$0.65	$0.65
wizardLM-2-8x22B	64k	$0.65	$0.65
Dolphin-2.6-mixtral-8x7b	32k	$0.24	$0.24

Model	Context	$ per 1M input tokens	$ per 1M output tokens
Llama-3-8B-Instruct	8k	$0.08	$0.08
OpenChat-3.5	8k	$0.07	$0.07
Llama-2-7b-chat	4k	$0.07	$0.07
Mistral-7B-v2	32k	$0.07	$0.07
WizardLM-2-7B	32k	$0.07	$0.07
Gemma-7b	8k	$0.07	$0.07

Model	Context	$ per 1M input tokens	$ per 1M output tokens
Llama-2-13b-chat	4k	$0.13	$0.13
MythoMax-L2-13b	4k	$0.13	$0.13
MythoMax-L2-13b-turbo	4k	$0.13	$0.13

Model	Context	$ per 1M input tokens	$ per 1M output tokens
Yi-34B-Chat	4k	$0.60	$0.60
CodeLlama-34b-Instruct	4k	$0.60	$0.60
Phind-CodeLlama-34B-v2	4k	$0.60	$0.60

Model	Context	$ per 1M input tokens	$ per 1M output tokens
Llama-3-70B-Instruct	8k	$0.59	$0.79
Llama-2-70b-chat	4k	$0.64	$0.80
Airoboros-70b	4k	$0.70	$0.90
Lzlv-70b	4k	$0.59	$0.79

Simple Pricing, Deep Infrastructure

Token Pricing

Mixtral 8x22b

Mixture of experts

7 or 8 billion parameters

13 billion parameters

34 billion parameters

70 billion parameters

Custom LLMs

Dedicated Instances and Clusters

Embeddings Pricing

Execution Time Pricing

$0.03

Hardware

Auto Scaling

Billing

GPU	Price
Nvidia A100 GPU	$2.00/GPU-hour
Nvidia H100 GPU	$4.00/GPU-hour

Model	Context	$ per 1M input tokens
bge-large-en-v1.5	512	$0.01
bge-base-en-v1.5	512	$0.005
e5-large-v2	512	$0.01
e5-base-v2	512	$0.005
gte-large	512	$0.01
gte-base	512	$0.005