Voxtral Small is an enhancement of Mistral Small 3, incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription, translation and audio understanding.
Voxtral Small is an enhancement of Mistral Small 3, incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription, translation and audio understanding.
Please upload an audio file
task to perform 2
optional text to provide as a prompt for the first window.. (Default: empty)
temperature to use for sampling (Default: 0)
language that the audio is in; uses detected language if None; use two letter language code (ISO 639-1) (e.g. en, de, ja) 101
chunk level, either 'segment' or 'word' 2
Chunk Length S
chunk length in seconds to split audio (Default: 30, 1 ≤ chunk_length_s ≤ 30)
Voxtral Small is an enhancement of Mistral Small 3, incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription, translation and audio understanding.
Learn more about Voxtral in our blog post here.
Voxtral builds upon Mistral Small 3 with powerful audio understanding capabilities.
Average word error rate (WER) over the FLEURS, Mozilla Common Voice and Multilingual LibriSpeech benchmarks:
The model can be used with the following frameworks;
vllm (recommended)
: See hereTransformers
🤗: See hereNotes:
temperature=0.2
and top_p=0.95
for chat completion (e.g. Audio Understanding) and temperature=0.0
for transcriptionWe recommend using this model with vLLM.
Make sure to install vllm from "main", we recommend using uv
uv pip install -U "vllm[audio]" --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
Doing so should automatically install mistral_common >= 1.8.1
.
To check:
python -c "import mistral_common; print(mistral_common.__version__)"
You can test that your vLLM setup works as expected by cloning the vLLM repo:
git clone https://github.com/vllm-project/vllm && cd vllm
and then running:
python examples/offline_inference/audio_language.py --num-audios 2 --model-type voxtral
We recommend that you use Voxtral-Small-24B-2507 in a server/client setting.
vllm serve mistralai/Voxtral-Small-24B-2507 --tokenizer_mode mistral --config_format mistral --load_format mistral --tensor-parallel-size 2 --tool-call-parser mistral --enable-auto-tool-choice
Note: Running Voxtral-Small-24B-2507 on GPU requires ~55 GB of GPU RAM in bf16 or fp16.
Leverage the audio capabilities of Voxtral-Small-24B-2507 to chat.
Make sure that your client has mistral-common
with audio installed:
pip install --upgrade mistral_common$$audio$$
from mistral_common.protocol.instruct.messages import TextChunk, AudioChunk, UserMessage, AssistantMessage, RawAudio
from mistral_common.audio import Audio
from huggingface_hub import hf_hub_download
from openai import OpenAI
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://<your-server-host>:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
bcn_file = hf_hub_download("patrickvonplaten/audio_samples", "bcn_weather.mp3", repo_type="dataset")
def file_to_chunk(file: str) -> AudioChunk:
audio = Audio.from_file(file, strict=False)
return AudioChunk.from_audio(audio)
text_chunk = TextChunk(text="Which speaker is more inspiring? Why? How are they different from each other? Answer in French.")
user_msg = UserMessage(content=[file_to_chunk(obama_file), file_to_chunk(bcn_file), text_chunk]).to_openai()
print(30 * "=" + "USER 1" + 30 * "=")
print(text_chunk.text)
print("\n\n")
response = client.chat.completions.create(
model=model,
messages=[user_msg],
temperature=0.2,
top_p=0.95,
)
content = response.choices[0].message.content
print(30 * "=" + "BOT 1" + 30 * "=")
print(content)
print("\n\n")
# The model could give the following answer:
# ```L'orateur le plus inspirant est le président.
# Il est plus inspirant parce qu'il parle de ses expériences personnelles
# et de son optimisme pour l'avenir du pays.
# Il est différent de l'autre orateur car il ne parle pas de la météo,
# mais plutôt de ses interactions avec les gens et de son rôle en tant que président.```
messages = [
user_msg,
AssistantMessage(content=content).to_openai(),
UserMessage(content="Ok, now please summarize the content of the first audio.").to_openai()
]
print(30 * "=" + "USER 2" + 30 * "=")
print(messages[-1]["content"])
print("\n\n")
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=0.2,
top_p=0.95,
)
content = response.choices[0].message.content
print(30 * "=" + "BOT 2" + 30 * "=")
print(content)
Voxtral-Small-24B-2507 has powerful transcription capabilities!
Make sure that your client has mistral-common
with audio installed:
pip install --upgrade mistral_common$$audio$$
from mistral_common.protocol.transcription.request import TranscriptionRequest
from mistral_common.protocol.instruct.messages import RawAudio
from mistral_common.audio import Audio
from huggingface_hub import hf_hub_download
from openai import OpenAI
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://<your-server-host>:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
obama_file = hf_hub_download("patrickvonplaten/audio_samples", "obama.mp3", repo_type="dataset")
audio = Audio.from_file(obama_file, strict=False)
audio = RawAudio.from_audio(audio)
req = TranscriptionRequest(model=model, audio=audio, language="en", temperature=0.0).to_openai(exclude=("top_p", "seed"))
response = client.audio.transcriptions.create(**req)
print(response)
Voxtral has some experimental function calling support. You can try as shown below.
Make sure that your client has mistral-common
with audio installed:
pip install --upgrade mistral_common$$audio$$
from mistral_common.protocol.instruct.messages import AudioChunk, UserMessage, TextChunk
from mistral_common.protocol.transcription.request import TranscriptionRequest
from mistral_common.protocol.instruct.tool_calls import Function, Tool
from mistral_common.audio import Audio
from huggingface_hub import hf_hub_download
from openai import OpenAI
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://<your-server-host>:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
tool = Tool(
function=Function(
name="get_current_weather",
description="Get the current weather",
parameters={
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA",
},
"format": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "The temperature unit to use. Infer this from the user's location.",
},
},
"required": ["location", "format"],
},
)
)
tools = [tool.to_openai()]
weather_like = hf_hub_download("patrickvonplaten/audio_samples", "fn_calling.wav", repo_type="dataset")
def file_to_chunk(file: str) -> AudioChunk:
audio = Audio.from_file(file, strict=False)
return AudioChunk.from_audio(audio)
audio_chunk = file_to_chunk(weather_like)
print(30 * "=" + "Transcription" + 30 * "=")
req = TranscriptionRequest(model=model, audio=audio_chunk.input_audio, language="en", temperature=0.0).to_openai(exclude=("top_p", "seed"))
response = client.audio.transcriptions.create(**req)
print(response.text) # How is the weather in Madrid at the moment?
print("\n")
print(30 * "=" + "Function calling" + 30 * "=")
audio_chunk = file_to_chunk(weather_like)
user_msg = UserMessage(content=[audio_chunk]).to_openai()
response = client.chat.completions.create(
model=model,
messages=[user_msg],
temperature=0.2,
top_p=0.95,
tools=[tool.to_openai()]
)
print(30 * "=" + "BOT 1" + 30 * "=")
print(response.choices[0].message.tool_calls)
print("\n\n")
Voxtral is supported in Transformers natively!
Install Transformers from source:
pip install git+https://github.com/huggingface/transformers
Make sure to have mistral-common >= 1.8.1
installed with audio dependencies:
pip install --upgrade "mistral-common[audio]"
from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch
device = "cuda"
repo_id = "mistralai/Voxtral-Small-24B-2507"
processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
conversation = [
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/mary_had_lamb.mp3",
},
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
},
{"type": "text", "text": "What sport and what nursery rhyme are referenced?"},
],
}
]
inputs = processor.apply_chat_template(conversation)
inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated response:")
print("=" * 80)
print(decoded_outputs[0])
print("=" * 80)
from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch
device = "cuda"
repo_id = "mistralai/Voxtral-Small-24B-2507"
processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
conversation = [
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",
},
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",
},
{"type": "text", "text": "Describe briefly what you can hear."},
],
},
{
"role": "assistant",
"content": "The audio begins with the speaker delivering a farewell address in Chicago, reflecting on his eight years as president and expressing gratitude to the American people. The audio then transitions to a weather report, stating that it was 35 degrees in Barcelona the previous day, but the temperature would drop to minus 20 degrees the following day.",
},
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
},
{"type": "text", "text": "Ok, now compare this new audio with the previous one."},
],
},
]
inputs = processor.apply_chat_template(conversation)
inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated response:")
print("=" * 80)
print(decoded_outputs[0])
print("=" * 80)
from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch
device = "cuda"
repo_id = "mistralai/Voxtral-Small-24B-2507"
processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
conversation = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Why should AI models be open-sourced?",
},
],
}
]
inputs = processor.apply_chat_template(conversation)
inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated response:")
print("=" * 80)
print(decoded_outputs[0])
print("=" * 80)
from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch
device = "cuda"
repo_id = "mistralai/Voxtral-Small-24B-2507"
processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
conversation = [
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
},
],
}
]
inputs = processor.apply_chat_template(conversation)
inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated response:")
print("=" * 80)
print(decoded_outputs[0])
print("=" * 80)
from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch
device = "cuda"
repo_id = "mistralai/Voxtral-Small-24B-2507"
processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
conversations = [
[
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",
},
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",
},
{
"type": "text",
"text": "Who's speaking in the speach and what city's weather is being discussed?",
},
],
}
],
[
{
"role": "user",
"content": [
{
"type": "audio",
"path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
},
{"type": "text", "text": "What can you tell me about this audio?"},
],
}
],
]
inputs = processor.apply_chat_template(conversations)
inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated responses:")
print("=" * 80)
for decoded_output in decoded_outputs:
print(decoded_output)
print("=" * 80)
from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch
device = "cuda"
repo_id = "mistralai/Voxtral-Small-24B-2507"
processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
inputs = processor.apply_transcrition_request(language="en", audio="https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3", model_id=repo_id)
inputs = inputs.to(device, dtype=torch.bfloat16)
outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print("\nGenerated responses:")
print("=" * 80)
for decoded_output in decoded_outputs:
print(decoded_output)
print("=" * 80)
Run models at scale with our fully managed GPU infrastructure, delivering enterprise-grade uptime at the industry's best rates.