We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

Qwen3-Max-Thinking state-of-the-art reasoning model at your fingertips!

Building a Voice Assistant with Whisper, LLM, and TTS
Published on 2024.09.20 by Askar Aitzhan
Building a Voice Assistant with Whisper, LLM, and TTS

Building a Voice Assistant with Whisper, LLM, and TTS

In this tutorial, we'll walk you through the process of creating a voice assistant using three powerful AI technologies:

  1. Whisper: For speech recognition
  2. LLM: For natural language processing and conversation
  3. TTS: For text-to-speech conversion

All the models are available on DeepInfra. But we will use OpenAI's python client to interact with LLM. And ElevenLabs' python client to interact with TTS.

Prerequisites

Before we begin, make sure you have the following installed and set up:

Create a virtual environment

python3 -m venv .venv
copy

Activate the environment

source .venv/bin/activate
copy

Install required libraries

brew install portaudio
pip install openai elevenlabs pyaudio numpy deepinfra scipy requests
copy

You'll also need to set up API key for DeepInfra.

Step 1: Speech Recognition with Whisper

First, let's use Whisper to transcribe user speech:

import pyaudio
import wave
import numpy as np
import requests
import json
import io
from scipy.io import wavfile
from openai import OpenAI
from elevenlabs import ElevenLabs, play

DEEPINFRA_API_KEY = "YOUR_DEEPINFRA_TOKEN"
WHISPER_MODEL = "distil-whisper/distil-large-v3"

def record_audio(duration=5, sample_rate=16000):
    CHUNK = 1024
    FORMAT = pyaudio.paInt16
    CHANNELS = 1

    p = pyaudio.PyAudio()

    print("Recording...")
    stream = p.open(format=FORMAT,
                    channels=CHANNELS,
                    rate=sample_rate,
                    input=True,
                    frames_per_buffer=CHUNK)

    frames = []

    for i in range(0, int(sample_rate / CHUNK * duration)):
        data = stream.read(CHUNK)
        frames.append(data)

    print("Recording complete.")

    stream.stop_stream()
    stream.close()
    p.terminate()

    # Convert to numpy array
    audio = np.frombuffer(b''.join(frames), dtype=np.int16)
    return audio

def transcribe_audio(audio):
    # Convert numpy array to WAV file in memory
    buffer = io.BytesIO()
    wavfile.write(buffer, 16000, audio.astype(np.int16))
    buffer.seek(0)

    # Prepare the request
    url = f'https://api.deepinfra.com/v1/inference/{WHISPER_MODEL}'
    headers = {
        "Authorization": f"bearer {DEEPINFRA_API_KEY}"
    }
    files = {
        'audio': ('audio.wav', buffer, 'audio/wav')
    }

    # Send the request
    response = requests.post(url, headers=headers, files=files)
    
    if response.status_code == 200:
        result = json.loads(response.text)
        return result['text']
    else:
        print(f"Error: {response.status_code}")
        print(response.text)
        return None
copy

Usage

audio = record_audio()
transcription = transcribe_audio(audio)
print(f"Transcription: {transcription}")
copy

Step 2: Conversing with LLM using OpenAI Client

Now, let's use the OpenAI client to interact with an LLM:

openai_client = OpenAI(api_key=DEEPINFRA_API_KEY, base_url="https://api.deepinfra.com/v1/openai")

MODEL_DI = "meta-llama/Meta-Llama-3.1-70B-Instruct"
def chat_with_llm(user_input):
    response = openai_client.chat.completions.create(
        model=MODEL_DI,
        messages=[{"role": "user", "content": user_input}],
        max_tokens=1000,
    )
    return response.choices[0].message.content
copy

Usage

llm_response = chat_with_llm(transcription)
print(f"LLM Response: {llm_response}")
copy

Step 3: Text-to-Speech with ElevenLabs

client = ElevenLabs(api_key=DEEPINFRA_API_KEY, base_url="https://api.deepinfra.com")

def text_to_speech(text):
    audio = client.generate(
        text=text,
        voice="luna",
        model="deepinfra/tts"
    )
    play(audio)
copy

Usage

text_to_speech(llm_response)
copy

Putting It All Together

Now, let's combine all these steps into a single voice assistant function:

def voice_assistant():
    while True:
        # Record and transcribe audio
        audio = record_audio()
        transcription = transcribe_audio(audio)
        print(f"You said: {transcription}")
        # Chat with LLM
        llm_response = chat_with_llm(transcription)
        print(f"Assistant: {llm_response}")
        # Convert response to speech
        text_to_speech(llm_response)
        # Ask if the user wants to continue
        if input("Continue? (y/n): ").lower() != 'y':
            break
copy

Run the voice assistant

voice_assistant()
copy

This voice assistant will continuously listen for user input, transcribe it, process it with an LLM, and respond with synthesized speech until the user chooses to stop.

Remember to replace YOUR_DEEPINFRA_TOKEN with your actual API key.

By leveraging the power of Whisper for speech recognition, LLM for intelligent conversation, and TTS for natural-sounding text-to-speech, you can create a sophisticated voice assistant capable of understanding and responding to a wide range of user queries.

Related articles
Reliable JSON-Only Responses with DeepInfra LLMsReliable JSON-Only Responses with DeepInfra LLMs<p>When large language models are used inside real applications, their role changes fundamentally. Instead of chatting with users, they become infrastructure components: extracting information, transforming text, driving workflows, or powering APIs. In these scenarios, natural language is no longer the desired output. What applications need is structured data — and very often, that structure is [&hellip;]</p>
Qwen API Pricing Guide 2026: Max Performance on a BudgetQwen API Pricing Guide 2026: Max Performance on a Budget<p>If you have been following the AI leaderboards lately, you have likely noticed a new name constantly trading blows with GPT-4o and Claude 3.5 Sonnet: Qwen. Developed by Alibaba Cloud, the Qwen model family (specifically Qwen 2.5 and Qwen 3) has exploded in popularity for one simple reason: unbeatable price-to-performance. In 2025, Qwen is widely [&hellip;]</p>
Nemotron 3 Nano vs GPT-OSS-20B: Performance, Benchmarks & DeepInfra ResultsNemotron 3 Nano vs GPT-OSS-20B: Performance, Benchmarks & DeepInfra Results<p>The open-source LLM landscape is becoming increasingly diverse, with models optimized for reasoning, throughput, cost-efficiency, and real-world agentic applications. Two models that stand out in this new generation are NVIDIA’s Nemotron 3 Nano and OpenAI’s GPT-OSS-20B, both of which offer strong performance while remaining openly available and deployable across cloud and edge systems. Although both [&hellip;]</p>