We use essential cookies to make our site work. With your consent, we may also use non-essential cookies to improve user experience and analyze website traffic…

DeepInfra raises $107M Series B to scale the inference cloud — read the announcement

Building a Voice Assistant with Whisper, LLM, and TTS
Published on 2024.09.20 by Askar Aitzhan
Building a Voice Assistant with Whisper, LLM, and TTS

Building a Voice Assistant with Whisper, LLM, and TTS

In this tutorial, we'll walk you through the process of creating a voice assistant using three powerful AI technologies:

  1. Whisper: For speech recognition
  2. LLM: For natural language processing and conversation
  3. TTS: For text-to-speech conversion

All the models are available on DeepInfra. But we will use OpenAI's python client to interact with LLM. And ElevenLabs' python client to interact with TTS.

Prerequisites

Before we begin, make sure you have the following installed and set up:

Create a virtual environment

python3 -m venv .venv
copy

Activate the environment

source .venv/bin/activate
copy

Install required libraries

brew install portaudio
pip install openai elevenlabs pyaudio numpy deepinfra scipy requests
copy

You'll also need to set up API key for DeepInfra.

Step 1: Speech Recognition with Whisper

First, let's use Whisper to transcribe user speech:

import pyaudio
import wave
import numpy as np
import requests
import json
import io
from scipy.io import wavfile
from openai import OpenAI
from elevenlabs import ElevenLabs, play

DEEPINFRA_API_KEY = "YOUR_DEEPINFRA_TOKEN"
WHISPER_MODEL = "distil-whisper/distil-large-v3"

def record_audio(duration=5, sample_rate=16000):
    CHUNK = 1024
    FORMAT = pyaudio.paInt16
    CHANNELS = 1

    p = pyaudio.PyAudio()

    print("Recording...")
    stream = p.open(format=FORMAT,
                    channels=CHANNELS,
                    rate=sample_rate,
                    input=True,
                    frames_per_buffer=CHUNK)

    frames = []

    for i in range(0, int(sample_rate / CHUNK * duration)):
        data = stream.read(CHUNK)
        frames.append(data)

    print("Recording complete.")

    stream.stop_stream()
    stream.close()
    p.terminate()

    # Convert to numpy array
    audio = np.frombuffer(b''.join(frames), dtype=np.int16)
    return audio

def transcribe_audio(audio):
    # Convert numpy array to WAV file in memory
    buffer = io.BytesIO()
    wavfile.write(buffer, 16000, audio.astype(np.int16))
    buffer.seek(0)

    # Prepare the request
    url = f'https://api.deepinfra.com/v1/inference/{WHISPER_MODEL}'
    headers = {
        "Authorization": f"bearer {DEEPINFRA_API_KEY}"
    }
    files = {
        'audio': ('audio.wav', buffer, 'audio/wav')
    }

    # Send the request
    response = requests.post(url, headers=headers, files=files)
    
    if response.status_code == 200:
        result = json.loads(response.text)
        return result['text']
    else:
        print(f"Error: {response.status_code}")
        print(response.text)
        return None
copy

Usage

audio = record_audio()
transcription = transcribe_audio(audio)
print(f"Transcription: {transcription}")
copy

Step 2: Conversing with LLM using OpenAI Client

Now, let's use the OpenAI client to interact with an LLM:

openai_client = OpenAI(api_key=DEEPINFRA_API_KEY, base_url="https://api.deepinfra.com/v1/openai")

MODEL_DI = "meta-llama/Meta-Llama-3.1-70B-Instruct"
def chat_with_llm(user_input):
    response = openai_client.chat.completions.create(
        model=MODEL_DI,
        messages=[{"role": "user", "content": user_input}],
        max_tokens=1000,
    )
    return response.choices[0].message.content
copy

Usage

llm_response = chat_with_llm(transcription)
print(f"LLM Response: {llm_response}")
copy

Step 3: Text-to-Speech with ElevenLabs

client = ElevenLabs(api_key=DEEPINFRA_API_KEY, base_url="https://api.deepinfra.com")

def text_to_speech(text):
    audio = client.generate(
        text=text,
        voice="luna",
        model="deepinfra/tts"
    )
    play(audio)
copy

Usage

text_to_speech(llm_response)
copy

Putting It All Together

Now, let's combine all these steps into a single voice assistant function:

def voice_assistant():
    while True:
        # Record and transcribe audio
        audio = record_audio()
        transcription = transcribe_audio(audio)
        print(f"You said: {transcription}")
        # Chat with LLM
        llm_response = chat_with_llm(transcription)
        print(f"Assistant: {llm_response}")
        # Convert response to speech
        text_to_speech(llm_response)
        # Ask if the user wants to continue
        if input("Continue? (y/n): ").lower() != 'y':
            break
copy

Run the voice assistant

voice_assistant()
copy

This voice assistant will continuously listen for user input, transcribe it, process it with an LLM, and respond with synthesized speech until the user chooses to stop.

Remember to replace YOUR_DEEPINFRA_TOKEN with your actual API key.

By leveraging the power of Whisper for speech recognition, LLM for intelligent conversation, and TTS for natural-sounding text-to-speech, you can create a sophisticated voice assistant capable of understanding and responding to a wide range of user queries.

Related articles
Open vs Closed Source AI Models: Intelligence, Price & Speed ComparedOpen vs Closed Source AI Models: Intelligence, Price & Speed Compared<p>The LLM landscape in 2026 looks nothing like it did two years ago. Back then the assumption was simple: if you wanted the best model, you paid OpenAI or Anthropic, and that was that. Open source models were a respectable second tier, good for experimentation, fine-tuning, and budget workloads, but not quite there for serious [&hellip;]</p>
Gemma 4 on DeepInfra: Fast & Scalable Open AI ModelsGemma 4 on DeepInfra: Fast & Scalable Open AI Models<p>Google DeepMind&#8217;s Gemma 4 scored 88.3% on AIME 2026 mathematics benchmarks in its 26B MoE variant — compared to 20.8% for its predecessor, Gemma 3 27B. That&#8217;s not an incremental update. The family spans four model sizes designed for hardware targets as different as a Raspberry Pi and a consumer GPU workstation, with every model [&hellip;]</p>
DeepInfra Launches Access to NVIDIA Nemotron Models for Vision, Retrieval, and AI SafetyDeepInfra Launches Access to NVIDIA Nemotron Models for Vision, Retrieval, and AI SafetyDeepInfra is serving the new, open NVIDIA Nemotron vision language and OCR AI models from day zero of their release. As a leading inference provider committed to performance and cost-efficiency, we're making these cutting-edge models available at the industry's best prices, empowering developers to build specialized AI agents without compromising on budget or performance.