How to Build a Real-Time Voice Translator with Open-Source AI

Follow this guide to build your own voice translator agent!

This article walks through building a real-time voice translator using open-source AI tools like Whisper, LLaMA-3, and Coqui XTTS powered by GMI Cloud’s scalable infrastructure. It explains how to combine speech recognition, translation, and text-to-speech models into a fully functional browser-based app.

What you’ll learn:

  • How to transcribe speech with Whisper ASR
  • Translating text using LLaMA-3 hosted on GMI Cloud
  • Generating multilingual audio with Coqui XTTS
  • Setting up GPU devices for optimal performance
  • Building and deploying an interactive Gradio interface

Real-Time Voice Translator at a Glance

Step Tool Used Function Why It Matters
Speech-to-Text Whisper Converts speech into text Accurate multilingual ASR
Translation LLaMA-3 Translates text into target lang Flexible, open-source LLM
Text-to-Speech XTTS Generates natural voice output Fast, expressive TTS
Deployment GMI Cloud Runs models on GPU instances Scalable, low-latency environment

Based on an ODSC webinar by Grace Deng, Software Engineer at GMI Cloud

You can watch the original webinar recording here!

Introduction

Imagine saying “hello” in English and hearing it spoken back in Mandarin—instantly, naturally, and with personality. That’s what real-time voice translation can do, and now, with open-source tooling and scalable infrastructure, anyone can build it.

In a recent ODSC webinar, Grace Deng, Software Engineer at GMI Cloud, walked through building a voice-to-voice translator in under an hour. This guide distills the key steps and open-source tooling used, so you can follow along and deploy your own in minutes.

Why build a voice translator yourself?

Because open-source AI now makes it possible to create near-instant translation experiences without relying on proprietary APIs or expensive services.

For businesses, this means breaking language barriers in real time whether during global meetings, customer support interactions, or live product demos. A self-built, open-source solution offers full control over data privacy, integration flexibility with existing systems, and lower operating costs compared to commercial APIs. It helps enterprises deliver inclusive, multilingual communication without sacrificing security or scalability.

What You’ll Build

You’ll create a real-time voice translator that has:

  • 🎤 Real-time speech input via microphone
  • 📝 Automatic English speech transcription using Whisper
  • 🌐 Translation of English text into Chinese using LLaMA 3
  • 🔊 Text-to-speech generation in Chinese using XTTS
  • 🚀 Deployed with Gradio for browser-based interaction

Use cases:

  • Travel assistant
  • Accessibility support
  • Live meeting translation

Cloud-Based vs On-Device Speech Translation

When should you build for the cloud, and when should you keep everything on-device?

While both deployment options have their strengths, your choice depends on use case and scale:

  • Cloud-based translation offers scalability, multilingual model support, and easy deployment through APIs like GMI Cloud’s hosted endpoints. It’s ideal for high-volume applications such as live meeting transcription or multilingual call centers.
  • On-device translation, by contrast, provides offline usability, enhanced privacy, and in some cases lower latency since audio doesn’t need to travel to a remote server. It’s perfect for travel apps or assistive tools that need to work without an internet connection.

In short, cloud setups win on scale and capability, while on-device wins on independence and privacy developers can even combine both for hybrid reliability.

Tools You’ll Use

Core AI Models

  • Whisper (ASR): Speech-to-text transcription

  • LLaMA-3 (LLM): Text translation from English to Chinese

  • Coqui XTTS v2 (TTS): Voice synthesis with support for multilingual output

Supporting Stack

  • Transformers: HuggingFace pipelines for ASR and LLM
  • CUDA: Multi-GPU support for ASR/TTS acceleration
  • Gradio: Fast UI for real-time browser interaction
  • GMI Cloud API: Hosted endpoints for model inference at scale

Why these tools?

Each model plays a distinct role: Whisper listens, LLaMA 3 understands, XTTS speaks, and Gradio connects it all. Together, they form a modular, customizable translation stack.

Step-by-Step Guide

📁 Project Structure

voice_translator/
├── translator.py             # Main app script
├── audio.wav                 # Output audio file
├── requirements.txt          # Dependencies
└── README.md                 # Project documentation

How is the project structured?

By breaking it into clear stages from setup to deployment you can replicate, modify, or scale the same workflow for your own applications.

1. Set Up Your Environment

Create your Python environment (Conda or venv), add the dependencies in your requirements.txt file, and install:

pip install -r requirements.txt

2. Import Dependencies

Here's a list of imported dependencies and why we want them.

  • gradio: For building the web-based user interface.
  • os: For interacting with the file system.
  • torch: The core PyTorch library for running ML models.
  • whisper: OpenAI's ASR model for converting speech to text.
  • TTS.api.TTS: The TTS engine from the coqui-ai/TTS library for converting text to speech.
  • requests: For HTTP requests (e.g., downloading resources).
  • transformers: HuggingFace library for using pretrained models (e.g., for translation).
  • numpy: General-purpose numerical computing.
  • librosa: Audio processing and feature extraction.
import gradio as gr
import os
import torch
import whisper as whisper_ai 
from TTS.api import TTS
import requests
import transformers
import numpy
import librosa 

3. Prepare GPU Devices

This section prepares the hardware setup for running our Voice-to-Voice Translator by allocating GPU devices for two major tasks:

  • Whisper ASR (Automatic Speech Recognition) — converts spoken language into text
  • Text-to-Speech (TTS) — synthesizes translated text back into spoken audio

We aim to run each task on a separate GPU (if available) for optimal performance.

✅ Tip: When working with deep learning models like Whisper and TTS, spreading the workload across multiple GPUs can significantly improve runtime performance and reduce latency. Feel free to spread out the workload in a way that makes sense for your device. In this case, we use the last two GPU's.

num_gpus = torch.cuda.device_count()
if num_gpus >= 2:
    device_whisper = f"cuda:{num_gpus - 2}"
    device_tts = f"cuda:{num_gpus - 1}"
elif num_gpus == 1:
    device_whisper = device_tts = "cuda:0"
else:
    device_whisper = device_tts = "cpu"

print(f"Using {device_whisper} for Whisper ASR")
print(f"Using {device_tts} for Mozilla TTS")

4. Whisper ASR: Speech to Text

In this section, we prepare the automatic speech recognition (ASR) component using OpenAI's Whisper Large V3 model, powered by Hugging Face Transformers.

  • 🔁 Pipeline: The Whisper ASR model and processor are wrapped in a pipeline, which processes audio in chunks (up to 30 seconds per batch) for more efficient handling.
  • 🎧 Preprocess Audio: Before feeding the audio into the ASR model, we need to ensure the audio is in the right format. The following function resamples the audio to 16kHz (if it’s not already), converts it to mono if stereo, and ensures it's in the correct data type for processing.
  • 📝 Transcribe Audio: This function takes audio as input, preprocesses it, and passes it through the Whisper ASR pipeline to generate a transcription. The audio is first resampled and converted, then normalized, and finally sent to the model for transcription.

Why start with Whisper?

OpenAI’s Whisper Large V3 sets the benchmark for real-time speech recognition in open-source AI. It’s not just accurate it’s robust, multilingual, and production-ready, making it ideal for enterprise-grade voice translation pipelines.

  • Exceptional speech-to-text accuracy: Whisper has been trained on over 680,000 hours of multilingual audio data, allowing it to recognize diverse accents, dialects, and noisy environments with industry-leading precision.
  • Multilingual transcription out of the box: It supports nearly 100 languages, enabling developers to build global voice translation systems without retraining or fine-tuning.

Noise-resilient and domain-flexible: Whether used in customer support, live meetings, or accessibility tools, Whisper maintains high Word Error Rate (WER) performance even with background noise or variable recording quality.

torch_dtype = torch.float32
model_id = "openai/whisper-large-v3"

transcribe_model = transformers.AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
transcribe_model.to(device_whisper)

processor = transformers.AutoProcessor.from_pretrained(model_id)

pipe0 = transformers.pipeline(
    "automatic-speech-recognition",
    model=transcribe_model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device_whisper,
)

def preprocess_audio(audio):
    if audio.dtype != numpy.float32:
        audio = audio.astype(numpy.float32)
    sr = 16000
    y = librosa.resample(audio, orig_sr=sr, target_sr=sr)
    if len(y.shape) > 1:
        y = librosa.to_mono(y)
    return y

def transcribe_audio(audio) -> str:
    sr, y = audio
    y = preprocess_audio(y)
    y = y.astype(numpy.float32)
    y /= numpy.max(numpy.abs(y))
    output_text = pipe0(y, generate_kwargs={"language": "english","temperature": 0.5, "top_p": 0.9,})["text"]
    print(output_text)
    return output_text

Streaming ASR Best Practices

How can you make your speech recognition feel truly real-time?
To achieve smooth live transcription, it’s important to handle streaming intelligently:

  • Voice Activity Detection (VAD) helps the system decide when a user starts and stops speaking, preventing awkward pauses.
  • Chunking and partial results let you display or process text as it arrives—this keeps latency low for real-time translation.
  • End-of-utterance detection ensures the pipeline doesn’t cut off words mid-sentence.
  • Be mindful of language auto-detection or code-switching, as mixed-language audio can confuse even advanced ASR models.

Implementing these techniques keeps your translator responsive and natural, especially for live conversation or accessibility use cases.

5. Translate with LLaMA-3

In this step, we bridge the gap between transcription and speech synthesis by introducing text translation.

🔁 Translate Text Using LLaMA-3 API

To translate English transcripts into Chinese, we use a hosted LLaMA-3 model via a REST API (https://api.gmi-serving.com/v1/chat/completions). This function wraps the call to the API in Python and uses a POST request with the appropriate headers and payload.

  • The Authorization header includes a bearer token (API Key).
  • The request payload specifies:
    • The model name (meta-llama/Llama-3.3-70B-Instruct)
    • A system message directing the model to perform translation from English to Chinese, and to return only the translation.
    • A user message containing the input text.
    • Temperature (0) for deterministic output and a token limit of 500.

The response is parsed as JSON. If the response is valid and successful (status_code == 200), the function extracts the translated message and returns it. Otherwise, it handles errors gracefully and logs useful debugging information.

✅ At this point, you can transcribe English audio, translate it to Chinese, and are now ready to generate Chinese audio output.

def translate_text(text):
    url = "https://api.gmi-serving.com/v1/chat/completions"
    headers = {
        "Content-Type": "application/json",
        "Authorization": "<GMI_API_KEY>"
    }
    payload = {
        "model": "meta-llama/Llama-3.3-70B-Instruct",
        "messages": [
            {"role": "system", "content": "Translate the following English text into Chinese. Include the translation and nothing else."},
            {"role": "user", "content": text}
        ],
        "temperature": 0,
        "max_tokens": 500
    }
    response = requests.post(url, headers=headers, json=payload, verify=False)
    if response.status_code == 200:
        try:
            data = response.json()
            translated_text = data['choices'][0]['message']['content']
        except ValueError as e:
            print("Failed to parse JSON:", e)
        return translated_text
    else:
        print("Error with the request:", response.status_code, response.text)
    return "No translation provided"

Why use LLaMA-3 for translation?

LLaMA-3 is ideal for powering production-grade translation pipelines because it combines linguistic precision with scalability and enterprise control:

  • Consistent, deterministic results: Its instruction-tuned design and support for temperature = 0 enable predictable, repeatable translations crucial for regulated industries, customer support scripts, and multilingual product content.
  • Context-aware accuracy: LLaMA-3 maintains long-range semantic context and preserves tone, meaning, and domain-specific terminology, reducing post-editing time and improving overall translation quality.
  • Customizable via prompts: Teams can apply prompt engineering for glossary enforcement, brand tone, or style consistency without fine-tuning ideal for dynamic enterprise use cases.
  • Real-time scalability: When deployed through GMI Cloud’s scalable GPU infrastructure, LLaMA-3 delivers low-latency translation with efficient batching and multi-GPU throughput supporting live conversations and interactive applications.

Cost efficiency and flexibility: As an open-source LLM, LLaMA-3 avoids vendor lock-in and offers transparent control over deployment costs, performance, and compliance (hybrid / multi-cloud AI infrastructure).

6. Coqui XTTS: Text to Speech

In this section, we complete the voice-to-voice translation pipeline by generating audio from the translated text and connecting all components into a single function.

🔁 Load the TTS Model

The multilingual TTS model xtts_v2 from Coqui TTS is loaded and moved to the designated TTS device (device_tts), ensuring fast inference using GPU if available.

🎙️ Convert Translated Text to Speech

We define a function text_to_speech that takes in translated text and generates a spoken audio file from it.

  • The output audio is saved as "audio.wav".
  • The function uses a predefined speaker voice, "Ana Florence", and the output language is set to "zh-cn" (Chinese).
  • tts_to_file() from Coqui TTS handles the synthesis and writes the audio to disk.
tts_model = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
tts_model.to(device_tts)

def text_to_speech(text):
    output_path = "audio.wav"
    tts_model.tts_to_file(text=text, file_path=output_path, speaker="Ana Florence", language="zh-cn")
    return output_path

Optimizing Real-Time TTS Output

How do you make your translated voice sound natural and fast?
To improve perceived “real-time” flow:

  • Stream TTS output in small chunks so playback begins before synthesis completes.
  • Adjust prosody and intonation settings for a smoother, more expressive voice delivery.
  • Tune cross-lingual pronunciation if your target language involves tones or accents, like Mandarin.
  • Reduce ASR → MT → TTS buffering gaps by overlapping stages slightly (e.g., start translation as transcription finalizes).

Small adjustments here have a huge impact the smoother the TTS handoff, the more your app feels like a live human translator.

7. Wrap It All Up

🔄 End-to-End Voice Translation Pipeline

The voice_to_voice() function integrates all stages:

  1. Input: Receives a raw audio clip (user's speech).
  2. Check: Returns None if no audio was provided.
  3. Transcribe: Converts the audio to English text using Whisper.
  4. Translate: Translates the English text to Chinese using LLaMA.
  5. Synthesize: Generates a spoken Chinese audio clip from the translated text.

Finally, it returns the path to the generated audio file.

✅ You now have a fully functioning voice-to-voice translator: English audio in → Chinese audio out!

def voice_to_voice(audio):
    if audio is None:  
        return gr.Audio(value=None)  
    output_audio = text_to_speech(translate_text(transcribe_audio(audio)))
    return output_audio

8. Launch with Gradio

In this final step, we wrap our voice-to-voice translation pipeline into a user-friendly interface using Gradio.

🛠️ Define the Gradio Interface

We create a gr.Interface instance to handle:

  • Input: Real-time audio from the user's microphone (gr.Audio(sources="microphone")).
  • Output: Generated audio in Chinese, also rendered in a microphone-style audio widget for playback.
  • Function: The voice_to_voice() function defined earlier is used as the core processing pipeline.
  • Metadata:
    • Title: "Voice-to-Voice Translator"
    • Description: Provides a step-by-step summary of the system’s behavior.
    • Live Mode: Enabled (live=True) to support real-time audio streaming.

What’s the benefit of Gradio?

It gives you a plug-and-play web interface for testing your translator instantly, without writing extra front-end code.

🚀 Launch the App

Finally, we call .launch() with share=True to:

  • Start a local server for the app.
  • Generate a public URL so you can share the demo with others online for testing or showcasing your voice-to-voice translator.
# --- Gradio UI ---
demo = gr.Interface(
    fn=voice_to_voice,
    inputs=gr.Audio(sources="microphone"),
    outputs=gr.Audio(sources=["microphone"]),
    title="Voice-to-Voice Translator",
    description="🎤 Speak in English → 📝 Get Chinese text → 🔊 Listen to Chinese speech.",
    live=True
)

demo.launch(share=True)

Scenario Stack Choice Key Advantage Note
Lightweight / Edge devices Distilled models + VITS Low compute cost Sacrifices accuracy
Multilingual research use Whisper + LLaMA-3 + XTTS Flexibility, open-source Runs well on 1 GPU
Enterprise real-time scale Whisper + LLaMA-3 + XTTS (multi-GPU) High concurrency Supports thousands of users
High-accuracy translation Whisper Large-v3 + M2M100 Better translation BLEU Requires more GPU resources

Quality & Latency Evaluation (What to Measure)

How do you know if your translator is performing well?
Once your system runs end-to-end, it’s time to measure both accuracy and responsiveness:

  • Latency budget: aim for roughly <700 ms for ASR, <500 ms for translation, and <800 ms for TTS, keeping total round-trip under 2 seconds for natural flow.
  • Accuracy metrics: use Word Error Rate (WER) for ASR and BLEU or COMET for translation quality.
  • User testing: conduct quick Mean Opinion Score (MOS) checks to rate how natural the output voice sounds to human listeners.

Evaluating both quality and speed turns your translator from a prototype into a production-ready AI system and helps you answer the question “What’s the best AI voice translator?” with data.

Pro Tips from Grace

  • Run ASR and TTS on separate GPUs for lower latency
  • Use clean 16kHz audio input to improve Whisper accuracy
  • Adjust temperature and prompts for translation accuracy
  • Use Coqui voice cloning for custom speakers or accessibility tuning

Why these pro tips matter for you:

Each tip maps to a concrete benefit you’ll feel when you ship:

  • Run ASR and TTS on separate GPUs
    Parallelizing Whisper (ASR) and XTTS (TTS) removes queuing delays, so turn-taking in conversations feels natural and your end-to-end stays within the ~<2s budget even under load. You also gain headroom to use larger models or higher batch sizes without spiking latency.
  • Use clean 16 kHz audio input
    Feeding models the sample rate they expect reduces resampling artifacts and background hiss, which typically lowers WER for Whisper and gives XTTS cleaner prosody. Fewer mis-hearings → fewer retries → smoother UX and lower compute spend.
  • Adjust temperature and prompts for translation accuracy
    A lower temperature (e.g., 0) makes outputs deterministic great for support scripts, product names, and terminology. Clear system prompts (“translate and return only the translation”) reduces post-editing and keeps style consistent across agents and shifts.
  • Use Coqui voice cloning for custom speakers or accessibility
    A consistent voice improves brand recall and listener comfort in long sessions. Matching speaker traits can reduce listener fatigue and increase comprehension, which matters in trainings, support lines, or accessibility scenarios.

These practices help you stay within the latency targets already outlined above (ASR ~<700 ms, MT ~<500 ms, TTS ~<800 ms), improve output quality you don’t have to fix later, and trim avoidable GPU minutes better UX at lower cost whether you deploy on-device or via GMI Cloud endpoints.

Resources

Build AI Without Limits

GMI Cloud is here to help you get your AI project off the ground—fast. With GPU-ready APIs and hosted endpoints, you don’t need to wrangle infrastructure just to experiment.

Join the GMI Discord community to connect with other builders and share your creations.

Frequently Asked Questions about Building a Real-Time Voice Translator with Open-Source AI

1. What tools are used to build the real-time voice translator?

The project combines several open-source components: Whisper for English speech recognition (ASR), LLaMA-3 for translating text into Chinese, and Coqui XTTS v2 for generating Chinese speech. It’s deployed using Gradio for an interactive web interface and accelerated with CUDA for multi-GPU performance. GMI Cloud’s API provides scalable endpoints for model inference.

2. How does the real-time voice translation pipeline work from start to finish?

The process begins when a user speaks into a microphone. The audio is captured and processed by Whisper, which transcribes it into English text. Then, LLaMA-3 translates the English text into Chinese. Finally, Coqui XTTS v2 converts the translated text into spoken audio, saving the result as audio.wav. The user can instantly listen to the output in the browser through Gradio’s interface.

3. Why is multi-GPU setup recommended for this project?

Using multiple GPUs drastically reduces latency by distributing workloads. In the example, one GPU is dedicated to Whisper (speech-to-text) and another to XTTS (text-to-speech). Running these components simultaneously speeds up the translation process and ensures smoother real-time performance essential for live conversations or accessibility tools.

4. How is LLaMA-3 used for translation in this setup?

The script integrates GMI Cloud’s LLaMA-3 API, which receives English text and returns a Chinese translation. It uses the meta-llama/Llama-3.3-70B-Instruct model with a temperature of 0 to ensure consistent and accurate results. The API request includes a system instruction that specifies: “Translate the following English text into Chinese and return only the translation.” The output text is then sent to the TTS component for speech synthesis.

5. What improves translation accuracy and audio naturalness?

Grace Deng’s tutorial emphasizes using clean 16kHz audio input to improve Whisper’s accuracy. Setting temperature to 0 produces stable translations without random variations. For speech output, using the language code “zh-cn” in XTTS ensures natural pronunciation, and choosing a clear voice like “Ana Florence” helps create high-quality, human-like audio.

6. How do you deploy and test the voice translator application?

The project setup is simple: a few files (translator.py, requirements.txt, and audio.wav) and dependencies installed with pip install -r requirements.txt. The Gradio interface manages input and output audio in real time. Running demo.launch(share=True) starts the app locally and generates a public link for testing, making it easy to demonstrate or share your real-time voice translator online.

Build AI Without Limits
GMI Cloud helps you architect, deploy, optimize, and scale your AI strategies
Get Started Now

Ready to build?

Explore powerful AI models and launch your project in just a few clicks.
Get Started