How to Build a Real-Time Voice Translator with Open-Source AI

Follow this guide to build your own voice translator agent!

May 6, 2025

Why managing AI risk presents new challenges

Aliquet morbi justo auctor cursus auctor aliquam. Neque elit blandit et quis tortor vel ut lectus morbi. Amet mus nunc rhoncus sit sagittis pellentesque eleifend lobortis commodo vestibulum hendrerit proin varius lorem ultrices quam velit sed consequat duis. Lectus condimentum maecenas adipiscing massa neque erat porttitor in adipiscing aliquam auctor aliquam eu phasellus egestas lectus hendrerit sit malesuada tincidunt quisque volutpat aliquet vitae lorem odio feugiat lectus sem purus.

  • Lorem ipsum dolor sit amet consectetur lobortis pellentesque sit ullamcorpe.
  • Mauris aliquet faucibus iaculis vitae ullamco consectetur praesent luctus.
  • Posuere enim mi pharetra neque proin condimentum maecenas adipiscing.
  • Posuere enim mi pharetra neque proin nibh dolor amet vitae feugiat.

The difficult of using AI to improve risk management

Viverra mi ut nulla eu mattis in purus. Habitant donec mauris id consectetur. Tempus consequat ornare dui tortor feugiat cursus. Pellentesque massa molestie phasellus enim lobortis pellentesque sit ullamcorper purus. Elementum ante nunc quam pulvinar. Volutpat nibh dolor amet vitae feugiat varius augue justo elit. Vitae amet curabitur in sagittis arcu montes tortor. In enim pulvinar pharetra sagittis fermentum. Ultricies non eu faucibus praesent tristique dolor tellus bibendum. Cursus bibendum nunc enim.

Id suspendisse massa mauris amet volutpat adipiscing odio eu pellentesque tristique nisi.

How to bring AI into managing risk

Mattis quisque amet pharetra nisl congue nulla orci. Nibh commodo maecenas adipiscing adipiscing. Blandit ut odio urna arcu quam eleifend donec neque. Augue nisl arcu malesuada interdum risus lectus sed. Pulvinar aliquam morbi arcu commodo. Accumsan elementum elit vitae pellentesque sit. Nibh elementum morbi feugiat amet aliquet. Ultrices duis lobortis mauris nibh pellentesque mattis est maecenas. Tellus pellentesque vivamus massa purus arcu sagittis. Viverra consectetur praesent luctus faucibus phasellus integer fermentum mattis donec.

Pros and cons of using AI to manage risks

Commodo velit viverra neque aliquet tincidunt feugiat. Amet proin cras pharetra mauris leo. In vitae mattis sit fermentum. Maecenas nullam egestas lorem tincidunt eleifend est felis tincidunt. Etiam dictum consectetur blandit tortor vitae. Eget integer tortor in mattis velit ante purus ante.

  1. Vestibulum faucibus semper vitae imperdiet at eget sed diam ullamcorper vulputate.
  2. Quam mi proin libero morbi viverra ultrices odio sem felis mattis etiam faucibus morbi.
  3. Tincidunt ac eu aliquet turpis amet morbi at hendrerit donec pharetra tellus vel nec.
  4. Sollicitudin egestas sit bibendum malesuada pulvinar sit aliquet turpis lacus ultricies.
“Lacus donec arcu amet diam vestibulum nunc nulla malesuada velit curabitur mauris tempus nunc curabitur dignig pharetra metus consequat.”
Benefits and opportunities for risk managers applying AI

Commodo velit viverra neque aliquet tincidunt feugiat. Amet proin cras pharetra mauris leo. In vitae mattis sit fermentum. Maecenas nullam egestas lorem tincidunt eleifend est felis tincidunt. Etiam dictum consectetur blandit tortor vitae. Eget integer tortor in mattis velit ante purus ante.

How to Build a Real-Time Voice Translator with Open-Source AI

Based on an ODSC webinar by Grace Deng, Software Engineer at GMI Cloud

You can watch the original webinar recording here!

Introduction

Imagine saying “hello” in English and hearing it spoken back in Mandarin—instantly, naturally, and with personality. That’s what real-time voice translation can do, and now, with open-source tooling and scalable infrastructure, anyone can build it.

In a recent ODSC webinar, Grace Deng, Software Engineer at GMI Cloud, walked through building a voice-to-voice translator in under an hour. This guide distills the key steps and open-source tooling used, so you can follow along and deploy your own in minutes.

What You’ll Build

You’ll create a real-time voice translator that has:

  • 🎤 Real-time speech input via microphone
  • 📝 Automatic English speech transcription using Whisper
  • 🌐 Translation of English text into Chinese using LLaMA 3
  • 🔊 Text-to-speech generation in Chinese using XTTS
  • 🚀 Deployed with Gradio for browser-based interaction

Use cases:

  • Travel assistant
  • Accessibility support
  • Live meeting translation

Tools You’ll Use

Core AI Models

  • Whisper (ASR): Speech-to-text transcription

  • LLaMA-3 (LLM): Text translation from English to Chinese

  • Coqui XTTS v2 (TTS): Voice synthesis with support for multilingual output

Supporting Stack

  • Transformers: HuggingFace pipelines for ASR and LLM
  • CUDA: Multi-GPU support for ASR/TTS acceleration
  • Gradio: Fast UI for real-time browser interaction
  • GMI Cloud API: Hosted endpoints for model inference at scale

Step-by-Step Guide

📁 Project Structure

voice_translator/
├── translator.py             # Main app script
├── audio.wav                 # Output audio file
├── requirements.txt          # Dependencies
└── README.md                 # Project documentation

1. Set Up Your Environment

Create your Python environment (Conda or venv), add the dependencies in your requirements.txt file, and install:

pip install -r requirements.txt

2. Import Dependencies

Here's a list of imported dependencies and why we want them.

  • gradio: For building the web-based user interface.
  • os: For interacting with the file system.
  • torch: The core PyTorch library for running ML models.
  • whisper: OpenAI's ASR model for converting speech to text.
  • TTS.api.TTS: The TTS engine from the coqui-ai/TTS library for converting text to speech.
  • requests: For HTTP requests (e.g., downloading resources).
  • transformers: HuggingFace library for using pretrained models (e.g., for translation).
  • numpy: General-purpose numerical computing.
  • librosa: Audio processing and feature extraction.
import gradio as gr
import os
import torch
import whisper as whisper_ai 
from TTS.api import TTS
import requests
import transformers
import numpy
import librosa 

3. Prepare GPU Devices

This section prepares the hardware setup for running our Voice-to-Voice Translator by allocating GPU devices for two major tasks:

  • Whisper ASR (Automatic Speech Recognition) — converts spoken language into text
  • Text-to-Speech (TTS) — synthesizes translated text back into spoken audio

We aim to run each task on a separate GPU (if available) for optimal performance.

✅ Tip: When working with deep learning models like Whisper and TTS, spreading the workload across multiple GPUs can significantly improve runtime performance and reduce latency. Feel free to spread out the workload in a way that makes sense for your device. In this case, we use the last two GPU's.

num_gpus = torch.cuda.device_count()
if num_gpus >= 2:
    device_whisper = f"cuda:{num_gpus - 2}"
    device_tts = f"cuda:{num_gpus - 1}"
elif num_gpus == 1:
    device_whisper = device_tts = "cuda:0"
else:
    device_whisper = device_tts = "cpu"

print(f"Using {device_whisper} for Whisper ASR")
print(f"Using {device_tts} for Mozilla TTS")

4. Whisper ASR: Speech to Text

In this section, we prepare the automatic speech recognition (ASR) component using OpenAI's Whisper Large V3 model, powered by Hugging Face Transformers.

  • 🔁 Pipeline: The Whisper ASR model and processor are wrapped in a pipeline, which processes audio in chunks (up to 30 seconds per batch) for more efficient handling.
  • 🎧 Preprocess Audio: Before feeding the audio into the ASR model, we need to ensure the audio is in the right format. The following function resamples the audio to 16kHz (if it’s not already), converts it to mono if stereo, and ensures it's in the correct data type for processing.
  • 📝 Transcribe Audio: This function takes audio as input, preprocesses it, and passes it through the Whisper ASR pipeline to generate a transcription. The audio is first resampled and converted, then normalized, and finally sent to the model for transcription.
torch_dtype = torch.float32
model_id = "openai/whisper-large-v3"

transcribe_model = transformers.AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
transcribe_model.to(device_whisper)

processor = transformers.AutoProcessor.from_pretrained(model_id)

pipe0 = transformers.pipeline(
    "automatic-speech-recognition",
    model=transcribe_model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device_whisper,
)

def preprocess_audio(audio):
    if audio.dtype != numpy.float32:
        audio = audio.astype(numpy.float32)
    sr = 16000
    y = librosa.resample(audio, orig_sr=sr, target_sr=sr)
    if len(y.shape) > 1:
        y = librosa.to_mono(y)
    return y

def transcribe_audio(audio) -> str:
    sr, y = audio
    y = preprocess_audio(y)
    y = y.astype(numpy.float32)
    y /= numpy.max(numpy.abs(y))
    output_text = pipe0(y, generate_kwargs={"language": "english","temperature": 0.5, "top_p": 0.9,})["text"]
    print(output_text)
    return output_text

5. Translate with LLaMA-3

In this step, we bridge the gap between transcription and speech synthesis by introducing text translation.

🔁 Translate Text Using LLaMA-3 API

To translate English transcripts into Chinese, we use a hosted LLaMA-3 model via a REST API (https://api.gmi-serving.com/v1/chat/completions). This function wraps the call to the API in Python and uses a POST request with the appropriate headers and payload.

  • The Authorization header includes a bearer token (API Key).
  • The request payload specifies:
    • The model name (meta-llama/Llama-3.3-70B-Instruct)
    • A system message directing the model to perform translation from English to Chinese, and to return only the translation.
    • A user message containing the input text.
    • Temperature (0) for deterministic output and a token limit of 500.

The response is parsed as JSON. If the response is valid and successful (status_code == 200), the function extracts the translated message and returns it. Otherwise, it handles errors gracefully and logs useful debugging information.

✅ At this point, you can transcribe English audio, translate it to Chinese, and are now ready to generate Chinese audio output.

def translate_text(text):
    url = "https://api.gmi-serving.com/v1/chat/completions"
    headers = {
        "Content-Type": "application/json",
        "Authorization": "<GMI_API_KEY>"
    }
    payload = {
        "model": "meta-llama/Llama-3.3-70B-Instruct",
        "messages": [
            {"role": "system", "content": "Translate the following English text into Chinese. Include the translation and nothing else."},
            {"role": "user", "content": text}
        ],
        "temperature": 0,
        "max_tokens": 500
    }
    response = requests.post(url, headers=headers, json=payload, verify=False)
    if response.status_code == 200:
        try:
            data = response.json()
            translated_text = data['choices'][0]['message']['content']
        except ValueError as e:
            print("Failed to parse JSON:", e)
        return translated_text
    else:
        print("Error with the request:", response.status_code, response.text)
    return "No translation provided"

6. Coqui XTTS: Text to Speech

In this section, we complete the voice-to-voice translation pipeline by generating audio from the translated text and connecting all components into a single function.

🔁 Load the TTS Model

The multilingual TTS model xtts_v2 from Coqui TTS is loaded and moved to the designated TTS device (device_tts), ensuring fast inference using GPU if available.

🎙️ Convert Translated Text to Speech

We define a function text_to_speech that takes in translated text and generates a spoken audio file from it.

  • The output audio is saved as "audio.wav".
  • The function uses a predefined speaker voice, "Ana Florence", and the output language is set to "zh-cn" (Chinese).
  • tts_to_file() from Coqui TTS handles the synthesis and writes the audio to disk.
tts_model = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
tts_model.to(device_tts)

def text_to_speech(text):
    output_path = "audio.wav"
    tts_model.tts_to_file(text=text, file_path=output_path, speaker="Ana Florence", language="zh-cn")
    return output_path

7. Wrap It All Up

🔄 End-to-End Voice Translation Pipeline

The voice_to_voice() function integrates all stages:

  1. Input: Receives a raw audio clip (user's speech).
  2. Check: Returns None if no audio was provided.
  3. Transcribe: Converts the audio to English text using Whisper.
  4. Translate: Translates the English text to Chinese using LLaMA.
  5. Synthesize: Generates a spoken Chinese audio clip from the translated text.

Finally, it returns the path to the generated audio file.

✅ You now have a fully functioning voice-to-voice translator: English audio in → Chinese audio out!

def voice_to_voice(audio):
    if audio is None:  
        return gr.Audio(value=None)  
    output_audio = text_to_speech(translate_text(transcribe_audio(audio)))
    return output_audio

8. Launch with Gradio

In this final step, we wrap our voice-to-voice translation pipeline into a user-friendly interface using Gradio.

🛠️ Define the Gradio Interface

We create a gr.Interface instance to handle:

  • Input: Real-time audio from the user's microphone (gr.Audio(sources="microphone")).
  • Output: Generated audio in Chinese, also rendered in a microphone-style audio widget for playback.
  • Function: The voice_to_voice() function defined earlier is used as the core processing pipeline.
  • Metadata:
    • Title: "Voice-to-Voice Translator"
    • Description: Provides a step-by-step summary of the system’s behavior.
    • Live Mode: Enabled (live=True) to support real-time audio streaming.

🚀 Launch the App

Finally, we call .launch() with share=True to:

  • Start a local server for the app.
  • Generate a public URL so you can share the demo with others online for testing or showcasing your voice-to-voice translator.
# --- Gradio UI ---
demo = gr.Interface(
    fn=voice_to_voice,
    inputs=gr.Audio(sources="microphone"),
    outputs=gr.Audio(sources=["microphone"]),
    title="Voice-to-Voice Translator",
    description="🎤 Speak in English → 📝 Get Chinese text → 🔊 Listen to Chinese speech.",
    live=True
)

demo.launch(share=True)

Pro Tips from Grace

  • Run ASR and TTS on separate GPUs for lower latency
  • Use clean 16kHz audio input to improve Whisper accuracy
  • Adjust temperature and prompts for translation accuracy
  • Use Coqui voice cloning for custom speakers or accessibility tuning

Resources

Build AI Without Limits

GMI Cloud is here to help you get your AI project off the ground—fast. With GPU-ready APIs and hosted endpoints, you don’t need to wrangle infrastructure just to experiment.

Join the GMI Discord community to connect with other builders and share your creations.

Get started today

Give GMI Cloud a try and see for yourself if it's a good fit for AI needs.

Get started
14-day trial
No long-term commits
No setup needed
On-demand GPUs

Starting at

$4.39/GPU-hour

$4.39/GPU-hour
Private Cloud

As low as

$2.50/GPU-hour

$2.50/GPU-hour