Follow this guide to build your own voice translator agent!
Aliquet morbi justo auctor cursus auctor aliquam. Neque elit blandit et quis tortor vel ut lectus morbi. Amet mus nunc rhoncus sit sagittis pellentesque eleifend lobortis commodo vestibulum hendrerit proin varius lorem ultrices quam velit sed consequat duis. Lectus condimentum maecenas adipiscing massa neque erat porttitor in adipiscing aliquam auctor aliquam eu phasellus egestas lectus hendrerit sit malesuada tincidunt quisque volutpat aliquet vitae lorem odio feugiat lectus sem purus.
Viverra mi ut nulla eu mattis in purus. Habitant donec mauris id consectetur. Tempus consequat ornare dui tortor feugiat cursus. Pellentesque massa molestie phasellus enim lobortis pellentesque sit ullamcorper purus. Elementum ante nunc quam pulvinar. Volutpat nibh dolor amet vitae feugiat varius augue justo elit. Vitae amet curabitur in sagittis arcu montes tortor. In enim pulvinar pharetra sagittis fermentum. Ultricies non eu faucibus praesent tristique dolor tellus bibendum. Cursus bibendum nunc enim.
Mattis quisque amet pharetra nisl congue nulla orci. Nibh commodo maecenas adipiscing adipiscing. Blandit ut odio urna arcu quam eleifend donec neque. Augue nisl arcu malesuada interdum risus lectus sed. Pulvinar aliquam morbi arcu commodo. Accumsan elementum elit vitae pellentesque sit. Nibh elementum morbi feugiat amet aliquet. Ultrices duis lobortis mauris nibh pellentesque mattis est maecenas. Tellus pellentesque vivamus massa purus arcu sagittis. Viverra consectetur praesent luctus faucibus phasellus integer fermentum mattis donec.
Commodo velit viverra neque aliquet tincidunt feugiat. Amet proin cras pharetra mauris leo. In vitae mattis sit fermentum. Maecenas nullam egestas lorem tincidunt eleifend est felis tincidunt. Etiam dictum consectetur blandit tortor vitae. Eget integer tortor in mattis velit ante purus ante.
“Lacus donec arcu amet diam vestibulum nunc nulla malesuada velit curabitur mauris tempus nunc curabitur dignig pharetra metus consequat.”
Commodo velit viverra neque aliquet tincidunt feugiat. Amet proin cras pharetra mauris leo. In vitae mattis sit fermentum. Maecenas nullam egestas lorem tincidunt eleifend est felis tincidunt. Etiam dictum consectetur blandit tortor vitae. Eget integer tortor in mattis velit ante purus ante.
Based on an ODSC webinar by Grace Deng, Software Engineer at GMI Cloud
You can watch the original webinar recording here!
Imagine saying “hello” in English and hearing it spoken back in Mandarin—instantly, naturally, and with personality. That’s what real-time voice translation can do, and now, with open-source tooling and scalable infrastructure, anyone can build it.
In a recent ODSC webinar, Grace Deng, Software Engineer at GMI Cloud, walked through building a voice-to-voice translator in under an hour. This guide distills the key steps and open-source tooling used, so you can follow along and deploy your own in minutes.
You’ll create a real-time voice translator that has:
Use cases:
voice_translator/
├── translator.py # Main app script
├── audio.wav # Output audio file
├── requirements.txt # Dependencies
└── README.md # Project documentation
Create your Python environment (Conda or venv), add the dependencies in your requirements.txt file, and install:
pip install -r requirements.txt
Here's a list of imported dependencies and why we want them.
import gradio as gr
import os
import torch
import whisper as whisper_ai
from TTS.api import TTS
import requests
import transformers
import numpy
import librosa
This section prepares the hardware setup for running our Voice-to-Voice Translator by allocating GPU devices for two major tasks:
We aim to run each task on a separate GPU (if available) for optimal performance.
✅ Tip: When working with deep learning models like Whisper and TTS, spreading the workload across multiple GPUs can significantly improve runtime performance and reduce latency. Feel free to spread out the workload in a way that makes sense for your device. In this case, we use the last two GPU's.
num_gpus = torch.cuda.device_count()
if num_gpus >= 2:
device_whisper = f"cuda:{num_gpus - 2}"
device_tts = f"cuda:{num_gpus - 1}"
elif num_gpus == 1:
device_whisper = device_tts = "cuda:0"
else:
device_whisper = device_tts = "cpu"
print(f"Using {device_whisper} for Whisper ASR")
print(f"Using {device_tts} for Mozilla TTS")
In this section, we prepare the automatic speech recognition (ASR) component using OpenAI's Whisper Large V3 model, powered by Hugging Face Transformers.
torch_dtype = torch.float32
model_id = "openai/whisper-large-v3"
transcribe_model = transformers.AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
transcribe_model.to(device_whisper)
processor = transformers.AutoProcessor.from_pretrained(model_id)
pipe0 = transformers.pipeline(
"automatic-speech-recognition",
model=transcribe_model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=30,
batch_size=16,
return_timestamps=True,
torch_dtype=torch_dtype,
device=device_whisper,
)
def preprocess_audio(audio):
if audio.dtype != numpy.float32:
audio = audio.astype(numpy.float32)
sr = 16000
y = librosa.resample(audio, orig_sr=sr, target_sr=sr)
if len(y.shape) > 1:
y = librosa.to_mono(y)
return y
def transcribe_audio(audio) -> str:
sr, y = audio
y = preprocess_audio(y)
y = y.astype(numpy.float32)
y /= numpy.max(numpy.abs(y))
output_text = pipe0(y, generate_kwargs={"language": "english","temperature": 0.5, "top_p": 0.9,})["text"]
print(output_text)
return output_text
In this step, we bridge the gap between transcription and speech synthesis by introducing text translation.
To translate English transcripts into Chinese, we use a hosted LLaMA-3 model via a REST API (https://api.gmi-serving.com/v1/chat/completions). This function wraps the call to the API in Python and uses a POST request with the appropriate headers and payload.
The response is parsed as JSON. If the response is valid and successful (status_code == 200), the function extracts the translated message and returns it. Otherwise, it handles errors gracefully and logs useful debugging information.
✅ At this point, you can transcribe English audio, translate it to Chinese, and are now ready to generate Chinese audio output.
def translate_text(text):
url = "https://api.gmi-serving.com/v1/chat/completions"
headers = {
"Content-Type": "application/json",
"Authorization": "<GMI_API_KEY>"
}
payload = {
"model": "meta-llama/Llama-3.3-70B-Instruct",
"messages": [
{"role": "system", "content": "Translate the following English text into Chinese. Include the translation and nothing else."},
{"role": "user", "content": text}
],
"temperature": 0,
"max_tokens": 500
}
response = requests.post(url, headers=headers, json=payload, verify=False)
if response.status_code == 200:
try:
data = response.json()
translated_text = data['choices'][0]['message']['content']
except ValueError as e:
print("Failed to parse JSON:", e)
return translated_text
else:
print("Error with the request:", response.status_code, response.text)
return "No translation provided"
In this section, we complete the voice-to-voice translation pipeline by generating audio from the translated text and connecting all components into a single function.
The multilingual TTS model xtts_v2 from Coqui TTS is loaded and moved to the designated TTS device (device_tts), ensuring fast inference using GPU if available.
We define a function text_to_speech that takes in translated text and generates a spoken audio file from it.
tts_model = TTS("tts_models/multilingual/multi-dataset/xtts_v2")
tts_model.to(device_tts)
def text_to_speech(text):
output_path = "audio.wav"
tts_model.tts_to_file(text=text, file_path=output_path, speaker="Ana Florence", language="zh-cn")
return output_path
🔄 End-to-End Voice Translation Pipeline
The voice_to_voice() function integrates all stages:
Finally, it returns the path to the generated audio file.
✅ You now have a fully functioning voice-to-voice translator: English audio in → Chinese audio out!
def voice_to_voice(audio):
if audio is None:
return gr.Audio(value=None)
output_audio = text_to_speech(translate_text(transcribe_audio(audio)))
return output_audio
In this final step, we wrap our voice-to-voice translation pipeline into a user-friendly interface using Gradio.
We create a gr.Interface instance to handle:
Finally, we call .launch() with share=True to:
# --- Gradio UI ---
demo = gr.Interface(
fn=voice_to_voice,
inputs=gr.Audio(sources="microphone"),
outputs=gr.Audio(sources=["microphone"]),
title="Voice-to-Voice Translator",
description="🎤 Speak in English → 📝 Get Chinese text → 🔊 Listen to Chinese speech.",
live=True
)
demo.launch(share=True)
GMI Cloud is here to help you get your AI project off the ground—fast. With GPU-ready APIs and hosted endpoints, you don’t need to wrangle infrastructure just to experiment.
Join the GMI Discord community to connect with other builders and share your creations.
Give GMI Cloud a try and see for yourself if it's a good fit for AI needs.
Starting at
$4.39/GPU-hour
As low as
$2.50/GPU-hour