Skip to Content

Real-time Voice Translation Pipeline

Build a low-latency speech-to-speech translation system using Whisper, NLLB-200 / SeamlessM4T, and Bark / XTTS


Problem Statement

Real-time voice translation is a challenging systems problem that requires high accuracy and ultra-low latency. The system must transcribe speech, translate it into a target language, and synthesize natural-sounding speech — all within strict time constraints.

Task Goals:


Solution Overview

NEO designed a chained multi-model inference pipeline optimized for low-latency execution:

  1. Whisper (large-v3) for high-accuracy speech transcription
  2. NLLB-200 or SeamlessM4T for multilingual translation
  3. Bark or XTTS for natural speech synthesis

The pipeline uses streaming inference, batching, and asynchronous execution to meet real-time performance requirements.


Workflow / Pipeline

StepDescription
1. Audio IngestionStream microphone input in small overlapping chunks
2. Speech-to-TextWhisper (large-v3) performs incremental transcription
3. Text TranslationNLLB-200 or SeamlessM4T translates transcribed text to target language
4. Text-to-SpeechBark or XTTS generates natural speech output
5. Audio PlaybackSynthesized speech streamed back to the user in real time
6. Latency OptimizationAsync execution, chunking, and model warm-up to keep latency <2s

Repository & Artifacts

GitHub Repository:
Real-time Voice Translation Pipeline by NEO 

Generated Artifacts:


Technical Details


Results

The system successfully maintains conversational flow without noticeable lag.


Best Practices & Lessons Learned


Next Steps


References