Skip to Content

Real-time Voice Translation Pipeline

Build a low-latency speech-to-speech translation system using Whisper, NLLB-200 / SeamlessM4T, and Bark / XTTS


Problem Statement

We Asked NEO to : Build a speech-to-speech translation system chaining Whisper (large-v3) for transcription, NLLB-200 or SeamlessM4T for translation, and Bark or XTTS for speech synthesis, optimizing for <2 second latency.


Solution Overview

NEO designed a chained multi-model inference pipeline optimized for low-latency execution:

  1. Whisper (large-v3) for high-accuracy speech transcription
  2. NLLB-200 or SeamlessM4T for multilingual translation
  3. Bark or XTTS for natural speech synthesis

The pipeline uses streaming inference, batching, and asynchronous execution to meet real-time performance requirements.

Voice Translation Pipeline


Workflow / Pipeline

StepDescription
1. Audio IngestionStream microphone input in small overlapping chunks
2. Speech-to-TextWhisper (large-v3) performs incremental transcription
3. Text TranslationNLLB-200 or SeamlessM4T translates transcribed text to target language
4. Text-to-SpeechBark or XTTS generates natural speech output
5. Audio PlaybackSynthesized speech streamed back to the user in real time
6. Latency OptimizationAsync execution, chunking, and model warm-up to keep latency <2s

Repository & Artifacts

README preview

Generated Artifacts:


Technical Details


Results

The system successfully maintains conversational flow without noticeable lag.


Best Practices & Lessons Learned


Next Steps


References


Learn More