Real-time Voice Translation Pipeline
Build a low-latency speech-to-speech translation system using Whisper, NLLB-200 / SeamlessM4T, and Bark / XTTS
Problem Statement
Real-time voice translation is a challenging systems problem that requires high accuracy and ultra-low latency. The system must transcribe speech, translate it into a target language, and synthesize natural-sounding speech — all within strict time constraints.
Task Goals:
- Enable speech-to-speech translation across multiple languages
- Maintain end-to-end latency under 2 seconds
- Preserve speaker intent and prosody
- Support real-time streaming audio input
Solution Overview
NEO designed a chained multi-model inference pipeline optimized for low-latency execution:
- Whisper (large-v3) for high-accuracy speech transcription
- NLLB-200 or SeamlessM4T for multilingual translation
- Bark or XTTS for natural speech synthesis
The pipeline uses streaming inference, batching, and asynchronous execution to meet real-time performance requirements.
Workflow / Pipeline
| Step | Description |
|---|---|
| 1. Audio Ingestion | Stream microphone input in small overlapping chunks |
| 2. Speech-to-Text | Whisper (large-v3) performs incremental transcription |
| 3. Text Translation | NLLB-200 or SeamlessM4T translates transcribed text to target language |
| 4. Text-to-Speech | Bark or XTTS generates natural speech output |
| 5. Audio Playback | Synthesized speech streamed back to the user in real time |
| 6. Latency Optimization | Async execution, chunking, and model warm-up to keep latency <2s |
Repository & Artifacts
GitHub Repository:
Real-time Voice Translation Pipeline by NEO
Generated Artifacts:
- Streaming audio ingestion scripts
- Whisper transcription pipeline
- Translation adapters for NLLB-200 / SeamlessM4T
- Bark / XTTS speech synthesis modules
- Latency benchmarking reports
- End-to-end inference orchestration code
Technical Details
- Audio Chunking: Small overlapping windows to reduce transcription lag
- Transcription: Whisper large-v3 with partial decoding enabled
- Translation:
- NLLB-200 for wide multilingual coverage
- SeamlessM4T for speech-aligned translation
- Speech Synthesis:
- Bark for expressive speech generation
- XTTS for faster, lower-latency synthesis
- Concurrency: Async pipelines and parallel GPU execution
- Caching: Model warm-up and reusable translation context
Results
- End-to-End Latency: ~1.6–1.9 seconds on GPU
- Transcription Accuracy: 95%+ across supported languages
- Translation Quality: High BLEU scores on conversational speech
- Speech Output: Natural prosody with minimal audible delay
The system successfully maintains conversational flow without noticeable lag.
Best Practices & Lessons Learned
- Stream audio in small chunks instead of waiting for full sentences
- Warm up all models before accepting real-time traffic
- Decouple transcription, translation, and synthesis into async stages
- Prefer XTTS when ultra-low latency is more critical than expressiveness
- Log latency at each stage to identify bottlenecks
Next Steps
- Add adaptive bitrate and noise suppression for live environments
- Enable speaker voice cloning for personalized translation
- Extend support to mobile and edge devices
- Integrate WebRTC for browser-based real-time translation
References
- GitHub Repository
- Whisper (large-v3): https://github.com/openai/whisper
- NLLB-200: https://ai.facebook.com/research/no-language-left-behind/
- SeamlessM4T: https://ai.facebook.com/research/seamless-communication/
- Bark TTS: https://github.com/suno-ai/bark
- XTTS: https://github.com/coqui-ai/TTS