Real-time Voice Translation Pipeline

Build a low-latency speech-to-speech translation system using Whisper, NLLB-200 / SeamlessM4T, and Bark / XTTS

Problem Statement

Real-time voice translation is a challenging systems problem that requires high accuracy and ultra-low latency. The system must transcribe speech, translate it into a target language, and synthesize natural-sounding speech — all within strict time constraints.

Task Goals:

Enable speech-to-speech translation across multiple languages
Maintain end-to-end latency under 2 seconds
Preserve speaker intent and prosody
Support real-time streaming audio input

Solution Overview

NEO designed a chained multi-model inference pipeline optimized for low-latency execution:

Whisper (large-v3) for high-accuracy speech transcription
NLLB-200 or SeamlessM4T for multilingual translation
Bark or XTTS for natural speech synthesis

The pipeline uses streaming inference, batching, and asynchronous execution to meet real-time performance requirements.

Workflow / Pipeline

Step	Description
1. Audio Ingestion	Stream microphone input in small overlapping chunks
2. Speech-to-Text	Whisper (large-v3) performs incremental transcription
3. Text Translation	NLLB-200 or SeamlessM4T translates transcribed text to target language
4. Text-to-Speech	Bark or XTTS generates natural speech output
5. Audio Playback	Synthesized speech streamed back to the user in real time
6. Latency Optimization	Async execution, chunking, and model warm-up to keep latency <2s

Repository & Artifacts

GitHub Repository:
Real-time Voice Translation Pipeline by NEO

Generated Artifacts:

Streaming audio ingestion scripts
Whisper transcription pipeline
Translation adapters for NLLB-200 / SeamlessM4T
Bark / XTTS speech synthesis modules
Latency benchmarking reports
End-to-end inference orchestration code

Technical Details

Audio Chunking: Small overlapping windows to reduce transcription lag
Transcription: Whisper large-v3 with partial decoding enabled
Translation:
- NLLB-200 for wide multilingual coverage
- SeamlessM4T for speech-aligned translation
Speech Synthesis:
- Bark for expressive speech generation
- XTTS for faster, lower-latency synthesis
Concurrency: Async pipelines and parallel GPU execution
Caching: Model warm-up and reusable translation context

Results

End-to-End Latency: ~1.6–1.9 seconds on GPU
Transcription Accuracy: 95%+ across supported languages
Translation Quality: High BLEU scores on conversational speech
Speech Output: Natural prosody with minimal audible delay

The system successfully maintains conversational flow without noticeable lag.

Best Practices & Lessons Learned

Stream audio in small chunks instead of waiting for full sentences
Warm up all models before accepting real-time traffic
Decouple transcription, translation, and synthesis into async stages
Prefer XTTS when ultra-low latency is more critical than expressiveness
Log latency at each stage to identify bottlenecks

Next Steps

Add adaptive bitrate and noise suppression for live environments
Enable speaker voice cloning for personalized translation
Extend support to mobile and edge devices
Integrate WebRTC for browser-based real-time translation

References

GitHub Repository
Whisper (large-v3): https://github.com/openai/whisper
NLLB-200: https://ai.facebook.com/research/no-language-left-behind/
SeamlessM4T: https://ai.facebook.com/research/seamless-communication/
Bark TTS: https://github.com/suno-ai/bark
XTTS: https://github.com/coqui-ai/TTS