Real-time Voice Translation Pipeline

Build a low-latency speech-to-speech translation system using Whisper, NLLB-200 / SeamlessM4T, and Bark / XTTS

Problem Statement

We Asked NEO to : Build a speech-to-speech translation system chaining Whisper (large-v3) for transcription, NLLB-200 or SeamlessM4T for translation, and Bark or XTTS for speech synthesis, optimizing for <2 second latency.

Solution Overview

NEO designed a chained multi-model inference pipeline optimized for low-latency execution:

Whisper (large-v3) for high-accuracy speech transcription
NLLB-200 or SeamlessM4T for multilingual translation
Bark or XTTS for natural speech synthesis

The pipeline uses streaming inference, batching, and asynchronous execution to meet real-time performance requirements.

Voice Translation Pipeline

Workflow / Pipeline

Step	Description
1. Audio Ingestion	Stream microphone input in small overlapping chunks
2. Speech-to-Text	Whisper (large-v3) performs incremental transcription
3. Text Translation	NLLB-200 or SeamlessM4T translates transcribed text to target language
4. Text-to-Speech	Bark or XTTS generates natural speech output
5. Audio Playback	Synthesized speech streamed back to the user in real time
6. Latency Optimization	Async execution, chunking, and model warm-up to keep latency <2s

Repository & Artifacts

Generated Artifacts:

Streaming audio ingestion scripts
Whisper transcription pipeline
Translation adapters for NLLB-200 / SeamlessM4T
Bark / XTTS speech synthesis modules
Latency benchmarking reports
End-to-end inference orchestration code

Technical Details

Audio Chunking: Small overlapping windows to reduce transcription lag
Transcription: Whisper large-v3 with partial decoding enabled
Translation:
- NLLB-200 for wide multilingual coverage
- SeamlessM4T for speech-aligned translation
Speech Synthesis:
- Bark for expressive speech generation
- XTTS for faster, lower-latency synthesis
Concurrency: Async pipelines and parallel GPU execution
Caching: Model warm-up and reusable translation context

Results

End-to-End Latency: ~1.6–1.9 seconds on GPU
Transcription Accuracy: 95%+ across supported languages
Translation Quality: High BLEU scores on conversational speech
Speech Output: Natural prosody with minimal audible delay

The system successfully maintains conversational flow without noticeable lag.

Best Practices & Lessons Learned

Stream audio in small chunks instead of waiting for full sentences
Warm up all models before accepting real-time traffic
Decouple transcription, translation, and synthesis into async stages
Prefer XTTS when ultra-low latency is more critical than expressiveness
Log latency at each stage to identify bottlenecks

Next Steps

Add adaptive bitrate and noise suppression for live environments
Enable speaker voice cloning for personalized translation
Extend support to mobile and edge devices
Integrate WebRTC for browser-based real-time translation

References

GitHub Repository
Whisper (large-v3): https://github.com/openai/whisper
NLLB-200: https://ai.facebook.com/research/no-language-left-behind/
SeamlessM4T: https://ai.facebook.com/research/seamless-communication/
Bark TTS: https://github.com/suno-ai/bark
XTTS: https://github.com/coqui-ai/TTS

Learn More

VS Code Extension

Install Neo and work directly with local code and data.

Platform Features

Understand Neo’s capabilities across web and IDE environments.

FAQ

Review security, privacy, limits, and troubleshooting information.