Skip to Content

Multilingual ASR Pipeline with Qwen3-ASR-0.6B

Build a production-ready automatic speech recognition system supporting 52 languages with streaming inference, timestamp alignment, and high-throughput batch processing


Problem Statement

We Asked NEO to: Build a complete ASR pipeline using Qwen3-ASR-0.6B that supports multilingual speech recognition across 52 languages and dialects, implements streaming inference for real-time transcription, provides word-level timestamp alignment, handles long-form audio processing, and achieves 2000x throughput at high concurrency.


Solution Overview

NEO designed a comprehensive speech-to-text pipeline leveraging state-of-the-art ASR technology:

  1. Qwen3-ASR-0.6B Model for fast, accurate multilingual transcription
  2. Streaming Inference Engine enabling real-time transcription with low latency
  3. Forced Alignment Module generating word-level timestamps for precise synchronization
  4. Batch Processing System handling large-scale audio transcription workloads

The pipeline achieves production-grade accuracy with 2000x throughput while maintaining <100ms latency for streaming applications.


Workflow / Pipeline

StepDescription
1. Audio PreprocessingResample audio to 16kHz, normalize volume, apply VAD filtering
2. Language DetectionAutomatic identification of spoken language from 52 supported options
3. Speech RecognitionQwen3-ASR-0.6B transcribes audio to text with high accuracy
4. Timestamp AlignmentQwen3-ForcedAligner-0.6B generates word/character-level timestamps
5. Post-ProcessingPunctuation restoration, speaker diarization, formatting
6. Output GenerationJSON/SRT/VTT export with transcripts and timing information

Repository & Artifacts

README preview

Generated Artifacts:


Technical Details


Results

Example Transcription Output

{ "text": "Hello, this is a test of the Qwen3 ASR system.", "language": "en", "confidence": 0.98, "duration": 3.52, "words": [ {"word": "Hello", "start": 0.00, "end": 0.32, "confidence": 0.99}, {"word": "this", "start": 0.48, "end": 0.68, "confidence": 0.97}, {"word": "is", "start": 0.72, "end": 0.84, "confidence": 0.98}, {"word": "a", "start": 0.88, "end": 0.96, "confidence": 0.95}, {"word": "test", "start": 1.00, "end": 1.28, "confidence": 0.99}, {"word": "of", "start": 1.32, "end": 1.48, "confidence": 0.96}, {"word": "the", "start": 1.52, "end": 1.64, "confidence": 0.98}, {"word": "Qwen3", "start": 1.68, "end": 2.08, "confidence": 0.97}, {"word": "ASR", "start": 2.12, "end": 2.56, "confidence": 0.99}, {"word": "system", "start": 2.60, "end": 3.12, "confidence": 0.98} ], "processing_time": 0.45 }

Best Practices & Lessons Learned


Next Steps


References


Learn More