reports
Published on : May 27, 2025
Automatic Speech Recognition (ASR) technology continues to advance but has reached a plateau in accuracy improvements for English pre-recorded content. According to the latest State of ASR report by 3Play Media, human review remains essential to meet accessibility standards for captioning and transcription.
ASR Accuracy Plateau
Remarkable progress has been made, but error rates across leading ASR engines still fall short of accessibility requirements.
The gap between top-performing engines and others has widened.
Study Scope
Evaluated 205 hours of diverse audio content, a 30% increase from last year, spanning multiple industries and use cases.
Included testing of eight ASR engines and Gemini, a multimodal large language model (LLM).
Engine Performance
Whisper X showed improved accuracy and avoided hallucinations found in earlier Whisper versions.
AssemblyAI’s Universal-2 and Whisper X outperformed Speechmatics, with all three ahead of other tested engines.
Industry Variations
ASR accuracy varies by industry, highlighting the need for tailored solutions based on content type.
Sports content remains most challenging due to noisy environments and complex terminology, with error rates three times higher than top-performing industries.
LLMs and Future Trends
Large language models are not yet ready to replace dedicated ASR engines for transcription.
Future ASR innovation will likely focus on real-time processing and support for non-English languages rather than further improving English pre-recorded content accuracy.
While ASR technologies are becoming more sophisticated, 3Play Media’s report emphasizes the ongoing necessity of human-in-the-loop workflows to ensure captioning and transcription meet accessibility standards. The report also suggests that future ASR developments will shift toward new applications and broader language capabilities.