Voxtral Transcribe 2 delivers ultra-fast, highly accurate speech-to-text with real-time transcription and speaker diarization. Built for live apps, voice agents, and meetings, it supports 13 languages, word-level timestamps, and privacy-first deployment.

Voxtral Transcribe 2 is a family of next-generation speech-to-text models offering real-time transcription and batch transcription capabilities. It consists of Voxtral Mini Transcribe V2 for high-accuracy batch processing and Voxtral Realtime for live applications. The models are designed for developers building voice agents, meeting intelligence tools, and multilingual transcription services. Core value includes state-of-the-art transcription quality, ultra-low latency down to sub-200ms, and speaker diarization, all at an industry-leading price-performance ratio.

Traditional speech-to-text solutions often struggle with latency, accuracy in noisy environments, and speaker attribution. Voice agents and live transcription apps require sub-second response times, yet many systems cannot deliver real-time results without sacrificing quality. Additionally, handling multiple speakers in meetings or call centers is error-prone, leading to garbled transcripts. Multi-language support is often limited or expensive. These pain points force developers to compromise on performance or incur high costs. Voxtral Transcribe 2 addresses these challenges by providing both a batch model with best-in-class word error rate and a streaming model with configurable delay, solving the core issues of speed, accuracy, diarization, and multilingual coverage.

Voxtral Transcribe 2 includes speaker diarization and word-level timestamps as key features. Speaker diarization generates transcriptions with labels identifying who spoke each segment, along with precise start and end times. This makes it ideal for meeting transcription, interview analysis, and multi-party call processing. Word-level timestamps provide start and end times for every word, enabling applications like subtitle generation, audio search within recordings, and precise content alignment. These features work together to give developers full control over transcription output, allowing downstream systems to index, search, and attribute speech accurately.

The model also introduces context biasing, allowing users to provide up to 100 words or phrases that guide transcription toward correct spellings of names, technical terms, or domain-specific vocabulary. This is particularly useful for proper nouns and industry terminology that generic models often miss, such as medication names in healthcare or product codes in manufacturing. Additionally, noise robustness ensures the model maintains accuracy in challenging acoustic environments like factory floors, busy call centers, and field recordings. These capabilities expand the usefulness of Voxtral Transcribe 2 beyond clean speech to real-world, noisy conditions.

Voxtral Transcribe 2

Key Features

Use Cases

Who is this for?

Comments