Mistral's Voxtral TTS Beats ElevenLabs in Human Preference Tests

Voxtral TTS Challenges Proprietary TTS Leaders

Mistral AI has released Voxtral TTS, a 3-billion-parameter text-to-speech model with open weights that outperforms ElevenLabs Flash v2.5 in human preference tests, according to the company. The model achieved a 68.4% win rate in evaluations conducted by native speakers assessing naturalness and expressivity across nine languages.

The model runs on approximately 3GB of RAM, making it suitable for deployment on edge devices including smartphones, smartwatches, and laptops. It achieves a time-to-first-audio of just 90 milliseconds for a 10-second sample (500 characters), with a real-time factor of 6x—rendering a 10-second clip in roughly 1.6 seconds.

Technical Architecture

Voxtral TTS is built on the Ministral 3B architecture and uses a hybrid approach combining auto-regressive generation of semantic speech tokens with flow-matching for acoustic tokens. The model supports nine languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic.

In additional benchmark scenarios tested against ElevenLabs v3, Voxtral TTS proved competitive in explicit steering settings and consistently outperformed both ElevenLabs models in implicit steering settings, according to Mistral's research paper. The human preference tests specifically assessed voice adaptation quality when using less than 5 seconds of reference audio, evaluating how well the model captured subtle accents, inflections, intonations, and speech flow irregularities.

Open Source Advantage

Released under the Apache 2.0 license, Voxtral TTS represents Mistral's expansion into the voice AI space beyond its earlier transcription-focused models. The company positions the model as a low-cost alternative to proprietary TTS services from ElevenLabs, Deepgram, and OpenAI, targeting enterprise use cases including voice AI assistants, customer support, and sales applications.

Mistral's voice suite now includes both speech-to-text (transcription) and text-to-speech capabilities, with the company indicating plans for end-to-end multimodal platforms handling audio, text, and images.