Mistral AI's Voxtral TTS: A Multilingual Voice Revolution in 70 ms

⚡

Key Takeaways

1Mistral AI has launched Voxtral TTS, a multilingual voice system that processes nine languages with a latency of 70 ms.

2The model uses a "decoder-only" transformer and can clone voices in 3 to 10 seconds, although the quality varies.

3Available through Mistral AI Studio and an API, Voxtral TTS costs $0.016 for 1,000 characters, with open-source options on Hugging Face.

💡Why it matters — Voxtral TTS offers fast and adaptable speech synthesis, expanding the possibilities for natural voice interaction in multiple languages.

Voxtral TTS: A Breakthrough in Speech Synthesis

Mistral AI recently unveiled Voxtral TTS, a speech synthesis system that stands out for its ability to handle nine different languages. This innovative model adjusts tone and can clone voices in a record time of 3 to 10 seconds. However, the quality of the generated voices may vary outside of controlled demonstrations.

The system is based on a decoder-only transformer model, known as Ministral 3B. It first generates semantic tokens before proceeding to speech synthesis, with a latency of about 70 ms. Despite its impressive performance, Voxtral TTS has limitations in terms of duration, segmenting productions that exceed approximately two minutes.

Features and Performance

Voxtral TTS is designed to make the generated voices more natural and expressive. It supports the following languages:

French
English
German
Spanish
Italian
Portuguese
Dutch
Hindi
Arabic

The model allows for the interpretation of different tones, such as neutral, enthusiastic, or serious, in order to adjust prosody and rhythm, thus avoiding a monotonous reading.

The tool also offers a voice cloning feature. From a short audio sample, it can reproduce a timbre, an accent, and even a certain vocal "personality."

Quality and Limitations

Although the demonstrations of Voxtral TTS are convincing, the output can be uneven in practice. The generated voices retain a slight artificiality, even though the accent and intonation are well reproduced. In internal tests, Voxtral TTS was preferred over ElevenLabs Flash v2.5 by native speakers, particularly for its naturalness and accent accuracy.

Technical Aspects

The Voxtral TTS model is specifically tailored for voice, first generating semantic speech tokens that describe the content and how to say it. A second module transforms these tokens into detailed audio signals.

One of the major strengths of Voxtral TTS is its low latency, allowing it to generate speech up to ten times faster than real-time. In practice, latency will depend more on the network or audio player than on the model itself.

However, quality may degrade beyond two minutes of continuous generation, prompting Mistral AI to segment generation into blocks of 20 to 30 seconds, assembled server-side to simulate a continuous stream.

Accessibility

Voxtral TTS is available for testing in the Mistral AI Studio and Le Chat, without requiring technical integration. For production use, an API is offered at $0.016 for 1,000 characters, and a version with open weights is available on Hugging Face for non-commercial use.