Mistral Challenges ElevenLabs and OpenAI with Its Open-Source Voice Model

⚡

Key Takeaways

1Mistral has unveiled Voxtral TTS, an open-source text-to-speech model aimed at revolutionizing voice assistants and customer support.

2Voxtral TTS supports nine languages, including English, French, and Arabic, and works on various devices such as smartwatches.

3The model offers advanced voice customization with less than five seconds of sample, capturing intonations and accents.

💡Why it matters — Mistral positions itself as a key player in voice technology, providing a flexible and cost-effective alternative to proprietary solutions.

Mistral Takes on the Voice Synthesis Market with an Open-Source Model

The French artificial intelligence company, Mistral, recently announced the launch of an open-source voice synthesis model. This model, unveiled on Thursday, is intended for use in various contexts, ranging from voice assistants to enterprise applications such as customer support. By introducing this model, Mistral is going head-to-head with industry giants like ElevenLabs, Deepgram, and OpenAI.

The model, named Voxtral TTS, is capable of processing nine different languages. Among these are English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic.

“Our clients wanted a voice synthesis model. So we developed a compact model that can be integrated into devices such as smartwatches, smartphones, and laptops. Its cost is significantly lower than existing solutions while offering top-notch performance,” explained Pierre Stock, Vice President of Scientific Operations at Mistral AI, in an interview with TechCrunch.

Advanced Features for Voice Customization

Mistral emphasized that this new model allows for the creation of a personalized voice from a sample of less than five seconds. It is also capable of reproducing vocal characteristics such as subtle accents, inflections, and intonations, as well as irregularities in speech flow. Based on the Ministral 3B model, Voxtral TTS can switch from one language to another without altering vocal characteristics, which is particularly useful for applications like dubbing or real-time translation. Stock stressed that the goal was to make the model as human-like as possible, avoiding a robotic output.

Designed to operate in real-time, the model boasts a time-to-first-audio (TTFA) of 90 milliseconds for a 10-second sample containing 500 characters. Additionally, with a real-time factor (RTF) of 6x, it can generate a 10-second excerpt in about 1.6 seconds.

Towards a Comprehensive Suite of Voice Products

Earlier this year, Mistral had already launched two transcription models: one for processing large amounts of data and the other for real-time applications requiring low latency. With this new voice synthesis model, the company seems to be aiming to offer a complete range of voice products for businesses.

“We are looking to develop an integrated platform capable of handling multimodal input streams, including audio, text, and images, while producing outputs. The main advantage is that you get much more information with an end-to-end agentic system that supports audio input or output,” said Stock.

Mistral is betting on the open-source aspect and the customization possibilities of its model to attract businesses, allowing them to tailor voice models to their specific needs, which could encourage them to choose Mistral over its competitors.

Mistral Challenges ElevenLabs and OpenAI with Its Open-Source Voice Model

Le brief IA que les pros lisent chaque soir

Mistral Takes on the Voice Synthesis Market with an Open-Source Model

Advanced Features for Voice Customization

Towards a Comprehensive Suite of Voice Products

Brief IA — L'actualité IA en français