Mistral Challenges ElevenLabs and OpenAI with Its Open-Source Voice Model
Le brief IA que les pros lisent chaque soir
Les 7 actus IA du jour, décryptées en 5 min. Gratuit.
Inclus dès l'inscription : notre sélection des meilleurs guides & comparatifs IA.
Choisis ton rythme
Gratuit · Pas de spam · Désabonnement en 1 clic
Mistral Takes on the Voice Synthesis Market with an Open-Source Model
The French artificial intelligence company, Mistral, recently announced the launch of an open-source voice synthesis model. This model, unveiled on Thursday, is intended for use in various contexts, ranging from voice assistants to enterprise applications such as customer support. By introducing this model, Mistral is going head-to-head with industry giants like ElevenLabs, Deepgram, and OpenAI.
The model, named Voxtral TTS, is capable of processing nine different languages. Among these are English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic.
“Our clients wanted a voice synthesis model. So we developed a compact model that can be integrated into devices such as smartwatches, smartphones, and laptops. Its cost is significantly lower than existing solutions while offering top-notch performance,” explained Pierre Stock, Vice President of Scientific Operations at Mistral AI, in an interview with TechCrunch.
Advanced Features for Voice Customization
Mistral emphasized that this new model allows for the creation of a personalized voice from a sample of less than five seconds. It is also capable of reproducing vocal characteristics such as subtle accents, inflections, and intonations, as well as irregularities in speech flow. Based on the Ministral 3B model, Voxtral TTS can switch from one language to another without altering vocal characteristics, which is particularly useful for applications like dubbing or real-time translation. Stock stressed that the goal was to make the model as human-like as possible, avoiding a robotic output.
Designed to operate in real-time, the model boasts a time-to-first-audio (TTFA) of 90 milliseconds for a 10-second sample containing 500 characters. Additionally, with a real-time factor (RTF) of 6x, it can generate a 10-second excerpt in about 1.6 seconds.
Towards a Comprehensive Suite of Voice Products
Earlier this year, Mistral had already launched two transcription models: one for processing large amounts of data and the other for real-time applications requiring low latency. With this new voice synthesis model, the company seems to be aiming to offer a complete range of voice products for businesses.
“We are looking to develop an integrated platform capable of handling multimodal input streams, including audio, text, and images, while producing outputs. The main advantage is that you get much more information with an end-to-end agentic system that supports audio input or output,” said Stock.
Mistral is betting on the open-source aspect and the customization possibilities of its model to attract businesses, allowing them to tailor voice models to their specific needs, which could encourage them to choose Mistral over its competitors.
Brief IA — L'actualité IA en français
L'essentiel de l'actualité de l'intelligence artificielle, décrypté et expliqué chaque jour.