ByteDance Challenges Qwen2.5 with Its Innovative iLLaDA Model

Le brief IA que les pros lisent chaque soir
Les 7 actus IA du jour, décryptées en 5 min. Gratuit.
Inclus dès l'inscription : notre sélection des meilleurs guides & comparatifs IA.
Choisis ton rythme
Gratuit · Pas de spam · Désabonnement en 1 clic
ByteDance Challenges Qwen2.5 with Its Innovative iLLaDA Model
ByteDance's iLLaDA language model is a diffusion model that competes with Qwen2.5.
Researchers from Renmin University and ByteDance have launched iLLaDA, an 8 billion parameter language model that operates differently from ChatGPT. It matches Qwen2.5 at the base level but falls behind after fine-tuning.
Almost all well-known AI language models, such as GPT, Claude, or Qwen, generate text in an autoregressive manner: word by word, from left to right, with each new token depending only on those that precede it.
Diffusion language models take a different approach. They start with a sequence of masked tokens and refine them through multiple parallel passes. This is similar to how image models shape an image from noise. Each position can attend to all other positions simultaneously, making the process bidirectional.
iLLaDA is part of a broader movement that includes Google. In June 2026, Google DeepMind launched DiffusionGemma. This model generates text about four times faster via diffusion but scores lower on benchmarks like MMLU and code compared to the similarly sized autoregressive model, Gemma 4. Google recommends it for low-latency use cases, not for quality-critical productions.
DiffusionGemma takes a different approach. It is built on the Gemma 4 architecture, a mixture-of-experts model with 25 billion parameters that only changes the generation method to prioritize speed. iLLaDA, short for "improved LLaDA," takes the opposite path. It is a dense model of 8 billion parameters trained from scratch, focused on quality.
The question arises whether a diffusion model built from the ground up can truly compete with autoregressive models. However, a direct numerical comparison between the two is challenging. Google uses partially different and more difficult benchmark variants, and DiffusionGemma operates in a different weight class.
What iLLaDA Can Do
The team pre-trained iLLaDA on 12 trillion tokens, compared to 2.3 trillion for its predecessor LLaDA, and fine-tuned it for twelve epochs. According to the article, iLLaDA-Base significantly improves upon LLaDA, with a jump of 21.6 points on the BBH reasoning test, for example. On average, it scores 63.9 points, slightly surpassing Qwen2.5 7B, which scores 63.3.
Mathematics and Sciences
The comparison with the competing diffusion model Dream 7B also favors iLLaDA. Dream was not trained from scratch but was fine-tuned from an existing checkpoint of Qwen2.5. iLLaDA still outperforms Dream on average, 63.9 versus 61.4, even without the advantage of a solid autoregressive base. Dream only holds a slight edge on coding benchmarks.
A gap remains in instruction performance. iLLaDA-Instruct scores 67.1 points while Qwen2.5 7B Instruct reaches 77.1, with mathematics and code accounting for most of the difference. The authors attribute this to additional reinforcement learning alignment in Qwen2.5, which iLLaDA lacks. In the article's appendix, they also note that the model can get stuck in reasoning loops on more challenging tasks.
Brief IA — L'actualité IA en français
L'essentiel de l'actualité de l'intelligence artificielle, décrypté et expliqué chaque jour.