ByteDance Challenges Qwen2.5 with Its Innovative iLLaDA Model

⚡

Key Takeaways

1ByteDance and Renmin University have developed iLLaDA, a language model with 8 billion parameters.

2iLLaDA stands out for its ability to generate text differently from ChatGPT, although it remains in competition with Qwen2.5.

3Despite competitive performance at the base level, iLLaDA shows lower results after fine-tuning compared to Qwen2.5.

💡Why it matters — ByteDance is looking to strengthen its position in the field of language models against established competitors like Qwen2.5.

ByteDance Challenges Qwen2.5 with Its Innovative iLLaDA Model

ByteDance's iLLaDA language model is a diffusion model that competes with Qwen2.5.

Researchers from Renmin University and ByteDance have launched iLLaDA, an 8 billion parameter language model that operates differently from ChatGPT. It matches Qwen2.5 at the base level but falls behind after fine-tuning.

Almost all well-known AI language models, such as GPT, Claude, or Qwen, generate text in an autoregressive manner: word by word, from left to right, with each new token depending only on those that precede it.

Diffusion language models take a different approach. They start with a sequence of masked tokens and refine them through multiple parallel passes. This is similar to how image models shape an image from noise. Each position can attend to all other positions simultaneously, making the process bidirectional.

iLLaDA is part of a broader movement that includes Google. In June 2026, Google DeepMind launched DiffusionGemma. This model generates text about four times faster via diffusion but scores lower on benchmarks like MMLU and code compared to the similarly sized autoregressive model, Gemma 4. Google recommends it for low-latency use cases, not for quality-critical productions.

DiffusionGemma takes a different approach. It is built on the Gemma 4 architecture, a mixture-of-experts model with 25 billion parameters that only changes the generation method to prioritize speed. iLLaDA, short for "improved LLaDA," takes the opposite path. It is a dense model of 8 billion parameters trained from scratch, focused on quality.

The question arises whether a diffusion model built from the ground up can truly compete with autoregressive models. However, a direct numerical comparison between the two is challenging. Google uses partially different and more difficult benchmark variants, and DiffusionGemma operates in a different weight class.

What iLLaDA Can Do

The team pre-trained iLLaDA on 12 trillion tokens, compared to 2.3 trillion for its predecessor LLaDA, and fine-tuned it for twelve epochs. According to the article, iLLaDA-Base significantly improves upon LLaDA, with a jump of 21.6 points on the BBH reasoning test, for example. On average, it scores 63.9 points, slightly surpassing Qwen2.5 7B, which scores 63.3.

Mathematics and Sciences

The comparison with the competing diffusion model Dream 7B also favors iLLaDA. Dream was not trained from scratch but was fine-tuned from an existing checkpoint of Qwen2.5. iLLaDA still outperforms Dream on average, 63.9 versus 61.4, even without the advantage of a solid autoregressive base. Dream only holds a slight edge on coding benchmarks.

A gap remains in instruction performance. iLLaDA-Instruct scores 67.1 points while Qwen2.5 7B Instruct reaches 77.1, with mathematics and code accounting for most of the difference. The authors attribute this to additional reinforcement learning alignment in Qwen2.5, which iLLaDA lacks. In the article's appendix, they also note that the model can get stuck in reasoning loops on more challenging tasks.

ByteDance Challenges Qwen2.5 with Its Innovative iLLaDA Model

Le brief IA que les pros lisent chaque soir

ByteDance Challenges Qwen2.5 with Its Innovative iLLaDA Model

What iLLaDA Can Do

Mathematics and Sciences

Brief IA — L'actualité IA en français