Alibaba and Tsinghua Innovate with HopChain for Visual AI
Le brief IA que les pros lisent chaque soir
Les 7 actus IA du jour, décryptées en 5 min. Gratuit.
Inclus dès l'inscription : notre sélection des meilleurs guides & comparatifs IA.
Choisis ton rythme
Gratuit · Pas de spam · Désabonnement en 1 clic
Vision-language models, which combine image and text understanding, often struggle when tasked with multi-step reasoning. These models can make errors early in the process, such as miscounting objects or confusing spatial relationships, leading to incorrect results that propagate through all subsequent steps.
To overcome these challenges, Alibaba's Qwen team, in collaboration with Tsinghua University, has developed an innovative framework named HopChain. This system automatically generates multi-step image questions, forcing models to meticulously re-examine images and target accumulated errors. This approach aims to enhance the accuracy of vision-language models by exposing and correcting fundamental weaknesses in their visual understanding.
A Four-Step Data Generation Process
The data creation for HopChain unfolds in four distinct steps. First, Alibaba's Qwen3-VL-235B-A22B-Thinking language model identifies the categories of objects present in an image. Next, Meta's SAM3 segmentation model locates the individual instances of these categories. This step is crucial to ensure that each object is correctly identified and segmented within the image.
In the third step, multi-level image questions are constructed around combinations of three to six objects. These questions are designed to test the models' ability to reason about images in multiple steps. Finally, four human annotators independently resolve each question, retaining only those where consensus is reached. This rigorous process produces between 60,000 and 80,000 training examples per model, ensuring high diversity and quality in the training data.
Promising Results with HopChain
Researchers tested two models, Qwen3.5-35B-A3B and Qwen3.5-397B-A17B, using the questions generated by HopChain. Performance was measured across 24 benchmarks covering areas such as general image understanding, text recognition, and video comprehension. The results are impressive: HopChain data improved 20 out of the 24 benchmarks.
For example, the EMMA score of the smaller model increased from 53 to 58, while the CharXiv score rose from 69 to 73.1. The BabyVision score of the larger model progressed from 28.61 to 32.22, and the ZeroBench score doubled, going from 4 to 8. These results demonstrate that the generated questions are not tailored to a specific benchmark but allow for genuine generalization of the models.
Although the training data is entirely based on images, both models also improved on five out of six video benchmarks, suggesting that the skills taught by HopChain transfer beyond static images. This indicates the models' ability to apply acquired skills to different contexts, thereby broadening their applicability.
The Importance of Complete Question Chaining
An ablation study revealed that complete question chaining is crucial for improving model performance. When questions are reduced to their final step, the average score across five representative benchmarks drops from 70.4 to 64.3. By retaining only the second half of the chain, the score reaches 66.7.
Improvements are also proportional to the length of the reasoning chain. For particularly long responses, the larger model saw its accuracy increase by over 50 points. HopChain enhances perception, logic, knowledge, and reduces hallucination errors. The distribution of errors confirms that HopChain aids in all areas: perception, logic, knowledge, and hallucination errors all see comparable gains.
However, a limitation remains: the process requires SAM3 to recognize objects, excluding images without segmentable objects. Visual perception remains a challenge for current models, as illustrated by the WorldVQA benchmark from Moonshot AI, where even the best models failed to correctly identify half of the objects. This limitation underscores the importance of precise segmentation for the success of vision-language models.
Brief IA — L'actualité IA en français
L'essentiel de l'actualité de l'intelligence artificielle, décrypté et expliqué chaque jour.