Alibaba and Tsinghua Innovate with HopChain for Visual AI

⚡

Key Takeaways

1Vision-language models often struggle with complex tasks requiring multiple reasoning steps.

2HopChain, developed by Alibaba and Tsinghua University, generates questions to enhance the accuracy of AI models.

3With HopChain, 20 out of 24 benchmarks showed significant improvements, demonstrating better generalization of the models.

💡Why it matters — This advancement could transform the efficiency of AIs in analyzing complex images, impacting many tech sectors.

Vision-language models, which combine image and text understanding, often struggle when tasked with multi-step reasoning. These models can make errors early in the process, such as miscounting objects or confusing spatial relationships, leading to incorrect results that propagate through all subsequent steps.

To overcome these challenges, Alibaba's Qwen team, in collaboration with Tsinghua University, has developed an innovative framework named HopChain. This system automatically generates multi-step image questions, forcing models to meticulously re-examine images and target accumulated errors. This approach aims to enhance the accuracy of vision-language models by exposing and correcting fundamental weaknesses in their visual understanding.

A Four-Step Data Generation Process

The data creation for HopChain unfolds in four distinct steps. First, Alibaba's Qwen3-VL-235B-A22B-Thinking language model identifies the categories of objects present in an image. Next, Meta's SAM3 segmentation model locates the individual instances of these categories. This step is crucial to ensure that each object is correctly identified and segmented within the image.

In the third step, multi-level image questions are constructed around combinations of three to six objects. These questions are designed to test the models' ability to reason about images in multiple steps. Finally, four human annotators independently resolve each question, retaining only those where consensus is reached. This rigorous process produces between 60,000 and 80,000 training examples per model, ensuring high diversity and quality in the training data.

Promising Results with HopChain

Researchers tested two models, Qwen3.5-35B-A3B and Qwen3.5-397B-A17B, using the questions generated by HopChain. Performance was measured across 24 benchmarks covering areas such as general image understanding, text recognition, and video comprehension. The results are impressive: HopChain data improved 20 out of the 24 benchmarks.

For example, the EMMA score of the smaller model increased from 53 to 58, while the CharXiv score rose from 69 to 73.1. The BabyVision score of the larger model progressed from 28.61 to 32.22, and the ZeroBench score doubled, going from 4 to 8. These results demonstrate that the generated questions are not tailored to a specific benchmark but allow for genuine generalization of the models.

Although the training data is entirely based on images, both models also improved on five out of six video benchmarks, suggesting that the skills taught by HopChain transfer beyond static images. This indicates the models' ability to apply acquired skills to different contexts, thereby broadening their applicability.

The Importance of Complete Question Chaining

An ablation study revealed that complete question chaining is crucial for improving model performance. When questions are reduced to their final step, the average score across five representative benchmarks drops from 70.4 to 64.3. By retaining only the second half of the chain, the score reaches 66.7.

Improvements are also proportional to the length of the reasoning chain. For particularly long responses, the larger model saw its accuracy increase by over 50 points. HopChain enhances perception, logic, knowledge, and reduces hallucination errors. The distribution of errors confirms that HopChain aids in all areas: perception, logic, knowledge, and hallucination errors all see comparable gains.

However, a limitation remains: the process requires SAM3 to recognize objects, excluding images without segmentable objects. Visual perception remains a challenge for current models, as illustrated by the WorldVQA benchmark from Moonshot AI, where even the best models failed to correctly identify half of the objects. This limitation underscores the importance of precise segmentation for the success of vision-language models.

Alibaba and Tsinghua Innovate with HopChain for Visual AI

Le brief IA que les pros lisent chaque soir

A Four-Step Data Generation Process

Promising Results with HopChain

The Importance of Complete Question Chaining

Brief IA — L'actualité IA en français