GI-Bench: A Panoramic Benchmark Revealing the Knowledge-Experience Dissociation of Multimodal Large Language Models in Gastrointestinal Endoscopy Against Clinical Standards
- Yan Zhu ,
- Tengfei Luo ,
- Pei-yao Fu ,
- Zhen Zhang ,
- Zilong Wang ,
- Yi-Fan Qu ,
- Zifan Geng ,
- Jia-qi Xu ,
- L. Yao ,
- Li-yun Ma ,
- Wei Su ,
- Wei-Feng Chen ,
- Quan-Lin Li ,
- Shuo Wang ,
- P. Zhou
ArXiv | , Vol abs/2601.08183
Multimodal Large Language Models (MLLMs) show promise in gastroenterology, yet their performance against comprehensive clinical workflows and human benchmarks remains unverified. To systematically evaluate state-of-the-art MLLMs across a panoramic gastrointestinal endoscopy workflow and determine their clinical utility compared with human endoscopists. We constructed GI-Bench, a benchmark encompassing 20 fine-grained lesion categories. Twelve MLLMs were evaluated across a five-stage clinical workflow: anatomical localization, lesion identification, diagnosis, findings description, and management. Model performance was benchmarked against three junior endoscopists and three residency trainees using Macro-F1, mean Intersection-over-Union (mIoU), and multi-dimensional Likert scale. Gemini-3-Pro achieved state-of-the-art performance. In diagnostic reasoning, top-tier models (Macro-F1 0.641) outperformed trainees (0.492) and rivaled junior endoscopists (0.727; p>0.05). However, a critical”spatial grounding bottleneck”persisted; human lesion localization (mIoU>0.506) significantly outperformed the best model (0.345; p<0.05). Furthermore, qualitative analysis revealed a”fluency-accuracy paradox”: models generated reports with superior linguistic readability compared with humans (p<0.05) but exhibited significantly lower factual correctness (p<0.05) due to”over-interpretation”and hallucination of visual features. GI-Bench maintains a dynamic leaderboard that tracks the evolving performance of MLLMs in clinical endoscopy. The current rankings and benchmark results are available at https://roterdl.github.io/GIBench/.