GI-Bench: A Panoramic Benchmark Revealing the Knowledge-Experience Dissociation of Multimodal Large Language Models in Gastrointestinal Endoscopy Against Clinical Standards

Yan Zhu; Tengfei Luo; Pei-yao Fu; Zhen Zhang; Zilong Wang; Yi-Fan Qu; Zifan Geng; Jia-qi Xu; L. Yao; Li-yun Ma; Wei Su; Wei-Feng Chen; Quan-Lin Li; Shuo Wang; P. Zhou

GI-Bench: A Panoramic Benchmark Revealing the Knowledge-Experience Dissociation of Multimodal Large Language Models in Gastrointestinal Endoscopy Against Clinical Standards

Yan Zhu ,
Tengfei Luo ,
Pei-yao Fu ,
Zhen Zhang ,
Zilong Wang ,
Yi-Fan Qu ,
Zifan Geng ,
Jia-qi Xu ,
L. Yao ,
Li-yun Ma ,
Wei Su ,
Wei-Feng Chen ,
Quan-Lin Li ,
Shuo Wang ,
P. Zhou

ArXiv | January 2026 , Vol abs/2601.08183

Download BibTex

Multimodal Large Language Models (MLLMs) show promise in gastroenterology, yet their performance against comprehensive clinical workflows and human benchmarks remains unverified. To systematically evaluate state-of-the-art MLLMs across a panoramic gastrointestinal endoscopy workflow and determine their clinical utility compared with human endoscopists. We constructed GI-Bench, a benchmark encompassing 20 fine-grained lesion categories. Twelve MLLMs were evaluated across a five-stage clinical workflow: anatomical localization, lesion identification, diagnosis, findings description, and management. Model performance was benchmarked against three junior endoscopists and three residency trainees using Macro-F1, mean Intersection-over-Union (mIoU), and multi-dimensional Likert scale. Gemini-3-Pro achieved state-of-the-art performance. In diagnostic reasoning, top-tier models (Macro-F1 0.641) outperformed trainees (0.492) and rivaled junior endoscopists (0.727; p>0.05). However, a critical”spatial grounding bottleneck”persisted; human lesion localization (mIoU>0.506) significantly outperformed the best model (0.345; p<0.05). Furthermore, qualitative analysis revealed a”fluency-accuracy paradox”: models generated reports with superior linguistic readability compared with humans (p<0.05) but exhibited significantly lower factual correctness (p<0.05) due to”over-interpretation”and hallucination of visual features. GI-Bench maintains a dynamic leaderboard that tracks the evolving performance of MLLMs in clinical endoscopy. The current rankings and benchmark results are available at https://roterdl.github.io/GIBench/.