Can Multimodal Large Language Models Understand Graphic Design? A Comparative Study

IEEE Transactions on Multimedia |

Graphic design evaluation is inherently subjective and multidimensional, posing significant challenges for reliable and systematic assessment. To address this, we propose a novel evaluation framework structuring across three hierarchical levels-recognition, semantic, and overall. This framework facilitates structured and automated evaluation, providing a unified foundation for assessing design quality using Multimodal Large Language Models (MLLMs). Building on this framework, we develop a comprehensive image-to-text benchmark to evaluate the alignment of MLLMs with human judgments across the three hierarchical levels. The benchmark consists of 8 tasks distributed across these levels and includes 1,600 meticulously annotated examples, ensuring high-quality and diverse coverage of graphic design scenarios. Using this benchmark, we systematically evaluate the performance of 19 MLLMs, including both black-box APIs (e.g., GPT-4.1, GPT-4o, Gemini-2) and open-weight models ranging from 0.5 billion to 78 billion parameters. Results show that design understanding remains a challenging task for MLLMs, with GPT-4.1 achieving the best overall performance (65.5%) and InternVL-v2.5 (78B) leading among open-weight models, albeit with a small gap compared to the black-box APIs. Furthermore, we further study the effectiveness of common enhancement strategies, including prompt refinement and the incorporation of few-shot examples. Finally, we validate the applicability of our framework through real-world design evaluation experiments, demonstrating consistent and interpretable results.