{"id":1048632,"date":"2024-06-19T14:06:14","date_gmt":"2024-06-19T21:06:14","guid":{"rendered":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/?post_type=msr-project&#038;p=1048632"},"modified":"2024-08-29T14:19:45","modified_gmt":"2024-08-29T21:19:45","slug":"image-understanding-benchmark","status":"publish","type":"msr-project","link":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/project\/image-understanding-benchmark\/","title":{"rendered":"A Dynamic Benchmark for Image Understanding"},"content":{"rendered":"<section class=\"mb-3 moray-highlight\">\n\t<div class=\"card-img-overlay mx-lg-0\">\n\t\t<div class=\"card-background  has-background-plum card-background--full-bleed\">\n\t\t\t\t\t<\/div>\n\t\t<!-- Foreground -->\n\t\t<div class=\"card-foreground d-flex mt-md-n5 my-lg-5 px-g px-lg-0\">\n\t\t\t<!-- Container -->\n\t\t\t<div class=\"container d-flex mt-md-n5 my-lg-5 \">\n\t\t\t\t<!-- Card wrapper -->\n\t\t\t\t<div class=\"w-100 w-lg-col-5\">\n\t\t\t\t\t<!-- Card -->\n\t\t\t\t\t<div class=\"card material-md-card py-5 px-md-5\">\n\t\t\t\t\t\t<div class=\"card-body \">\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n<h2 class=\"wp-block-heading\" id=\"a-dynamic-benchmark-for-image-understanding\">A Dynamic Benchmark for Image Understanding<\/h2>\n\n\n\n<p>We have created a procedurally generatable, synthetic dataset for testing spatial reasoning, visual prompting, object recognition and detection.<\/p>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n\n\n<p>A key question for understanding multimodal model performance is how well is can understand images, in particular basic vs. detailed spatial understanding of images. These capabilities are needed for models to be used in real-world tasks, such as an assistant in the physical world.&nbsp; We have created a procedurally generatable, synthetic dataset for testing spatial reasoning, visual prompting, object recognition and detection.&nbsp; The datasets are challenging and by being procedurally generated and non-public thus the results can\u2019t be due to memorization.<\/p>\n\n\n\n<p>The benchmark has 4 sub-tasks that test high-level and detailed understanding image using a Visual-Question-Answering (VQA) approach. The figure below shows the four tasks. Each one has a single object and pair of objects condition. For each image we show the question that can be presented as a prompt to a Multimodal language model.<\/p>\n\n\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2560\" height=\"1212\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/06\/spatialunderstanding-66734a5909a8e-scaled.jpg\" alt=\"Examples of the benchmark images and tasks\" class=\"wp-image-1048662\" style=\"object-fit:cover\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/06\/spatialunderstanding-66734a5909a8e-scaled.jpg 2560w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/06\/spatialunderstanding-66734a5909a8e-300x142.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/06\/spatialunderstanding-66734a5909a8e-1024x485.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/06\/spatialunderstanding-66734a5909a8e-768x364.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/06\/spatialunderstanding-66734a5909a8e-1536x727.jpg 1536w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/06\/spatialunderstanding-66734a5909a8e-2048x970.jpg 2048w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/06\/spatialunderstanding-66734a5909a8e-240x114.jpg 240w\" sizes=\"auto, (max-width: 2560px) 100vw, 2560px\" \/><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>We have created a procedurally generatable, synthetic dataset for testing spatial reasoning, visual prompting, object recognition and detection. A key question for understanding multimodal model performance is how well is can understand images, in particular basic vs. detailed spatial understanding of images. These capabilities are needed for models to be used in real-world tasks, such [&hellip;]<\/p>\n","protected":false},"featured_media":1048644,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13556,13562],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-1048632","msr-project","type-msr-project","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-research-area-computer-vision","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"","related-publications":[],"related-downloads":[],"related-videos":[],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[],"slides":[],"related-researchers":[{"type":"user_nicename","display_name":"Neel Joshi","user_id":33073,"people_section":"Related people","alias":"neel"}],"msr_research_lab":[992148],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1048632","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":6,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1048632\/revisions"}],"predecessor-version":[{"id":1081296,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1048632\/revisions\/1081296"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media\/1048644"}],"wp:attachment":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1048632"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1048632"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1048632"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1048632"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=1048632"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}