{"id":717256,"date":"2021-01-14T09:47:35","date_gmt":"2021-01-14T17:47:35","guid":{"rendered":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/?p=717256"},"modified":"2021-02-26T14:42:15","modified_gmt":"2021-02-26T22:42:15","slug":"vinvl-advancing-the-state-of-the-art-for-vision-language-models","status":"publish","type":"post","link":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/vinvl-advancing-the-state-of-the-art-for-vision-language-models\/","title":{"rendered":"VinVL: Advancing the state of the art for vision-language models"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/1400x788_Leaderboard_slideshow_nologo-1.gif\" alt=\"\"\/><\/figure>\n\n\n\n<p>Humans understand the world by perceiving and fusing information from multiple channels, such as images viewed by the eyes, voices heard by the ears, and other forms of sensory input. One of the core aspirations in AI is to develop algorithms that endow computers with a similar ability: to effectively learn from multimodal data like vision-language to make sense of the world around us. For example, vision-language (VL) systems allow searching the relevant images for a text query (or vice versa) and describing the content of an image using natural language.<\/p>\n\n\n\n<p>As illustrated in Figure 1, a typical VL system uses a modular architecture with two modules to achieve VL understanding:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>An image encoding module,<\/strong> also known as a visual feature extractor, is implemented using convolutional neural network (CNN) models to generate feature maps of input image. The CNN-based object detection model trained on the Visual Genome (VG) dataset is the most popular choice before our work.<\/li><li><strong>A vision-language fusion module<\/strong>&nbsp;maps&nbsp;the&nbsp;encoded&nbsp;image and text into vectors in the same semantic space so that their semantic similarity can be computed using cosine distance&nbsp;of&nbsp;their vectors. The module is&nbsp;typically&nbsp;implemented using a&nbsp;Transformer-based model,&nbsp;such as&nbsp;<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/objects-are-the-secret-key-to-revealing-the-world-between-vision-and-language\/\" target=\"_blank\" rel=\"noreferrer noopener\">OSCAR<\/a>.&nbsp;<\/li><\/ul>\n\n\n\n<p>Recently, vision-language pretraining (VLP) has made great progress in improving the vision-language fusion module by pretraining it on a large-scale paired image-text corpus. The most representative approach is to train large Transformer-based models on massive image-text pair data in a self-supervised manner, for example, predicting the masked elements based on their context. The pretrained vision-language fusion model can be fine-tuned to adapt to various downstream vision-language tasks. However, existing VLP methods treat the image encoding module as a black box and leave the visual feature improvement untouched since the development of the classical bottom-up region features in 2017, despite that there has been much research progress on improving image encoding and object detection.<\/p>\n\n\n\n<p>Here, we introduce recent Microsoft work on improving the image encoding module. Researchers from Microsoft have developed a new object-attribute detection model for image encoding, dubbed <strong>VinVL <\/strong>(<strong>V<\/strong>isual features <strong>in<\/strong> <strong>V<\/strong>ision-<strong>L<\/strong>anguage), and performed a comprehensive empirical study to show that visual features matter significantly in VL models. Combining VinVL with state-of-the-art VL fusion modules such as <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/objects-are-the-secret-key-to-revealing-the-world-between-vision-and-language\/\">OSCAR <\/a>and <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/novel-object-captioning-surpasses-human-performance-on-benchmarks\/\">VIVO<\/a>, the Microsoft VL system sets new state of the art on all seven major VL benchmarks, achieving top position in the most competitive VL leaderboards, including <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/eval.ai\/web\/challenges\/challenge-page\/514\/leaderboard\/1386\">Visual Question Answering (VQA)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/competitions.codalab.org\/competitions\/3221#results\">Microsoft COCO Image Captioning<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/eval.ai\/web\/challenges\/challenge-page\/355\/leaderboard\/1011\">Novel Object Captioning (nocaps)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. Most notably, the Microsoft VL system significantly surpasses human performance on the nocaps leaderboard in terms of CIDEr (92.5 vs. 85.3).<\/p>\n\n\n\n<div class=\"annotations \" data-bi-aN=\"margin-callout\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 annotations__list--right\">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Publication<\/span>\n\t\t\t<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/vinvl-making-visual-representations-matter-in-vision-language-models\/\" data-bi-cN=\"VinVL: Making Visual Representations Matter in Vision-Language Models\" data-external-link=\"false\" data-bi-aN=\"margin-callout\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>VinVL: Making Visual Representations Matter in Vision-Language Models<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n<p>Microsoft will release the VinVL model and the source code to the public. Please refer to the <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/vinvl-making-visual-representations-matter-in-vision-language-models\/\">research paper<\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/Oscar\">GitHub repository<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. In addition, VinVL is being integrated into the Azure Cognitive Services, powering a wide range of multimodal scenarios (such as Seeing AI, Image Captioning in Office and LinkedIn, and others) to benefit millions of users through the Microsoft <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/project\/ai-at-scale\/\">AI at Scale<\/a> initiative.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"309\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/VinVLBlogFigure1-1024x309.jpg\" alt=\"Figure 1: Illustration on state-of-the-art modular architecture for vision-language tasks, with two modules, image encoding module and vision-language fusion module, which are typically trained on Visual Genome and Conceptual Captions, respectively. \" class=\"wp-image-717265\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/VinVLBlogFigure1-1024x309.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/VinVLBlogFigure1-300x90.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/VinVLBlogFigure1-768x231.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/VinVLBlogFigure1-16x5.jpg 16w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/VinVLBlogFigure1.jpg 1251w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Figure 1: Illustration on state-of-the-art modular architecture for vision-language tasks, with two modules, image encoding module and vision-language fusion module, which are typically trained on Visual Genome and Conceptual Captions, respectively.<\/figcaption><\/figure>\n\n\n\n<h2 id=\"vinvl-a-generic-object-attribute-detection-model\">VinVL: A generic object-attribute detection model<\/h2>\n\n\n\n<p>As opposed to classical computer vision tasks such as object detection, VL tasks require understanding more diverse visual concepts and aligning them with corresponding concepts in the text modality. On one hand, most popular object detection benchmarks (such as COCO, Open Images, Objects365) contain annotations for up to 600 object classes, mainly focusing on objects with a well-defined shape (such as car, person) but missing visual objects occupying amorphous regions (such as grass, sky), which are typically useful for describing an image. The limited and biased object classes make these object detection datasets insufficient for training very useful VL understanding models for real-world applications. On the other hand, although the VG dataset has annotations for more diverse and unbiased object and attribute classes, it contains only 110,000 images and is statistically too small to learn a reliable image encoding model.<\/p>\n\n\n\n<p>To train our object-attribute detection model for VL tasks, we constructed a large object detection dataset containing 2.49M images for 1,848 object classes and 524 attribute classes, by merging four public object detection datasets, that is, COCO, Open Images, Objects365 and VG. As most datasets do not have attribute annotations, we adopted a pretraining and fine-tuning strategy to build our object-attribute detection model. We first pretrained an object detection model on the merged dataset, and then fine-tuned the model with an additional attribute branch on VG, making it capable of detecting both objects and attributes. The resultant object-attribute detection model is a Faster-RCNN model with 152 convolutional layers and 133M parameters, which is the largest image encoding model for VL tasks reported.<\/p>\n\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"1144027\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">PODCAST SERIES<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300 display-block\" href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/story\/ai-testing-and-evaluation-learnings-from-science-and-industry\/\" aria-label=\"AI Testing and Evaluation: Learnings from Science and Industry\" data-bi-cN=\"AI Testing and Evaluation: Learnings from Science and Industry\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/06\/EP2-AI-TE_Hero_Feature_River_No_Text_1400x788.jpg\" alt=\"Illustrated headshots of Daniel Carpenter, Timo Minssen, Chad Atalla, and Kathleen Sullivan for the Microsoft Research Podcast\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">AI Testing and Evaluation: Learnings from Science and Industry<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p id=\"ai-testing-and-evaluation-learnings-from-science-and-industry\" class=\"large\">Discover how Microsoft is learning from other domains to advance evaluation and testing as a pillar of AI governance.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/story\/ai-testing-and-evaluation-learnings-from-science-and-industry\/\" aria-describedby=\"ai-testing-and-evaluation-learnings-from-science-and-industry\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" data-bi-cN=\"AI Testing and Evaluation: Learnings from Science and Industry\" target=\"_blank\">\n\t\t\t\t\t\t\tListen now\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n\n<p>Our object-attribute detection model can detect 1,594 object classes and 524 visual attributes. As a result, the model can detect and encode nearly all the semantically meaningful regions in an input image, according to our experiments. As illustrated in Figure 2, compared with detections of a classical object detection model (left), our model (right) can detect more visual objects and attributes in an image and encode them with richer visual features, which are crucial for a wide range of VL tasks.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"943\" height=\"326\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/VinVLBlogFigure2.jpg\" alt=\"Figure 2: Detections from a classical object detection model trained on Open Images (Left) and our object-attribute detection model trained on four public object detection datasets (Right). Our model contains much richer semantics, such as richer visual concepts and attribute information, and the detected bounding boxes cover nearly all semantically meaningful regions.  \n\n \" class=\"wp-image-717271\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/VinVLBlogFigure2.jpg 943w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/VinVLBlogFigure2-300x104.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/VinVLBlogFigure2-768x266.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/VinVLBlogFigure2-16x6.jpg 16w\" sizes=\"auto, (max-width: 943px) 100vw, 943px\" \/><figcaption>Figure 2: Detections from a classical object detection model trained on Open Images (left) and our object-attribute detection model trained on four public object detection datasets (right). Our model contains much richer semantics, such as richer visual concepts and attribute information, and the detected bounding boxes cover nearly all semantically meaningful regions.<\/figcaption><\/figure>\n\n\n\n<h2 id=\"state-of-the-art-performance-on-vision-language-tasks\">State-of-the-art performance on vision-language tasks<\/h2>\n\n\n\n<p>Since the image encoding module is fundamental to VL systems, as illustrated in Figure 1, our new image encoding can be used together with many existing VL fusion modules to improve the performance of VL tasks. For example, as reported in Table 1, by simply replacing visual features produced by the popular bottom-up model with the ones produced by our model, but keeping the VL fusion module (for example, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/oscar-object-semantics-aligned-pre-training-for-vision-language-tasks\/\">OSCAR <\/a>and <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/vivo-surpassing-human-performance-in-novel-object-captioning-with-visual-vocabulary-pre-training\/\">VIVO<\/a>) intact1, we observe significant improvement on all seven established VL tasks, often outperforming previous SoTA models by a significantly large margin.<\/p>\n\n\n\n<p class=\"has-small-font-size\">[1] Note that we still perform training for the VL fusion module, but use the same model architecture, training data, and training recipe.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/Table1_VinVL.jpg\" alt=\"Table 1: Uniform improvements on seven VL tasks by replacing visual features from the popular bottom-up features with ours. The nocaps baseline is from VIVO, and the baselines for the rest of the tasks are from OSCAR.\"\/><figcaption>Table 1: Uniform improvements on seven VL tasks by replacing visual features from the popular bottom-up features with ours. The nocaps baseline is from VIVO, and the baselines for the rest of the tasks are from OSCAR.<\/figcaption><\/figure>\n\n\n\n<p>To account for parameter efficiency, we compare models of different sizes in Table 2. Our base model outperforms previous large models on most tasks, indicating that with better image encoding the VL fusion module can be much more parameter efficient.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><a href=\"https:\/\/arxiv.org\/abs\/1907.03950\"><img decoding=\"async\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/Table2_VinVLBlog.jpg\" alt=\"\"\/><\/a><figcaption>Table 2: Oscar+, using visual features produced by our object-attribute detection model, achieves the best performance on seven established VL tasks. SoTA with subscript S, B, L indicates the best performance achieved by small, base, and large models (sizes are measured by <a href=\"https:\/\/arxiv.org\/abs\/1810.04805\">BERT<\/a>), respectively. For all the tables in this blog, Blue indicates the best result for a task, and gray background indicates results produced by Oscar. The previous SoTA results are collected from <a href=\"https:\/\/arxiv.org\/abs\/2006.16934\">ERNIE-VIL<\/a>, <a href=\"https:\/\/arxiv.org\/abs\/1907.03950\">Neural State Machine (NSM)<\/a>, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/vivo-surpassing-human-performance-in-novel-object-captioning-with-visual-vocabulary-pre-training\/\">VIVO<\/a>, <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/large-scale-adversarial-training-for-vision-and-language-representation-learning\/\">VILLA <\/a>and <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/oscar-object-semantics-aligned-pre-training-for-vision-language-tasks\/\">OSCAR<\/a>.<\/figcaption><\/figure>\n\n\n\n<p>Our&nbsp;new&nbsp;VL models,&nbsp;which consist of the new&nbsp;object-attribute detection model&nbsp;as its image encoding module and OSCAR as its VL fusion module,&nbsp;sit comfortably atop several&nbsp;AI benchmarks&nbsp;as of December 31, 2020,&nbsp;including&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/eval.ai\/web\/challenges\/challenge-page\/514\/leaderboard\/1386\" target=\"_blank\" rel=\"noopener noreferrer\">Visual Question Answering (VQA)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/competitions.codalab.org\/competitions\/3221#results\" target=\"_blank\" rel=\"noopener noreferrer\">Microsoft COCO Image Captioning<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/eval.ai\/web\/challenges\/challenge-page\/355\/leaderboard\/1011\" target=\"_blank\" rel=\"noopener noreferrer\">Novel Object Captioning (nocaps)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.&nbsp;Most notably, our VL model performance on nocaps substantially surpasses human performance in terms of CIDEr (92.5 vs. 85.3). On the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/paperswithcode.com\/sota\/visual-question-answering-on-gqa-test-std\">GQA benchmark<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, our model is also the first VL model that outperforms NSM, which contains some sophisticated reasoning components deliberately designed for that specific task.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large is-resized\"><a href=\"https:\/\/eval.ai\/web\/challenges\/challenge-page\/514\/leaderboard\/1386\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/VinVLLeaderboard_1.png\" alt=\"\" width=\"582\" height=\"495\"\/><\/a><figcaption><a href=\"https:\/\/eval.ai\/web\/challenges\/challenge-page\/514\/leaderboard\/1386\" target=\"_blank\" rel=\"noreferrer noopener\">Visual Question Answering (VQA)<\/a><\/figcaption><\/figure><\/div>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large is-resized\"><a href=\"https:\/\/competitions.codalab.org\/competitions\/3221#results\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/VinVLLeaderboard_2.jpg\" alt=\"\" width=\"541\" height=\"470\"\/><\/a><figcaption>&nbsp;<a href=\"https:\/\/competitions.codalab.org\/competitions\/3221#results\" target=\"_blank\" rel=\"noreferrer noopener\">Microsoft COCO Image Captioning<\/a><\/figcaption><\/figure><\/div>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large is-resized\"><a href=\"https:\/\/eval.ai\/web\/challenges\/challenge-page\/355\/leaderboard\/1011\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/VinVLLeaderboard_3.jpg\" alt=\"graphical user interface, application\" class=\"wp-image-717289\" width=\"553\" height=\"512\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/VinVLLeaderboard_3.jpg 734w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/VinVLLeaderboard_3-300x278.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/VinVLLeaderboard_3-13x12.jpg 13w\" sizes=\"auto, (max-width: 553px) 100vw, 553px\" \/><\/a><figcaption><a href=\"https:\/\/eval.ai\/web\/challenges\/challenge-page\/355\/leaderboard\/1011\">Novel Object Captioning (nocaps)<\/a><\/figcaption><\/figure><\/div>\n\n\n\n<h2 id=\"looking-forward\">Looking forward<\/h2>\n\n\n\n<p>VinVL has demonstrated great potential in improving image encoding for VL understanding. Our newly developed image encoding model can benefit a wide range of VL tasks, as illustrated by examples in <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/publication\/vinvl-making-visual-representations-matter-in-vision-language-models\/\">this paper<\/a>. Despite the promising results we obtained, such as surpassing human performance on image captioning benchmarks, our model is by no means reaching the human-level intelligence of VL understanding. Interesting directions of future works include: (1) further scale up the object-attribute detection pretraining by leveraging massive image classification\/tagging data, and (2) extend the methods of cross-modal VL representation learning to building perception-grounded language models that can ground visual concepts in natural language and vice versa like humans do.<\/p>\n\n\n\n<p><em>Acknowledgments: This research was conducted by Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. Additional thanks go to the Microsoft Research Service Engineering Group for providing computer resources for large-scale modeling. The baseline models used in our experiments are based on the open-source code released in the GitHub repository; we acknowledge all the authors who made their code public, which tremendously accelerates our project progress.<\/em><\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Humans understand the world by perceiving and fusing information from multiple channels, such as images viewed by the eyes, voices heard by the ears, and other forms of sensory input. One of the core aspirations in AI is to develop algorithms that endow computers with a similar ability: to effectively learn from multimodal data like [&hellip;]<\/p>\n","protected":false},"author":38838,"featured_media":717433,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Pengchuan Zhang","user_id":"36587"},{"type":"user_nicename","value":"Lei Zhang","user_id":"32641"},{"type":"user_nicename","value":"Jianfeng Gao","user_id":"32246"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556,13545],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-717256","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-research-area-human-language-technologies","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199565],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[144931,737755],"related-projects":[737098,649749],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Jianfeng Gao","user_id":32246,"display_name":"Jianfeng Gao","author_link":"<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/jfgao\/\" aria-label=\"Visit the profile page for Jianfeng Gao\">Jianfeng Gao<\/a>","is_active":false,"last_first":"Gao, Jianfeng","people_section":0,"alias":"jfgao"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/1400x788_VinVL_still_no_logo-960x540.jpg\" class=\"img-object-cover\" alt=\"\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/1400x788_VinVL_still_no_logo-960x540.jpg 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/1400x788_VinVL_still_no_logo-300x169.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/1400x788_VinVL_still_no_logo-1024x577.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/1400x788_VinVL_still_no_logo-768x433.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/1400x788_VinVL_still_no_logo-1536x865.jpg 1536w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/1400x788_VinVL_still_no_logo-2048x1154.jpg 2048w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/1400x788_VinVL_still_no_logo-16x9.jpg 16w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/1400x788_VinVL_still_no_logo-1066x600.jpg 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/1400x788_VinVL_still_no_logo-655x368.jpg 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/1400x788_VinVL_still_no_logo-343x193.jpg 343w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/1400x788_VinVL_still_no_logo-640x360.jpg 640w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/1400x788_VinVL_still_no_logo-1280x720.jpg 1280w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2021\/01\/1400x788_VinVL_still_no_logo-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"Pengchuan Zhang, Lei Zhang, and <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/jfgao\/\" title=\"Go to researcher profile for Jianfeng Gao\" aria-label=\"Go to researcher profile for Jianfeng Gao\" data-bi-type=\"byline author\" data-bi-cN=\"Jianfeng Gao\">Jianfeng Gao<\/a>","formattedDate":"January 14, 2021","formattedExcerpt":"Humans understand the world by perceiving and fusing information from multiple channels, such as images viewed by the eyes, voices heard by the ears, and other forms of sensory input. One of the core aspirations in AI is to develop algorithms that endow computers with&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/717256","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/users\/38838"}],"replies":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/comments?post=717256"}],"version-history":[{"count":17,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/717256\/revisions"}],"predecessor-version":[{"id":728947,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/717256\/revisions\/728947"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media\/717433"}],"wp:attachment":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=717256"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/categories?post=717256"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/tags?post=717256"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=717256"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=717256"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=717256"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=717256"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=717256"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=717256"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=717256"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=717256"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}