{"id":1131576,"date":"2025-02-25T11:08:21","date_gmt":"2025-02-25T19:08:21","guid":{"rendered":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/?p=1131576"},"modified":"2025-07-17T07:18:12","modified_gmt":"2025-07-17T14:18:12","slug":"magma-a-foundation-model-for-multimodal-ai-agents-across-digital-and-physical-worlds","status":"publish","type":"post","link":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/magma-a-foundation-model-for-multimodal-ai-agents-across-digital-and-physical-worlds\/","title":{"rendered":"Magma: A foundation model for multimodal AI agents across digital and physical worlds"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"788\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Project-Magma-BlogHeroFeature-1400x788-1.jpg\" alt=\"Gradient background transitioning from blue on the left to pink on the right. In the center, a rectangular box with \u2018MAGMA\u2019 written in bold white letters. To the left, an icon of a globe representing Earth. To the right, an icon of a computer monitor displaying a globe. Arrows connect these three elements in a circular flow, indicating interaction or data exchange between Earth, MAGMA, and the computer.\" class=\"wp-image-1131846\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Project-Magma-BlogHeroFeature-1400x788-1.jpg 1400w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Project-Magma-BlogHeroFeature-1400x788-1-300x169.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Project-Magma-BlogHeroFeature-1400x788-1-1024x576.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Project-Magma-BlogHeroFeature-1400x788-1-768x432.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Project-Magma-BlogHeroFeature-1400x788-1-1066x600.jpg 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Project-Magma-BlogHeroFeature-1400x788-1-655x368.jpg 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Project-Magma-BlogHeroFeature-1400x788-1-240x135.jpg 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Project-Magma-BlogHeroFeature-1400x788-1-640x360.jpg 640w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Project-Magma-BlogHeroFeature-1400x788-1-960x540.jpg 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Project-Magma-BlogHeroFeature-1400x788-1-1280x720.jpg 1280w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><\/figure>\n\n\n\n<p>Imagine an AI system capable of guiding a robot to manipulate physical objects as effortlessly as it navigates software menus. Such seamless integration of digital and physical tasks has long been the stuff of science fiction.&nbsp;&nbsp;<\/p>\n\n\n\n<p>Today, Microsoft researchers are bringing that vision closer to reality with <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/ai.azure.com\/labs\/projects\/magma\" target=\"_blank\" rel=\"noopener noreferrer\">Magma<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, a multimodal AI foundation model designed to process information and generate action proposals across both digital and physical environments. It is designed to enable AI agents to interpret user interfaces and suggest actions like button clicks, while also orchestrating robotic movements and interactions in the physical world. &nbsp;<\/p>\n\n\n\n<p>Built on the foundation model paradigm, Magma is pretrained on an expansive and diverse dataset, allowing it to generalize better across tasks and environments than smaller, task-specific models. As illustrated in Figure 1, Magma synthesizes visual and textual inputs to generate meaningful actions\u2014whether executing a command in software or grabbing a tool in the physical world. This new model represents a significant step toward AI agents that can serve as versatile, general-purpose assistants.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"4667\" height=\"2100\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Figure-1-3.png\" alt=\"Given a described goal, Magma can formulate plans and execute actions to achieve it. By effectively transferring knowledge from freely available visual and language data, Magma bridges verbal, spatial and temporal intelligence to navigate complex tasks and settings.\" class=\"wp-image-1131618\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Figure-1-3.png 4667w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Figure-1-3-300x135.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Figure-1-3-1024x461.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Figure-1-3-768x346.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Figure-1-3-1536x691.png 1536w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Figure-1-3-2048x922.png 2048w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Figure-1-3-240x108.png 240w\" sizes=\"auto, (max-width: 4667px) 100vw, 4667px\" \/><figcaption class=\"wp-element-caption\">Figure 1: Magma is one of the first foundation models that is capable of interpreting and grounding multimodal inputs within both digital and physical environments. Given a described goal, Magma can formulate plans and execute actions to achieve it. By effectively transferring knowledge from freely available visual and language data, Magma bridges verbal, spatial and temporal intelligence to navigate complex tasks and settings.<\/figcaption><\/figure>\n\n\n\n<p>Vision-Language-Action (VLA) models integrate visual perception, language comprehension, and action reasoning to enable AI systems to interpret images, process textual instructions, and propose actions. These models bridge the gap between multimodal understanding and real-world interaction. Typically pretrained on large numbers of VLA datasets, they acquire the ability to understand visual content, process language, and perceive and interact with the spatial world, allowing them to perform a wide range of tasks. However, due to the dramatic difference among various digital and physical environments, separate VLA models are trained and used for different environments. As a result, these models struggle to generalize to new tasks and environments outside of their training data. Moreover, most of these models do not leverage pretrained vision-language (VL) models or diverse VL datasets, which hampers their understanding of VL relations and generalizability. <\/p>\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"1144027\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">PODCAST SERIES<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300 display-block\" href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/story\/ai-testing-and-evaluation-learnings-from-science-and-industry\/\" aria-label=\"AI Testing and Evaluation: Learnings from Science and Industry\" data-bi-cN=\"AI Testing and Evaluation: Learnings from Science and Industry\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/06\/EP2-AI-TE_Hero_Feature_River_No_Text_1400x788.jpg\" alt=\"Illustrated headshots of Daniel Carpenter, Timo Minssen, Chad Atalla, and Kathleen Sullivan for the Microsoft Research Podcast\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">AI Testing and Evaluation: Learnings from Science and Industry<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p id=\"ai-testing-and-evaluation-learnings-from-science-and-industry\" class=\"large\">Discover how Microsoft is learning from other domains to advance evaluation and testing as a pillar of AI governance.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/story\/ai-testing-and-evaluation-learnings-from-science-and-industry\/\" aria-describedby=\"ai-testing-and-evaluation-learnings-from-science-and-industry\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" data-bi-cN=\"AI Testing and Evaluation: Learnings from Science and Industry\" target=\"_blank\">\n\t\t\t\t\t\t\tListen now\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n<p>Magma, to the best of our knowledge, is one of the first VLA foundation model that can adapt to new tasks in both digital and physical environments, which helps AI-powered assistants or robots understand their surroundings and suggest appropriate actions. For example, it could enable a home assistant robot to learn how to organize a new type of object it has never encountered or help a virtual assistant generate step-by-step user interface navigation instructions for an unfamiliar task. Through Magma, we demonstrate the advantages of pretraining a single VLA model for AI agents across multiple environments while still achieving state-of-the-art results on user interface navigation and robotic manipulation tasks, outperforming previous models that are tailored to these specific domains. On VL tasks, Magma also compares favorably to popular VL models that are trained on much larger datasets.&nbsp;<\/p>\n\n\n\n<p>Building a foundation model that spans such different modalities has required us to rethink how we train and supervise AI agents. Magma introduces a novel training paradigm centered on two key innovations: Set-of-Mark (SoM) and Trace-of-Mark (ToM) annotations. These techniques developed by Microsoft Research, imbue the model with a structured understanding of tasks in both user interface navigation and robotic manipulation domains.&nbsp;<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Set-of-Mark (SoM):<\/strong> SoM is an annotated set of key objects, or interface elements that are relevant to achieving a given goal. For example, if the task is to navigate a web page, the SoM includes all the bounding boxes for clickable user interface elements. In a physical task like setting a table, the SoM could include the plate, the cup, and the position of each item on the table. By providing SoM, we give Magma a high-level hint of \u201cwhat needs attention\u201d\u2014the essential elements of the task\u2014without yet specifying the order or method.<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1280\" height=\"234\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Figure-2-2.jpg\" alt=\"Set-of-Mark prompting enables effective action grounding in images for both UI screenshot , robot manipulation and human video by having the model predict numeric marks for clickable buttons or robot arms in image space. These marks give Magma a high-level hint of \u201cwhat needs attention\u201d \u2013 the essential elements of the task.\" class=\"wp-image-1131630\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Figure-2-2.jpg 1280w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Figure-2-2-300x55.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Figure-2-2-1024x187.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Figure-2-2-768x140.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Figure-2-2-240x44.jpg 240w\" sizes=\"auto, (max-width: 1280px) 100vw, 1280px\" \/><figcaption class=\"wp-element-caption\">Figure 2: Set-of-Mark (SoM) for Action Grounding. Set-of-Mark prompting enables effective action grounding in images for both UI screenshot (left), robot manipulation (middle) and human video (right) by having the model predict numeric marks for clickable buttons or robot arms in image space. These marks give Magma a high-level hint of \u201cwhat needs attention\u201d \u2013 the essential elements of the task&nbsp;<\/figcaption><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Trace-of-Mark (ToM):<\/strong> In ToM we extend the strategy of \u201coverlaying marks\u201d from static images to dynamic videos, by incorporating tracing lines following object movements over time. While SoM highlights key objects or interface elements relevant to a task, ToM captures how these elements change or move throughout an interaction. For example, in a physical task like moving an object on a table, ToM might illustrate the motion of a hand placing the object and adjusting its position. By providing these temporal traces, ToM offers Magma a richer understanding of how actions unfold, complementing SoM\u2019s focus on what needs attention.<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"9025\" height=\"1707\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Figure-3-4.png\" alt=\"Trace-of-Mark (ToM) for Action Planning. Trace-of-Mark supervisions for robot manipulation and human action. It compels the model to comprehend temporal video dynamics and anticipate future states before acting, while using fewer tokens than next-frame prediction to capture longer temporal horizons and action-related dynamics without ambient distractions\" class=\"wp-image-1131645\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Figure-3-4.png 9025w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Figure-3-4-300x57.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Figure-3-4-1024x194.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Figure-3-4-768x145.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Figure-3-4-1536x291.png 1536w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Figure-3-4-2048x387.png 2048w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Figure-3-4-240x45.png 240w\" sizes=\"auto, (max-width: 9025px) 100vw, 9025px\" \/><figcaption class=\"wp-element-caption\">Figure 3: Trace-of-Mark (ToM) for Action Planning. Trace-of-Mark supervisions for robot manipulation (left) and human action (right). It compels the model to comprehend temporal video dynamics and anticipate future states before acting, while using fewer tokens than next-frame prediction to capture longer temporal horizons and action-related dynamics without ambient distractions.&nbsp;<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"performance-and-evaluation\">Performance and evaluation<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"zero-shot-agentic-intelligence\">Zero-shot agentic intelligence<\/h3>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2221\" height=\"601\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Table-1-1.png\" alt=\"Table 1: Zero-shot evaluation on agentic intelligence. We report the results for pretrained Magma without any domain-specific finetuning. In this experiment, Magma is the only model that can conduct the full task spectrum.\" class=\"wp-image-1131813\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Table-1-1.png 2221w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Table-1-1-300x81.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Table-1-1-1024x277.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Table-1-1-768x208.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Table-1-1-1536x416.png 1536w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Table-1-1-2048x554.png 2048w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Table-1-1-240x65.png 240w\" sizes=\"auto, (max-width: 2221px) 100vw, 2221px\" \/><figcaption class=\"wp-element-caption\">Table 1: Zero-shot evaluation on agentic intelligence. We report the results for pretrained Magma without any domain-specific finetuning. In this experiment, Magma is the only model that can conduct the full task spectrum.<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"4284\" height=\"1271\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Figure-4-1.png\" alt=\"Zero-shot evaluation on Google Robots and Bridge with SimplerEnv. Magma shows strong zero-shot cross-domain robustness and demonstrates impressive results in cross-embodiment manipulation simulation tasks.\" class=\"wp-image-1131657\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Figure-4-1.png 4284w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Figure-4-1-300x89.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Figure-4-1-1024x304.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Figure-4-1-768x228.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Figure-4-1-1536x456.png 1536w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Figure-4-1-2048x608.png 2048w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Figure-4-1-240x71.png 240w\" sizes=\"auto, (max-width: 4284px) 100vw, 4284px\" \/><figcaption class=\"wp-element-caption\">Figure 4: Zero-shot evaluation on Google Robots and Bridge with SimplerEnv. Magma shows strong zero-shot cross-domain robustness and demonstrates impressive results in cross-embodiment manipulation simulation tasks.<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"efficient-finetuning\">Efficient finetuning<\/h3>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2835\" height=\"844\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Table-2.png\" alt=\"Table showing Zero-shot evaluation on agentic intelligence. We report the results for pretrained Magma without any domain-specific finetuning. In this experiment, Magma is the only model that can conduct the full task spectrum.\" class=\"wp-image-1131819\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Table-2.png 2835w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Table-2-300x89.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Table-2-1024x305.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Table-2-768x229.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Table-2-1536x457.png 1536w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Table-2-2048x610.png 2048w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Table-2-240x71.png 240w\" sizes=\"auto, (max-width: 2835px) 100vw, 2835px\" \/><figcaption class=\"wp-element-caption\">Table 2: Efficient finetuning on Mind2Web for web UI navigation.<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1072\" height=\"255\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Magma_Figure5.jpg\" alt=\"Figure 5: Few-shot finetuning on Widow-X robot (left) and LIBERO (right). Magma achieves a significantly higher average success rate in all task suites. Additionally, removing SoM and ToM during pretraining has a negative impact on model performance.\" class=\"wp-image-1131771\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Magma_Figure5.jpg 1072w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Magma_Figure5-300x71.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Magma_Figure5-1024x244.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Magma_Figure5-768x183.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Magma_Figure5-1066x255.jpg 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Magma_Figure5-240x57.jpg 240w\" sizes=\"auto, (max-width: 1072px) 100vw, 1072px\" \/><figcaption class=\"wp-element-caption\">Figure 5: Few-shot finetuning on Widow-X robot (left) and LIBERO (right). Magma achieves a significantly higher average success rate in all task suites. Additionally, removing SoM and ToM during pretraining has a negative impact on model performance. <\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"2440\" height=\"686\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Table-3.jpg\" alt=\"Without task-specific data, Magma performs competitively and even outperforms some state-of-the-art approaches such as Video-Llama2 and ShareGPT4Video on most benchmarks, despite using much fewer video instruction tuning data.\" class=\"wp-image-1131831\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Table-3.jpg 2440w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Table-3-300x84.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Table-3-1024x288.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Table-3-768x216.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Table-3-1536x432.jpg 1536w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Table-3-2048x576.jpg 2048w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Table-3-240x67.jpg 240w\" sizes=\"auto, (max-width: 2440px) 100vw, 2440px\" \/><figcaption class=\"wp-element-caption\">Table 3: Without task-specific data, Magma performs competitively and even outperforms some state-of-the-art approaches such as Video-Llama2 and ShareGPT4Video on most benchmarks, despite using much fewer video instruction tuning data.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"relation-to-broader-research\">Relation to broader research<\/h2>\n\n\n\n<p>Magma is one component of a much larger vision within Microsoft Research for the future of agentic AI systems. Across various teams and projects at Microsoft, we are collectively exploring how AI systems can detect, analyze, and respond in the world to amplify human capabilities.<\/p>\n\n\n\n<p>Earlier this month, we announced <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/autogen-v0-4-reimagining-the-foundation-of-agentic-ai-for-scale-extensibility-and-robustness\/\" target=\"_blank\" rel=\"noreferrer noopener\">AutoGen v0.4<\/a>, a fully reimagined open-source library for building advanced agentic AI systems. While AutoGen focuses on the structure and management of AI agents, Magma enhances those agents by empowering them with a new level of capability. Developers can already use AutoGen to set up an AI assistant that leverages an LLM for planning and dialogue using conventional LLMs. Now with MAGMA, if developers want to build agents that execute both physical or user interface\/browser tasks, that same assistant would call upon Magma to understand the environment, perform reasoning, and take a sequence of actions to complete the task.&nbsp;<\/p>\n\n\n\n<p>The reasoning ability of Magma can be further developed by incorporating test-time search and reinforcement learning, as described in <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/exact-improving-ai-agents-decision-making-via-test-time-compute-scaling\/\" target=\"_blank\" rel=\"noreferrer noopener\">ExACT<\/a>. ExACT shows an approach for teaching AI agents to explore more effectively, enabling them to intelligently navigate their environments, gather valuable information, evaluate options, and identify optimal decision-making and planning strategies.<\/p>\n\n\n\n<p>At the application level, we are also exploring new user experience (UX) powered by foundation models for the next generation of agentic AI systems. <a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/blog\/data-formulator-exploring-how-ai-can-help-analysts-create-rich-data-visualizations\/\" target=\"_blank\" rel=\"noreferrer noopener\">Data Formulator<\/a> is a prime example. Announced late last year, Data Formulator, is an AI-driven visualization tool developed by Microsoft Research that translates high-level analytical intents into rich visual representations by handling complex data transformations behind the scenes\u200b.&nbsp;&nbsp;<\/p>\n\n\n\n<p>Looking ahead, the integration of reasoning, exploration and action capabilities will pave the way for highly capable, robust agentic AI systems.<\/p>\n\n\n\n<p>Magma is available on <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/ai.azure.com\/labs\/projects\/magma\" target=\"_blank\" rel=\"noopener noreferrer\">Azure AI Foundry Labs<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> as well as on <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/microsoft\/Magma-8B\" target=\"_blank\" rel=\"noopener noreferrer\"><u>HuggingFace<\/u><span class=\"sr-only\"> (opens in new tab)<\/span><\/a> with an MIT license. Please refer to the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/microsoft.github.io\/Magma\/\" target=\"_blank\" rel=\"noopener noreferrer\"><u>Magma project page<\/u><span class=\"sr-only\"> (opens in new tab)<\/span><\/a> for more technical details. We invite you to test and explore these cutting-edge agentic model innovations from Microsoft Research.<\/p>\n\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button is-style-pill\"><a data-bi-type=\"button\" class=\"wp-block-button__link has-white-color has-blue-background-color has-text-color has-background has-link-color wp-element-button\" href=\"https:\/\/ai.azure.com\/labs\/projects\/magma\">Explore Magma on Azure AI Foundry<\/a><\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Explore Magma, a foundation model that can empower AI assistants to interpret environments, plan actions, and execute tasks across digital and physical spaces. Now available, learn how it advances the field of agentic AI.<\/p>\n","protected":false},"author":38004,"featured_media":1131846,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Swadheen Shukla","user_id":"38248"},{"type":"user_nicename","value":"Jianwei Yang","user_id":"40261"},{"type":"user_nicename","value":"Reuben Tan","user_id":"43827"},{"type":"user_nicename","value":"Qianhui Wu","user_id":"40741"},{"type":"user_nicename","value":"Jianfeng Gao","user_id":"32246"}],"msr_hide_image_in_river":null,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[269148,243984,269142],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-1131576","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-approved-for-river","msr-post-option-blog-homepage-featured","msr-post-option-include-in-river"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[1057371],"related-projects":[],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Swadheen Shukla","user_id":38248,"display_name":"Swadheen Shukla","author_link":"<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/swads\/\" aria-label=\"Visit the profile page for Swadheen Shukla\">Swadheen Shukla<\/a>","is_active":false,"last_first":"Shukla, Swadheen","people_section":0,"alias":"swads"},{"type":"user_nicename","value":"Reuben Tan","user_id":43827,"display_name":"Reuben Tan","author_link":"<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/tanreuben\/\" aria-label=\"Visit the profile page for Reuben Tan\">Reuben Tan<\/a>","is_active":false,"last_first":"Tan, Reuben","people_section":0,"alias":"tanreuben"},{"type":"user_nicename","value":"Qianhui Wu","user_id":40741,"display_name":"Qianhui Wu","author_link":"<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/qianhuiwu\/\" aria-label=\"Visit the profile page for Qianhui Wu\">Qianhui Wu<\/a>","is_active":false,"last_first":"Wu, Qianhui","people_section":0,"alias":"qianhuiwu"},{"type":"user_nicename","value":"Jianfeng Gao","user_id":32246,"display_name":"Jianfeng Gao","author_link":"<a href=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/people\/jfgao\/\" aria-label=\"Visit the profile page for Jianfeng Gao\">Jianfeng Gao<\/a>","is_active":false,"last_first":"Gao, Jianfeng","people_section":0,"alias":"jfgao"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Project-Magma-BlogHeroFeature-1400x788-1-960x540.jpg\" class=\"img-object-cover\" alt=\"Gradient background transitioning from blue on the left to pink on the right. In the center, a rectangular box with \u2018MAGMA\u2019 written in bold white letters. To the left, an icon of a globe representing Earth. To the right, an icon of a computer monitor displaying a globe. Arrows connect these three elements in a circular flow, indicating interaction or data exchange between Earth, MAGMA, and the computer.\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Project-Magma-BlogHeroFeature-1400x788-1-960x540.jpg 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Project-Magma-BlogHeroFeature-1400x788-1-300x169.jpg 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Project-Magma-BlogHeroFeature-1400x788-1-1024x576.jpg 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Project-Magma-BlogHeroFeature-1400x788-1-768x432.jpg 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Project-Magma-BlogHeroFeature-1400x788-1-1066x600.jpg 1066w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Project-Magma-BlogHeroFeature-1400x788-1-655x368.jpg 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Project-Magma-BlogHeroFeature-1400x788-1-240x135.jpg 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Project-Magma-BlogHeroFeature-1400x788-1-640x360.jpg 640w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Project-Magma-BlogHeroFeature-1400x788-1-1280x720.jpg 1280w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2025\/02\/Project-Magma-BlogHeroFeature-1400x788-1.jpg 1400w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"","formattedDate":"February 25, 2025","formattedExcerpt":"Explore Magma, a foundation model that can empower AI assistants to interpret environments, plan actions, and execute tasks across digital and physical spaces. Now available, learn how it advances the field of agentic AI.","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/1131576","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/users\/38004"}],"replies":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/comments?post=1131576"}],"version-history":[{"count":36,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/1131576\/revisions"}],"predecessor-version":[{"id":1145106,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/posts\/1131576\/revisions\/1145106"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media\/1131846"}],"wp:attachment":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1131576"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/categories?post=1131576"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/tags?post=1131576"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1131576"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=1131576"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=1131576"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1131576"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1131576"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1131576"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=1131576"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=1131576"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}