{"id":992553,"date":"2024-08-23T03:31:35","date_gmt":"2024-08-23T10:31:35","guid":{"rendered":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/?post_type=msr-project&#038;p=992553"},"modified":"2025-05-25T23:33:02","modified_gmt":"2025-05-26T06:33:02","slug":"interactive-world-simulator","status":"publish","type":"msr-project","link":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/project\/interactive-world-simulator\/","title":{"rendered":"Interactive World Simulator"},"content":{"rendered":"\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-top is-style-default is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:100%\"><section class=\"mb-3 moray-highlight\">\n\t<div class=\"card-img-overlay mx-lg-0\">\n\t\t<div class=\"card-background  has-background-catalina-blue card-background--full-bleed\">\n\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"960\" height=\"540\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/08\/MSR.png\" class=\"attachment-full size-full\" alt=\"background pattern\" style=\"\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/08\/MSR.png 960w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/08\/MSR-300x169.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/08\/MSR-768x432.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/08\/MSR-655x368.png 655w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/08\/MSR-240x135.png 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/08\/MSR-640x360.png 640w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>\t\t<\/div>\n\t\t<!-- Foreground -->\n\t\t<div class=\"card-foreground d-flex mt-md-n5 my-lg-5 px-g px-lg-0\">\n\t\t\t<!-- Container -->\n\t\t\t<div class=\"container d-flex mt-md-n5 my-lg-5 \">\n\t\t\t\t<!-- Card wrapper -->\n\t\t\t\t<div class=\"w-100 w-lg-col-5\">\n\t\t\t\t\t<!-- Card -->\n\t\t\t\t\t<div class=\"card material-md-card py-5 px-md-5\">\n\t\t\t\t\t\t<div class=\"card-body \">\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n<h1 class=\"wp-block-heading has-text-align-left\" id=\"interactive-world-simulator\">Interactive World Simulator<\/h1>\n\n\n\n<p>Learning to simulate the visual world from large-scale videos.<\/p>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n<\/div>\n<\/div>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n\n\n<h3 class=\"wp-block-heading\" id=\"video-tokenization-1\">Video Tokenization<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>VidTok<\/strong>: a cutting-edge family of video tokenizers that excels in both continuous and discrete tokenizations. [<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/VidTok\">GitHub<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>]<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"autoregressive-video-models\">Autoregressive Video Models<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Video In-Context Learning<\/strong>: autoregressive transformers are zero-shot video imitators.<\/li>\n\n\n\n<li><strong>Diagonal Decoding<\/strong>: fast autoregressive video generation with diagonal decoding.<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"4d-world-simulator\">4D World Simulator<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Compositional 3D-aware Video Generation<\/strong>: C3V generates each concept in 3D representation separately and then composes them with priors from Large Language Models (LLM) and 2D diffusion models.<\/li>\n<\/ul>\n\n\n\n<p><\/p>\n\n\n\n\n\n<h2 class=\"wp-block-heading has-text-align-center\" id=\"video-in-context-learning-autoregressive-transformers-are-zero-shot-video-imitators-1\">Video In-context Learning: Autoregressive Transformers are Zero-Shot Video Imitators<\/h2>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-buttons is-content-justification-left is-content-justification-left is-layout-flex wp-container-core-buttons-is-layout-16165478 wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/arxiv.org\/abs\/2407.07356\">Paper<\/a><\/div>\n<\/div>\n\n\n\n<p>In-context learning for vision data has been underexplored compared with that in natural language. Previous works studied image in-context learning, urging models to generate a single image guided by demonstrations. In this project, we propose and study video in-context learning, where the model starts from an existing video clip and generates diverse potential future sequences, each semantically guided by the prompted video demonstrations. To achieve this, we provide a clear definition of the task, and train an autoregressive Transformer on video datasets. We thoroughly analyze the effect of different datasets and represent frames as discrete tokens, and then model them by next token predictions.<\/p>\n\n\n\n<p>As a result, the obtained vision Transformer model is able to generate a subsequent video sequence of a query video clip, which is semantically aligned with the demonstration video. Demonstration videos are highly versatile and capable of conveying a wide range of information, such as examples for various tasks including moving or grabbing objects, or movements of the camera in an ego-centric video. This allows video in-context learning to address multiple downstream tasks, such as embodied planning and simulating, by letting a query robot imitate the actions demonstrated by other robots, as shown below.  <strong><em>Video in-context learning serves as a new and crucial interface for models to interact with the real world<\/em><\/strong>, as videos are good at describing low-level details (where language may fall short) and temporal dynamics (where images are insufficient).<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"592\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/figure2-1024x592.png\" alt=\"Model Architecture\" class=\"wp-image-1052997\" style=\"width:908px;height:auto\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/figure2-1024x592.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/figure2-300x174.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/figure2-768x444.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/figure2-1536x888.png 1536w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/figure2-2048x1185.png 2048w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/figure2-240x139.png 240w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">The pipeline of Vid-ICL. Left: Training of Vid-ICL. The data used for training are continuous video clips and the Transformer is trained by next token prediction. Right: In-context Inference of Vid-ICL. The model is conditioned on demonstration videos and generates future frames.<\/figcaption><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"generated-samples-on-something-something-v2\">Generated Samples on Something-Something v2<\/h4>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Demonstration<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">DeLVM<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Vid-ICL (700M, pt)<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Vid-ICL (1.1B, pt)<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Vid-ICL (700M, ft)<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Vid-ICL (1.1B, ft)<\/p>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:100%\">\n<div class=\"wp-block-group is-layout-constrained wp-block-group-is-layout-constrained\">\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:100%\">\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:100%\">\n<figure class=\"wp-block-gallery has-nested-images columns-6 is-cropped wp-block-gallery-1 is-layout-flex wp-block-gallery-is-layout-flex\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053009\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/4-66839f45d9812.gif\" alt=\"gif placeholder\" class=\"wp-image-1053009\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053015\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/0-66839f9fc2758.gif\" alt=\"gif placeholder\" class=\"wp-image-1053015\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053018\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/11.gif\" alt=\"gif placeholder\" class=\"wp-image-1053018\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053021\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/11-6683a03123635.gif\" alt=\"gif placeholder\" class=\"wp-image-1053021\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053024\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/11-6683a043bc758.gif\" alt=\"gif placeholder\" class=\"wp-image-1053024\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053027\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/4-6683a056ae654.gif\" alt=\"gif placeholder\" class=\"wp-image-1053027\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053033\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/5.gif\" alt=\"gif placeholder\" class=\"wp-image-1053033\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053036\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/1.gif\" alt=\"gif placeholder\" class=\"wp-image-1053036\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053039\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/15.gif\" alt=\"gif placeholder\" class=\"wp-image-1053039\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053042\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/15-6683a2f531380.gif\" alt=\"gif placeholder\" class=\"wp-image-1053042\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053045\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/15-6683a3155d78d.gif\" alt=\"gif placeholder\" class=\"wp-image-1053045\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053048\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/5-6683a321daf6d.gif\" alt=\"gif placeholder\" class=\"wp-image-1053048\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053054\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/6.gif\" alt=\"gif placeholder\" class=\"wp-image-1053054\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053057\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/2.gif\" alt=\"gif placeholder\" class=\"wp-image-1053057\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053060\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/17.gif\" alt=\"gif placeholder\" class=\"wp-image-1053060\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053063\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/17-6683a3904eae6.gif\" alt=\"gif placeholder\" class=\"wp-image-1053063\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053069\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/17-6683a3e431613.gif\" alt=\"gif placeholder\" class=\"wp-image-1053069\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053072\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/6-6683a3ef3af43.gif\" alt=\"gif placeholder\" class=\"wp-image-1053072\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053075\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/7.gif\" alt=\"gif placeholder\" class=\"wp-image-1053075\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053078\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/3.gif\" alt=\"gif placeholder\" class=\"wp-image-1053078\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053081\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/19.gif\" alt=\"gif placeholder\" class=\"wp-image-1053081\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053084\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/19-6683a433d902c.gif\" alt=\"gif placeholder\" class=\"wp-image-1053084\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053087\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/19-6683a440dabd5.gif\" alt=\"gif placeholder\" class=\"wp-image-1053087\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053090\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/7-6683a44d84e2a.gif\" alt=\"gif placeholder\" class=\"wp-image-1053090\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053093\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/15-6683a467c30cb.gif\" alt=\"gif placeholder\" class=\"wp-image-1053093\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053096\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/12.gif\" alt=\"gif placeholder\" class=\"wp-image-1053096\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053099\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/110.gif\" alt=\"gif placeholder\" class=\"wp-image-1053099\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053102\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/110-6683a491efa62.gif\" alt=\"gif placeholder\" class=\"wp-image-1053102\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053105\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/110-6683a4a0e32aa.gif\" alt=\"gif placeholder\" class=\"wp-image-1053105\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053120\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/15-6683a54d15e55.gif\" alt=\"gif placeholder\" class=\"wp-image-1053120\" \/><\/figure>\n<\/figure>\n\n\n\n<p>Qualitative results on the generated samples by Vid-ICL on Something-Something v2 dataset, a dataset of real world human activities. . The first column shows the demonstration video. The second column shows the generated samples by DeLVM. The third to sixth columns show the generated samples by Vid-ICL with different pretraining and finetuning strategies. It shows that Vid-ICL generates more diverse and plausible samples than DeLVM, and the quality of the generated samples is improved by scaling model sizes and in-domain fine-tuning.<\/p>\n<\/div>\n<\/div>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"generated-samples-on-robotics-transformer-dataset\">Generated Samples on Robotics Transformer Dataset<\/h4>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Demonstration<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Vid-ICL (700M, pt)<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Vid-ICL (1.1B, pt)<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Vid-ICL (300M, ft)<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Vid-ICL (700M, ft)<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Vid-ICL (1.1B, ft)<\/p>\n<\/div>\n<\/div>\n\n\n\n<figure class=\"wp-block-gallery has-nested-images columns-6 is-cropped wp-block-gallery-2 is-layout-flex wp-block-gallery-is-layout-flex\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053135\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/0-6683a808d3bc0.gif\" alt=\"gif placeholder\" class=\"wp-image-1053135\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053138\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/0-6683a8180f7db.gif\" alt=\"gif placeholder\" class=\"wp-image-1053138\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053141\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/0-6683a825241a2.gif\" alt=\"gif placeholder\" class=\"wp-image-1053141\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053144\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/0-6683a83259a44.gif\" alt=\"gif placeholder\" class=\"wp-image-1053144\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053147\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/0-6683a83db69b1.gif\" alt=\"gif placeholder\" class=\"wp-image-1053147\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053150\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/0-6683a84d1869a.gif\" alt=\"gif placeholder\" class=\"wp-image-1053150\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053153\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/8.gif\" alt=\"gif placeholder\" class=\"wp-image-1053153\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053156\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/8-6683a89c68359.gif\" alt=\"gif placeholder\" class=\"wp-image-1053156\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053159\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/8-6683a8a6d6e8d.gif\" alt=\"gif placeholder\" class=\"wp-image-1053159\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053162\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/8-6683a8b1989ee.gif\" alt=\"gif placeholder\" class=\"wp-image-1053162\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053165\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/8-6683a8bc41ee6.gif\" alt=\"gif placeholder\" class=\"wp-image-1053165\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053168\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/8-6683a8c7620c6.gif\" alt=\"gif placeholder\" class=\"wp-image-1053168\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053171\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/24.gif\" alt=\"gif placeholder\" class=\"wp-image-1053171\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053174\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/24-6683a8e7e2880.gif\" alt=\"gif placeholder\" class=\"wp-image-1053174\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053177\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/24-6683a8f0d2429.gif\" alt=\"gif placeholder\" class=\"wp-image-1053177\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053180\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/24-6683a8fcd9a37.gif\" alt=\"gif placeholder\" class=\"wp-image-1053180\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053183\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/24-6683a907f3c8c.gif\" alt=\"gif placeholder\" class=\"wp-image-1053183\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053186\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/24-6683a9143829a.gif\" alt=\"gif placeholder\" class=\"wp-image-1053186\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053189\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/62.gif\" alt=\"gif placeholder\" class=\"wp-image-1053189\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053192\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/62-6683a93226921.gif\" alt=\"gif placeholder\" class=\"wp-image-1053192\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053195\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/62-6683a93ba233b.gif\" alt=\"gif placeholder\" class=\"wp-image-1053195\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053198\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/62-6683a946879bf.gif\" alt=\"gif placeholder\" class=\"wp-image-1053198\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053201\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/62-6683a9504596c.gif\" alt=\"gif placeholder\" class=\"wp-image-1053201\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053204\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/62-6683a95a4f32a.gif\" alt=\"gif placeholder\" class=\"wp-image-1053204\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053207\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/51.gif\" alt=\"gif placeholder\" class=\"wp-image-1053207\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053210\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/51-6683a9819d178.gif\" alt=\"gif placeholder\" class=\"wp-image-1053210\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053213\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/51-6683a98d8882b.gif\" alt=\"gif placeholder\" class=\"wp-image-1053213\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053216\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/51-6683a9971c096.gif\" alt=\"gif placeholder\" class=\"wp-image-1053216\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053219\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/51-6683a9a069983.gif\" alt=\"gif placeholder\" class=\"wp-image-1053219\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053222\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/51-6683a9a9ba409.gif\" alt=\"gif placeholder\" class=\"wp-image-1053222\" \/><\/figure>\n<\/figure>\n\n\n\n<p>Qualitative results on the generated samples by Vid-ICL on Robotics Transformer(RT-1) dataset, a dataset of robotic manipulation tasks. Vid-ICL offers more concise control over various robotic tasks with the demonstration video.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"contrastive-demonstration\">Contrastive Demonstration<\/h4>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Demonstration<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Generation<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Contrastive Demonstration<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Contrastive Generation<\/p>\n<\/div>\n<\/div>\n\n\n\n<figure class=\"wp-block-gallery aligncenter has-nested-images columns-4 is-cropped wp-block-gallery-3 is-layout-flex wp-block-gallery-is-layout-flex\">\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053234\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/1-6683abe085d38.gif\" alt=\"gif placeholder\" class=\"wp-image-1053234\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053237\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/1-6683abec3dd85.gif\" alt=\"gif placeholder\" class=\"wp-image-1053237\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053240\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/1-6683abfbb2bd7.gif\" alt=\"gif placeholder\" class=\"wp-image-1053240\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053243\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/1-6683ac07891a0.gif\" alt=\"gif placeholder\" class=\"wp-image-1053243\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053246\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/15-6683ac1b630fd.gif\" alt=\"gif placeholder\" class=\"wp-image-1053246\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053249\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/15-6683ac267f930.gif\" alt=\"gif placeholder\" class=\"wp-image-1053249\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053252\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/15-6683ac308fafb.gif\" alt=\"gif placeholder\" class=\"wp-image-1053252\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053255\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/15-6683ac3d265b2.gif\" alt=\"gif placeholder\" class=\"wp-image-1053255\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053258\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/2-6683ac5209723.gif\" alt=\"gif placeholder\" class=\"wp-image-1053258\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053261\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/2-6683ac5f2bbee.gif\" alt=\"gif placeholder\" class=\"wp-image-1053261\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053264\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/2-6683ac6aa2947.gif\" alt=\"gif placeholder\" class=\"wp-image-1053264\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053267\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/2-6683ac761ae3b.gif\" alt=\"gif placeholder\" class=\"wp-image-1053267\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053270\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/4-6683ac832df72.gif\" alt=\"gif placeholder\" class=\"wp-image-1053270\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053273\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/4-6683ac90605e5.gif\" alt=\"gif placeholder\" class=\"wp-image-1053273\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053276\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/4-6683ac9aef5fe.gif\" alt=\"gif placeholder\" class=\"wp-image-1053276\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053279\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/4-6683aca67f7cf.gif\" alt=\"gif placeholder\" class=\"wp-image-1053279\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053282\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/11-6683acb98695e.gif\" alt=\"gif placeholder\" class=\"wp-image-1053282\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053285\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/11-6683acc4ea7fd.gif\" alt=\"gif placeholder\" class=\"wp-image-1053285\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053288\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/11-6683acd0ebc70.gif\" alt=\"gif placeholder\" class=\"wp-image-1053288\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053291\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/11-6683acde3c986.gif\" alt=\"gif placeholder\" class=\"wp-image-1053291\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053294\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/13.gif\" alt=\"gif placeholder\" class=\"wp-image-1053294\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053297\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/13-6683ad0093b3c.gif\" alt=\"gif placeholder\" class=\"wp-image-1053297\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053300\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/13-6683ad1314c06.gif\" alt=\"gif placeholder\" class=\"wp-image-1053300\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053303\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/13-6683ad2d660bb.gif\" alt=\"gif placeholder\" class=\"wp-image-1053303\" \/><\/figure>\n<\/figure>\n\n\n\n<p>In this section, we give shared prompts to Vid-ICL in the second and fourth columns of each row, and a contrastive prompt in the first and third columns. Results shows an important feature of Vid-ICL that when conditioned on demonstration with contrastive semantics, Vid-ICL generates contrary samples given the same demonstration, which demonstrates Vid-ICL&#8217;s ability to&nbsp;<strong>precisely understand the demonstration dynamics and generate diverse future sequences.<\/strong><\/p>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"text-conditioned-generation\">Text Conditioned Generation<\/h4>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Text<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Demonstration<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Generation w\/o Text<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Generation w\/ Text<\/p>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:380px\">\n<p class=\"has-text-align-center\">Push [something] from left to right<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-gallery aligncenter has-nested-images columns-default is-cropped wp-block-gallery-4 is-layout-flex wp-block-gallery-is-layout-flex\">\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053321\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/8-6683b22db315f.gif\" alt=\"gif placeholder\" class=\"wp-image-1053321\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053324\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/20.gif\" alt=\"gif placeholder\" class=\"wp-image-1053324\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053327\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/20-6683b2471482d.gif\" alt=\"gif placeholder\" class=\"wp-image-1053327\" \/><\/figure>\n<\/figure>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:380px\">\n<p class=\"has-text-align-center\">Turning the camera left while filming [something]<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-gallery aligncenter has-nested-images columns-default is-cropped wp-block-gallery-5 is-layout-flex wp-block-gallery-is-layout-flex\">\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053342\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/0-6683b32f3b405.gif\" alt=\"gif placeholder\" class=\"wp-image-1053342\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053345\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/0-6683b33bed1be.gif\" alt=\"gif placeholder\" class=\"wp-image-1053345\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053348\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/0-6683b3451d71f.gif\" alt=\"gif placeholder\" class=\"wp-image-1053348\" \/><\/figure>\n<\/figure>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:380px\">\n<p class=\"has-text-align-center\">Moving [something] closer to [something]<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-gallery aligncenter has-nested-images columns-default is-cropped wp-block-gallery-6 is-layout-flex wp-block-gallery-is-layout-flex\">\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053351\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/76.gif\" alt=\"gif placeholder\" class=\"wp-image-1053351\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053354\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/76-6683b381b4f29.gif\" alt=\"gif placeholder\" class=\"wp-image-1053354\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053357\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/76-6683b389a3f7d.gif\" alt=\"gif placeholder\" class=\"wp-image-1053357\" \/><\/figure>\n<\/figure>\n<\/div>\n<\/div>\n\n\n\n<p>We show that text can be used as an additional in-context condition signals to augment the generation of Vid-ICL. After aligning Vid-ICL to the text, the generated samples are more consistent with the text instructions and demonstration videos.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"on-reinforcement-learning-tasks\">On Reinforcement Learning Tasks<\/h4>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"335\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/robodesk-6683b44fc2704-1024x335.png\" alt=\"gif placeholder\" class=\"wp-image-1053369\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/robodesk-6683b44fc2704-1024x335.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/robodesk-6683b44fc2704-300x98.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/robodesk-6683b44fc2704-768x251.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/robodesk-6683b44fc2704-1536x503.png 1536w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/robodesk-6683b44fc2704-2048x670.png 2048w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/robodesk-6683b44fc2704-240x79.png 240w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>We demonstrate that Vid-ICL can also function as a simulator in reinforcement learning tasks by evaluating it on the RoboDesk, a test benchmark containing several reinforcement learning tasks. Vid-ICL generates future frames with videos that accomplishing the same task as the demonstration, where the generated frames inversely reflect the corresponding actions that correctly interact with the environment. Evaluating on the <em><strong>Push_red<\/strong><\/em> task, we find that Vid-ICL provides more precise control over the environment interaction.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"ethics-statement-1\">Ethics Statement<\/h4>\n\n\n\n<p>Vid-ICL is exclusively a research initiative with no current plans for product integration or public access. We are committed to adhering to Microsoft AI principles during the ongoing development of our models. The datasets utilized in this study are publicly available and have been thoroughly reviewed to ensure they do not include personally identifiable information or offensive content. Nonetheless, as these datasets are sourced from the Internet, there may still be inherent biases. To address this, we have implemented a rigorous filtering process on the training data to minimize the potential for the model to generate inappropriate content.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"citation\">Citation<\/h4>\n\n\n\n<p>@article{zhang2024video,<br>  title={Video In-Context Learning},<br>  author={Zhang, Wentao and Guo, Junliang and He, Tianyu and Zhao, Li and Xu, Linli and Bian, Jiang},<br>  journal={arXiv preprint arXiv:2407.07356},<br>  year={2024}<br>}<\/p>\n\n\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<iframe loading=\"lazy\" title=\"Video Example\" width=\"300\" height=\"300\" src=\"https:\/\/c3vv.github.io\/C3V\/ours_result\/1.mp4\" frameborder=\"0\" scrolling=\"auto\"><\/iframe>\nIn a Magician&#8217;s magical cabin alone in a serene forest, an alien walking on the floor, starting from the cabin&#8217;s door to the mow near the bottom right corner of this image.\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<iframe loading=\"lazy\" title=\"Video Example\" width=\"300\" height=\"300\" src=\"https:\/\/c3vv.github.io\/C3V\/ours_result\/2.mp4\" frameborder=\"0\" scrolling=\"auto\"><\/iframe>\nFour characters stood on the stage. In front of the stage, a man and a woman are performing China Kung Fu and dancing respectively. On the right side of the stage, a skeleton man is dancing, and behind them, a clown is performing.\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<iframe loading=\"lazy\" title=\"Video Example\" width=\"300\" height=\"300\" src=\"https:\/\/c3vv.github.io\/C3V\/ours_result\/6.mp4\" frameborder=\"0\" scrolling=\"auto\"><\/iframe>\nIn a long anime-style road with anime-blocks and little anime-grass, anime-houses and anime-tree on the side of the anime-style road, an alchemist is walking while swinging his arms.\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<iframe loading=\"lazy\" title=\"Video Example\" width=\"300\" height=\"300\" src=\"https:\/\/c3vv.github.io\/C3V\/ours_result\/5.mp4\" frameborder=\"0\" scrolling=\"auto\"><\/iframe>\nIn an idyllic park with vibrant flowers, lush greenery, and a serene lake, a person with a curly beard and a black hoodie is walking with touching  on tiptoe.\n<\/div>\n<\/div>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h4 class=\"wp-block-heading has-text-align-center\" id=\"abstract\">Abstract<\/h4>\n\n\n\n<p>Significant progress has been made in text-to-video generation through the use of powerful generative models and large-scale internet data. However, substantial challenges remain in precisely controlling individual concepts within the generated video, such as the motion and appearance of specific characters and the movement of viewpoints. In this work, we propose a novel paradigm that generates each concept in 3D representation separately and then composes them with priors from Large Language Models (LLM) and 2D diffusion models. Specifically, given an input textual prompt, our scheme consists of three stages: 1) We leverage LLM as the director to first decompose the complex query into several sub-prompts that indicate individual concepts within the video (e.g., scene, objects, motions), then we let LLM to invoke pre-trained expert models to obtain corresponding 3D representations of concepts. 2) To compose these representations, we prompt multi-modal LLM to produce coarse guidance on the scales and coordinates of trajectories for the objects. 3) To make the generated frames adhere to natural image distribution, we further leverage 2D diffusion priors and use Score Distillation Sampling to refine the composition. Extensive experiments demonstrate that our method can generate high-fidelity videos from text with diverse motion and flexible control over each concept.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h4 class=\"wp-block-heading has-text-align-center\" id=\"method\">Method<\/h4>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1120\" height=\"720\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/08\/framework.png\" alt=\"graphical user interface, website\" class=\"wp-image-1079262\" style=\"width:828px;height:auto\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/08\/framework.png 1120w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/08\/framework-300x193.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/08\/framework-1024x658.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/08\/framework-768x494.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/08\/framework-240x154.png 240w\" sizes=\"auto, (max-width: 1120px) 100vw, 1120px\" \/><\/figure>\n\n\n\n<p>Our method consists of three stages: 1) The input textual prompt is decomposed into individual concepts by the LLM. Then we generate each concept in the form of 3D with the corresponding pre-trained expert model. 2) We leverage knowledge in multi-modal LLM to estimate the 2D trajectory of objects step-by-step. 3) After lifting the estimated 2D trajectory into 3D as initialization, we refine the scales, locations, and rotations of objects within the 3D scene using 2D diffusion priors.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h4 class=\"wp-block-heading has-text-align-center\" id=\"extension-object-editing\">Extension: Object Editing<\/h4>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<iframe loading=\"lazy\" title=\"Video Example\" width=\"300\" height=\"300\" src=\"https:\/\/c3vv.github.io\/C3V\/ours_editing_object\/appearedit1_video.mp4\" frameborder=\"0\" scrolling=\"auto\"><\/iframe>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<iframe loading=\"lazy\" title=\"Video Example\" width=\"300\" height=\"300\" src=\"https:\/\/c3vv.github.io\/C3V\/ours_editing_object\/appearedit2_video.mp4\" frameborder=\"0\" scrolling=\"auto\"><\/iframe>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<iframe loading=\"lazy\" title=\"Video Example\" width=\"300\" height=\"300\" src=\"https:\/\/c3vv.github.io\/C3V\/ours_editing_object\/appearedit3_video.mp4\" frameborder=\"0\" scrolling=\"auto\"><\/iframe>\n<\/div>\n<\/div>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h4 class=\"wp-block-heading has-text-align-center\" id=\"extension-motion-editing\">Extension: Motion Editing<\/h4>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<iframe loading=\"lazy\" title=\"Video Example\" width=\"300\" height=\"300\" src=\"https:\/\/c3vv.github.io\/C3V\/ours_editing_motion\/motionedit1_video.mp4\" frameborder=\"0\" scrolling=\"auto\"><\/iframe>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<iframe loading=\"lazy\" title=\"Video Example\" width=\"300\" height=\"300\" src=\"https:\/\/c3vv.github.io\/C3V\/ours_editing_motion\/motionedit2_video.mp4\" frameborder=\"0\" scrolling=\"auto\"><\/iframe>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<iframe loading=\"lazy\" title=\"Video Example\" width=\"300\" height=\"300\" src=\"https:\/\/c3vv.github.io\/C3V\/ours_editing_motion\/motionedit3_video.mp4\" frameborder=\"0\" scrolling=\"auto\"><\/iframe>\n<\/div>\n<\/div>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h4 class=\"wp-block-heading has-text-align-center\" id=\"extension-scene-editing\">Extension: Scene Editing<\/h4>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<iframe loading=\"lazy\" title=\"Video Example\" width=\"300\" height=\"300\" src=\"https:\/\/c3vv.github.io\/C3V\/ours_editing_bg\/sceneedit1_video.mp4\" frameborder=\"0\" scrolling=\"auto\"><\/iframe>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<iframe loading=\"lazy\" title=\"Video Example\" width=\"300\" height=\"300\" src=\"https:\/\/c3vv.github.io\/C3V\/ours_editing_bg\/sceneedit2_video.mp4\" frameborder=\"0\" scrolling=\"auto\"><\/iframe>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<iframe loading=\"lazy\" title=\"Video Example\" width=\"300\" height=\"300\" src=\"https:\/\/c3vv.github.io\/C3V\/ours_editing_bg\/sceneedit3_video.mp4\" frameborder=\"0\" scrolling=\"auto\"><\/iframe>\n<\/div>\n<\/div>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h4 class=\"wp-block-heading has-text-align-center\" id=\"comparisons\">Comparisons<\/h4>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"565\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/08\/1-1024x565.png\" alt=\"graphical user interface, website\" class=\"wp-image-1079415\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/08\/1-1024x565.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/08\/1-300x165.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/08\/1-768x424.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/08\/1-240x132.png 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/08\/1.png 1280w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">In a Magician&#8217;s magical cabin alone in a serene forest, an alien is walking on the floor, starting from the cabin&#8217;s door to the mow near the bottom right corner of this image.<\/figcaption><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"552\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/08\/2-66c872f94f0a5-1024x552.png\" alt=\"graphical user interface, website\" class=\"wp-image-1079418\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/08\/2-66c872f94f0a5-1024x552.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/08\/2-66c872f94f0a5-300x162.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/08\/2-66c872f94f0a5-768x414.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/08\/2-66c872f94f0a5-240x129.png 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/08\/2-66c872f94f0a5.png 1280w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Four characters stood on the stage. In front of the stage, a man and a woman are performing Kung Fu and dancing respectively. On the right side of the stage, a skeleton man is dancing, and behind them, a clown is performing.<\/figcaption><\/figure>\n<\/div>\n<\/div>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h4 class=\"wp-block-heading has-text-align-center\" id=\"ethics-statement\">Ethics Statement<\/h4>\n\n\n\n<p>C3V&nbsp;is exclusively a research initiative with no current plans for product integration or public access. We are committed to adhering to Microsoft AI principles during the ongoing development of our models. The model is trained on AI-generated content, which has been thoroughly reviewed to ensure that they do not include personally identifiable information or offensive content. Nonetheless, as these generated data are sourced from the Internet, there may still be inherent biases. To address this, we have implemented a rigorous filtering process on the data to minimize the potential for the model to generate inappropriate content.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h4 class=\"wp-block-heading has-text-align-center\" id=\"acknowledgments\">Acknowledgments<\/h4>\n\n\n\n<p>Our work is based on the following excellent works: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/luciddreamer-cvlab.github.io\/\">LucidDreamer: Domain-free Generation of 3D Gaussian Splatting Scenes<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/alvinliu0.github.io\/projects\/HumanGaussian\">HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/motion-x-dataset.github.io\/\">Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h4 class=\"wp-block-heading has-text-align-center\" id=\"bibtex\">BibTeX<\/h4>\n\n\n\n<pre class=\"wp-block-code\"><code>@article{zhu2024compositional,\n  title={Compositional 3D-aware Video Generation with LLM Director},\n  author={Zhu, Hanxin and He, Tianyu and Tang, Anni and Guo, Junliang and Chen, Zhibo and Bian, Jiang},\n  journal={Advances in Neural Information Processing Systems},\n  year={2024}\n}<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<p>This work is accomplished in Microsoft, April 2024.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n\n\n<h2 class=\"wp-block-heading has-text-align-center\" id=\"fast-autoregressive-video-generation-with-diagonal-decoding\">Fast Autoregressive Video Generation with Diagonal Decoding<\/h2>\n\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/arxiv.org\/pdf\/2503.14070\">Paper<\/a><\/div>\n\n\n\n<div class=\"wp-block-button\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\">Code (Coming Soon)<\/a><\/div>\n<\/div>\n\n\n\n<p>Autoregressive Transformer models have demonstrated impressive performance in video generation, but their sequential token-by-token decoding process poses a major bottleneck, particularly for long videos represented by tens of thousands of tokens. Here, we propose <strong>Diagonal Decoding (DiagD)<\/strong>, a training-free inference acceleration algorithm for autoregressively pre-trained models that exploits spatial and temporal correlations in videos. Our method generates tokens along diagonal paths in the spatial-temporal token grid, enabling parallel decoding within each frame as well as partially overlapping across consecutive frames. The proposed algorithm is versatile and adaptive to various generative models and tasks, while providing flexible control over the trade-off between inference speed and visual quality. Experiments on multiple autoregressive video generation models and datasets demonstrate that DiagD achieves up to <strong>10x speedup compared to naive sequential decoding<\/strong>, while maintaining comparable visual fidelity.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"865\" height=\"502\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/front-1.png\" alt=\"Comparisons between naive Next-Token Prediction (NTP) and Diagonal Decoding (DiagD) on Cosmos autoregressive models. \" class=\"wp-image-1134475\" style=\"width:596px;height:auto\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/front-1.png 865w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/front-1-300x174.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/front-1-768x446.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/front-1-480x280.png 480w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/front-1-240x139.png 240w\" sizes=\"auto, (max-width: 865px) 100vw, 865px\" \/><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h4 class=\"wp-block-heading has-text-align-center\" id=\"method\">Method<\/h4>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"472\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/method-1024x472.png\" alt=\"Illustration of Diagonal Decoding\" class=\"wp-image-1134476\" style=\"width:779px;height:auto\" srcset=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/method-1024x472.png 1024w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/method-300x138.png 300w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/method-768x354.png 768w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/method-1536x708.png 1536w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/method-240x111.png 240w, https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/method.png 1629w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>The motivation of our method arises from intuitive observations on consecutive frames in a video, which can be summarized in two key insights. As shown above, the first insight is that patches exhibit stronger correlations with their spatial neighbors than with sequential ones. Secondly, due to the temporal redundancy of videos, patches from consecutive frames that occupy similar relative positions are highly likely to be similar to each other. <\/p>\n\n\n\n<p>As a result, we propose Diagonal Decoding, an iterative algorithm that generates tokens along diagonal paths in the spatial-temporal token grid. Spatially, within each frame, tokens along the same diagonal are generated in parallel, leveraging the strong local dependencies between neighboring patches. And temporally, by stacking frames together, our method generates the top-left tokens of the next frame before completing the current frame, as these tokens are less likely to depend on the bottom-right tokens that have not yet been generated.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h4 class=\"wp-block-heading has-text-align-center\" id=\"case-cosmos-12b-autoregressive-model\">Case: Cosmos 12B Autoregressive Model<\/h4>\n\n\n\n<div class=\"wp-block-columns are-vertically-aligned-center is-not-stacked-on-mobile is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Next Token Prediction<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">DiagD (k=1)<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">DiagD (k=2)<\/p>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-not-stacked-on-mobile is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"500\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/buYzC9In_lM_415.gif\" alt=\"NTP Case 1\" class=\"wp-image-1134477\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"500\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/buYzC9In_lM_415-1.gif\" alt=\"Diagd k=1 Case 1\" class=\"wp-image-1134478\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"500\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/buYzC9In_lM_415-2.gif\" alt=\"Diagd k=2 Case 1\" class=\"wp-image-1134479\" \/><\/figure>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-not-stacked-on-mobile is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"500\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/TFYdFphVyZw_1341-4.gif\" alt=\"NTP Case 2\" class=\"wp-image-1134485\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"500\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/TFYdFphVyZw_1341-3.gif\" alt=\"Diagd k=1 Case 2\" class=\"wp-image-1134484\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"500\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/TFYdFphVyZw_1341-2.gif\" alt=\"Diagd k=2 Case 2\" class=\"wp-image-1134483\" \/><\/figure>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-not-stacked-on-mobile is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"500\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/twV1rn5uaI_5745.gif\" alt=\"NTP Case 3\" class=\"wp-image-1134486\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"500\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/twV1rn5uaI_5745-2.gif\" alt=\"Diagd k=1 Case 3\" class=\"wp-image-1134488\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"500\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/twV1rn5uaI_5745-1.gif\" alt=\"Diagd k=2 Case 3\" class=\"wp-image-1134487\" \/><\/figure>\n<\/div>\n<\/div>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h4 class=\"wp-block-heading has-text-align-center\" id=\"case-wham-1-6b-model\">Case: WHAM 1.6B Model<\/h4>\n\n\n\n<div class=\"wp-block-columns are-vertically-aligned-center is-not-stacked-on-mobile is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Next Token Prediction<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">DiagD (k=1)<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">DiagD (k=2)<\/p>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-not-stacked-on-mobile is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"300\" height=\"180\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/FBD3991.gif\" alt=\"NTP Case 1\" class=\"wp-image-1134493\" style=\"width:501px;height:auto\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"300\" height=\"180\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/FBD39991B0AE4A7AB3474ED102EEDCC5779D665B4619_250Movie.pb_step-1750_tick-14162_video-ezgif.com-video-to-gif-converter.gif\" alt=\"Diagd k=1 Case 1\" class=\"wp-image-1134489\" style=\"width:501px;height:auto\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"300\" height=\"180\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/FBD39991B0AE4A7AB3474ED102EEDCC5779D665B4619_250Movie.pb_step-1750_tick-14162_video-ezgif.com-video-to-gif-converter-1.gif\" alt=\"Diagd k=2 Case 1\" class=\"wp-image-1134491\" style=\"width:501px;height:auto\" \/><\/figure>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"300\" height=\"180\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/FBE4D21.gif\" alt=\"NTP Case 2\" class=\"wp-image-1134494\" style=\"width:501px;height:auto\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"300\" height=\"180\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/FBE4D26593B2E7DF33F742A95AC06D4CA3AF7A7AFC3B_71Movie.pb_step-16_tick-4591_video-ezgif.com-video-to-gif-converter.gif\" alt=\"Diagd k=1 Case 2\" class=\"wp-image-1134490\" style=\"width:501px;height:auto\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"300\" height=\"180\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/FBE4D26593B2E7DF33F742A95AC06D4CA3AF7A7AFC3B_71Movie.pb_step-16_tick-4591_video-ezgif.com-video-to-gif-converter-1.gif\" alt=\"Diagd k=2 Case 2\" class=\"wp-image-1134492\" style=\"width:501px;height:auto\" \/><\/figure>\n<\/div>\n<\/div>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h4 class=\"wp-block-heading has-text-align-center\" id=\"case-autoregressive-model-on-minecraft\">Case: Autoregressive Model on Minecraft<\/h4>\n\n\n\n<div class=\"wp-block-columns are-vertically-aligned-center is-not-stacked-on-mobile is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Next Token Prediction<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">DiagD w\/o Finetune<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">DiagD w\/ Finetune<\/p>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-not-stacked-on-mobile is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"384\" height=\"224\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/clip_58-ezgif.com-video-to-gif-converter-3.gif\" alt=\"a screenshot of a video game\" class=\"wp-image-1134501\" style=\"width:501px;height:auto\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"384\" height=\"224\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/clip_58-ezgif.com-video-to-gif-converter.gif\" alt=\"a screenshot of a video game\" class=\"wp-image-1134495\" style=\"width:501px;height:auto\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"384\" height=\"224\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/clip_58-ezgif.com-video-to-gif-converter-2.gif\" alt=\"a screenshot of a video game\" class=\"wp-image-1134499\" style=\"width:501px;height:auto\" \/><\/figure>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-not-stacked-on-mobile is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"384\" height=\"224\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/clip_198-ezgif.com-video-to-gif-converter-3.gif\" alt=\"a screenshot of a video game\" class=\"wp-image-1134502\" style=\"width:501px;height:auto\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"384\" height=\"224\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/clip_198-ezgif.com-video-to-gif-converter.gif\" alt=\"a screenshot of a video game\" class=\"wp-image-1134496\" style=\"width:501px;height:auto\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"384\" height=\"224\" src=\"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-content\/uploads\/2024\/07\/clip_198-ezgif.com-video-to-gif-converter-2.gif\" alt=\"a screenshot of a video game\" class=\"wp-image-1134500\" style=\"width:501px;height:auto\" \/><\/figure>\n<\/div>\n<\/div>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h4 class=\"wp-block-heading has-text-align-center\" id=\"bibtex-1\">BibTex<\/h4>\n\n\n\n<p>@article{ye2025fast,<br>      title={Fast Autoregressive Video Generation with Diagonal Decoding}, <br>      author={Ye, Yang and Guo, Junliang and Wu, Haoyu and He, Tianyu and Pearce, Tim and Rashid, Tabish and Hofmann, Katja and Bian, Jiang},<br>      journal={arXiv preprint arXiv:2503.14070},<br>      year={2025},<br>}<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n","protected":false},"excerpt":{"rendered":"<p>Learning to simulate the visual world from large-scale videos. In-context learning for vision data has been underexplored compared with that in natural language. Previous works studied image in-context learning, urging models to generate a single image guided by demonstrations. In this project, we propose and study video in-context learning, where the model starts from an [&hellip;]<\/p>\n","protected":false},"featured_media":1079220,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13556,13562],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-992553","msr-project","type-msr-project","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-research-area-computer-vision","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"","related-publications":[],"related-downloads":[],"related-videos":[],"related-groups":[269241],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[],"slides":[],"related-researchers":[],"msr_research_lab":[],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/992553","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":66,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/992553\/revisions"}],"predecessor-version":[{"id":1140298,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/992553\/revisions\/1140298"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media\/1079220"}],"wp:attachment":[{"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/media?parent=992553"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=992553"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=992553"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=992553"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/cm-edgetun.pages.dev\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=992553"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}