À propos
I am Yifan Yang, a Senior Research SDE at Microsoft Research Asia (MSRA), Shanghai, where I joined in 2021. My research focuses on visual content generation, multimodal foundation models, and general-purpose agentic systems, with a particular emphasis on bridging research innovation and real-world deployment. I have published over 30 peer-reviewed papers in top-tier venues, including CVPR, ICCV, ECCV, NeurIPS, ICLR, ICML, and AAAI, and have served as an Area Chair for leading conferences such as NeurIPS, ICML, and ICLR. I have been deeply involved in the development of Microsoft’s Phi model family, including Phi-3 and Phi-4, with several of my techniques successfully transferred into core Microsoft products, including Office and Azure. My recent work, LLM2CLIP, enhances cross-modal representation learning by leveraging large language models. It has been integrated into the Phi-4-mini pretraining pipeline and was recognized with the AAAI 2026 Outstanding Paper Award. My recent research also explores multimodal agents, commercial visual content generation, text-to-audio-video generation, and structured multimodal reasoning.
If you are interested in internship opportunities or research collaborations, feel free to reach out at
📧 yifanyang@microsoft.com
First-author and Corresponding-author Publications
(* denotes co-first author, † denotes corresponding author)
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-Modality Representation
Weiquan Huang, Aoqi Wu, Yifan Yang†, et al.
AAAI 2026 — Outstanding Paper Award 🏆 - World-R1: Reinforcing 3D Constraints for Text-to-Video Generation
Weijie Wang, Xiaoxuan He, Youping Gu, Yifan Yang†, et al.
ICML 2026 - AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation
Ziwei Zhou, Zeyuan Lai, Rui Wang, Yifan Yang†, et al.
ICML 2026 - Video-in-the-loop: Span-grounded Long Video QA with Interleaved Reasoning
Chendong Wang, Donglin Bai, Yifan Yang†, et al.
ICML 2026 - A Large Language Model Powered Integrated Circuit Footprint Geometry Understanding
Yida Wang, Taiting Lu, Runze Liu, Lanqing Yang, Yifan Yang†, et al.
ICML 2026 - HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models
Ziqin Zhou, Yifan Yang†, et al.
AAAI 2026 - VidGuard-R1: AI-generated Video Detection and Explanation via Reasoning Multimodal Language Models and Reinforcement Learning
Kyoungjun Park, Yifan Yang†, et al.
ICLR 2026 - Region-adaptive Sampling for Diffusion Transformers
Ziming Liu*, Yifan Yang*, et al.
CVPR 2026 - Diffusion²: Turning 3D Environments into Radio Frequency Heatmaps
Kyoungjun Park, Yifan Yang†, et al.
CVPR 2026 Findings - Zoomer: Adaptive Image Focus Optimization for Black-box Multimodal Large Language Models
Jiaxu Qian, Chendong Wang, Yifan Yang†, et al.
Transactions on Machine Learning Research (TMLR), 2025 - VoLUT: Efficient Volumetric Streaming Enhanced by LUT-based Super-resolution
Chendong Wang, Anlan Zhang, Yifan Yang†, et al.
MLSys 2025 - LoRaSC: Expressive and Generalizable Low-rank Adaptation for Large Models via Slow Cascaded Learning
Siwei Li, Yifan Yang†*, et al.
EMNLP 2024 Findings - VIGOR: Reviving Cloud Gaming Sessions
Zhaoyuan He, Yifan Yang*, et al.
ACM CoNEXT 2024 - Nerve: Real-time Neural Video Recovery and Enhancement on Mobile Devices
Zhaoyuan He, Yifan Yang*, et al.
Proceedings of the ACM on Networking (CoNEXT), 2024 - Attentive Mask CLIP
Yifan Yang, et al.
ICCV 2023 - ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation
Yasheng Sun*, Yifan Yang*, et al.
NeurIPS 2023 - Directional Self-supervised Learning for Heavy Image Augmentations
Yalong Bai*, Yifan Yang*, et al.
CVPR 2021
Technical Reports and Major Corresponding-author Preprints
- Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Marah Abdin, Sam Ade Jacobs, et al., Yifan Yang
arXiv Technical Report, 2024 - Phi-4-mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Abdelrahman Abouelenin, Atabak Ashfaq, et al., Yifan Yang
arXiv Technical Report, 2025 - ReasonGen-R1: Chain-of-Thought for Autoregressive Image Generation Models through Supervised Fine-tuning and Reinforcement Learning
Yu Zhang, Yunqi Li, Yifan Yang†, et al.
arXiv:2505.24875, under review at ECCV 2026 - MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
Yan Li, Zezi Zeng, Yifan Yang†, et al.
arXiv preprint arXiv:2604.15309, 2026 - BizGenEval: A Systematic Benchmark for Commercial Visual Content Generation
Yan Li, Zezi Zeng, Ziwei Zhou, Xin Gao, Muzhao Tian, Yifan Yang†, et al.
arXiv preprint arXiv:2603.25732, 2026 - OmniSch: A Multimodal PCB Schematic Benchmark for Structured Diagram Visual Reasoning
Taiting Lu, Kaiyuan Lin, Yuxin Tian, Yubo Wang, Muchuan Wang, Yifan Yang†, et al.
arXiv preprint arXiv:2604.00270, 2026
Collaborative Publications
- RE-TRAC: REcursive TRAjectory Compression for Deep Search Agents
Jialiang Zhu, Gongrui Zhang, Xiaolong Ma, Lin Xu, Miaosen Zhang, Ruiqi Yang, Song Wang, Kai Qiu, Zhirong Wu, Qi Dai, Ruichun Ma, Bei Liu, Yifan Yang, et al.
ICML 2026 - AdaNav: Adaptive Reasoning with Uncertainty for Vision-Language Navigation
Xin Ding, Jianyu Wei, Yifan Yang, et al.
ICML 2026 - AMID: Model-Agnostic Dataset Distillation by Adversarial Mutual Information Minimization
Aoqi Wu, Junming Liu, Yuwei Zhang, Weiquan Huang, Liang Hu, Yifan Yang, et al.
Proceedings of the ACM Web Conference 2026 - A Comprehensive Ecosystem for Open-Domain Customized Video Generation
Jingxu Zhang, Yuqian Hong, Daneul Kim, Kai Qiu, Qi Dai, Jianmin Bao, Yifan Yang, et al.
ICASSP 2026 - Unified Medical Image Pre-training in Language-Guided Common Semantic Space
Xiaoxuan He, Yifan Yang, et al.
ECCV 2024 - StreamMind: Unlocking Full Frame-rate Streaming Video Dialogue through Event-gated Cognition
Xin Ding, Hao Wu, Yifan Yang, et al.
ICCV 2025 - Efficient and Adaptive Diffusion Model Inference through Lookup Tables on Mobile Devices
Qipeng Wang, Shiqi Jiang, Yifan Yang, et al.
IEEE Transactions on Mobile Computing, 2025 - Online Video Quality Enhancement with Spatial-Temporal Look-up Tables
Zefan Qu, Xinyang Jiang, Yifan Yang, et al.
ECCV 2025 - ProLongVid: A Simple but Strong Baseline for Long-context Video Instruction Tuning
Rui Wang, Bohao Li, Yifan Yang, et al.
EMNLP 2025 - MageBench: Bridging Large Multimodal Models to Agents
Miaosen Zhang, Qi Dai, Yifan Yang, et al.
WACV 2025 - Reducio! Generating 1K Video within 16 Seconds Using Extremely Compressed Motion Latents
Rui Tian, Qi Dai, Yifan Yang, et al.
ICCV 2025 - Expand Heterogeneous Learning Systems with Selective Multi-Source Knowledge Fusion
Gengyuan Dai, Hongxu Xu, Yifan Yang, et al.
AAAI 2026 - Empowering Agentic Video Analytics Systems with Video Language Models
Yuxuan Yan, Shiqi Jiang, Yifan Yang, et al.
USENIX NSDI 2025 - DreamDistribution: Learning Prompt Distribution for Diverse In-distribution Generation
Brian Nlong Zhao, Yifan Yang, et al.
ICLR 2025 - Understanding and Improving Training-free Loss-based Diffusion Guidance
Yifei Shen, Xinyang Jiang, Yifan Yang, et al.
NeurIPS 2024 - Online Video Super-resolution with Convolutional Kernel Bypass Grafts
Jun Xiao, Xinyang Jiang, Yifan Yang, et al.
IEEE Transactions on Multimedia, 2023