Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding

Yikai Zheng; Xin Ding; Yifan Yang; Shiqi Jiang; Hao Wu; Qianxi Zhang; Weijun Wang; Ting Cao; Yunxin Liu

Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding

Yikai Zheng ,
Xin Ding ,
Yifan Yang ,
Shiqi Jiang ,
Hao Wu ,
Qianxi Zhang ,
Weijun Wang ,
Ting Cao ,
Yunxin Liu

March 2026

arXiv

Recent advances in Streaming Video Understanding has enabled a new interaction paradigm where models respond proactively to user queries. Current proactive VideoLLMs rely on per-frame triggering decision making, which suffers from an efficiency-accuracy dilemma. We propose Em-Garde, a novel framework that decouples semantic understanding from streaming perception. At query time, the Instruction-Guided Proposal Parser transforms user queries into structured, perceptually grounded visual proposals; during streaming, a Lightweight Proposal Matching Module performs efficient embedding-based matching to trigger responses. Experiments on StreamingBench and OVO-Bench demonstrate consistent improvements over prior models in proactive response accuracy and efficiency, validating an effective solution for proactive video understanding under strict computational constraints.