RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference

Yaoqi Chen; Jin-Kang Zhang; Baotong Lu; Qianxi Zhang; Cheng Zhang; Jing Liu; Jingjia Luo; Di Liu; Huiqiang Jiang; Qi Chen; Bailu Ding; Xiao Yan; Jiawei Jiang; Chen Chen; Mingxing Zhang; Cheng Li; Yuqing Yang; Fan Yang; Mao Yang

RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference

Yaoqi Chen ,
Jin-Kang Zhang ,
Baotong Lu ,
Qianxi Zhang ,
Cheng Zhang ,
Jing Liu ,
Jingjia Luo ,
Di Liu ,
Huiqiang Jiang ,
Qi Chen ,
Bailu Ding ,
Xiao Yan ,
Jiawei Jiang ,
Chen Chen ,
Mingxing Zhang ,
Cheng Li ,
Yuqing Yang ,
Fan Yang ,
Mao Yang

VLDB Endowment | May 2025 , Vol 19(5)

Recent large language models (LLMs) are rapidly extending their context windows, yet inference throughput lags due to increasing GPU memory and bandwidth demands. This is because the key-value (KV) cache, an intermediate structure storing token representations, grows linearly with context length and requires an iterative linear scan for attention computation. A promising direction to accelerate long-context inference is to exploit attention’s inherent sparsity by offloading the KV cache to CPU memory and retrieving only a small subset of tokens important to the current generation step. However, prior sparse attention approaches struggle to balance accuracy and retrieval cost due to varying sparsity patterns and inefficient GPU-CPU memory management. We present RetroInfer, a vector storage engine that realizes a sparsity-based KV cache for long-context inference. RetroInfer introduces an Attention-aWare VEctor index (wave index), which fundamentally improves the tradeoff between attention accuracy and retrieval cost through tripartite attention approximation, accuracy-bound attention estimation, and segmented clustering. We also design the wave buffer, a GPU-CPU buffer manager that assigns computation and manages data across heterogeneous hardware. We evaluate RetroInfer across a range of models and workloads, demonstrating up to 4.4X decoding throughput over full attention at 120K context and up to 12.2X over sparse attention baselines at 1 million tokens — all while preserving full-attention-level accuracy.