RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference

Yaoqi Chen; Jin-Kang Zhang; Baotong Lu; Qianxi Zhang; Cheng Zhang; Jing Liu; Jingjia Luo; Di Liu; Huiqiang Jiang; Qi Chen; Bailu Ding; Xiao Yan; Jiawei Jiang; Chen Chen; Mingxing Zhang; Cheng Li; Yuqing Yang; Fan Yang; Mao Yang

RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference

Yaoqi Chen ,
Jin-Kang Zhang ,
Baotong Lu ,
Qianxi Zhang ,
Cheng Zhang ,
Jing Liu ,
Jingjia Luo ,
Di Liu ,
Huiqiang Jiang ,
Qi Chen ,
Bailu Ding ,
Xiao Yan ,
Jiawei Jiang ,
Chen Chen ,
Mingxing Zhang ,
Cheng Li ,
Yuqing Yang ,
Fan Yang ,
Mao Yang

VLDB Endowment | May 2025 , Vol 19(5)

Download BibTex

Recent large language models (LLMs) are rapidly extending their context windows, yet inference throughput lags due to increasing GPU memory and bandwidth demands. This is because the key-value (KV) cache, an intermediate structure storing token representations, grows linearly with context length and requires an iterative linear scan for attention computation. A promising direction to accelerate long-context inference is to exploit attention’s inherent sparsity by offloading the KV cache to CPU memory and retrieving only a small subset of tokens important to the current generation step. However, prior sparse attention approaches struggle to balance accuracy and retrieval cost due to varying sparsity patterns and inefficient GPU-CPU memory management. We present RetroInfer, a vector storage engine that realizes a sparsity-based KV cache for long-context inference. RetroInfer introduces an Attention-aWare VEctor index (wave index), which fundamentally improves the tradeoff between attention accuracy and retrieval cost through tripartite attention approximation, accuracy-bound attention estimation, and segmented clustering. We also design the wave buffer, a GPU-CPU buffer manager that assigns computation and manages data across heterogeneous hardware. We evaluate RetroInfer across a range of models and workloads, demonstrating up to 4.4X decoding throughput over full attention at 120K context and up to 12.2X over sparse attention baselines at 1 million tokens — all while preserving full-attention-level accuracy.