RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference
- Yaoqi Chen ,
- Jin-Kang Zhang ,
- Baotong Lu ,
- Qianxi Zhang ,
- Cheng Zhang ,
- Jing Liu ,
- Jingjia Luo ,
- Di Liu ,
- Huiqiang Jiang ,
- Qi Chen ,
- Bailu Ding ,
- Xiao Yan ,
- Jiawei Jiang ,
- Chen Chen ,
- Mingxing Zhang ,
- Cheng Li ,
- Yuqing Yang ,
- Fan Yang ,
- Mao Yang
VLDB Endowment | , Vol 19(5)
Recent large language models (LLMs) are rapidly extending their context windows, yet inference throughput lags due to increasing GPU memory and bandwidth demands. This is because the key-value (KV) cache, an intermediate structure storing token representations, grows linearly with context length and requires an iterative linear scan for attention computation. A promising direction to accelerate long-context inference is to exploit attention’s inherent sparsity by offloading the KV cache to CPU memory and retrieving only a small subset of tokens important to the current generation step. However, prior sparse attention approaches struggle to balance accuracy and retrieval cost due to varying sparsity patterns and inefficient GPU-CPU memory management. We present RetroInfer, a vector storage engine that realizes a sparsity-based KV cache for long-context inference. RetroInfer introduces an Attention-aWare VEctor index (wave index), which fundamentally improves the tradeoff between attention accuracy and retrieval cost through tripartite attention approximation, accuracy-bound attention estimation, and segmented clustering. We also design the wave buffer, a GPU-CPU buffer manager that assigns computation and manages data across heterogeneous hardware. We evaluate RetroInfer across a range of models and workloads, demonstrating up to 4.4X decoding throughput over full attention at 120K context and up to 12.2X over sparse attention baselines at 1 million tokens — all while preserving full-attention-level accuracy.