Invited Talk Mon, May 18, 2026 • 9:50 AM – 10:15 AM PDT Grand Ballroom 1

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

Yuhan Liu

Abstract

KV cache has traditionally been stored in GPU memory to accelerate the decoding phase of large language model (LLM) inference. However, it is increasingly necessary to move KV caches outside GPU devices, to enable cache reuse across different queries and inference engines. Our real-world usage statistics confirm this trend: over time, the total KV cache stored by users has grown rapidly, far exceeding the capacity of GPU memory. Despite this need, there lacks an efficient solution for offloading and transferring KV caches.x000D x000D In this talk, I'll present LMCache, the first efficient open-source KV caching solution, which extracts and stores KV caches generated by modern LLM engines (vLLM and SGLang) out of the GPU memory and shares them across engines and queries. LMCache supports both cache offloading (prefix reuse across queries) and prefill-decode (PD) disaggregation (cross-engine/GPU cache transfer). Our evaluation shows that combining LMCache with vLLM achieves up to 15x improvement in throughput across workloads such as multi-round question answering and document analysis. I'll also briefly talk about the key KV cache optimizations behind LMCache, including CacheGen for KV cache compression and CacheBlend for non-prefix KV cache sharing.

Speaker

Yuhan Liu

Yuhan Liu is a fifth-year PhD candidate at the University of Chicago, co-advised by Junchen Jiang and Shan Lu. Her research interest is in building efficient large-scale system and networking support for ML model inference. She received MIT EECS rising star, EuroSys best paper award, and UChicago’s Neubauer PhD fellowship for her research. She also leads two open-source projects that build large-scale KV caching layer for efficient LLM inference, and are used in over 30 companies in production, including Google Cloud, Amazon AWS, NVIDIA, IBM etc.

Chat is not available.