Poster
FlexInfer: Flexible LLM Inference with CPU Computations
Seonjin Na · Geonhwa Jeong · Byung Hoon Ahn · Aaron Jezghani · Jeffrey Young · Christopher Hughes · Tushar Krishna · Hyesoon Kim
Mission City Ballroom #55
LLMs have demonstrated remarkable performance across various fields, prompting data centers to use high computation cost accelerators like GPUs and NPUs for model training and inference. However, the immense size of these models and key-value (KV) caches poses substantial memory capacity challenges. While offloading-based approaches utilize CPU memory to store model weights and KV caches—enabling deployment of models exceeding GPU memory capacity—they often suffer from performance degradation due to PCIe transfer bottlenecks. To address the performance limitations of existing offloading-based LLM inference in CPU and memory-limited single GPU systems, this paper proposes FlexInfer. FlexInfer uses a performance estimator to dynamically select the most appropriate execution policy for each phase—prefill and decode—based on hardware configurations and runtime parameters such as sequence length and batch size. Our evaluation results show that by selecting optimal policies for these phases, FlexInfer can significantly reduce end-to-end latency by 75% and 76% on average across two different server configurations, when compared to FlexGen, a state-of-the-art offload-based LLM inference technique.