Poster
FlexInfer: Flexible LLM Inference with CPU Computations
Seonjin Na · Geonhwa Jeong · Byung Hoon Ahn · Aaron Jezghani · Jeffrey Young · Christopher Hughes · Tushar Krishna · Hyesoon Kim
LLMs have achieved remarkable performance across various fields, prompting data centers to use high computation cost accelerators like GPUs and NPUs for model training and inference. However, LLM’s large model sizes and related key-value (KV) caches create significant memory capacity challenges. To address this, offloading-based techniques leverage CPU memory for storing model weights and KV cache, allowing models larger than GPU memory to be served. However, these approaches often encounter performance bottlenecks due to PCIe transfer latency and fail to effectively leverage the potential of CPU computation. To address the performance limitations of existing offloading-based LLM inference in CPU and memory-limited single GPU systems, this paper proposes FlexInfer. FlexInfer uses a performance estimator to dynamically select the most appropriate execution policy for each phase—prefill and decode—based on their distinct characteristics. Our evaluation results show that by selecting optimal policies for these phases, FlexInfer can significantly reduce end-to-end latency by 75.2% and 77% on average across two different server configurations for various models such as OPT and LLaMA compared to FlexGen, the state-of-the-art offload-based LLM inference technique.