Efficient, VRAM-Constrained xLM Inference on Clients
Abstract
To usher in the next round of client AI innovation, there is an urgent need to enable efficient, lossless inference of high-accuracy large language models (LLMs) and vision language models (VLMs), jointly referred to as xLMs, on client systems. This means efficient support for: a) interactive use (i.e. batch size 1), b) high resolution VLM inference, c) dense and mixture-of-experts (MoE) LLMs, and d) adapting to system conditions (CPU thread count, CPU-GPU interconnect bandwidth, and VRAM budget) and inference conditions (phase of execution and context size). While recent CPU-GPU hybrid scheduling techniques show promise, to our best knowledge, no single product handles all of the above. In this paper, we address this problem with pipelined sharding, a novel, benchmark profile-guided CPU-GPU hybrid scheduling technique to achieve efficient, VRAM-constrained inference for both dense and mixture-ofexperts (MoE) LLMs. Using a combination of model sharding at layer or sub-layer levels, CPU offloading, pipelined copy-compute, and prioritized tensor placement in VRAM, it optimizes both time-to-first-token (TTFT) and tokens per second (TPS) metrics, while flexibly adapting to system and inference conditions. For efficient, high-accuracy VLM inference, we combine pipelined sharding with a llama.cpp implementation of three well-understood prior ideas (jointly called VLMOpt), namely, vision tensor CPU offloading, flash attention, and vision and language model VRAM overlap avoidance. These enhancements are targeted at improving client xLM inference in future releases of two important NVIDIA products - the In-Game Inferencing (IGI) software development kit (SDK) and the Cosmos-Reason-1 (CR1) physical AI reasoning VLM. Highlights from our rigorous evaluation spanning multiple models and client systems include: time-to-first-token (TTFT) improves by up to 6.7× and tokens per second by up to 30× for LLMs, and CR1 inference’s VRAM demand is down by 10×, compared to their respective aggressive baselines.