TokenBlend: Accelerating Tensor Parallelism LLM Inference Through Efficient Compute-Communication Overlap
Raja Gond ⋅ Nipun Kwatra ⋅ Ramachandran Ramjee
Abstract
Distributed inference of large language models (LLMs) using tensor parallelism can introduce communication overheads of $20\%$ even over GPUs connected via NVLink. Several techniques have been proposed to mitigate these overheads by decomposing computations into smaller tasks and overlapping communication with these computation subtasks. However, as of this writing, none of the open-source LLM serving systems (vLLM, SGLANG, TensorRT-LLM) support compute-communication overlap for LLMs served using tensor parallelism. This is because the number of tokens processed per iteration is kept small to support low latency serving and decomposing these smaller workloads to enable communication overlap results in worse performance. We present TOKENBLEND, the first system to enable efficient compute-communication overlap for tensor-parallel models for token lengths as small as 1024. TOKENBLEND identifies RMSNorm, a previously overlooked operation, as crucial and optimizes it along with communication by implementing a novel fused \textbf{AllReduce--RMSNorm} kernel. Further, this kernel leverages the multimem feature available on modern GPUs (e.g., Hopper, Blackwell) to jointly perform communication and RMSNorm efficiently using only 2--8 SMs. Our evaluations demonstrate up to $\boldsymbol{1.28\times}$ speedup in latency and $\boldsymbol{1.19\times}$ higher throughput across multiple models and workloads. In several settings, TOKENBLEND delivers \textit{better performance than an equivalent model with all communication removed}. The source code of TOKENBLEND is available at https://anonymous.4open.science/r/tokenblend-mlsys/.
Chat is not available.
Successful Page Load