Skip to yearly menu bar Skip to main content


Session

Research Track Oral Presentation: Multimodal and Generative Models

Grand Ballroom 2

Moderator: Haoran Qiu

Tue 19 May 1 p.m. PDT — 2:15 p.m. PDT
Abstract:
Chat is not available.

Tue 19 May 13:00 - 13:15 PDT

ContextPilot: Fast Long-Context Inference via Context Reuse

Yinsicheng Jiang ⋅ Yeqi Huang ⋅ Liang Cheng ⋅ Cheng Deng ⋅ Xuan Sun ⋅ Luo Mai

AI applications increasingly depend on long-context inference, where LLMs consume substantial context to support stronger reasoning. Common examples include retrieval-augmented generation, agent memory layers, and multi-agent orchestration. As input contexts get longer, prefill latency becomes the main bottleneck. Yet today’s prefill acceleration techniques face a trade-off: they either preserve reasoning quality but deliver little KV-cache reuse, or improve reuse at the cost of degraded reasoning quality. We present ContextPilot, a system that accelerates prefill by introducing context reuse as a new mechanism for faster long-context inference. ContextPilot introduces a context index to identify overlapping context blocks across LLM interactions (e.g., across users and turns). It further proposes context alignment and de-duplication techniques to maximize KV-cache reuse. To preserve reasoning quality under reuse, it introduces succinct context annotations that prevent quality degradation. Finally, ContextPilot is built around a modular architecture with a clean interface that integrates with existing inference engines. Extensive evaluation shows that ContextPilot reduces LLM prefill latency by up to 3X compared to state-of-the-art methods while preserving reasoning quality. At longer context lengths, it can even improve reasoning quality. ContextPilot is open-sourced at: https://github.com/EfficientContext/ContextPilot.

Tue 19 May 13:15 - 13:30 PDT

db-SP: Accelerating Sparse Attention for Visual Generative Models with Dual-Balanced Sequence Parallelism

Siqi Chen ⋅ Ke Hong ⋅ Tianchen Zhao ⋅ Ruiqi Xie ⋅ Zhenhua Zhu ⋅ Xudong Zhang ⋅ Yu Wang

Scaling Diffusion Transformer (DiT) inference via sequence parallelism is critical for reducing latency in visual generation, but is severely hampered by workload imbalance when applied to models employing block-wise sparse attention. The imbalance stems from the inherent variation in sparsity across attention heads and the irregular distribution of dense blocks within the sparse mask, when sequence parallelism is applied along the head dimension (as in Ulysses) or the block dimension (as in Ring Attention). In this paper, we formalize a \textit{sparse imbalance ratio} to quantify the imbalance, and propose \textit{db}-SP, a sparsity-aware sequence parallelism technique that tackles the challenge. \textit{db}-SP contains a dual-level partitioning approach that achieves near-perfect workload balance at both the head and block levels with negligible overhead. Furthermore, to handle the evolving sparsity patterns across denoising steps and layers, \textit{db}-SP dynamically determines the parallel degrees for the head and block dimensions at runtime. Experimental results demonstrate that \nickname delivers an end-to-end speedup of 1.25× and an attention-specific speedup of 1.40× over state-of-the-art sequence parallel methods on average.

Tue 19 May 13:30 - 13:45 PDT

SwiftGS: Algorithm and System Co-Optimization for Fast 3D Gaussian Splatting on GPUs

Lingjun Gao ⋅ Zhican Wang ⋅ Zhiwen Mo ⋅ Hongxiang Fan

Recent advances in 3D Gaussian Splatting (3DGS) have enabled high-quality and efficient novel view synthesis, demonstrating great potential in real-world applications such as robotic perception and digital-twin construction. However, 3DGS requires processing up to millions of Gaussians in parallel, imposing significant computational and memory demands that limit its deployment on resource-constrained platforms. Through systematic profiling and analysis, this paper identifies several redundancy at both the algorithmic and system implementation levels. These insights motivate us to explore several novel optimizations, including adaptive early sorting, GPU-efficient axis-shared rasterization, and dynamic thresholding. Unlike prior work that focuses only on either algorithmic improvements or systems optimization, our approach explores a joint algorithm and system co-optimization to push the performance limits of 3DGS on GPUs. Comprehensive evaluation demonstrates that our co-optimization approach, named \textit{Flash3DGS} achieves a speed-up of up to $1.41 \times$ with negligible algorithmic performance drop in rendering image quality compared with the \textit{gsplat} baseline. Importantly, our co-optimization is orthogonal to most existing 3DGS acceleration methods, allowing for synergistic performance gains when used in combination. We plan to release our code publicly upon paper acceptance to support reproducibility and future research.

Tue 19 May 13:45 - 14:00 PDT

TriInfer: Hybrid EPD Disaggregation for Efficient Multimodal Large Language Model Inference

Xianzhe Dong ⋅ Tongxuan Liu ⋅ Yuting Zeng ⋅ Weizhe Huang ⋅ Xiaoyang Zhao ⋅ Siyu Wu ⋅ Liangyu Liu ⋅ Liu Yang ⋅ Yu Wu ⋅ Hailong Yang ⋅ Ke Zhang ⋅ Jing Li

Existing MLLM inference systems are typically designed based on the architecture of language models, coupling image processing and language processing. This design struggles to accommodate the heterogeneous demands of different stages in terms of computational resources, memory access patterns, and service-level objectives (SLOs), leading to low resource utilization and high request latency, ultimately failing to meet the service requirements of diverse inference scenarios. To address these challenges, we propose TriInfer, an efficient MLLM inference system that adopts a Hybrid Encode-Prefill-Decode (EPD) Disaggregation architecture. By scheduling the three stages — encode, prefill, and decode — onto separate heterogeneous inference instances, the system flexibly reallocates resources across stages, significantly reducing idle computation, alleviating resource bottlenecks, and improving overall system throughput and scalability. In addition, TriInfer supports a stage-level batching strategy that enhances load balancing, enables parallel execution of visual and language models, and further optimizes inference performance. Experiments under real multimodal inference workloads demonstrate that TriInfer can achieve up to 3.7× higher inference throughput compared to state-of-the-art systems (e.g., vLLM, SGLang) while meeting the 90th percentile request SLO.The source code of TriInfer will be released at https://github.com/dongxianzhe/triinfer.

Tue 19 May 14:00 - 14:15 PDT

TiDAR: Think in Diffusion, Talk in Autoregression

Jingyu Liu ⋅ Xin Dong ⋅ Zhifan Ye ⋅ Rishabh Mehta ⋅ Yonggan Fu ⋅ vartika singh ⋅ Ce Zhang ⋅ Pavlo Molchanov

Diffusion language models hold the promise of fast parallel generation, while autoregressive (AR) models typically excel in quality due to their causal structure aligning naturally with language modeling. This raises a fundamental question: can we achieve a synergy with high throughput, higher GPU utilization, and AR level quality? Existing methods fail to effectively balance these two aspects, either prioritizing AR using a weaker model for sequential drafting (speculative decoding), leading to lower drafting efficiency, or using some form of left-to-right (AR-like) decoding logic for diffusion, which still suffers from quality degradation and forfeits its potential parallelizability. We introduce TIDAR, a sequence-level hybrid architecture that drafts tokens (Thinking) in Diffusion and samples final outputs (Talking) AutoRegressively - all within a single forward pass using specially designed structured attention masks. This design exploits the free compute density on GPUs, achieving a strong balance between drafting and verification capacity. Moreover, we design TIDAR to be serving-friendly as a standalone model. We extensively evaluate TIDAR against AR models, speculative decoding, and diffusion variants across generative and likelihood tasks at both 1.5B and 8B scales. Thanks to parallel drafting and sampling as well as efficient exact KV cache support, TIDAR outperforms speculative decoding in measured throughput and surpasses diffusion models like Dream and Llada in both efficiency and quality. Most notably, TIDAR is the first architecture to close the quality gap with AR models while delivering 4.71× to 5.91× more tokens per second.