Session
Research-Track Oral Presentation: R19: Multimodal and Generative Models
Grand Ballroom 2
db-SP: Accelerating Sparse Attention for Visual Generative Models with Dual-Balanced Sequence Parallelism
Siqi Chen ⋅ Ke Hong ⋅ Tianchen Zhao ⋅ Ruiqi Xie ⋅ Zhenhua Zhu ⋅ Xudong Zhang ⋅ Yu Wang
Scaling Diffusion Transformer (DiT) inference via sequence parallelism is critical for reducing latency in visual generation, but is severely hampered by workload imbalance when applied to models employing block-wise sparse attention. The imbalance stems from the inherent variation in sparsity across attention heads and the irregular distribution of dense blocks within the sparse mask, when sequence parallelism is applied along the head dimension (as in Ulysses) or the block dimension (as in Ring Attention). In this paper, we formalize a \textit{sparse imbalance ratio} to quantify the imbalance, and propose \textit{db}-SP, a sparsity-aware sequence parallelism technique that tackles the challenge. \textit{db}-SP contains a dual-level partitioning approach that achieves near-perfect workload balance at both the head and block levels with negligible overhead. Furthermore, to handle the evolving sparsity patterns across denoising steps and layers, \textit{db}-SP dynamically determines the parallel degrees for the head and block dimensions at runtime. Experimental results demonstrate that \nickname delivers an end-to-end speedup of 1.25× and an attention-specific speedup of 1.40× over state-of-the-art sequence parallel methods on average.
Flash3DGS: Algorithm and System Co-Optimization for Fast 3D Gaussian Splatting on GPUs
Lingjun Gao ⋅ Zhican Wang ⋅ Zhiwen Mo ⋅ Hongxiang Fan
Recent advances in 3D Gaussian Splatting (3DGS) have enabled high-quality and efficient novel view synthesis, demonstrating great potential in real-world applications such as robotic perception and digital-twin construction. However, 3DGS requires processing up to millions of Gaussians in parallel, imposing significant computational and memory demands that limit its deployment on resource-constrained platforms. Through systematic profiling and analysis, this paper identifies several redundancy at both the algorithmic and system implementation levels. These insights motivate us to explore several novel optimizations, including adaptive early sorting, GPU-efficient axis-shared rasterization, and dynamic thresholding. Unlike prior work that focuses only on either algorithmic improvements or systems optimization, our approach explores a joint algorithm and system co-optimization to push the performance limits of 3DGS on GPUs. Comprehensive evaluation demonstrates that our co-optimization approach, named \textit{Flash3DGS} achieves a speed-up of up to $1.41 \times$ with negligible algorithmic performance drop in rendering image quality compared with the \textit{gsplat} baseline. Importantly, our co-optimization is orthogonal to most existing 3DGS acceleration methods, allowing for synergistic performance gains when used in combination. We plan to release our code publicly upon paper acceptance to support reproducibility and future research.
StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation
Tianrui Feng ⋅ Zhi Li ⋅ Shuo Yang ⋅ Haocheng Xi ⋅ Muyang Li ⋅ Xiuyu Li ⋅ ⋅ Keting Yang ⋅ Kelly Peng ⋅ Song Han ⋅ Maneesh Agrawala ⋅ Kurt Keutzer ⋅ Akio Kodaira ⋅ Chenfeng Xu
Generative models are reshaping the live-streaming industry by redefining how content is created, styled, and delivered. Previous image-based streaming diffusion models have powered efficient and creative live streaming products but has hit limits on temporal consistency due to the foundation of image-based designs. Recent advances in video diffusion have markedly improved temporal consistency and sampling efficiency for offline generation. However, offline generation systems primarily optimize throughput by batching large workloads. In contrast, live online streaming operates under strict service-level objectives (SLOs): time-to-first-frame must be minimal, and every frame must meet a per-frame deadline with low jitter. Besides, scalable multi-GPU serving for real-time streams remains largely unresolved so far. To address this, we present \textbf{StreamDiffusionV2}, a \emph{training-free} pipeline for interactive live streaming with video diffusion models. StreamDiffusionV2 integrates an SLO-aware batching scheduler and a block scheduler, together with a sink-token–guided rolling KV cache, a motion-aware noise controller, and other system-level optimizations. Moreover, we introduce a scalable pipeline orchestration that parallelizes the diffusion process across denoising steps and network layers, achieving near-linear FPS scaling without violating latency guarantees. The system scales seamlessly across heterogeneous GPU environments and supports flexible denoising steps (e.g., 1–4), enabling both ultra-low-latency and higher-quality modes. Without TensorRT or quantization, StreamDiffusionV2 renders the first frame within 0.5s and attains 58.28 FPS with a 14B-parameter model and 64.52 FPS with a 1.3B-parameter model on four H100 GPUs. Even when increasing denoising steps to improve quality, it sustains 31.62 FPS (14B) and 61.58 FPS (1.3B), making state-of-the-art generative live streaming practical and accessible—from individual creators to enterprise-scale platforms.
TiDAR: Think in Diffusion, Talk in Autoregression
Jingyu Liu ⋅ Xin Dong ⋅ Zhifan Ye ⋅ ⋅ Yonggan Fu ⋅ vartika singh ⋅ Ce Zhang ⋅ Pavlo Molchanov
Diffusion language models hold the promise of fast parallel generation, while autoregressive (AR) models typically excel in quality due to their causal structure aligning naturally with language modeling. This raises a fundamental question: can we achieve a synergy with high throughput, higher GPU utilization, and AR level quality? Existing methods fail to effectively balance these two aspects, either prioritizing AR using a weaker model for sequential drafting (speculative decoding), leading to lower drafting efficiency, or using some form of left-to-right (AR-like) decoding logic for diffusion, which still suffers from quality degradation and forfeits its potential parallelizability. We introduce TIDAR, a sequence-level hybrid architecture that drafts tokens (Thinking) in Diffusion and samples final outputs (Talking) AutoRegressively - all within a single forward pass using specially designed structured attention masks. This design exploits the free compute density on GPUs, achieving a strong balance between drafting and verification capacity. Moreover, we design TIDAR to be serving-friendly as a standalone model. We extensively evaluate TIDAR against AR models, speculative decoding, and diffusion variants across generative and likelihood tasks at both 1.5B and 8B scales. Thanks to parallel drafting and sampling as well as efficient exact KV cache support, TIDAR outperforms speculative decoding in measured throughput and surpasses diffusion models like Dream and Llada in both efficiency and quality. Most notably, TIDAR is the first architecture to close the quality gap with AR models while delivering 4.71× to 5.91× more tokens per second.
TriInfer: Hybrid EPD Disaggregation for Efficient Multimodal Large Language Model Inference
Xianzhe Dong ⋅ Tongxuan Liu ⋅ Yuting Zeng ⋅ Weizhe Huang ⋅ ⋅ Siyu Wu ⋅ ⋅ Liu Yang ⋅ ⋅ Hailong Yang ⋅ ⋅ Jing Li
Existing MLLM inference systems are typically designed based on the architecture of language models, coupling image processing and language processing. This design struggles to accommodate the heterogeneous demands of different stages in terms of computational resources, memory access patterns, and service-level objectives (SLOs), leading to low resource utilization and high request latency, ultimately failing to meet the service requirements of diverse inference scenarios. To address these challenges, we propose TriInfer, an efficient MLLM inference system that adopts a Hybrid Encode-Prefill-Decode (EPD) Disaggregation architecture. By scheduling the three stages — encode, prefill, and decode — onto separate heterogeneous inference instances, the system flexibly reallocates resources across stages, significantly reducing idle computation, alleviating resource bottlenecks, and improving overall system throughput and scalability. In addition, TriInfer supports a stage-level batching strategy that enhances load balancing, enables parallel execution of visual and language models, and further optimizes inference performance. Experiments under real multimodal inference workloads demonstrate that TriInfer can achieve up to 3.7× higher inference throughput compared to state-of-the-art systems (e.g., vLLM, SGLang) while meeting the 90th percentile request SLO.