Track: Research Track Oral Presentation: Best Paper Session

Tue 19 May 8:45 - 9:05 PDT

StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation

Tianrui Feng ⋅ Zhi Li ⋅ Shuo Yang ⋅ Haocheng Xi ⋅ Muyang Li ⋅ Xiuyu Li ⋅ Lvmin Zhang ⋅ Keting Yang ⋅ Kelly Peng ⋅ Song Han ⋅ Maneesh Agrawala ⋅ Kurt Keutzer ⋅ Akio Kodaira ⋅ Chenfeng Xu

Generative models are reshaping the live-streaming industry by redefining how content is created, styled, and delivered. Previous image-based streaming diffusion models have powered efficient and creative live streaming products but has hit limits on temporal consistency due to the foundation of image-based designs. Recent advances in video diffusion have markedly improved temporal consistency and sampling efficiency for offline generation. However, offline generation systems primarily optimize throughput by batching large workloads. In contrast, live online streaming operates under strict service-level objectives (SLOs): time-to-first-frame must be minimal, and every frame must meet a per-frame deadline with low jitter. Besides, scalable multi-GPU serving for real-time streams remains largely unresolved so far. To address this, we present \textbf{StreamDiffusionV2}, a \emph{training-free} pipeline for interactive live streaming with video diffusion models. StreamDiffusionV2 integrates an SLO-aware batching scheduler and a block scheduler, together with a sink-token–guided rolling KV cache, a motion-aware noise controller, and other system-level optimizations. Moreover, we introduce a scalable pipeline orchestration that parallelizes the diffusion process across denoising steps and network layers, achieving near-linear FPS scaling without violating latency guarantees. The system scales seamlessly across heterogeneous GPU environments and supports flexible denoising steps (e.g., 1–4), enabling both ultra-low-latency and higher-quality modes. Without TensorRT or quantization, StreamDiffusionV2 renders the first frame within 0.5s and attains 58.28 FPS with a 14B-parameter model and 64.52 FPS with a 1.3B-parameter model on four H100 GPUs. Even when increasing denoising steps to improve quality, it sustains 31.62 FPS (14B) and 61.58 FPS (1.3B), making state-of-the-art generative live streaming practical and accessible—from individual creators to enterprise-scale platforms.

Tue 19 May 9:05 - 9:25 PDT

LEANN: A Low-Storage Overhead Vector Index

Yichuan Wang ⋅ Zhifei Li ⋅ Shu Liu ⋅ Yongji Wu ⋅ Ziming Mao ⋅ Yilong Zhao ⋅ Xiao Yan ⋅ Zhiying Xu ⋅ Yang Zhou ⋅ Ion Stoica ⋅ Sewon Min ⋅ Matei Zaharia ⋅ Joseph Gonzalez

Embedding-based vector search underpins many important applications, such as recommendation and retrieval-augmented generation (RAG). It relies on vector indices to enable efficient search. However, these indices require storing high-dimensional embeddings and large index metadata, whose total size can be several times larger than the original data (e.g., text chunks). Such high storage overhead makes it difficult, or even impractical, to deploy vector search on personal devices or large-scale datasets. To tackle this problem, we propose LEANN, a storage-efficient index for vector search that recomputes embeddings on the fly instead of storing them, and compresses state-of-the-art proximity graph indices while preserving search accuracy. LEANN delivers high-quality vector search while using only a fraction of the storage (e.g., 5% of the original data) and supporting storage-efficient index construction and updates. On real-world benchmarks, LEANN reduces index size by up to 50× compared with conventional indices, while maintaining SOTA accuracy and comparable latency for RAG applications.

Tue 19 May 9:25 - 9:45 PDT

BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

Jiayi Yuan ⋅ Cameron Shinn ⋅ Kai Xu ⋅ Jingze Cui ⋅ George Klimiashvili ⋅ Guangxuan Xiao ⋅ Perkz Zheng ⋅ Bo Li ⋅ Zhou Yuxin ⋅ Zhouhai Ye ⋅ Weijie You ⋅ Tian Zheng ⋅ Dominic Brown ⋅ Pengbo Wang ⋅ Markus Hoehnerbach ⋅ Richard Cai ⋅ Julien Demouth ⋅ John D. Owens ⋅ Xia Hu ⋅ Song Han ⋅ Timmy Liu ⋅ Huizi Mao

The growing demand for long-context inference capabilities in Large Language Models (LLMs) has intensified the computational and memory bottlenecks inherent to the self-attention mechanism. To address this challenge, we introduce BLASST, a drop-in, dynamic sparse attention mechanism that accelerates inference by using only a fixed scalar threshold to skip attention blocks. Our method targets practical inference deployment by removing the barriers to adoption present in existing works. As such, BLASST eliminates training requirements, avoids expensive pre-computation passes, accelerates both prefill and decode across all major attention variants (MHA, GQA, MQA, and MLA), provides optimized support for modern hardware, and easily integrates into existing frameworks. This is achieved by reusing online softmax statistics to identify negligible attention scores, skipping softmax, value block loads, and the subsequent matrix multiplication. We demonstrate the BLASST algorithm by delivering optimized kernels with negligible latency overhead. Our automated threshold calibration procedure reveals a simple inverse relationship between optimal threshold and context length, meaning we require only a single threshold each for prefill and decode per model. Preserving benchmark accuracy, we demonstrate a 1.52x speedup for prefill at 71.9% sparsity and a 1.48x speedup for decode at 73.2% sparsity on modern GPUs.

Tue 19 May 9:45 - 10:05 PDT

ExecuTorch - A Unified PyTorch Solution to Run ML Models On-Device

Mergen Nachin ⋅ Digant Desai ⋅ Sicheng Jia ⋅ Chen Lai ⋅ Mengwei Liu ⋅ Jacob Szwejbka ⋅ Raziel Alvarez ⋅ Robert Ascani ⋅ Dave Bort ⋅ Manuel Candales ⋅ Andrew Caples ⋅ Yanan Cao ⋅ Zhengxu Chen ⋅ Soumith Chintala ⋅ Gregory Comer ⋅ Tanvir Islam ⋅ Songhao Jia ⋅ Tarun Karuturi ⋅ Jack Khuu ⋅ Abhinay Kukkadapu ⋅ Tugsbayasgalan Manlaibaatar ⋅ Andrew Or ⋅ Kimish Patel ⋅ Siddartha Pothapragada ⋅ Lucy Qiu ⋅ Supriya Rao ⋅ Orion Reblitz-Richardson ⋅ Max Ren ⋅ Scott Roy ⋅ Anthony Shoumikhin ⋅ Scott Wolchok ⋅ Guang Yang ⋅ Angela Yi ⋅ Martin Yuan ⋅ Hansong Zhang ⋅ Jack Zhang ⋅ Zhenrui Zhang ⋅ Shunting Zhang ⋅ Cemal Bilgin

Local execution of AI on edge devices is critical for privacy, low latency, and offline operation. However, deploying models on diverse hardware remains fragmented, often requiring model conversion or complete implementation outside the PyTorch ecosystem where the model was originally authored. We introduce ExecuTorch, a unified PyTorch-native deployment framework for edge AI. ExecuTorch enables seamless deployment of machine learning models across heterogeneous compute environments. It scales from completely embedded microcontrollers to complex system-on-chips (SoCs) with dedicated accelerators, powering devices ranging from wearables and smartphones to large compute clusters. ExecuTorch preserves PyTorch semantics while allowing customization, support for optimizations like quantization, and pluggable execution ''backends''. These features together enable fast experimentation, allowing researchers to validate deployment behavior entirely within PyTorch, bridging the gap between research and production.

Main Navigation

Session

Research Track Oral Presentation: Best Paper Session

Grand Ballroom 1

StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation

LEANN: A Low-Storage Overhead Vector Index

BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

ExecuTorch - A Unified PyTorch Solution to Run ML Models On-Device