Recorded Events
Discover all conference events with available recordings
74 recorded events
We use SlidesLive as our recording platform to live stream conference events. All recordings become freely available 30 days after the conference ends and are hosted on our website, viewable at any time for your convenience.
Filter by Event Type
Closing Remarks (1 event)
Closing Remarks
View Event & Recording
Industry (1 event)
Industry Lightning Talks
View Event & Recording
Invited Talk (4 events)
Extreme PyTorch: Inside the Most Demanding ML Workloads—and the Open Challenges in Building AI Agents to Democratize Them
Presenter:
Soumith Chintala
An AI stack: from scaling AI workloads to evaluating LLMs
Presenter:
Ion Stoica
Hardware-aware training and inference for large-scale AI
Presenter:
Animashree Anandkumar
Opening Remarks (2 events)
Opening Remarks - Young Professional Symposium
View Event & RecordingOpening Remarks
View Event & Recording
Panel Discussion (1 event)
Poster (60 events)
A Bring-Your-Own-Model Approach for ML-Driven Storage Placement in Warehouse-Scale Computers
Presenters:
Chenxi Yang
Yan Li
Martin Maas
Mustafa Uysal
Ubaid Hafeez
Arif Merchant
Richard McDougall
AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine
Presenters:
Carlo Siebenschuh
Kyle Hippe
Ozan Gokdemir
Alexander Brace
Arham Khan
Khalid Hossain
Yadu Babuji
Nicholas Chia
Venkatram Vishwanath
Arvind Ramanathan
Rick Stevens
Ian Foster
Robert Underwood
AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution
Presenters:
Zhiqiang Xie
Hao Kang
Ying Sheng
Tushar Krishna
Kayvon Fatahalian
Christos Kozyrakis
AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds
Presenters:
Yinfang Chen
Manish Shetty
Gagan Somashekar
Minghua Ma
Yogesh Simmhan
Jonathan Mace
Chetan Bansal
Rujia Wang
S R
APOLLO: SGD-like Memory, AdamW-level Performance
Presenters:
Hanqing Zhu
Zhenyu Zhang
Wenyan Cong
Xi Liu
Sem Park
Vikas Chandra
Bo Long
David Pan
Atlas Wang
Jinwon Lee
Balancing Pipeline Parallelism with Vocabulary Parallelism
Presenters:
Man Tsung Yeung
Penghui Qi
Min Lin
Xinyi Wan
COMET: Fine-grained Computation-communication Overlapping for Mixture-of-Experts
Presenters:
Shulai Zhang
Ningxin Zheng
Haibin Lin
Ziheng Jiang
Wenlei Bao
Chengquan Jiang
Qi Hou
Weihao Cui
Size Zheng
Li-Wen Chang
Quan Chen
Xin Liu
Context Parallelism for Scalable Million-Token Inference
Presenters:
Amy Yang
Jingyi Yang
Aya Ibrahim
Xinfeng Xie
Bangsheng Tang
Grigory Sizov
Jongsoo Park
Jianyu Huang
DiffServe: Efficiently Serving Text-to-Image Diffusion Models with Query-Aware Model Scaling
Presenters:
Sohaib Ahmad
Qizheng Yang
Haoliang Wang
Ramesh Sitaraman
Hui Guan
Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking
Presenters:
Marco Federici
Davide Belli
Mart van Baalen
Amir Jalalirad
Andrii Skliar
Bence Major
Markus Nagel
Paul Whatmough
Efficient On-Device Machine Learning with a Biologically-Plausible Forward-Only Algorithm
Presenters:
Baichuan Huang
Amir Aminifar
Enabling Unstructured Sparse Acceleration on Structured Sparse Accelerators
Presenters:
Geonhwa Jeong
Po-An Tsai
Abhimanyu Rajeshkumar Bambhaniya
Stephen Keckler
Tushar Krishna
FastTree: Optimizing Attention Kernel and Runtime for Tree-Structured LLM Inference
Presenters:
Zaifeng Pan
Yitong Ding
Yue Guan
Zheng Wang
Zhongkai Yu
Xulong Tang
Yida Wang
Yufei Ding
FedProphet: Memory-Efficient Federated Adversarial Training via Robust and Consistent Cascade Learning
Presenters:
Minxue Tang
Yitu Wang
Jingyang Zhang
Louis DiValentin
Aolin Ding
Amin Hass
Yiran Chen
Hai Li
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
Presenters:
Zihao Ye
Lequn Chen
Ruihang Lai
Wuwei Lin
Yineng Zhang
Stephanie Wang
Tianqi Chen
Baris Kasikci
Vinod Grover
Arvind Krishnamurthy
Luis Ceze
FlexAttention: A Programming Model for Generating Fused Attention Variants.
Presenters:
Juechu Dong
BOYUAN FENG
Driss Guessous
Yanbo Liang
Horace He
FlexInfer: Flexible LLM Inference with CPU Computations
Presenters:
Seonjin Na
Geonhwa Jeong
Byung Hoon Ahn
Aaron Jezghani
Jeffrey Young
Christopher Hughes
Tushar Krishna
Hyesoon Kim
FLStore: Efficient Federated Learning Storage for non-training workloads
Presenters:
Ahmad Faraz Khan
Samuel Fountain
Ahmed Mohamed Abdelmoniem Sayed
Ali R. Butt
Ali Anwar
Graph Learning at Scale: Characterizing and Optimizing Pre-Propagation GNNs
Presenters:
Zichao Yue
Chenhui Deng
Zhiru Zhang
GSplit: Scaling Graph Neural Network Training on Large Graphs via Split-Parallelism
Presenters:
Sandeep Polisetty
Juelin Liu
Yi Fung
Seung-Hwan Lim
Hui Guan
Marco Serafini
HyC-LoRA: Memory Efficient LoRA Fine-tuning with Hybrid Activation Compression
Presenters:
Yujin Wang
Shunan Dong
Zongle Huang
Yichen You
Liu He
Huazhong Yang
Yongpan Liu
Hongyang Jia
Interference-aware Edge Runtime Prediction with Conformal Matrix Completion
Presenters:
Tianshu Huang
Arjun Ramesh
Emily Ruppel
Nuno Pereira
Anthony Rowe
Carlee Joe-Wong
LAVA: Lifetime-Aware VM Allocation with Learned Distributions and Adaptation to Mispredictions
Presenters:
Jianheng Ling
Pratik Worah
Yawen Wang
Yunchuan Kong
Chunlei Wang
Clifford Stein
Diwakar Gupta
Jason Behmer
Logan Bush
Prakash Ramanan
Rajesh Kumar
Thomas Chestna
Yajing Liu
Ying Liu
Ye Zhao
Kathryn S. McKinley
Meeyoung Park
Martin Maas
LeanAttention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers
Presenters:
Rya Sanovar
Srikant Bharadwaj
Renée St. Amant
Victor Ruehle
Saravan Rajmohan
Lightweight Software Kernels and Hardware Extensions for Efficient Sparse Deep Neural Networks on Microcontrollers
Presenters:
Francesco Daghero
Daniele Jahier Pagliari
Francesco Conti
Luca Benini
Massimo Poncino
Alessio Burrello
LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
Presenters:
Shang Yang
Junxian Guo
Haotian Tang
Qinghao Hu
Guangxuan Xiao
Jiaming Tang
Yujun Lin
Zhijian Liu
Yao Lu
Song Han
Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training
Presenters:
Mingyu Liang
Hiwot Kassa
Wenyin Fu
Brian Coutinho
Louis Feng
Christina Delimitrou
Marconi: Prefix Caching for the Era of Hybrid LLMs
Presenters:
Rui Pan
Zhuang Wang
Zhen Jia
Can Karakus
Luca Zancato
Tri Dao
Yida Wang
Ravi Netravali
MAS-ATTENTION: MEMORY-AWARE STREAM PROCESSING FOR ATTENTION ACCELERATION ON RESOURCE-CONSTRAINED EDGE DEVICES
Presenters:
Mohammadali Shakerdargah
Shan Lu
Chao Gao
Di Niu
MEADOW: Memory-efficient Dataflow and Data Packing for Low Power Edge LLMs
Presenters:
Abhishek Moitra
Arkapravo Ghosh
Shrey Agrawal
Aporva Amarnath
Karthik Swaminathan
Priyadarshini Panda
MiLo: Efficient Quantized MoE Inference with Mixture of Low-Rank Compensators
Presenters:
Beichen Huang
Yueming Yuan
ZELEI SHAO
Minjia Zhang
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
Presenters:
Xuanlin Jiang
Yang Zhou
Shiyi Cao
Ion Stoica
Minlan Yu
On Distributed Larger-Than-Memory Subset Selection With Pairwise Submodular Functions
Presenters:
Maximilian Böther
Abe Sebastian
Pranjal Awasthi
Ana Klimovic
Srikumar Ramalingam
Optimizing LLM Queries in Relational Data Analytics Workloads
Presenters:
Shu Liu
Asim Biswal
Audrey Cheng
Amog Kamsetty
Luis Gaspar Schroeder
Liana Patel
Shiyi Cao
Xiangxi Mo
Ion Stoica
Joseph Gonzalez
Matei Zaharia
Photon: Federated LLM Pre-Training
Presenters:
Lorenzo Sani
Alex Iacob
Zeyu Cao
Royson Lee
Bill Marino
Yan Gao
Wanru Zhao
Dongqi Cai
Zexi Li
Xinchi Qiu
Nic Lane
PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training
Presenters:
Daiyaan Arfeen
Zhen Zhang
Xinwei Fu
Gregory R. Ganger
Yida Wang
ProtoRAIL: A Risk-cognizant Imitation Agent for Adaptive vCPU Oversubscription In the Cloud
Presenters:
Lu Wang
Mayukh Das
Fangkai Yang
Bo Qiao
Hang Dong
Si Qin
Victor Ruehle
Chetan Bansal
Eli Cortez
Íñigo Goiri
S R
Qingwei Lin
Dongmei Zhang
QServe:W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
Presenters:
Yujun Lin
Haotian Tang
Shang Yang
Zhekai Zhang
Guangxuan Xiao
Chuang Gan
Song Han
Radius: Range-based Gradient Sparsity for Large Foundation Model Pre-training
Presenters:
Mingkai Zheng
Zhao Zhang
ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation
Presenters:
Zhiyu Mei
WEI FU
Kaiwei Li
Guangju Wang
Huanchen Zhang
Yi Wu
Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving
Presenters:
Wei Gao
Xinyu Zhou
Peng Sun
Tianwei Zhang
Yonggang Wen
Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling
Presenters:
Xinyi Zhang
Hanyu Zhao
Wencong Xiao
Xianyan Jia
Fei Xu
Yong Li
Wei Lin
Fangming Liu
SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention
Presenters:
Qianchao Zhu
Jiangfei Duan
Chang Chen
Siran Liu
Xiuhong Li
Guanyu Feng
Xin Lv
Xiao Chuanfu
Dahua Lin
Chao Yang
ScaleFusion: Scalable Inference of Spatial-Temporal Diffusion Transformers for High-Resolution Long Video Generation
Presenters:
Jiacheng Yang
Jun Wu
Zhen Zhang
Xinwei Fu
Zhiying Xu
Zhen Jia
Yida Wang
Gennady Pekhimenko
Scaling Deep Learning Training with MPMD Pipeline Parallelism
Presenters:
Anxhelo Xhebraj
Sean Lee
Hanfeng Chen
Vinod Grover
Seesaw: High-throughput LLM Inference via Model Re-sharding
Presenters:
Qidong Su
Wei Zhao
Xin Li
Muralidhar Andoorveedu
Chenhao Jiang
Zhanda Zhu
Kevin Song
Christina Giannoula
Gennady Pekhimenko
Self-Data Distillation for Recovering Quality in Pruned Large Language Models
Presenters:
Vithursan Thangarasa
Ganesh Venkatesh
Mike Lasby
Nish Sinnadurai
Sean Lie
SOLA: Optimizing SLO Attainment for Large Language Model Serving with State-Aware Scheduling
Presenters:
Ke Hong
Xiuhong Li
Lufang Chen
Qiuli Mao
Guohao Dai
Xuefei Ning
Shengen Yan
Yun Liang
Yu Wang
SparseTransX: Efficient Training of Translation-Based Knowledge Graph Embeddings Using Sparse Matrix Operations
Presenters:
Md Saidul Hoque Anik
Ariful Azad
Supply-Chain Attacks in Machine Learning Frameworks
Presenters:
Yue Gao
Ilia Shumailov
Kassem Fawaz
SwiftVI: Time-Efficient Planning and Learning with MDPs
Presenters:
Kasper Overgaard Mortensen
Konstantinos Skitsas
Emil Morre Christensen
Mohammad Sadegh Talebi
Andreas Pavlogiannis
Davide Mottin
Panagiotis Karras
The Hidden Bloat in Machine Learning Systems
Presenters:
Huaifeng Zhang
Ahmed Ali-Eldin Hassan
ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments
Presenters:
YOUHE JIANG
Fangcheng Fu
Xiaozhe Yao
Taiyi Wang
Bin CUI
Ana Klimovic
Eiko Yoneki
TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives
Presenters:
Size Zheng
Jin Fang
Xuegui Zheng
Qi Hou
Wenlei Bao
Ningxin Zheng
Ziheng Jiang
Dongyang Wang
Jianxi Ye
Haibin Lin
Li-Wen Chang
Xin Liu
Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer
Presenters:
Jinghan Yao
Sam Jacobs
Masahiro Tanaka
Olatunji Ruwase
Hari Subramoni
Dhabaleswar Panda
TurboAttention: Efficient attention approximation for high throughputs llm
Presenters:
Hao Kang
Srikant Bharadwaj
James Hensman
Tushar Krishna
Victor Ruehle
Saravan Rajmohan
Venn: Resource Management For Collaborative Learning Jobs
Presenters:
Jiachen Liu
Fan Lai
Eric Ding
Yiwen Zhang
Mosharaf Chowdhury
VoLUT: Efficient Volumetric streaming enhanced by LUT-based super-resolution
Presenters:
Chendong Wang
Anlan Zhang
Yifan Yang
Lili Qiu
Yuqing Yang
XINYANG JIANG
Feng Qian
Suman Banerjee
XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models
Presenters:
Yixin Dong
Charlie Ruan
Yaxing Cai
Ziyi Xu
Yilong Zhao
Ruihang Lai
Tianqi Chen
Youmu: Efficient Columnar Data Pipeline for LLM Training
Presenters:
Tianle Zhong
Jiechen Zhao
Qiang Su
Geoffrey Fox
Poster Session (1 event)
Poster Session and Reception - Young Professional Symposium
View Event & Recording
Talk (4 events)
LMArena: An Open Platform for Crowdsourced AI benchmarks
Presenter:
Wei-Lin Chiang