Skip to yearly menu bar Skip to main content


MLSys 2025 List of Accepted Papers

Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking Session 7: Quantization and Sparsity
Marco Federici · Davide Belli · Mart van Baalen · Amir Jalalirad · Andrii Skliar · Bence Major · Markus Nagel · Paul Whatmough
Mission City Ballroom #38
Youmu: Efficient Columnar Data Pipeline for LLM Training Session 5: LLM training and fine-tuning
Tianle Zhong · Jiechen Zhao · Qiang Su · Geoffrey Fox
Mission City Ballroom #17
Efficient On-Device Machine Learning with a Biologically-Plausible Forward-Only Algorithm Session 6: Edge and Cloud Systems
Baichuan Huang · Amir Aminifar
Mission City Ballroom #25
FLStore: Efficient Federated Learning Storage for non-training workloads Session 11: Federated Learning
Ahmad Faraz Khan · Samuel Fountain · Ahmed Mohamed Abdelmoniem Sayed · Ali R. Butt · Ali Anwar
Mission City Ballroom #26
AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine Session 2: Parallel and Distributed Systems
Carlo Siebenschuh · Kyle Hippe · Ozan Gokdemir · Alexander Brace · Arham Khan · Khalid Hossain · Yadu Babuji · Nicholas Chia · Venkatram Vishwanath · Arvind Ramanathan · Rick Stevens · Ian Foster · Robert Underwood
Mission City Ballroom #60
MAS-ATTENTION: MEMORY-AWARE STREAM PROCESSING FOR ATTENTION ACCELERATION ON RESOURCE-CONSTRAINED EDGE DEVICES Session 11: Federated Learning
Mohammadali Shakerdargah · Shan Lu · Chao Gao · Di Niu
Mission City Ballroom #44
MEADOW: Memory-efficient Dataflow and Data Packing for Low Power Edge LLMs Session 12: Edge and Cloud Systems
Abhishek Moitra · Arkapravo Ghosh · Shrey Agrawal · Aporva Amarnath · Karthik Swaminathan · Priyadarshini Panda
Mission City Ballroom #45
SwiftVI: Time-Efficient Planning and Learning with MDPs Session 6: Edge and Cloud Systems
Kasper Overgaard Mortensen · Konstantinos Skitsas · Emil Morre Christensen · Mohammad Sadegh Talebi · Andreas Pavlogiannis · Davide Mottin · Panagiotis Karras
Mission City Ballroom #10
DiffServe: Efficiently Serving Text-to-Image Diffusion Models with Query-Aware Model Scaling Session 1: LLM and Diffusion Model Serving
Sohaib Ahmad · Qizheng Yang · Haoliang Wang · Ramesh Sitaraman · Hui Guan
Mission City Ballroom #2
Graph Learning at Scale: Characterizing and Optimizing Pre-Propagation GNNs Session 12: Edge and Cloud Systems
Zichao Yue · Chenhui Deng · Zhiru Zhang
Mission City Ballroom #15
SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention Session 7: Quantization and Sparsity
Qianchao Zhu · Jiangfei Duan · Chang Chen · Siran Liu · Xiuhong Li · Guanyu Feng · Xin Lv · Xiao Chuanfu · Dahua Lin · Chao Yang
Mission City Ballroom #31
Lightweight Software Kernels and Hardware Extensions for Efficient Sparse Deep Neural Networks on Microcontrollers Session 7: Quantization and Sparsity
Francesco Daghero · Daniele Jahier Pagliari · Francesco Conti · Luca Benini · Massimo Poncino · Alessio Burrello
Mission City Ballroom #22
SparseTransX: Efficient Training of Translation-Based Knowledge Graph Embeddings Using Sparse Matrix Operations Session 7: Quantization and Sparsity
Md Saidul Hoque Anik · Ariful Azad
Mission City Ballroom #7
ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation Session 5: LLM training and fine-tuning
Zhiyu Mei · WEI FU · Kaiwei Li · Guangju Wang · Huanchen Zhang · Yi Wu
Mission City Ballroom #61
On Distributed Larger-Than-Memory Subset Selection With Pairwise Submodular Functions Session 9: Parallel and Distributed Systems
Maximilian Böther · Abe Sebastian · Pranjal Awasthi · Ana Klimovic · Srikumar Ramalingam
Mission City Ballroom #56
Interference-aware Edge Runtime Prediction with Conformal Matrix Completion Session 4: Reliable and Scalable Systems
Tianshu Huang · Arjun Ramesh · Emily Ruppel · Nuno Pereira · Anthony Rowe · Carlee Joe-Wong
Mission City Ballroom #47
Enabling Unstructured Sparse Acceleration on Structured Sparse Accelerators Session 3: Quantization and Sparsity
Geonhwa Jeong · Po-An Tsai · Abhimanyu Rajeshkumar Bambhaniya · Stephen Keckler · Tushar Krishna
Mission City Ballroom #27
FlexInfer: Flexible LLM Inference with CPU Computations Session 8: LLM and Diffusion Model Serving
Seonjin Na · Geonhwa Jeong · Byung Hoon Ahn · Aaron Jezghani · Jeffrey Young · Christopher Hughes · Tushar Krishna · Hyesoon Kim
Mission City Ballroom #55
Balancing Pipeline Parallelism with Vocabulary Parallelism Session 9: Parallel and Distributed Systems
Man Tsung Yeung · Penghui Qi · Min Lin · Xinyi Wan
Mission City Ballroom #52
A Bring-Your-Own-Model Approach for ML-Driven Storage Placement in Warehouse-Scale Computers Session 6: Edge and Cloud Systems
Chenxi Yang · Yan Li · Martin Maas · Mustafa Uysal · Ubaid Hafeez · Arif Merchant · Richard McDougall
Mission City Ballroom #18
Scaling Deep Learning Training with MPMD Pipeline Parallelism Session 9: Parallel and Distributed Systems
Anxhelo Xhebraj · Sean Lee · Hanfeng Chen · Vinod Grover
Mission City Ballroom #32
LeanAttention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers Session 1: LLM and Diffusion Model Serving
Rya Sanovar · Srikant Bharadwaj · Renée St. Amant · Victor Ruehle · Saravan Rajmohan
Mission City Ballroom #20
TurboAttention: Efficient attention approximation for high throughputs llm Session 8: LLM and Diffusion Model Serving
Hao Kang · Srikant Bharadwaj · James Hensman · Tushar Krishna · Victor Ruehle · Saravan Rajmohan
Mission City Ballroom #39
Self-Data Distillation for Recovering Quality in Pruned Large Language Models Session 3: Quantization and Sparsity
Vithursan Thangarasa · Ganesh Venkatesh · Mike Lasby · Nish Sinnadurai · Sean Lie
Mission City Ballroom #42
The Hidden Bloat in Machine Learning Systems Session 4: Reliable and Scalable Systems
Huaifeng Zhang · Ahmed Ali-Eldin Hassan
Mission City Ballroom #51
Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training Session 5: LLM training and fine-tuning
Mingyu Liang · Hiwot Kassa · Wenyin Fu · Brian Coutinho · Louis Feng · Christina Delimitrou
Mission City Ballroom #49
Supply-Chain Attacks in Machine Learning Frameworks Session 12: Edge and Cloud Systems
Yue Gao · Ilia Shumailov · Kassem Fawaz
Mission City Ballroom #13
ProtoRAIL: A Risk-cognizant Imitation Agent for Adaptive vCPU Oversubscription In the Cloud Session 6: Edge and Cloud Systems
Lu Wang · Mayukh Das · Fangkai Yang · Bo Qiao · Hang Dong · Si Qin · Victor Ruehle · Chetan Bansal · Eli Cortez · Íñigo Goiri · S R · Qingwei Lin · Dongmei Zhang
Mission City Ballroom #12
AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution Session 4: Reliable and Scalable Systems
Zhiqiang Xie · Hao Kang · Ying Sheng · Tushar Krishna · Kayvon Fatahalian · Christos Kozyrakis
Mission City Ballroom #46
Context Parallelism for Scalable Million-Token Inference Session 2: Parallel and Distributed Systems
Amy Yang · Jingyi Yang · Aya Ibrahim · Xinfeng Xie · Bangsheng Tang · Grigory Sizov · Jongsoo Park · Jianyu Huang
Mission City Ballroom #34
Optimizing LLM Queries in Relational Data Analytics Workloads Session 6: Edge and Cloud Systems
Shu Liu · Asim Biswal · Audrey Cheng · Amog Kamsetty · Luis Gaspar Schroeder · Liana Patel · Shiyi Cao · Xiangxi Mo · Ion Stoica · Joseph Gonzalez · Matei Zaharia
Mission City Ballroom #28
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference Session 10: LLM and Diffusion Model Serving
Xuanlin Jiang · Yang Zhou · Shiyi Cao · Ion Stoica · Minlan Yu
Mission City Ballroom #59
Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer Session 5: LLM training and fine-tuning
Jinghan Yao · Sam Jacobs · Masahiro Tanaka · Olatunji Ruwase · Hari Subramoni · Dhabaleswar Panda
Mission City Ballroom #21
FlexAttention: A Programming Model for Generating Fused Attention Variants. Session 10: LLM and Diffusion Model Serving
Juechu Dong · BOYUAN FENG · Driss Guessous · Yanbo Liang · Horace He
Mission City Ballroom #3
AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds Session 4: Reliable and Scalable Systems
Yinfang Chen · Manish Shetty · Gagan Somashekar · Minghua Ma · Yogesh Simmhan · Jonathan Mace · Chetan Bansal · Rujia Wang · S R
Mission City Ballroom #4
FedProphet: Memory-Efficient Federated Adversarial Training via Robust and Consistent Cascade Learning Session 11: Federated Learning
Minxue Tang · Yitu Wang · Jingyang Zhang · Louis DiValentin · Aolin Ding · Amin Hass · Yiran Chen · Hai Li
Mission City Ballroom #24
Venn: Resource Management For Collaborative Learning Jobs Session 11: Federated Learning
Jiachen Liu · Fan Lai · Eric Ding · Yiwen Zhang · Mosharaf Chowdhury
Mission City Ballroom #50
XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models Session 10: LLM and Diffusion Model Serving
Yixin Dong · Charlie Ruan · Yaxing Cai · Ziyi Xu · Yilong Zhao · Ruihang Lai · Tianqi Chen
Mission City Ballroom #54
ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments Session 10: LLM and Diffusion Model Serving
YOUHE JIANG · Fangcheng Fu · Xiaozhe Yao · Taiyi Wang · Bin CUI · Ana Klimovic · Eiko Yoneki
Mission City Ballroom #5
GSplit: Scaling Graph Neural Network Training on Large Graphs via Split-Parallelism Session 2: Parallel and Distributed Systems
Sandeep Polisetty · Juelin Liu · Yi Fung · Seung-Hwan Lim · Hui Guan · Marco Serafini
Mission City Ballroom #40
HyC-LoRA: Memory Efficient LoRA Fine-tuning with Hybrid Activation Compression Session 5: LLM training and fine-tuning
Yujin Wang · Shunan Dong · Zongle Huang · Yichen You · Liu He · Huazhong Yang · Yongpan Liu · Hongyang Jia
Mission City Ballroom #35
LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention Session 7: Quantization and Sparsity
Shang Yang · Junxian Guo · Haotian Tang · Qinghao Hu · Guangxuan Xiao · Jiaming Tang · Yujun Lin · Zhijian Liu · Yao Lu · Song Han
Mission City Ballroom #19
LAVA: Lifetime-Aware VM Allocation with Learned Distributions and Adaptation to Mispredictions Session 12: Edge and Cloud Systems
Jianheng Ling · Pratik Worah · Yawen Wang · Yunchuan Kong · Chunlei Wang · Clifford Stein · Diwakar Gupta · Jason Behmer · Logan Bush · Prakash Ramanan · Rajesh Kumar · Thomas Chestna · Yajing Liu · Ying Liu · Ye Zhao · Kathryn S. McKinley · Meeyoung Park · Martin Maas
Mission City Ballroom #8
Know Where You’re Uncertain When Planning with Multimodal Foundation Models: A Formal Framework Session 4: Reliable and Scalable Systems
Neel P. Bhatt · Yunhao Yang · Rohan Siva · Daniel Milan · Ufuk Topcu · Atlas Wang
Mission City Ballroom #16
VoLUT: Efficient Volumetric streaming enhanced by LUT-based super-resolution Session 12: Edge and Cloud Systems
Chendong Wang · Anlan Zhang · Yifan Yang · Lili Qiu · Yuqing Yang · XINYANG JIANG · Feng Qian · Suman Banerjee
Mission City Ballroom #14
SOLA: Optimizing SLO Attainment for Large Language Model Serving with State-Aware Scheduling Session 8: LLM and Diffusion Model Serving
Ke Hong · Xiuhong Li · Lufang Chen · Qiuli Mao · Guohao Dai · Xuefei Ning · Shengen Yan · Yun Liang · Yu Wang
Mission City Ballroom #58
MiLo: Efficient Quantized MoE Inference with Mixture of Low-Rank Compensators Session 3: Quantization and Sparsity
Beichen Huang · Yueming Yuan · ZELEI SHAO · Minjia Zhang
Mission City Ballroom #23
Photon: Federated LLM Pre-Training Session 11: Federated Learning
Lorenzo Sani · Alex Iacob · Zeyu Cao · Royson Lee · Bill Marino · Yan Gao · Wanru Zhao · Dongqi Cai · Zexi Li · Xinchi Qiu · Nic Lane
Mission City Ballroom #9
Seesaw: High-throughput LLM Inference via Model Re-sharding Session 8: LLM and Diffusion Model Serving
Qidong Su · Wei Zhao · Xin Li · Muralidhar Andoorveedu · Chenhao Jiang · Zhanda Zhu · Kevin Song · Christina Giannoula · Gennady Pekhimenko
Mission City Ballroom #36
PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training Session 2: Parallel and Distributed Systems
Daiyaan Arfeen · Zhen Zhang · Xinwei Fu · Gregory R. Ganger · Yida Wang
Mission City Ballroom #6
Radius: Range-based Gradient Sparsity for Large Foundation Model Pre-training Session 3: Quantization and Sparsity
Mingkai Zheng · Zhao Zhang
Mission City Ballroom #33
QServe:W4A8KV4 Quantization and System Co-design for Efficient LLM Serving Session 3: Quantization and Sparsity
Yujun Lin · Haotian Tang · Shang Yang · Zhekai Zhang · Guangxuan Xiao · Chuang Gan · Song Han
Mission City Ballroom #1
Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling Session 2: Parallel and Distributed Systems
Xinyi Zhang · Hanyu Zhao · Wencong Xiao · Xianyan Jia · Fei Xu · Yong Li · Wei Lin · Fangming Liu
Mission City Ballroom #57
ScaleFusion: Scalable Inference of Spatial-Temporal Diffusion Transformers for High-Resolution Long Video Generation Session 8: LLM and Diffusion Model Serving
Jiacheng Yang · Jun Wu · Zhen Zhang · Xinwei Fu · Zhiying Xu · Zhen Jia · Yida Wang · Gennady Pekhimenko
Mission City Ballroom #37
APOLLO: SGD-like Memory, AdamW-level Performance Session 5: LLM training and fine-tuning
Hanqing Zhu · Zhenyu Zhang · Wenyan Cong · Xi Liu · Sem Park · Vikas Chandra · Bo Long · David Pan · Atlas Wang · Jinwon Lee
Mission City Ballroom #48
Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving Session 1: LLM and Diffusion Model Serving
Wei Gao · Xinyu Zhou · Peng Sun · Tianwei Zhang · Yonggang Wen
Mission City Ballroom #53
Marconi: Prefix Caching for the Era of Hybrid LLMs Session 10: LLM and Diffusion Model Serving
Rui Pan · Zhuang Wang · Zhen Jia · Can Karakus · Luca Zancato · Tri Dao · Yida Wang · Ravi Netravali
Mission City Ballroom #29
FastTree: Optimizing Attention Kernel and Runtime for Tree-Structured LLM Inference Session 1: LLM and Diffusion Model Serving
Zaifeng Pan · Yitong Ding · Yue Guan · Zheng Wang · Zhongkai Yu · Xulong Tang · Yida Wang · Yufei Ding
Mission City Ballroom #11
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving Session 1: LLM and Diffusion Model Serving
Zihao Ye · Lequn Chen · Ruihang Lai · Wuwei Lin · Yineng Zhang · Stephanie Wang · Tianqi Chen · Baris Kasikci · Vinod Grover · Arvind Krishnamurthy · Luis Ceze
Mission City Ballroom #30
TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives Session 9: Parallel and Distributed Systems
Size Zheng · Jin Fang · Xuegui Zheng · Qi Hou · Wenlei Bao · Ningxin Zheng · Ziheng Jiang · Dongyang Wang · Jianxi Ye · Haibin Lin · Li-Wen Chang · Xin Liu
Mission City Ballroom #41
COMET: Fine-grained Computation-communication Overlapping for Mixture-of-Experts Session 9: Parallel and Distributed Systems
Shulai Zhang · Ningxin Zheng · Haibin Lin · Ziheng Jiang · Wenlei Bao · Chengquan Jiang · Qi Hou · Weihao Cui · Size Zheng · Li-Wen Chang · Quan Chen · Xin Liu
Mission City Ballroom #43