Skip to yearly menu bar Skip to main content


MLSys 2025 List of Accepted Papers

Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling Session 2: Parallel and Distributed Systems
Xinyi Zhang · Hanyu Zhao · Wencong Xiao · Xianyan Jia · Fei Xu · Yong Li · Wei Lin · Fangming Liu
Interference-aware Edge Runtime Prediction with Conformal Matrix Completion Session 4: Measurement and Analysis
Tianshu Huang · Arjun Ramesh · Emily Ruppel · Nuno Pereira · Anthony Rowe · Carlee Joe-Wong
Enabling Unstructured Sparse Acceleration on Structured Sparse Accelerators Session 3: Quantization and Sparsity
Geonhwa Jeong · Po-An Tsai · Abhimanyu Rajeshkumar Bambhaniya · Stephen Keckler · Tushar Krishna
FlexInfer: Flexible LLM Inference with CPU Computations Session 8: LLM and Diffusion Model Serving
Seonjin Na · Geonhwa Jeong · Byung Hoon Ahn · Aaron Jezghani · Jeffrey Young · Christopher Hughes · Tushar Krishna · Hyesoon Kim
A Practical Cross-Layer Approach for ML-Driven Storage Placement in Warehouse-Scale Computers Session 6: Edge and Cloud Systems
Chenxi Yang · Yan Li · Martin Maas · Mustafa Uysal · Ubaid Hafeez · Arif Merchant · Richard McDougall
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving Session 1: LLM and Diffusion Model Serving
Zihao Ye · Lequn Chen · Ruihang Lai · Wuwei Lin · Yineng Zhang · Stephanie Wang · Tianqi Chen · Baris Kasikci · Vinod Grover · Arvind Krishnamurthy · Luis Ceze
AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution Session 4: Measurement and Analysis
Zhiqiang Xie · Hao Kang · Ying Sheng · Tushar Krishna · Kayvon Fatahalian · Christos Kozyrakis
TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives Session 9: Parallel and Distributed Systems
Size Zheng · Jin Fang · Xuegui Zheng · Qi Hou · Wenlei Bao · Ningxin Zheng · Ziheng Jiang · Dongyang Wang · Jianxi Ye · Haibin Lin · Li-Wen Chang · Xin Liu
MiLo: Efficient Quantized MoE Inference with Mixture of Low-Rank Compensators Session 3: Quantization and Sparsity
Beichen Huang · Yueming Yuan · ZELEI SHAO · Minjia Zhang
COMET: Fine-grained Computation-communication Overlapping for Mixture-of-Experts Session 9: Parallel and Distributed Systems
Shulai Zhang · Ningxin Zheng · Haibin Lin · Ziheng Jiang · Wenlei Bao · Chengquan Jiang · Qi Hou · Weihao Cui · Size Zheng · Li-Wen Chang · Quan Chen · Xin Liu
LAVA: Lifetime-Aware VM Allocation with Learned Distributions and Adaptation to Mispredictions Session 12: Edge and Cloud Systems
Jianheng Ling · Pratik Worah · Yawen Wang · Yunchuan Kong · Chunlei Wang · Clifford Stein · Diwakar Gupta · Jason Behmer · Logan Bush · Prakash Ramanan · Rajesh Kumar · Thomas Chestna · Yajing Liu · Ying Liu · Ye Zhao · Kathryn S. McKinley · Meeyoung Park · Martin Maas
Balancing Pipeline Parallelism with Vocabulary Parallelism Session 9: Parallel and Distributed Systems
Man Tsung Yeung · Penghui Qi · Min Lin · Xinyi Wan
Context Parallelism for Scalable Million-Token Inference Session 2: Parallel and Distributed Systems
Amy Yang · Jingyi Yang · Aya Ibrahim · Xinfeng Xie · Bangsheng Tang · Grigory Sizov · Jongsoo Park · Jianyu Huang
QServe:W4A8KV4 Quantization and System Co-design for Efficient LLM Serving Session 3: Quantization and Sparsity
Yujun Lin · Haotian Tang · Shang Yang · Zhekai Zhang · Guangxuan Xiao · Chuang Gan · Song Han
Efficient On-Device Machine Learning with a Biologically-Plausible Forward-Only Algorithm Session 6: Edge and Cloud Systems
Baichuan Huang · Amir Aminifar
HyC-LoRA: Memory Efficient LoRA Fine-tuning with Hybrid Activation Compression Session 5: LLM training and fine-tuning
Yujin Wang · Shunan Dong · Zongle Huang · Yichen You · Liu He · Huazhong Yang · Yongpan Liu · Hongyang Jia
FedProphet: Memory-Efficient Federated Adversarial Training via Robust and Consistent Cascade Learning Session 11: Federated Learning
Minxue Tang · Yitu Wang · Jingyang Zhang · Louis DiValentin · Aolin Ding · Amin Hass · Yiran Chen · Hai Li
Marconi: Prefix Caching for the Era of Hybrid LLMs Session 10: LLM and Diffusion Model Serving
Rui Pan · Zhuang Wang · Zhen Jia · Can Karakus · Luca Zancato · Tri Dao · Yida Wang · Ravi Netravali
Optimizing LLM Queries in Relational Data Analytics Workloads Session 6: Edge and Cloud Systems
Shu Liu · Asim Biswal · Audrey Cheng · Amog Kamsetty · Luis Gaspar Schroeder · Liana Patel · Shiyi Cao · Xiangxi Mo · Ion Stoica · Joseph Gonzalez · Matei Zaharia
ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation Session 5: LLM training and fine-tuning
Zhiyu Mei · WEI FU · Kaiwei Li · Guangju Wang · Huanchen Zhang · Yi Wu
Scaling Deep Learning Training with MPMD Pipeline Parallelism Session 9: Parallel and Distributed Systems
Anxhelo Xhebraj · Sean Lee · Hanfeng Chen · Vinod Grover
Self-Data Distillation for Recovering Quality in Pruned Large Language Models Session 3: Quantization and Sparsity
Vithursan Thangarasa · Ganesh Venkatesh · Mike Lasby · Nish Sinnadurai · Sean Lie
Supply-Chain Attacks in Machine Learning Frameworks Session 12: Edge and Cloud Systems
Yue Gao · Ilia Shumailov · Kassem Fawaz
Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training Session 5: LLM training and fine-tuning
Mingyu Liang · Hiwot Kassa · Wenyin Fu · Brian Coutinho · Louis Feng · Christina Delimitrou
The Hidden Bloat in Machine Learning Systems Session 4: Measurement and Analysis
Huaifeng Zhang · Ahmed Ali-Eldin Hassan
Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving Session 1: LLM and Diffusion Model Serving
Wei Gao · Xinyu Zhou · Peng Sun · Tianwei Zhang · Yonggang Wen
SOLA: Optimizing SLO Attainment for Large Language Model Serving with State-Aware Scheduling Session 8: LLM and Diffusion Model Serving
Ke Hong · Xiuhong Li · Lufang Chen · Qiuli Mao · Guohao Dai · Xuefei Ning · Shengen Yan · Yun Liang · Yu Wang
TurboAttention: Efficient attention approximation for high throughputs llm Session 8: LLM and Diffusion Model Serving
Hao Kang · Srikant Bharadwaj · James Hensman · Tushar Krishna · Victor Ruehle · Saravan Rajmohan
Photon: Federated LLM Pre-Training Session 11: Federated Learning
Lorenzo Sani · Alex Iacob · Zeyu Cao · Royson Lee · Bill Marino · Yan Gao · Wanru Zhao · Dongqi Cai · Zexi Li · Xinchi Qiu · Nic Lane
ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments Session 10: LLM and Diffusion Model Serving
YOUHE JIANG · Fangcheng Fu · Xiaozhe Yao · Taiyi Wang · Bin CUI · Ana Klimovic · Eiko Yoneki
Venn: Resource Management For Collaborative Learning Jobs Session 11: Federated Learning
Jiachen Liu · Fan Lai · Eric Ding · Yiwen Zhang · Mosharaf Chowdhury
SPA: SCALING GRAPH NEURAL NETWORK TRAINING ON LARGE GRAPHS VIA PROBABILISTIC SPLITTING Session 2: Parallel and Distributed Systems
Sandeep Polisetty · Juelin Liu · Yi Fung · Seung-Hwan Lim · Hui Guan · Marco Serafini
Youmu: Efficient Columnar Data Pipeline for LLM Training Session 5: LLM training and fine-tuning
Tianle Zhong · Jiechen Zhao · Qiang Su · Geoffrey Fox
MAS-ATTENTION: MEMORY-AWARE STREAM PROCESSING FOR ATTENTION ACCELERATION ON RESOURCE-CONSTRAINED EDGE DEVICES Session 11: Federated Learning
Mohammadali Shakerdargah · Shan Lu · Chao Gao · Di Niu
On Distributed Larger-Than-Memory Subset Selection With Pairwise Submodular Functions Session 9: Parallel and Distributed Systems
Maximilian Böther · Abe Sebastian · Pranjal Awasthi · Ana Klimovic · Srikumar Ramalingam
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference Session 10: LLM and Diffusion Model Serving
Xuanlin Jiang · Yang Zhou · Shiyi Cao · Ion Stoica · Minlan Yu
Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking Session 7: Quantization and Sparsity
Marco Federici · Davide Belli · Mart van Baalen · Amir Jalalirad · Andrii Skliar · Bence Major · Markus Nagel · Paul Whatmough
VoLUT: Efficient Volumetric streaming enhanced by LUT-based super-resolution Session 12: Edge and Cloud Systems
Chendong Wang · Anlan Zhang · Yifan Yang · Lili Qiu · Yuqing Yang · XINYANG JIANG · Feng Qian · Suman Banerjee
FastTree: Optimizing Attention Kernel and Runtime for Tree-Structured LLM Inference Session 1: LLM and Diffusion Model Serving
Zaifeng Pan · Yitong Ding · Yue Guan · Zheng Wang · Zhongkai Yu · Xulong Tang · Yida Wang · Yufei Ding
AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine Session 2: Parallel and Distributed Systems
Carlo Siebenschuh · Kyle Hippe · Ozan Gokdemir · Alexander Brace · Arham Khan · Khalid Hossain · Yadu Babuji · Nicholas Chia · Venkatram Vishwanath · Arvind Ramanathan · Rick Stevens · Ian Foster · Robert Underwood
Radius: Range-based Gradient Sparsity for Large Foundation Model Pre-training Session 3: Quantization and Sparsity
Mingkai Zheng · Zhao Zhang
XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models Session 10: LLM and Diffusion Model Serving
Yixin Dong · Charlie Ruan · Yaxing Cai · Ziyi Xu · Yilong Zhao · Ruihang Lai · Tianqi Chen
SwiftVI: Time-Efficient Planning and Learning with MDPs Session 6: Edge and Cloud Systems
Kasper Overgaard Mortensen · Konstantinos Skitsas · Emil Morre Christensen · Mohammad Sadegh Talebi · Andreas Pavlogiannis · Davide Mottin · Panagiotis Karras
FLStore: Efficient Federated Learning Storage for non-training workloads Session 11: Federated Learning
Ahmad Faraz Khan · Samuel Fountain · Ahmed Mohamed Abdelmoniem Sayed · Ali R. Butt · Ali Anwar
APOLLO: SGD-like Memory, AdamW-level Performance Session 5: LLM training and fine-tuning
Hanqing Zhu · Zhenyu Zhang · Wenyan Cong · Xi Liu · Sem Park · Vikas Chandra · Bo Long · David Pan · Atlas Wang · Jinwon Lee
LeanAttention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers Session 1: LLM and Diffusion Model Serving
Rya Sanovar · Srikant Bharadwaj · Renée St. Amant · Victor Ruehle · Saravan Rajmohan
LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention Session 7: Quantization and Sparsity
Shang Yang · Junxian Guo · Haotian Tang · Qinghao Hu · Guangxuan Xiao · Jiaming Tang · Yujun Lin · Zhijian Liu · Yao Lu · Song Han
ScaleFusion: Scalable Inference of Spatial-Temporal Diffusion Transformers for High-Resolution Long Video Generation Session 8: LLM and Diffusion Model Serving
Jiacheng Yang · Jun Wu · Zhen Zhang · Xinwei Fu · Zhiying Xu · Zhen Jia · Yida Wang · Gennady Pekhimenko
Seesaw: High-throughput LLM Inference via Model Re-sharding Session 8: LLM and Diffusion Model Serving
Qidong Su · Wei Zhao · Xin Li · Muralidhar Andoorveedu · Chenhao Jiang · Zhanda Zhu · Kevin Song · Christina Giannoula · Gennady Pekhimenko
MEADOW: Memory-efficient Dataflow and Data Packing for Low Power Edge LLMs Session 12: Edge and Cloud Systems
Abhishek Moitra · Arkapravo Ghosh · Shrey Agrawal · Aporva Amarnath · Karthik Swaminathan · Priyadarshini Panda
FlexAttention: A Programming Model for Generating Fused Attention Variants. Session 10: LLM and Diffusion Model Serving
Juechu Dong · BOYUAN FENG · Driss Guessous · Yanbo Liang · Horace He
DiffServe: Efficiently Serving Text-to-Image Diffusion Models with Query-Aware Model Scaling Session 1: LLM and Diffusion Model Serving
Sohaib Ahmad · Qizheng Yang · Haoliang Wang · Ramesh Sitaraman · Hui Guan
AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds Session 4: Measurement and Analysis
Yinfang Chen · Manish Shetty · Gagan Somashekar · Minghua Ma · Yogesh Simmhan · Jonathan Mace · Chetan Bansal · Rujia Wang · S R
PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training Session 2: Parallel and Distributed Systems
Daiyaan Arfeen · Zhen Zhang · Xinwei Fu · Gregory R. Ganger · Yida Wang
SparseTransX: Efficient Training of Translation-Based Knowledge Graph Embeddings Using Sparse Matrix Operations Session 7: Quantization and Sparsity
Md Saidul Hoque Anik · Ariful Azad
ProtoRAIL: A Risk-cognizant Imitation Agent for Adaptive vCPU Oversubscription In the Cloud Session 6: Edge and Cloud Systems
Lu Wang · Mayukh Das · Fangkai Yang · Bo Qiao · Hang Dong · Si Qin · Victor Ruehle · Chetan Bansal · Eli Cortez · Íñigo Goiri · S R · Qingwei Lin · Dongmei Zhang
Graph Learning at Scale: Characterizing and Optimizing Pre-Propagation GNNs Session 12: Edge and Cloud Systems
Zichao Yue · Chenhui Deng · Zhiru Zhang
Know Where You’re Uncertain When Planning with Multimodal Foundation Models: A Formal Framework Session 4: Measurement and Analysis
Neel P. Bhatt · Yunhao Yang · Rohan Siva · Daniel Milan · Ufuk Topcu · Atlas Wang
Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer Session 5: LLM training and fine-tuning
Jinghan Yao · Sam Jacobs · Masahiro Tanaka · Olatunji Ruwase · Hari Subramoni · Dhabaleswar Panda
Lightweight Software Kernels and Hardware Extensions for Efficient Sparse Deep Neural Networks on Microcontrollers Session 7: Quantization and Sparsity
Francesco Daghero · Daniele Jahier Pagliari · Francesco Conti · Luca Benini · Massimo Poncino · Alessio Burrello
SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention Session 7: Quantization and Sparsity
Qianchao Zhu · Jiangfei Duan · Chang Chen · Siran Liu · Xiuhong Li · Guanyu Feng · Xin Lv · Xiao Chuanfu · Dahua Lin · Chao Yang