Skip to yearly menu bar Skip to main content

MLSys 2024 List of Accepted Papers

Accelerating ReLU for MPC-Based Private Inference with a Communication-Efficient Sign Estimation Privacy and security
Kiwan Maeng (Pennsylvania State University) · G. Edward Suh (Meta AI)
FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics LLM 2
Ke Hong (Tsinghua University) · Guohao Dai (Shanghai Jiao Tong University) · Jiaming Xu (Shanghai Jiao Tong University) · Qiuli Mao (Tsinghua University, Tsinghua University) · Xiuhong Li (Peking University) · Jun Liu (Shanghai Jiaotong University) · kangdi chen (Infinigence) · Yuhan Dong (Tsinghua University) · Yu Wang (Tsinghua University, Tsinghua University)
CloudEval-YAML: A Practical Benchmark for Cloud Configuration Generation Measurement and Analysis
Yifei Xu (UCLA) · Yuning Chen (University of California, Merced) · Xumiao Zhang (University of Michigan) · Xianshang Lin (None) · Pan Hu (Alibaba Group) · Yunfei Ma (None) · Songwu Lu (University of California-Los Angeles) · Wan Du (University of California, Merced) · Zhuoqing Mao (University of Michigan) · Ennan Zhai (None) · Dennis Cai (None)
LIFL: A Lightweight, Event-driven Serverless Platform for Federated Learning Federated Learning
Shixiong Qi (University of California, Riverside) · K. K. Ramakrishnan (University of California, Riverside) · Myungjin Lee (Cisco Research)
Amey Agrawal (Georgia Tech) · Nitin Kedia (Microsoft Research India) · Jayashree Mohan (Research, Microsoft) · Ashish Panwar (Research, Microsoft) · Nipun Kwatra (Microsoft Research India) · Bhargav Gulavani (Indian Institute of Technology Bombay, Indian Institute of Technology, Bombay) · Ramachandran Ramjee (Microsoft Research) · Alexey Tumanov (Georgia Tech)
FLASH: Fast Model Adaptation in ML-Centric Cloud Platforms ML for Systems
Haoran Qiu (UIUC) · Weichao Mao (University of Illinois, Urbana Champaign) · Archit Patke (University of Illinois at Urbana-Champaign) · Shengkun Cui (None) · Chen Wang (IBM Research) · Hubertus Franke (IBM Research) · Zbigniew Kalbarczyk (University of Illinois at Urbana-Champaign) · Tamer Basar (University of Illinois, Urbana Champaign) · Ravi Iyer (University of Illinois)
UniDM: A Unified Framework for Data Manipulation with Large Language Models ML for Systems
Yichen Qian (Alibaba Group) · Yongyi He (University of Science and Technology of China) · Rong Zhu (Harbin Institute of Technology) · Jintao Huang (University of Science and Technology of China) · Zhijian Ma (Alibaba Group) · Haibin Wang (Alibaba Group) · Yaohua Wang (Tsinghua University, Tsinghua University) · Xiuyu Sun (Shanghai Academy of Artificial Intelligence for Science) · Defu Lian (University of Science and Technology of China) · Bolin Ding (Alibaba Group) · Jingren Zhou (Alibaba Group)
FedTrans: Efficient Federated Learning via Multi-Model Transformation Federated Learning
Yuxuan Zhu (University of Illinois Urbana-Champaign) · Jiachen Liu (University of Michigan) · Mosharaf Chowdhury (University of Michigan, Ann Arbor) · Fan Lai (University of Illinois at Urbana-Champaign)
ACROBAT: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time Performance and Memory
Pratik Fegade (Carnegie Mellon University) · Tianqi Chen (CMU) · Phillip Gibbons (CMU) · Todd Mowry (Carnegie Mellon University)
JIT-Q: Just-in-time Quantization with Processing-In-Memory for Efficient ML Training Quantization and Compression 2
Mohamed Ibrahim (AMD Inc.) · Shaizeen Aga (AMD) · Ada Li (AMD) · Suchita Pati (Advanced Micro Devices Inc) · Mahzabeen Islam (AMD)
Proteus: Preserving Model Confidentiality during Graph Optimizations Privacy and security
Yubo Gao (University of Toronto) · Maryam Haghifam (Eaigle) · Christina Giannoula (University of Toronto) · Renbo Tu (University of Toronto) · Gennady Pekhimenko (University of Toronto) · Nandita Vijaykumar (Department of Computer Science, University of Toronto)
HeteroSwitch: Characterizing and Taming System-Induced Data Heterogeneity in Federated Learning Federated Learning
Gyudong Kim (Korea University) · Mehdi Ghasemi (Arizona State University) · Soroush Heidari (Arizona State University) · Seungryong Kim (Korea University) · Young Geun Kim (Korea University) · Sarma Vrudhula (Arizona State University) · Carole-Jean Wu (Meta)
Keyformer: KV Cache reduction through key tokens selection for Efficient Generative Inference LLM 2
Muhammad Adnan (University of British Columbia) · Akhil Arunkumar (d-Matrix Corporation) · Gaurav Jain (University of Wisconsin - Madison) · Prashant Nair (University of British Columbia) · Ilya Soloveychik (School of Engineering and Applied Sciences, Harvard University) · Purushotham Kamath (dMatrix)
Fine-Tuning Language Models Using Formal Methods Feedback: A Use Case in Autonomous Systems Large Language Models 1
Yunhao Yang (University of Texas at Austin) · Neel P. Bhatt (University of Texas at Austin) · Tyler Ingebrand (University of Texas at Austin) · William Ward (University of Texas at Austin) · Steven Carr (University of Texas at Austin) · Atlas Wang (University of Texas at Austin) · Ufuk Topcu (University of Texas, Austin)
Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving Quantization and Compression 1
Yilong Zhao (Shanghai Jiaotong University) · Chien-Yu Lin (University of Washington) · Kan Zhu (University of Washington) · Zihao Ye (University of Washington) · Lequn Chen (University of Washington) · Size Zheng (Peking University) · Luis Ceze (University of Washington and OctoML) · Arvind Krishnamurthy (University of Washington) · Tianqi Chen (CMU) · Baris Kasikci (University of Michigan)
vMCU: Coordinated Memory Management and Kernel Optimization for DNN Inference on MCUs Performance and Memory
Size Zheng (Peking University) · Renze Chen (Peking University) · Meng Li (Peking University) · Zihao Ye (University of Washington) · Luis Ceze (University of Washington and OctoML) · Yun Liang (Peking University)
AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration Quantization and Compression 1
Ji Lin (MIT/OpenAI) · Jiaming Tang (Shanghai Jiao Tong University) · Haotian Tang (MIT) · Shang Yang (Massachusetts Institute of Technology) · Wei-Ming Chen (Massachusetts Institute of Technology) · Wei-Chen Wang (MIT) · Guangxuan Xiao (MIT) · Xingyu Dang (Institute for Interdisciplinary Information Sciences, Tsinghua University) · Chuang Gan () · Song Han (MIT)
SLoRA: Scalable Serving of Thousands of LoRA Adapters Large Language Models 1
Ying Sheng (Stanford University) · Shiyi Cao (University of California, Berkeley) · Dacheng Li (University of California, Berkeley) · Coleman Hooper (University of California, Berkeley) · Nicholas Lee (University of California, Berkeley) · Shuo Yang (Shanghai Jiaotong University) · Christopher Chou (University of California, Berkeley) · Banghua Zhu (University of California Berkeley) · Lianmin Zheng (UC Berkeley) · Kurt Keutzer (EECS, UC Berkeley) · Joseph Gonzalez (UC Berkeley) · Ion Stoica (University of California, Berkeley)
COMET: Neural Cost Model Explanation Framework Measurement and Analysis
Isha Chaudhary (Univ of Illinois Urbana-Champaign) · Alex Renda (Massachusetts Institute of Technology) · Charith Mendis (University of Illinois at Urbana-Champaign) · Gagandeep Singh (University of Illinois, Urbana Champaign)
Lancet: Accelerating Mixture-of-Experts Training by Overlapping Weight Gradient Computation and All-to-All Communication Parallel and Distributed 2
Chenyu Jiang (The University of Hong Kong) · Ye Tian (The University of Hong Kong) · Zhen Jia (Amazon) · Chuan Wu (The University of Hong Kong) · Yida Wang (Amazon) · Shuai Zheng (Amazon Web Services)
Jingtian Dang (Carnegie Mellon University) · Jianming Tong (Georgia Tech) · Anupam Golder (Intel) · Cong "Callie" Hao (Georgia Institute of Technology) · Arijit Raychowdhury (Georgia Institute of Technology) · Tushar Krishna (Georgia Institute of Technology)
L-GreCo: Layerwise-adaptive Gradient Compression For Efficient Data-parallel Deep Learning Parallel and Distributed 1
Ilia Markov (Institute of Science and Technology Austria) · Kaveh Alim (Massachusetts Institute of Technology) · Elias Frantar (Google DeepMind) · Dan Alistarh (Institute of Science and Technology)
Torch2Chip: An End-to-end Customizable Deep Neural Network Compression and Deployment Toolkit for Prototype Hardware Accelerator Design Quantization and Compression 2
Jian Meng (Cornell University) · Yuan Liao (Cornell University) · Anupreetham Anupreetham (Arizona State University) · Ahmed Hasssan (Arizona State University) · Shixing Yu (Cornell University) · Han-sok Suh (Cornell University) · Xiaofeng Hu (None) · Jae-sun Seo (Cornell Tech)
Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache Large Language Models 1
Zhenyu Zhang (University of Texas at Austin) · Shiwei Liu (University of Oxford) · Runjin Chen (University of Texas at Austin) · Bhavya Kailkhura (Lawrence Livermore National Laboratory) · Beidi Chen (FAIR/CMU) · Atlas Wang (University of Texas at Austin)
Efficient Post-training Quantization with FP8 Formats Quantization and Compression 2
Haihao Shen (Intel) · Naveen Mellempudi (Intel) · Xin He (Fudan University) · Qun Gao (Intel) · Chang Wang (Intel) · Mengni Wang (Shanghai Jiaotong University)
SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models Performance and Memory
Zhixu Du (Duke University) · Shiyu Li (Duke University) · Yuhao Wu (Duke University) · Xiangyu Jiang (Clemson University) · Jingwei Sun (Duke University) · Qilin Zheng (Duke University) · Yongkai Wu (Clemson University) · Ang Li (University of Maryland, College Park) · Hai Li (Duke University) · Yiran Chen (Duke University)
QMoE: Sub-1-Bit Compression of Trillion Parameter Models Quantization and Compression 1
Elias Frantar (Google DeepMind) · Dan Alistarh (Institute of Science and Technology)
DiffusionPipe: Training Large Diffusion Models with Efficient Pipelines Parallel and Distributed 1
Ye Tian (The University of Hong Kong) · Zhen Jia (Amazon) · Ziyue Luo (The Ohio State University) · Yida Wang (Amazon) · Chuan Wu (The University of Hong Kong)
Punica: Multi-Tenant LoRA Serving Large Language Models 1
Lequn Chen (University of Washington) · Zihao Ye (University of Washington) · Yongji Wu (Duke University) · Danyang Zhuo (Duke University) · Luis Ceze (University of Washington and OctoML) · Arvind Krishnamurthy (University of Washington)
Does Compressing Activations Help Model Parallel Training? Measurement and Analysis
Song Bian (University of Wisconsin - Madison) · Dacheng Li (University of California, Berkeley) · Hongyi Wang (Carnegie Mellon University) · Eric Xing (Mohamed bin Zayed Univeristy of AI) · Shivaram Venkataraman (University of Wisconsin, Madison)
Disaggregated Multi-Tower: Topology-aware Modeling Technique for Efficient Large Scale Recommendation Parallel and Distributed 2
Liang Luo (Meta) · Buyun Zhang (Facebook) · Michael Tsang (Meta) · Yinbin Ma (Meta) · Ching-Hsiang Chu (Meta) · Yuxin Chen (Meta) · Shen Li (Meta) · Yuchen Hao (Meta) · Yanli Zhao (Facebook) · Guna Lakshminarayanan (Meta) · Ellie Wen (Meta) · Jongsoo Park (Meta) · Dheevatsa Mudigere (NVIDIA) · Maxim Naumov (Meta)
HeteGen: Efficient Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices Parallel and Distributed 2
ZHAO XUANLEI (National University of Singapore) · Bin Jia (national university of singaore, National University of Singapore) · Haotian Zhou (None) · Ziming Liu (national university of singaore, National University of Singapore) · Shenggan Cheng (National University of Singapore) · Yang You (National University of Singapore)
Distributed Matrix-Based Sampling for Graph Neural Network Training Parallel and Distributed 1
Alok Tripathy (UC Berkeley / Lawrence Berkeley National Lab) · Katherine Yelick (University of California-Berkeley) · Aydin Buluc (Lawrence Berkeley National Lab)
VQPy: An Object-Oriented Approach to Modern Video Analytics ML for Systems
Shan Yu (UCLA) · Zhenting Zhu (University of California, Los Angeles) · Yu Chen (nanjing university) · Hanchen Xu (UCLA) · Pengzhan Zhao (UCLA) · Yang Wang (Intel) · Arthi Padmanabhan (Harvey Mudd College) · Hugo Latapie (Cisco) · Harry Xu (BreezeML Inc. and UCLA)
On Latency Predictors for Neural Architecture Search ML for Systems
Yash Akhauri (Google) · Mohamed Abdelfattah (Cornell University)
Prompt Cache: Modular Attention Reuse for Low-Latency Inference LLM 2
In Gim (Yale University) · Guojun Chen (Yale University) · Seung-seob Lee (Yale University) · Nikhil Sarda (Google Labs) · Anurag Khandelwal (Yale University) · Lin Zhong (Rice University)
Schrodinger's FP Training Neural Networks with Dynamic Floating-Point Containers Quantization and Compression 2
Milos Nikolic (University of Toronto) · Enrique Torres Sanchez (University of Toronto) · Jiahui Wang (University of Toronto) · Ali Hadi Zadeh (University of Toronto) · Mostafa Mahmoud (University of Toronto) · Ameer Abdelhadi (University of Toronto, University of St. Michael's College) · Kareem Ibrahim (University of Toronto) · Andreas Moshovos (University of Toronto)