Skip to yearly menu bar Skip to main content


MLSys 2024 List of Accepted Papers

Accelerating ReLU for MPC-Based Private Inference with a Communication-Efficient Sign Estimation Privacy and security
Kiwan Maeng · G. Edward Suh
DiffusionPipe: Training Large Diffusion Models with Efficient Pipelines Parallel and Distributed 1
Ye Tian · Zhen Jia · Ziyue Luo · Yida Wang · Chuan Wu
Lancet: Accelerating Mixture-of-Experts Training by Overlapping Weight Gradient Computation and All-to-All Communication Parallel and Distributed 2
Chenyu Jiang · Ye Tian · Zhen Jia · Chuan Wu · Yida Wang · Shuai Zheng
AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration Quantization and Compression 1
Ji Lin · Jiaming Tang · Haotian Tang · Shang Yang · Wei-Ming Chen · Wei-Chen Wang · Guangxuan Xiao · Xingyu Dang · Chuang Gan · Song Han
COMET: Neural Cost Model Explanation Framework Measurement and Analysis
Isha Chaudhary · Alex Renda · Charith Mendis · Gagandeep Singh
FLASH: Fast Model Adaptation in ML-Centric Cloud Platforms ML for Systems
Haoran Qiu · Weichao Mao · Archit Patke · Shengkun Cui · Chen Wang · Hubertus Franke · Zbigniew Kalbarczyk · Tamer Basar · Ravi Iyer
ACROBAT: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time Performance and Memory
Pratik Fegade · Tianqi Chen · Phillip Gibbons · Todd Mowry
SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models Performance and Memory
Zhixu Du · Shiyu Li · Yuhao Wu · Xiangyu Jiang · Jingwei Sun · Qilin Zheng · Yongkai Wu · Ang Li · Hai Li · Yiran Chen
LIFL: A Lightweight, Event-driven Serverless Platform for Federated Learning Federated Learning
Shixiong Qi · K. K. Ramakrishnan · Myungjin Lee
Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving Quantization and Compression 1
Yilong Zhao · Chien-Yu Lin · Kan Zhu · Zihao Ye · Lequn Chen · Size Zheng · Luis Ceze · Arvind Krishnamurthy · Tianqi Chen · Baris Kasikci
Efficient Post-training Quantization with FP8 Formats Quantization and Compression 2
Haihao Shen · Naveen Mellempudi · Xin He · Qun Gao · Chang Wang · Mengni Wang
VIDUR: A LARGE-SCALE SIMULATION FRAMEWORK FOR LLM INFERENCE Measurement and Analysis
Amey Agrawal · Nitin Kedia · Jayashree Mohan · Ashish Panwar · Nipun Kwatra · Bhargav Gulavani · Ramachandran Ramjee · Alexey Tumanov
VQPy: An Object-Oriented Approach to Modern Video Analytics ML for Systems
Shan Yu · Zhenting Zhu · Yu Chen · Hanchen Xu · Pengzhan Zhao · Yang Wang · Arthi Padmanabhan · Hugo Latapie · Harry Xu
Prompt Cache: Modular Attention Reuse for Low-Latency Inference LLM 2
In Gim · Guojun Chen · Seung-seob Lee · Nikhil Sarda · Anurag Khandelwal · Lin Zhong
Keyformer: KV Cache reduction through key tokens selection for Efficient Generative Inference LLM 2
Muhammad Adnan · Akhil Arunkumar · Gaurav Jain · Prashant Nair · Ilya Soloveychik · Purushotham Kamath
CloudEval-YAML: A Practical Benchmark for Cloud Configuration Generation Measurement and Analysis
Yifei Xu · Yuning Chen · Xumiao Zhang · Xianshang Lin · Pan Hu · Yunfei Ma · Songwu Lu · Wan Du · Zhuoqing Mao · Ennan Zhai · Dennis Cai
vMCU: Coordinated Memory Management and Kernel Optimization for DNN Inference on MCUs Performance and Memory
Size Zheng · Renze Chen · Meng Li · Zihao Ye · Luis Ceze · Yun Liang
JIT-Q: Just-in-time Quantization with Processing-In-Memory for Efficient ML Training Quantization and Compression 2
Mohamed Ibrahim · Shaizeen Aga · Ada Li · Suchita Pati · Mahzabeen Islam
Fine-Tuning Language Models Using Formal Methods Feedback: A Use Case in Autonomous Systems Large Language Models 1
Yunhao Yang · Neel P. Bhatt · Tyler Ingebrand · William Ward · Steven Carr · Atlas Wang · Ufuk Topcu
Does Compressing Activations Help Model Parallel Training? Measurement and Analysis
Song Bian · Dacheng Li · Hongyi Wang · Eric Xing · Shivaram Venkataraman
SLoRA: Scalable Serving of Thousands of LoRA Adapters Large Language Models 1
Ying Sheng · Shiyi Cao · Dacheng Li · Coleman Hooper · Nicholas Lee · Shuo Yang · Christopher Chou · Banghua Zhu · Lianmin Zheng · Kurt Keutzer · Joseph Gonzalez · Ion Stoica
Distributed Matrix-Based Sampling for Graph Neural Network Training Parallel and Distributed 1
Alok Tripathy · Katherine Yelick · Aydin Buluc
UniDM: A Unified Framework for Data Manipulation with Large Language Models ML for Systems
Yichen Qian · Yongyi He · Rong Zhu · Jintao Huang · Zhijian Ma · Haibin Wang · Yaohua Wang · Xiuyu Sun · Defu Lian · Bolin Ding · Jingren Zhou
FedTrans: Efficient Federated Learning via Multi-Model Transformation Federated Learning
Yuxuan Zhu · Jiachen Liu · Mosharaf Chowdhury · Fan Lai
HeteroSwitch: Characterizing and Taming System-Induced Data Heterogeneity in Federated Learning Federated Learning
Gyudong Kim · Mehdi Ghasemi · Soroush Heidari · Seungryong Kim · Young Geun Kim · Sarma Vrudhula · Carole-Jean Wu
L-GreCo: Layerwise-adaptive Gradient Compression For Efficient Data-parallel Deep Learning Parallel and Distributed 1
Ilia Markov · Kaveh Alim · Elias Frantar · Dan Alistarh
Torch2Chip: An End-to-end Customizable Deep Neural Network Compression and Deployment Toolkit for Prototype Hardware Accelerator Design Quantization and Compression 2
Jian Meng · Yuan Liao · Anupreetham Anupreetham · Ahmed Hasssan · Shixing Yu · Han-sok Suh · Xiaofeng Hu · Jae-sun Seo
ACCURATE LOW-DEGREE POLYNOMIAL APPROXIMATION OF NON-POLYNOMIAL OPERATORS FOR FAST PRIVATE INFERENCE IN HOMOMORPHIC ENCRYPTION Privacy and security
Jingtian Dang · Jianming Tong · Anupam Golder · Cong "Callie" Hao · Arijit Raychowdhury · Tushar Krishna
Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache Large Language Models 1
Zhenyu Zhang · Shiwei Liu · Runjin Chen · Bhavya Kailkhura · Beidi Chen · Atlas Wang
Proteus: Preserving Model Confidentiality during Graph Optimizations Privacy and security
Yubo Gao · Maryam Haghifam · Christina Giannoula · Renbo Tu · Gennady Pekhimenko · Nandita Vijaykumar
Punica: Multi-Tenant LoRA Serving Large Language Models 1
Lequn Chen · Zihao Ye · Yongji Wu · Danyang Zhuo · Luis Ceze · Arvind Krishnamurthy
On Latency Predictors for Neural Architecture Search ML for Systems
Yash Akhauri · Mohamed Abdelfattah
QMoE: Sub-1-Bit Compression of Trillion Parameter Models Quantization and Compression 1
Elias Frantar · Dan Alistarh
FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics LLM 2
Ke Hong · Guohao Dai · Jiaming Xu · Qiuli Mao · Xiuhong Li · Jun Liu · kangdi chen · Yuhan Dong · Yu Wang
Disaggregated Multi-Tower: Topology-aware Modeling Technique for Efficient Large Scale Recommendation Parallel and Distributed 2
Liang Luo · Buyun Zhang · Michael Tsang · Yinbin Ma · Ching-Hsiang Chu · Yuxin Chen · Shen Li · Yuchen Hao · Yanli Zhao · Guna Lakshminarayanan · Ellie Wen · Jongsoo Park · Dheevatsa Mudigere · Maxim Naumov
HeteGen: Efficient Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices Parallel and Distributed 2
ZHAO XUANLEI · Bin Jia · Haotian Zhou · Ziming Liu · Shenggan Cheng · Yang You
Schrodinger's FP Training Neural Networks with Dynamic Floating-Point Containers Quantization and Compression 2
Milos Nikolic · Enrique Torres Sanchez · Jiahui Wang · Ali Hadi Zadeh · Mostafa Mahmoud · Ameer Abdelhadi · Kareem Ibrahim · Andreas Moshovos