Skip to yearly menu bar
Skip to main content
Main Navigation
MLSys
Help/FAQ
Contact MLSys
Code of Conduct
Create Profile
Privacy Policy
My Stuff
Login
Select Year: (2024)
2025
2024
2023
2022
2021
2020
2019
2018
Getting Started
Schedule
Invited Talks
Papers
Awards
Sponsors
Help
Bookmarking/Agenda
Code Of Conduct
Browse
Visualization
mini
compact
topic
detail
Showing papers for
.
×
×
title
author
topic
session
shuffle
by
serendipity
bookmarked first
visited first
not visited first
bookmarked but not visited
Enable Javascript in your browser to see the papers page.
SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models
Does Compressing Activations Help Model Parallel Training?
FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics
VIDUR: A LARGE-SCALE SIMULATION FRAMEWORK FOR LLM INFERENCE
CloudEval-YAML: A Practical Benchmark for Cloud Configuration Generation
LIFL: A Lightweight, Event-driven Serverless Platform for Federated Learning
FLASH: Fast Model Adaptation in ML-Centric Cloud Platforms
UniDM: A Unified Framework for Data Manipulation with Large Language Models
ACROBAT: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time
FedTrans: Efficient Federated Learning via Multi-Model Transformation
JIT-Q: Just-in-time Quantization with Processing-In-Memory for Efficient ML Training
SLoRA: Scalable Serving of Thousands of LoRA Adapters
HeteroSwitch: Characterizing and Taming System-Induced Data Heterogeneity in Federated Learning
Fine-Tuning Language Models Using Formal Methods Feedback: A Use Case in Autonomous Systems
Distributed Matrix-Based Sampling for Graph Neural Network Training
Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving
vMCU: Coordinated Memory Management and Kernel Optimization for DNN Inference on MCUs
AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration
VQPy: An Object-Oriented Approach to Modern Video Analytics
COMET: Neural Cost Model Explanation Framework
Schrodinger's FP Training Neural Networks with Dynamic Floating-Point Containers
Lancet: Accelerating Mixture-of-Experts Training by Overlapping Weight Gradient Computation and All-to-All Communication
Accelerating ReLU for MPC-Based Private Inference with a Communication-Efficient Sign Estimation
L-GreCo: Layerwise-adaptive Gradient Compression For Efficient Data-parallel Deep Learning
Keyformer: KV Cache reduction through key tokens selection for Efficient Generative Inference
Torch2Chip: An End-to-end Customizable Deep Neural Network Compression and Deployment Toolkit for Prototype Hardware Accelerator Design
ACCURATE LOW-DEGREE POLYNOMIAL APPROXIMATION OF NON-POLYNOMIAL OPERATORS FOR FAST PRIVATE INFERENCE IN HOMOMORPHIC ENCRYPTION
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache
Proteus: Preserving Model Confidentiality during Graph Optimizations
Efficient Post-training Quantization with FP8 Formats
On Latency Predictors for Neural Architecture Search
QMoE: Sub-1-Bit Compression of Trillion Parameter Models
DiffusionPipe: Training Large Diffusion Models with Efficient Pipelines
Punica: Multi-Tenant LoRA Serving
Disaggregated Multi-Tower: Topology-aware Modeling Technique for Efficient Large Scale Recommendation
HeteGen: Efficient Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices
We use cookies to store which papers have been visited.
I agree
Successful Page Load
MLSys uses cookies for essential functions only. We do not sell your personal information.
Our Privacy Policy »
Accept Cookies
We use cookies to store which papers have been visited.
I agree