Skip to yearly menu bar
Skip to main content
Main Navigation
MLSys
Help/FAQ
Contact MLSys
Code of Conduct
Create Profile
Privacy Policy
My Stuff
Login
Select Year: (2024)
2025
2024
2023
2022
2021
2020
2019
2018
Getting Started
Schedule
Invited Talks
Papers
Awards
Sponsors
Help
Bookmarking/Agenda
Code Of Conduct
Browse
Visualization
Layout:
mini
compact
topic
detail
×
No topics available
No sessions available
title
author
topic
session
shuffle
by
serendipity
bookmarked first
visited first
not visited first
bookmarked but not visited
Enable Javascript in your browser to see the papers page.
SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models
Does Compressing Activations Help Model Parallel Training?
FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics
VIDUR: A LARGE-SCALE SIMULATION FRAMEWORK FOR LLM INFERENCE
CloudEval-YAML: A Practical Benchmark for Cloud Configuration Generation
LIFL: A Lightweight, Event-driven Serverless Platform for Federated Learning
FLASH: Fast Model Adaptation in ML-Centric Cloud Platforms
UniDM: A Unified Framework for Data Manipulation with Large Language Models
ACROBAT: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time
FedTrans: Efficient Federated Learning via Multi-Model Transformation
JIT-Q: Just-in-time Quantization with Processing-In-Memory for Efficient ML Training
SLoRA: Scalable Serving of Thousands of LoRA Adapters
HeteroSwitch: Characterizing and Taming System-Induced Data Heterogeneity in Federated Learning
Fine-Tuning Language Models Using Formal Methods Feedback: A Use Case in Autonomous Systems
Distributed Matrix-Based Sampling for Graph Neural Network Training
Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving
vMCU: Coordinated Memory Management and Kernel Optimization for DNN Inference on MCUs
AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration
VQPy: An Object-Oriented Approach to Modern Video Analytics
COMET: Neural Cost Model Explanation Framework
Schrodinger's FP Training Neural Networks with Dynamic Floating-Point Containers
Lancet: Accelerating Mixture-of-Experts Training by Overlapping Weight Gradient Computation and All-to-All Communication
Accelerating ReLU for MPC-Based Private Inference with a Communication-Efficient Sign Estimation
L-GreCo: Layerwise-adaptive Gradient Compression For Efficient Data-parallel Deep Learning
Keyformer: KV Cache reduction through key tokens selection for Efficient Generative Inference
Torch2Chip: An End-to-end Customizable Deep Neural Network Compression and Deployment Toolkit for Prototype Hardware Accelerator Design
ACCURATE LOW-DEGREE POLYNOMIAL APPROXIMATION OF NON-POLYNOMIAL OPERATORS FOR FAST PRIVATE INFERENCE IN HOMOMORPHIC ENCRYPTION
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache
Proteus: Preserving Model Confidentiality during Graph Optimizations
Efficient Post-training Quantization with FP8 Formats
On Latency Predictors for Neural Architecture Search
QMoE: Sub-1-Bit Compression of Trillion Parameter Models
DiffusionPipe: Training Large Diffusion Models with Efficient Pipelines
Punica: Multi-Tenant LoRA Serving
Disaggregated Multi-Tower: Topology-aware Modeling Technique for Efficient Large Scale Recommendation
HeteGen: Efficient Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices
We use cookies to store which papers have been visited.
I agree
Successful Page Load
MLSys uses cookies for essential functions only. We do not sell your personal information.
Our Privacy Policy »
Accept Cookies
We use cookies to store which papers have been visited.
I agree