Skip to yearly menu bar
Skip to main content
Main Navigation
MLSys
Help/FAQ
Contact MLSys
Code of Conduct
Create Profile
Privacy Policy
My Stuff
Login
Select Year: (2025)
2025
2024
2023
2022
2021
2020
2019
2018
Getting Started
Schedule
Invited Talks
Papers
Sponsors
Help
Bookmarking/Agenda
Code Of Conduct
Layout:
mini
compact
topic
detail
×
No topics available
No sessions available
title
author
topic
session
shuffle
by
serendipity
bookmarked first
visited first
not visited first
bookmarked but not visited
Enable Javascript in your browser to see the papers page.
QServe:W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
DiffServe: Efficiently Serving Text-to-Image Diffusion Models with Query-Aware Model Scaling
FlexAttention: A Programming Model for Generating Fused Attention Variants.
AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds
ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments
PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training
SparseTransX: Efficient Training of Translation-Based Knowledge Graph Embeddings Using Sparse Matrix Operations
LAVA: Lifetime-Aware VM Allocation with Learned Distributions and Adaptation to Mispredictions
Photon: Federated LLM Pre-Training
SwiftVI: Time-Efficient Planning and Learning with MDPs
FastTree: Optimizing Attention Kernel and Runtime for Tree-Structured LLM Inference
ProtoRAIL: A Risk-cognizant Imitation Agent for Adaptive vCPU Oversubscription In the Cloud
Supply-Chain Attacks in Machine Learning Frameworks
VoLUT: Efficient Volumetric streaming enhanced by LUT-based super-resolution
Graph Learning at Scale: Characterizing and Optimizing Pre-Propagation GNNs
Know Where You’re Uncertain When Planning with Multimodal Foundation Models: A Formal Framework
Youmu: Efficient Columnar Data Pipeline for LLM Training
A Practical Cross-Layer Approach for ML-Driven Storage Placement in Warehouse-Scale Computers
LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
LeanAttention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers
Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer
Lightweight Software Kernels and Hardware Extensions for Efficient Sparse Deep Neural Networks on Microcontrollers
MiLo: Efficient Quantized MoE Inference with Mixture of Low-Rank Compensators
FedProphet: Memory-Efficient Federated Adversarial Training via Robust and Consistent Cascade Learning
Efficient On-Device Machine Learning with a Biologically-Plausible Forward-Only Algorithm
FLStore: Efficient Federated Learning Storage for non-training workloads
Enabling Unstructured Sparse Acceleration on Structured Sparse Accelerators
Optimizing LLM Queries in Relational Data Analytics Workloads
Marconi: Prefix Caching for the Era of Hybrid LLMs
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention
Scaling Deep Learning Training with MPMD Pipeline Parallelism
Radius: Range-based Gradient Sparsity for Large Foundation Model Pre-training
Context Parallelism for Scalable Million-Token Inference
HyC-LoRA: Memory Efficient LoRA Fine-tuning with Hybrid Activation Compression
Seesaw: High-throughput LLM Inference via Model Re-sharding
ScaleFusion: Scalable Inference of Spatial-Temporal Diffusion Transformers for High-Resolution Long Video Generation
Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking
TurboAttention: Efficient attention approximation for high throughputs llm
SPA: SCALING GRAPH NEURAL NETWORK TRAINING ON LARGE GRAPHS VIA PROBABILISTIC SPLITTING
TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives
Self-Data Distillation for Recovering Quality in Pruned Large Language Models
COMET: Fine-grained Computation-communication Overlapping for Mixture-of-Experts
MAS-ATTENTION: MEMORY-AWARE STREAM PROCESSING FOR ATTENTION ACCELERATION ON RESOURCE-CONSTRAINED EDGE DEVICES
MEADOW: Memory-efficient Dataflow and Data Packing for Low Power Edge LLMs
AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution
Interference-aware Edge Runtime Prediction with Conformal Matrix Completion
APOLLO: SGD-like Memory, AdamW-level Performance
Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training
Venn: Resource Management For Collaborative Learning Jobs
The Hidden Bloat in Machine Learning Systems
Balancing Pipeline Parallelism with Vocabulary Parallelism
Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving
XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models
FlexInfer: Flexible LLM Inference with CPU Computations
On Distributed Larger-Than-Memory Subset Selection With Pairwise Submodular Functions
Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling
SOLA: Optimizing SLO Attainment for Large Language Model Serving with State-Aware Scheduling
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine
ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation
We use cookies to store which papers have been visited.
I agree
Successful Page Load
MLSys uses cookies for essential functions only. We do not sell your personal information.
Our Privacy Policy »
Accept Cookies
We use cookies to store which papers have been visited.
I agree