Skip to yearly menu bar
Skip to main content
Main Navigation
MLSys
Help/FAQ
Contact MLSys
Code of Conduct
Create Profile
Privacy Policy
My Stuff
Login
Select Year: (2025)
2025
2024
2023
2022
2021
2020
2019
2018
Getting Started
Schedule
Awards
Invited Talks
Papers
Sponsors
Help
Bookmarking/Agenda
Code Of Conduct
Layout:
mini
compact
topic
detail
×
No topics available
No sessions available
title
author
topic
session
shuffle
by
serendipity
bookmarked first
visited first
not visited first
bookmarked but not visited
Enable Javascript in your browser to see the papers page.
Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking
Youmu: Efficient Columnar Data Pipeline for LLM Training
Efficient On-Device Machine Learning with a Biologically-Plausible Forward-Only Algorithm
FLStore: Efficient Federated Learning Storage for non-training workloads
AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine
MAS-ATTENTION: MEMORY-AWARE STREAM PROCESSING FOR ATTENTION ACCELERATION ON RESOURCE-CONSTRAINED EDGE DEVICES
MEADOW: Memory-efficient Dataflow and Data Packing for Low Power Edge LLMs
SwiftVI: Time-Efficient Planning and Learning with MDPs
DiffServe: Efficiently Serving Text-to-Image Diffusion Models with Query-Aware Model Scaling
Graph Learning at Scale: Characterizing and Optimizing Pre-Propagation GNNs
SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention
Lightweight Software Kernels and Hardware Extensions for Efficient Sparse Deep Neural Networks on Microcontrollers
SparseTransX: Efficient Training of Translation-Based Knowledge Graph Embeddings Using Sparse Matrix Operations
ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation
On Distributed Larger-Than-Memory Subset Selection With Pairwise Submodular Functions
Interference-aware Edge Runtime Prediction with Conformal Matrix Completion
Enabling Unstructured Sparse Acceleration on Structured Sparse Accelerators
FlexInfer: Flexible LLM Inference with CPU Computations
Balancing Pipeline Parallelism with Vocabulary Parallelism
A Bring-Your-Own-Model Approach for ML-Driven Storage Placement in Warehouse-Scale Computers
Scaling Deep Learning Training with MPMD Pipeline Parallelism
LeanAttention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers
TurboAttention: Efficient attention approximation for high throughputs llm
Self-Data Distillation for Recovering Quality in Pruned Large Language Models
The Hidden Bloat in Machine Learning Systems
Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training
Supply-Chain Attacks in Machine Learning Frameworks
ProtoRAIL: A Risk-cognizant Imitation Agent for Adaptive vCPU Oversubscription In the Cloud
AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution
Context Parallelism for Scalable Million-Token Inference
Optimizing LLM Queries in Relational Data Analytics Workloads
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer
FlexAttention: A Programming Model for Generating Fused Attention Variants.
AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds
FedProphet: Memory-Efficient Federated Adversarial Training via Robust and Consistent Cascade Learning
Venn: Resource Management For Collaborative Learning Jobs
XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models
ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments
GSplit: Scaling Graph Neural Network Training on Large Graphs via Split-Parallelism
HyC-LoRA: Memory Efficient LoRA Fine-tuning with Hybrid Activation Compression
LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
LAVA: Lifetime-Aware VM Allocation with Learned Distributions and Adaptation to Mispredictions
Know Where You’re Uncertain When Planning with Multimodal Foundation Models: A Formal Framework
VoLUT: Efficient Volumetric streaming enhanced by LUT-based super-resolution
SOLA: Optimizing SLO Attainment for Large Language Model Serving with State-Aware Scheduling
MiLo: Efficient Quantized MoE Inference with Mixture of Low-Rank Compensators
Photon: Federated LLM Pre-Training
Seesaw: High-throughput LLM Inference via Model Re-sharding
PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training
Radius: Range-based Gradient Sparsity for Large Foundation Model Pre-training
QServe:W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling
ScaleFusion: Scalable Inference of Spatial-Temporal Diffusion Transformers for High-Resolution Long Video Generation
APOLLO: SGD-like Memory, AdamW-level Performance
Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving
Marconi: Prefix Caching for the Era of Hybrid LLMs
FastTree: Optimizing Attention Kernel and Runtime for Tree-Structured LLM Inference
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives
COMET: Fine-grained Computation-communication Overlapping for Mixture-of-Experts
We use cookies to store which papers have been visited.
I agree
Successful Page Load
MLSys uses cookies for essential functions only. We do not sell your personal information.
Our Privacy Policy »
Accept Cookies
We use cookies to store which papers have been visited.
I agree