MLSys 2025 Papers

Skip to yearly menu bar Skip to main content

Layout:

mini compact topic detail

by

Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking

Youmu: Efficient Columnar Data Pipeline for LLM Training

Efficient On-Device Machine Learning with a Biologically-Plausible Forward-Only Algorithm

FLStore: Efficient Federated Learning Storage for non-training workloads

AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine

MAS-ATTENTION: MEMORY-AWARE STREAM PROCESSING FOR ATTENTION ACCELERATION ON RESOURCE-CONSTRAINED EDGE DEVICES

MEADOW: Memory-efficient Dataflow and Data Packing for Low Power Edge LLMs

SwiftVI: Time-Efficient Planning and Learning with MDPs

DiffServe: Efficiently Serving Text-to-Image Diffusion Models with Query-Aware Model Scaling

Graph Learning at Scale: Characterizing and Optimizing Pre-Propagation GNNs

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

Lightweight Software Kernels and Hardware Extensions for Efficient Sparse Deep Neural Networks on Microcontrollers

SparseTransX: Efficient Training of Translation-Based Knowledge Graph Embeddings Using Sparse Matrix Operations

ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation

On Distributed Larger-Than-Memory Subset Selection With Pairwise Submodular Functions

Interference-aware Edge Runtime Prediction with Conformal Matrix Completion

Enabling Unstructured Sparse Acceleration on Structured Sparse Accelerators

FlexInfer: Flexible LLM Inference with CPU Computations

Balancing Pipeline Parallelism with Vocabulary Parallelism

A Bring-Your-Own-Model Approach for ML-Driven Storage Placement in Warehouse-Scale Computers

Scaling Deep Learning Training with MPMD Pipeline Parallelism

LeanAttention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers

TurboAttention: Efficient attention approximation for high throughputs llm

Self-Data Distillation for Recovering Quality in Pruned Large Language Models

The Hidden Bloat in Machine Learning Systems

Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training

Supply-Chain Attacks in Machine Learning Frameworks

ProtoRAIL: A Risk-cognizant Imitation Agent for Adaptive vCPU Oversubscription In the Cloud

AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution

Context Parallelism for Scalable Million-Token Inference

Optimizing LLM Queries in Relational Data Analytics Workloads

NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference

Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer

FlexAttention: A Programming Model for Generating Fused Attention Variants.

AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds

FedProphet: Memory-Efficient Federated Adversarial Training via Robust and Consistent Cascade Learning

Venn: Resource Management For Collaborative Learning Jobs

XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments

GSplit: Scaling Graph Neural Network Training on Large Graphs via Split-Parallelism

HyC-LoRA: Memory Efficient LoRA Fine-tuning with Hybrid Activation Compression

LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

LAVA: Lifetime-Aware VM Allocation with Learned Distributions and Adaptation to Mispredictions

Know Where You’re Uncertain When Planning with Multimodal Foundation Models: A Formal Framework

VoLUT: Efficient Volumetric streaming enhanced by LUT-based super-resolution

SOLA: Optimizing SLO Attainment for Large Language Model Serving with State-Aware Scheduling

MiLo: Efficient Quantized MoE Inference with Mixture of Low-Rank Compensators

Photon: Federated LLM Pre-Training

Seesaw: High-throughput LLM Inference via Model Re-sharding

PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training

Radius: Range-based Gradient Sparsity for Large Foundation Model Pre-training

QServe:W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling

ScaleFusion: Scalable Inference of Spatial-Temporal Diffusion Transformers for High-Resolution Long Video Generation

APOLLO: SGD-like Memory, AdamW-level Performance

Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving

Marconi: Prefix Caching for the Era of Hybrid LLMs

FastTree: Optimizing Attention Kernel and Runtime for Tree-Structured LLM Inference

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives

COMET: Fine-grained Computation-communication Overlapping for Mixture-of-Experts