MLSys 2026 Papers

Layout:

mini compact topic detail

Spira: Exploiting Voxel Data Structural Properties for Efficient Sparse Convolution in Point Cloud Networks

Beat the long tail: Distribution-Aware Speculative Decoding for RL Training

Attribution-based Sparse Activation in Large Language Models

FreeScale: Distributed Training for Sequence Recommendation Models with Minimal Scaling Cost

Wave: A Symbolic Python DSL And Compiler for High-Performance Machine Learning

Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding

ExecuTorch - A Unified PyTorch Solution to Run ML Models On-Device

Locality-Aware Beam Scheduling for Efficient Test-Time Compute with a Consumer-grade GPU

ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels

When Machine Learning Isn’t Sure: Building Resilient ML-Based Computer Systems by Embracing Uncertainty

LEANN: A Low-Storage Overhead Vector Index

BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models

ContextPilot: Fast Long-Context Inference via Context Reuse

CRAFT: Fine-Grained Cost-Aware Expert Replication For Efficient Mixture-of-Experts Serving

MoEBlaze: Breaking the Memory Wall for Efficient MoE Training on Modern GPUs

From Tokens to Layers: Redefining Stall-Free Scheduling for MoE Serving with Layered Prefill

GUARD: SCALABLE STRAGGLER DETECTION AND NODE HEALTH MANAGEMENT FOR LARGE-SCALE TRAINING

MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference

FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error

Toward Principled LLM Safety Testing: Solving the Jailbreak Oracle Problem

Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP

HetRL: Efficient Reinforcement Learning for LLMs in Heterogeneous Environments

ReSpec: Towards Optimizing Speculative Decoding in Reinforcement Learning Systems

Efficient, VRAM-Constrained xLM Inference on Clients

Breaking the Ice: Analyzing Cold Start Latency in vLLM

ProfInfer: An eBPF-based Fine-Grained LLM Inference Profiler

veScale-FSDP: Flexible and High-Performance FSDP at Scale

Blueprint, Bootstrap, and Bridge: A Security Look at NVIDIA GPU Confidential Computing

OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents

VeriMoA: A Mixture-of-Agents Framework for Spec-to-HDL Generation

db-SP: Accelerating Sparse Attention for Visual Generative Models with Dual-Balanced Sequence Parallelism

When Enough is Enough: Rank-Aware Early Termination for Vector Search

SwiftGS: Algorithm and System Co-Optimization for Fast 3D Gaussian Splatting on GPUs

Optimizing PyTorch Inference with LLM-Based Multi-Agent Systems

Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework

Hippocampus: An Efficient and Scalable Memory Module for Agentic AI

AgenticCache: Cache-Driven Asynchronous Planning for Embodied AI Agents

HipKittens: Fast and Furious AMD Kernels

TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference

Ontology-Guided Long-Term Agent Memory for Conversational RAG

Virtual Machine NUMA Placement at Scale: Learning the Norm, Shielding the Tail

GriNNder: Breaking the Memory Capacity Wall in Full-Graph GNN Training with Storage Offloading

G-HEMP: FAST MULTI-GPU PRIVATE INFERENCE FOR LARGE-SCALE GCNS WITH HOMOMORPHIC ENCRYPTION

ProToken: Token-Level Attribution for Federated Large Language Models

CSLE: A Reinforcement Learning Platform for Autonomous Security Management

Scaling Up Large Language Models Serving Systems for Semantic Job Search

The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents

StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation

PLayer-FL: A Principled Approach to Personalized Layer-wise Cross-Silo Federated Learning

TiDAR: Think in Diffusion, Talk in Autoregression

Zero redundancy distributed learning with differential privacy

Meeting SLOs, Slashing Hours: Automated Enterprise LLM Optimization with OptiKIT

FlashAgents: Accelerating Multi-Agent LLM Systems via Streaming Prefill Overlap

Efficient Long-Context Language Model Training by Core Attention Disaggregation

FLoRIST: Singular Value Thresholding for Efficient and Accurate Federated Fine-Tuning of Large Language Models

ProTrain: Efficient LLM Training via Automatic Memory Management

BOute: Cost-Efficient LLM Serving with Heterogeneous LLMs and GPUs via Multi-Objective Bayesian Optimization

HexiScale: Facilitating Large Language Model Training over Heterogeneous Hardware

SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding

TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval

TriInfer: Hybrid EPD Disaggregation for Efficient Multimodal Large Language Model Inference

Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants

FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management

FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling

PLA-Serve: A Prefill-Length-Aware LLM Serving System

Unified LLM Model for Power, Performance, and Area Prediction from Hardware Code

Using Span Queries to Optimize Cache and Attention Locality

Automated Algorithm Design for Auto-Tuning Optimizers

Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost

RaidServe: High-performance Resilient Serving

fabric-lib: RDMA Point-to-Point Communication for LLM Systems

Practical Adversarial Multi-Armed Bandits with Sublinear Runtime

Pylo: Towards Accessible Learned Optimizers in PyTorch

XProf: An Open, Scalable, and Extensible Profiling System for the Modern ML Stack

Machine Learning Fleet Efficiency: Improving TPU Systems at Scale with ML Productivity Goodput

Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes

BEAM: Joint Resource–Power Optimization for Energy-Efficient LLM Inference under SLO contraints

CATWILD: Compiler Autotuning for TPU workloads in the Wild

AXLearn: Modular, Hardware-Agnostic Large Model Training

GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving

ADR: AN AGENTIC DETECTION SYSTEMFORENTERPRISE AGENTIC AI SECURITY

FlexTrain: Scalable Hybrid-Parallel Training with Elastic Resource Utilization and Consistent Accuracy

Zorse: Optimizing LLM Training Efficiency on Heterogeneous GPU Clusters

Shannonic: Efficient Entropy-Optimal Compression for ML Workloads

NEST: Network- and Memory-Aware Device Placement for Distributed Deep Learning

Once-for-All Channel Mixers (HyperTinyPW): Generative Compression for TinyML

MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design

Search Your Block Floating Point Scales!

PRISM: Parametrically Refactor Inference for Speculative Decoding Draft Models

PROMPTS: PeRformance Optimization via Multi-Agent Planning for LLM Training and Serving

CDLM: Consistency Diffusion Language Models for Faster Sampling

Speculative Decoding: Performance or Illusion?

HELIOS : Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving

REPARO: LOSS-RESILIENT GENERATIVE CODEC FOR VIDEO CONFERENCING

EarthSight: A Distributed Framework for Low-Latency Satellite Intelligence

Massive-Scale Out-Of-Core UMAP on the GPU

Charon: A Unified and Fine-Grained Simulator for Large-Scale LLM Training and Inference

DriftBench: Measuring and Predicting Infrastructure Drift in LLM Serving Systems

FarSkip-Collective: Unhobbling Blocking Communication in Mixture of Experts Models

FaaScale: Unlocking Fast LLM Scaling for Serverless Inference

DynaFlow: Transparent and Flexible Intra-Device Parallelism via Programmable Operator Scheduling

BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

PARROT: Persuasion and Agreement Robustness Rating of Output Truth — A Sycophancy Robustness Benchmark for LLMs

MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training

MAC-Attention: a Match--Amend--Complete scheme for fast and accurate attention computation

SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models

Optimizing Deployment Configurations for LLM Inference

AIRS: Scaling Live Inference in Resource Constrained Environments

DreamDDP: Accelerating Low-Bandwidth Geo-Distributed LLM Training with Layer-wise Partial Synchronization

ZK-APEX: ZERO-KNOWLEDGE APPROXIMATE PERSONALIZED UNLEARNING WITH EXECUTABLE PROOFS

Privatar: Scalable Privacy-preserving Multi-user VR via Secure Offloading

A Lightweight High-Throughput Collective-Capable NoC for Large-Scale ML Accelerators

AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips

Rethinking DVFS for Mobile LLMs: Unified Energy-Aware Scheduling with CORE

MorphServe: Efficient and Workload-Aware LLM Serving via Runtime Quantized Layer Swapping and KV Cache Resizing

Beyond the Buzz: A Pragmatic Take on Inference Disaggregation

Hawkeye: Reproducing GPU-Level Non-Determinism

FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems

DisAgg: Distributed Aggregators for Efficient Secure Aggregation

CAGE: Curvature-Aware Gradient Estimation For Accurate Quantization-Aware Training

Stream2LLM: Overlap Context Streaming and Prefill for Reduced Time-to-First-Token

OPKV: A High-Throughput Plugin-Driven Framework for Recallable Sparsity in Paged KV Cache Systems

Dataflow Is All You Need

BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

SONAR: Benchmarking Topology and Collaboration in Decentralized Learning

Event Tensor: A Unified Abstraction for Compiling Dynamic Megakernel

Demystifying the Mixture of Experts Serving Tax

Cost-aware Duration Prediction for Software Upgrades in Datacenters

Agentic Operator Generation for ML ASICs

SHIP: SRAM-Based Huge Inference Pipelines for Fast LLM Serving

Sparing Strategies to Minimize Reliability Impact On Large Training Jobs

ApproxMLIR : Accuracy-Aware Compiler for Compound ML System

SAKURAONE: An Open Ethernet–Based AI HPC System and Its Observed Workload Dynamics in a Single-Tenant LLM Development Environment

ForeCache: Understanding Workloads and Optimizing KVCache Management for Efficiently Serving LLM Coding Agents

Flexo: A User-Controllable Distributed Training System

A Framework for Evaluating Neural Network Deployability on Analog In-Memory Computing Hardware

Neuro-Analog

Speciesism in the Assistant Axis: Probing Compassion Vectors in Post-Trained LLMs

Communication-Efficient Distributed Inference for Transformer Models via Vector Quantized Context

Equinox: Decentralized Scheduling for Hardware-aware Satellite Intelligence

SD-HC: Heterogeneous Functional Pipelining for Speculative LLM Decoding on AI PCs

Leveraging ASIC AI Chips for Homomorphic Encryption

Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference

Shortcut-connected Expert Parallelism for Accelerating Mixture of Experts

LYNX: Workload-Agnostic Expert Remapping for Efficient MoE Inference

NeSyKV: Neuro-Symbolic Architecture-Specific KV-Cache Eviction for LLM Inference

From 805 ms to 23 ms: Accelerating State-Space Models for Real-Time ICU Monitoring with Fused Triton Kernels

BioTriton: Portable Cross-Vendor GPU Kernels for High-Throughput Bioinformatics via OpenAI Triton

REMIX: Dynamic Partitioning for Fine-Grained Heterogeneous LLM Serving

On the Diminishing Returns of Expert Load Balancing in MoE LLM Serving

HiSpec: Hierarchical Speculative Decoding for LLMs

Designing Communication-Efficient AI Systems: An Interconnect-Aware HPC Perspective

DriftBench: Measuring and Predicting Infrastructure Drift in LLM Serving Systems

Impact of Scheduling for Terminal Agent Workloads on Unified-Memory Workstations

Towards Efficient Systems for Long-Context Automatic Speech Recognition

Tiered Autonomy Framework for Human‚ÄìAgent Collaboration in Mission-Critical Cyber-Physical Systems

SAT-Eval: A Framework for Preference Drift in Multi-Turn LLM Conversations

Learning-Guided Design Optimization for Lifecycle Impact Analysis

Cascade: Utility-Driven Speculative Decoding for Mixture-of-Experts

Automated Feature Engineering -- Faster Iteration to Solve Business Problems

ADAPTIVE ERASURE CODING FOR FAULT-TOLERANT LLM SERVING WITH CONTINUOUS BATCHING

Practical Unstructured Sparsity for Efficient LLM Inference

HiServe: A Prefix Cache Serving System for Hybrid LLMs

Toward a Small ML Runtime Stack for Raspberry Pi 5 QPUs

LearnedCache: An eBPF-Integrated Perceptron-Based Eviction Policy for the Linux Page Cache

Accelerating LLM Inference: Self-Speculative Decoding via Learned Seed Injection

HADIS: Hybrid Adaptive Diffusion Model Serving for Efficient Text-to-Image Generation

ov_training_kit : Model training and inference on local AI PC to strengthen the AI ecosystem

ViRuleEval: A Neuro-Symbolic System for Interpretable Evaluation of Text-to-Video Generation

AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization

BLAZE: Bias-Driven Load-Aware Zero-Overhead Expert Routing