{"count": 135, "next": null, "previous": null, "results": [{"id": 3608, "uid": "ed3d2c21991e3bef5e069713af9fa6ca", "name": "NodeSweep: Practical Straggler Detection and Health Monitoring for Large-Scale Foundation Model Training", "authors": [{"id": 27162, "fullname": "Guanliang Liu", "url": "http://mlsys.org/api/miniconf/users/27162?format=json", "institution": "Amazon"}, {"id": 21189, "fullname": "Zoe Zeng", "url": "http://mlsys.org/api/miniconf/users/21189?format=json", "institution": "Amazon"}, {"id": 27778, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27778?format=json", "institution": null}, {"id": 17887, "fullname": "Cong Cheng", "url": "http://mlsys.org/api/miniconf/users/17887?format=json", "institution": "Amazon"}, {"id": 27779, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27779?format=json", "institution": null}, {"id": 27780, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27780?format=json", "institution": null}, {"id": 27781, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27781?format=json", "institution": null}, {"id": 27782, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27782?format=json", "institution": null}, {"id": 27783, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27783?format=json", "institution": null}, {"id": 27784, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27784?format=json", "institution": null}, {"id": 27785, "fullname": "Alexander Zhipa", "url": "http://mlsys.org/api/miniconf/users/27785?format=json", "institution": "Amazon"}, {"id": 27163, "fullname": "Ashvin Nihalani", "url": "http://mlsys.org/api/miniconf/users/27163?format=json", "institution": "Amazon"}, {"id": 27786, "fullname": "Binxuan Huang", "url": "http://mlsys.org/api/miniconf/users/27786?format=json", "institution": "Amazon"}, {"id": 27787, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27787?format=json", "institution": null}, {"id": 27788, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27788?format=json", "institution": null}], "abstract": "As foundation model training scales to thousands of GPUs, maintaining consistent node performance becomes increasingly critical. Traditional health-checking methods such as NCCL or burn-in tests often fail to capture subtle performance degradations that can significantly impact large-scale training efficiency. In this paper, we present a comprehensive node health monitoring framework that integrates real-time performance tracking with a novel offline node sweep mechanism. Our approach effectively identifies problematic nodes that traditional methods overlook, especially under complex communication patterns common in distributed training. Extensive evaluations on production workloads show that our method improves mean FLOPS utilization (MFU) by up to 1.7\u00d7, reduces run-to-run variance from 20% to 1%, and increases the mean time to failure (MTTF) while reducing human intervention time. These improvements translate to substantial gains in training efficiency. The proposed solution is both practical and scalable, making it particularly valuable for production-scale foundation model training.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3608", "url": null, "sourceid": 98, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=JFEwQ821MS", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 902, "modified": "2026-03-23T21:52:47.116357-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=JFEwQ821MS", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3548, "uid": "f7177163c833dff4b38fc8d2872f1ec6", "name": "SchedFlow: Transparent and Flexible Intra-Device Parallelism via Programmable Operator Scheduling", "authors": [{"id": 27415, "fullname": "Yi Pan", "url": "http://mlsys.org/api/miniconf/users/27415?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 27416, "fullname": "Yile Gu", "url": "http://mlsys.org/api/miniconf/users/27416?format=json", "institution": "Department of Computer Science, University of Washington"}, {"id": 25662, "fullname": "Luo Jinbin", "url": "http://mlsys.org/api/miniconf/users/25662?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 27417, "fullname": "Yibo Wu", "url": "http://mlsys.org/api/miniconf/users/27417?format=json", "institution": "University of Washington; University of Wisconsin Madison"}, {"id": 27418, "fullname": "Ziren Wang", "url": "http://mlsys.org/api/miniconf/users/27418?format=json", "institution": "Tsinghua University"}, {"id": 27419, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27419?format=json", "institution": null}, {"id": 20912, "fullname": "Ziyi Xu", "url": "http://mlsys.org/api/miniconf/users/20912?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 27420, "fullname": "Shengkai Lin", "url": "http://mlsys.org/api/miniconf/users/27420?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 17972, "fullname": "Stephanie Wang", "url": "http://mlsys.org/api/miniconf/users/17972?format=json", "institution": "UW &amp; Anyscale"}, {"id": 17670, "fullname": "Baris Kasikci", "url": "http://mlsys.org/api/miniconf/users/17670?format=json", "institution": "University of Michigan"}], "abstract": "Intra-device parallelism addresses resource under-utilization in ML inference and training by overlapping the execution of operators with different resource usage.  However, its wide adoption is hindered by a fundamental conflict with the static, sequential programming model of existing frameworks. Integrating these strategies requires invasive, model-specific code overhauls, representing an intractable engineering cost. This is further amplified by the high sensitivity of strategies to execution contexts (e.g., workload, model architecture, hardware), forcing developers to implement and maintain multiple specialized solutions. To address this, we propose SchedFlow, a framework that enables the transparent and flexible integration of intra-device parallelism by decoupling the logical model definition from the physical execution schedule. SchedFlow introduces a flexible frontend with annotations for graph partitioning and a programmable interface for defining custom intra-device parallelism strategies. Its efficient backend manages complex control/data-flow asynchronously, uses custom memory management to eliminate copy overheads, and preserves compatibility with optimizations like CUDA Graphs and TorchInductor. We demonstrate that SchedFlow can integrate four representative parallelism strategies into three state-of-the-art ML systems (vLLM, SGLang, HuggingFace Transformer) with minimal code changes, achieving up to a 1.24x throughput improvement.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3548", "url": null, "sourceid": 44, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=i0yqC9954S", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 842, "modified": "2026-03-23T21:52:44.736245-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=i0yqC9954S", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3509, "uid": "02e74f10e0327ad868d138f2b4fdd6f0", "name": "From Tokens to Layers: Redefining Stall-Free Scheduling for LLM Serving with Layered Prefill", "authors": [{"id": 26188, "fullname": "Gunjun Lee", "url": "http://mlsys.org/api/miniconf/users/26188?format=json", "institution": "Seoul National University"}, {"id": 27174, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27174?format=json", "institution": null}, {"id": 27175, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27175?format=json", "institution": null}, {"id": 27176, "fullname": "Younjoo Lee", "url": "http://mlsys.org/api/miniconf/users/27176?format=json", "institution": "Seoul National University"}, {"id": 27177, "fullname": "Jung Ho Ahn", "url": "http://mlsys.org/api/miniconf/users/27177?format=json", "institution": "Seoul National University"}], "abstract": "Large Language Model (LLM) inference in production must meet stringent service-level objectives for both time-to-first-token (TTFT) and time-between-token (TBT) while maximizing throughput under fixed compute, memory, and interconnect budgets. Modern serving systems adopt stall-free scheduling techniques such as chunked prefill, which splits long prompt processing along the token dimension and interleaves prefill with ongoing decode iterations. While effective at stabilizing TBT, chunked prefill incurs substantial overhead in Mixture-of-Experts (MoE) models: redundant expert weight loads increase memory traffic by up to \\textbf{39\\%} and inflate energy consumption. We propose \\textbf{layered prefill}, a new scheduling paradigm that treats transformer layer groups as the primary scheduling unit. By vertically partitioning the model into contiguous layer groups and interleaving prefill and decode across the groups, layered prefill sustains stall-free decoding while eliminating chunk-induced MoE weight reloads. It reduces off-chip bandwidth demand, lowering TTFT by up to \\textbf{70\\%}, End-to-End latency by \\textbf{41\\%} and per-token energy by up to \\textbf{22\\%}. Evaluations show that layered prefill consistently improves the TTFT--TBT Pareto frontier over chunked prefill, reducing expert-load traffic and energy cost while maintaining stall-free decoding. Overall, shifting the scheduling axis from tokens to layers unlocks a new operating regime for high-efficiency, energy-aware LLM serving in co-located environments.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3509", "url": null, "sourceid": 27, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=yyDbI3HXco", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 803, "modified": "2026-03-23T21:52:43.267572-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=yyDbI3HXco", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3553, "uid": "069059b7ef840f0c74a814ec9237b6ec", "name": "FlexTrain: Scalable Hybrid-Parallel Training with Elastic Resource Utilization and Consistent Accuracy", "authors": [{"id": 25944, "fullname": "Weilin Cai", "url": "http://mlsys.org/api/miniconf/users/25944?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 27470, "fullname": "Diandian Gu", "url": "http://mlsys.org/api/miniconf/users/27470?format=json", "institution": "Bytedance Seed"}, {"id": 27471, "fullname": "Jun Wang", "url": "http://mlsys.org/api/miniconf/users/27471?format=json", "institution": "ByteDance Inc."}, {"id": 27472, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27472?format=json", "institution": null}, {"id": 19049, "fullname": "Jiayi Huang", "url": "http://mlsys.org/api/miniconf/users/19049?format=json", "institution": "HKUST(GZ)"}], "abstract": "Large language model (LLM) training has become a critical workload in shared GPU clusters. However, our observations reveal that these clusters suffer from significant underutilization. To address this inefficiency, various elastic training techniques have been developed to dynamically adjust GPU allocations to harness idle resources. Despite their potential, these methods have seen limited deployment in production environments due to three major challenges: accuracy inconsistency, excessive profiling overhead, and limited flexibility. In this paper, we propose FlexTrain, an elastic training system that achieves consistent model accuracy, high training efficiency, and effective resource utilization. FlexTrain prioritizes adjustments to the pipeline parallelism (PP) degree to preserve deterministic computation and maintain accuracy consistency, while also supporting data parallelism (DP) scaling to further enhance throughput under relaxed consistency requirements. It generates optimal PP schedules, predicts training performance under different configurations, and makes scaling decisions based on job submission intervals, scaling overhead, and expected throughput gains. Evaluation results show that FlexTrain can achieve up to 1.73x speedup for elastic jobs while preserving consistent accuracy, and up to 2.27x when accuracy consistency is relaxed, compared to CompanyX's current scheduling strategy.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3553", "url": null, "sourceid": 126, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=h2yhNcbwSL", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 847, "modified": "2026-03-23T21:52:44.957563-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=h2yhNcbwSL", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3600, "uid": "54229abfcfa5649e7003b83dd4755294", "name": "Optimizing PyTorch Inference with LLM-Based Multi-Agent Systems", "authors": [{"id": 27737, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27737?format=json", "institution": null}, {"id": 27738, "fullname": "Luka Grbcic", "url": "http://mlsys.org/api/miniconf/users/27738?format=json", "institution": "Lawrence Berkeley National Laboratory"}, {"id": 27739, "fullname": "Samuel Williams", "url": "http://mlsys.org/api/miniconf/users/27739?format=json", "institution": "Lawrence Berkeley National Lab"}, {"id": 27740, "fullname": "Costin Iancu", "url": "http://mlsys.org/api/miniconf/users/27740?format=json", "institution": "Ambassador University"}], "abstract": "Maximizing performance on available GPU hardware is an ongoing challenge for modern AI inference systems. Traditional approaches include writing custom GPU kernels and using specialized model compilers to tune high-level code for specific GPU targets. Recent work shows that LLM-based multi-agent systems can effectively perform such tuning, often outperforming existing compilers and eliminating the need for manual kernel development. However, the dynamics of multi-agent systems for this task remain unexplored. In this work, we present a logical framework for comparing multi-agent PyTorch optimization systems. Our evaluation shows that exploit-heavy strategies perform best when paired with error-fixing agents, and that performance correlates with the granularity of optimization steps. The best implementation achieves an average 2.88\u00d7 speedup on an H100 GPU across diverse tasks in KernelBench, a benchmark suite covering a range of machine learning architectures in PyTorch.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3600", "url": null, "sourceid": 91, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=MJxhiX3sSd", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 894, "modified": "2026-03-23T21:52:46.841467-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=MJxhiX3sSd", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3617, "uid": "eccbc87e4b5ce2fe28308fd9f2a7baf3", "name": "FLoRIST: Singular Value Thresholding for Efficient and Accurate Federated Fine-Tuning of Large Language Models", "authors": [{"id": 27825, "fullname": "Hariharan Ramesh", "url": "http://mlsys.org/api/miniconf/users/27825?format=json", "institution": "University of Arizona"}, {"id": 23907, "fullname": "Jyotikrishna Dass", "url": "http://mlsys.org/api/miniconf/users/23907?format=json", "institution": "University of Arizona"}], "abstract": "Integrating Low-Rank Adaptation (LoRA) into federated learning offers a promising solution for parameter-efficient fine-tuning of Large Language Models (LLMs) without sharing local data. However, several methods designed for federated LoRA present significant challenges in balancing communication efficiency, model accuracy, and computational cost, particularly among heterogeneous clients. These methods either rely on simplistic averaging of local adapters, which introduces aggregation noise, require transmitting large stacked local adapters, leading to poor communication efficiency, or necessitate reconstructing memory-dense global weight-update matrix and performing computationally expensive decomposition to design client-specific low-rank adapters. In this work, we propose FLoRIST, a federated fine-tuning framework that achieves mathematically accurate aggregation without incurring high communication or computational overhead. Instead of constructing the full global weight-update matrix at the server, FLoRIST employs an efficient decomposition pipeline by performing singular value decomposition on stacked local adapters separately. This approach operates within a compact intermediate space to represent the accumulated information from local LoRAs. We introduce tunable singular value thresholding for server-side optimal rank selection to construct a pair of global low-rank adapters shared by all clients. Extensive empirical evaluations across multiple datasets and LLMs demonstrate that FLoRIST consistently strikes the best balance between superior communication efficiency and competitive performance in both homogeneous and heterogeneous setups.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3617", "url": null, "sourceid": 3, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=GTZRs756YJ", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 911, "modified": "2026-03-23T21:52:47.495721-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=GTZRs756YJ", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3558, "uid": "7f6ffaa6bb0b408017b62254211691b5", "name": "AIRS: Scaling Live Inference in Resource Constrained Environments", "authors": [{"id": 27552, "fullname": "Nilesh Jagnik", "url": "http://mlsys.org/api/miniconf/users/27552?format=json", "institution": "Google"}, {"id": 27553, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27553?format=json", "institution": null}, {"id": 27554, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27554?format=json", "institution": null}, {"id": 27555, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27555?format=json", "institution": null}, {"id": 27556, "fullname": "Harshvardhan GM", "url": "http://mlsys.org/api/miniconf/users/27556?format=json", "institution": "Google LLC"}], "abstract": "Advancements in large language models (LLMs) have made them increasingly useful for complex reasoning tasks which previously required domain experts. One such task is quality evaluation of query responses produced by a search engine. Evaluation generates metrics necessary to study the quality, impact, and usefulness of product changes and features. Typically, to compute evaluation metrics, human experts are asked to rate various attributes of search responses. This process is generally quite expensive and requires several days to complete. As an alternative, LLMs are now being used to perform rating tasks with lower costs and latency. In addition, many new metrics are being developed to evaluate Google's new AI-based offerings, which require ratings too. As a result, there is much higher demand for LLM rating prediction tasks in comparison with the allocated TPU (Tensor Processing Unit) budget. A larger portion of the company's TPU resources are reserved for serving live user traffic. In this paper, we present the AI Rater Service (AIRS), an inference pipeline that employs several software engineering techniques to generate AI ratings with high reliability and low latency. AIRS maximizes LLM inference throughput by optimizing TPU resource utilization across various evaluation workflows, while minimizing latency for higher priority tasks.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3558", "url": null, "sourceid": 112, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=g1RWik4Gy1", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 852, "modified": "2026-03-23T21:52:45.165663-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=g1RWik4Gy1", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3515, "uid": "70efdf2ec9b086079795c442636b55fb", "name": "Ontology-Guided Long-Term Memory for Conversational RAG", "authors": [{"id": 27197, "fullname": "Shuang Cao", "url": "http://mlsys.org/api/miniconf/users/27197?format=json", "institution": "Hill Research"}, {"id": 27198, "fullname": "Rui Li", "url": "http://mlsys.org/api/miniconf/users/27198?format=json", "institution": "Hill Research"}], "abstract": "Retrieval-augmented generation (RAG) enables LLMs to ground responses in external knowledge, but long-term, multi-session conversations still suffer from implicit recall failures: when current user queries lack lexical overlap with earlier facts (e.g., preferences), standard dense retrieval and long-context prompting often miss the most relevant memories. We present a dialogue-aware RAG system that jointly addresses what to store and how to retrieve under constraints. Our design extracts durable user facts into a lightweight memory graph, enriches queries with conversational cues, performs hybrid retrieval, and uses a budget-aware router to balance quality and serving cost. On our Implicit Preference Recall benchmark, the system lifts Recall@10 to 0.70 (vs. 0.58 for dense-only) and improves nDCG@10 from 0.41 to 0.51. The system also reduces cross-modality disagreement by 47% and achieves a 81% cost reduction compared to long-context methods.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3515", "url": null, "sourceid": 17, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=wpZHLPz4N0", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 809, "modified": "2026-03-23T21:52:43.477104-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=wpZHLPz4N0", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3637, "uid": "642e92efb79421734881b53e1e1b18b6", "name": "FlexScale: Flexible and High-Performance FSDP at Scale", "authors": [{"id": 27954, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27954?format=json", "institution": null}, {"id": 27955, "fullname": "Youjie Li", "url": "http://mlsys.org/api/miniconf/users/27955?format=json", "institution": "ByteDance veScale"}, {"id": 27956, "fullname": "Zhiqi Lin", "url": "http://mlsys.org/api/miniconf/users/27956?format=json", "institution": "ByteDance Inc."}, {"id": 27957, "fullname": "Jiacheng Yang", "url": "http://mlsys.org/api/miniconf/users/27957?format=json", "institution": "ByteDance Inc."}, {"id": 27958, "fullname": "Cong Xie", "url": "http://mlsys.org/api/miniconf/users/27958?format=json", "institution": "ByteDance"}, {"id": 27959, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27959?format=json", "institution": null}, {"id": 27960, "fullname": "ZHENG ZHONG", "url": "http://mlsys.org/api/miniconf/users/27960?format=json", "institution": "Bytedance"}, {"id": 27961, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27961?format=json", "institution": null}, {"id": 27962, "fullname": "Hongyu Zhu", "url": "http://mlsys.org/api/miniconf/users/27962?format=json", "institution": "ByteDance Inc."}, {"id": 27963, "fullname": "Zhi Zhang", "url": "http://mlsys.org/api/miniconf/users/27963?format=json", "institution": "ByteDance Inc."}, {"id": 20969, "fullname": "Xin Liu", "url": "http://mlsys.org/api/miniconf/users/20969?format=json", "institution": "ByteDance Inc."}, {"id": 27964, "fullname": "Yanghua Peng", "url": "http://mlsys.org/api/miniconf/users/27964?format=json", "institution": "ByteDance Inc."}], "abstract": "Fully Sharded Data Parallel (FSDP), also known as ZeRO, is widely used for training large-scale models, featuring its flexibility and minimal intrusion on model code. However, current FSDP systems struggle with structure-aware training methods\u2014e.g., block-wise quantized training\u2014and with optimizers such as Shampoo and Muon used in cutting-edge models (e.g., Gemini, Kimi K2). FSDP's fixed element- or row-wise sharding formats conflict with the block-structured computations. In addition, today\u2019s implementations fall short in communication and memory efficiency, limiting scaling to tens of thousands of GPUs. We introduce FlexScale, a redesigned FSDP framework that couples a flexible sharding format, RaggedShard, with a structure-aware planning algorithm to deliver both flexibility and performance at scale. FlexScale natively supports efficient data placement required by FSDP, accommodates non-element-wise optimizers and block-wise quantization. As a result, FlexScale achieves 5$\\sim$66\\% higher throughput and 16$\\sim$30\\% lower memory usage than existing FSDP systems, while efficiently scales to 30K GPUs. FlexScale has been battle-tested in production and will be open-sourced to the MLSys community upon acceptance.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3637", "url": null, "sourceid": 48, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=3Lj8R0F48P", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 931, "modified": "2026-03-23T21:52:48.342230-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=3Lj8R0F48P", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3621, "uid": "1afa34a7f984eeabdbb0a7d494132ee5", "name": "OPKV: A High-Throughput Plugin-Driven Framework for Recallable Sparsity in Paged KV Cache Systems", "authors": [{"id": 27839, "fullname": "Huazheng Lao", "url": "http://mlsys.org/api/miniconf/users/27839?format=json", "institution": "Southeast University"}, {"id": 25856, "fullname": "Xiaofeng Li", "url": "http://mlsys.org/api/miniconf/users/25856?format=json", "institution": "Southeast University"}, {"id": 25589, "fullname": "Rui Xu", "url": "http://mlsys.org/api/miniconf/users/25589?format=json", "institution": ""}, {"id": 25641, "fullname": "Long Chen", "url": "http://mlsys.org/api/miniconf/users/25641?format=json", "institution": "Southeast University"}, {"id": 27840, "fullname": "Xia Zhu", "url": "http://mlsys.org/api/miniconf/users/27840?format=json", "institution": "Southeast University"}, {"id": 27841, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27841?format=json", "institution": null}], "abstract": "Long-context large language model (LLM) inference faces severe KV cache inflation, making GPU memory a key bottleneck. Existing recallable sparsity methods mitigate memory pressure by offloading non-critical key\u2013value (KV) pairs to CPU memory and recalling them on demand, they are intrusive to KV cache management in the existing inference frameworks and fail to cope with the linearly increasing recall overhead under high batches. To address these limitations, we propose OPKV, a high-throughput plugin-driven framework that seamlessly integrates recallable sparsity into paged KV cache systems and performs unified recall optimization.  OPKV introduces a plugin interface that decouples sparsity logic from model and cache management, and applies object reaggregation and hot page hit algorithms to reduce the recall overhead based on the observation of spatial discreteness and temporal locality in critical KV selection. In addition, a local intra-iteration metadata manager is implemented to perform millisecond-level page retrieval and cache eviction. The experimental results show that OPKV helps the SoTA methods attain 1.36-1.77x higher decoding throughput under different batches.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3621", "url": null, "sourceid": 131, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=EB5bgzv4qA", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 915, "modified": "2026-03-23T21:52:47.660267-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=EB5bgzv4qA", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3570, "uid": "735b90b4568125ed6c3f678819b6e058", "name": "ZK-APEX: ZERO-KNOWLEDGE APPROXIMATE PERSONALIZED UNLEARNING WITH EXECUTABLE PROOFS", "authors": [{"id": 27601, "fullname": "Mohammad M Maheri", "url": "http://mlsys.org/api/miniconf/users/27601?format=json", "institution": "Imperial College London"}, {"id": 27602, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27602?format=json", "institution": null}, {"id": 27603, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27603?format=json", "institution": null}, {"id": 11279, "fullname": "Hamed Haddadi", "url": "http://mlsys.org/api/miniconf/users/11279?format=json", "institution": "Imperial College London"}], "abstract": "Machine unlearning removes the influence of specified data from trained models to satisfy privacy, copyright, and safety requirements (e.g., the \u201cright to be forgotten\u201d). In practice, providers distribute a global model to edge devices, that each locally personalize the model based on their private data.  However, since clients may ignore or falsify deletion requests, providers must verify correct unlearning for these distributed models, without accessing private parameters. This is particularly challenging for personalized models, which must forget designated samples without degrading local utility, while ensuring that verification remains efficient and scalable on resource-constrained edge devices.  We formalize personalized unlearning and develop a zero-shot approximate unlearning algorithm that works directly on the personalized model without retraining. Our novel method, \\name, combines provider-side sparse masking for targeted removal with client-side Group-OBS compensation computed from a block-wise empirical Fisher. This technique yields a curvature-aware update designed for low-overhead execution and proof generation. Using modern Halo2 ZK-SNARKs, we prove operator compliance by showing that the unlearned model exactly matches the committed output of the prescribed transformation, without revealing personalized model parameters or data.  On Vision Transformer (ViT) classification models, our approach recovers approximately 99\\% Top-1 personalization accuracy while enforcing effective forgetting. We further evaluate the unlearning algorithm on a generative model, OPT125M, trained on the CodeParrot code dataset, achieving $\\sim$70\\% recovery of original accuracy. ZK-SNARK proof generation for the ViT case completes in $\\approx$2~hours, which is more than $10^7\\times$ faster than retraining based verification, with peak memory under 0.7~GB and proof sizes about 400~MB. Together, these results establish the first verifiable personalized unlearning framework practical for deployment on resource constrained edge devices.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3570", "url": null, "sourceid": 67, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=bLx6orLvQM", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 864, "modified": "2026-03-23T21:52:45.678302-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=bLx6orLvQM", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3585, "uid": "19ca14e7ea6328a42e0eb13d585e4c22", "name": "AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization", "authors": [{"id": 18114, "fullname": "Genghan Zhang", "url": "http://mlsys.org/api/miniconf/users/18114?format=json", "institution": "Stanford University"}, {"id": 27646, "fullname": "Shaowei Zhu", "url": "http://mlsys.org/api/miniconf/users/27646?format=json", "institution": "Amazon"}, {"id": 27647, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27647?format=json", "institution": null}, {"id": 27171, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27171?format=json", "institution": null}, {"id": 27648, "fullname": "Allen Nie", "url": "http://mlsys.org/api/miniconf/users/27648?format=json", "institution": "Google DeepMind"}, {"id": 15476, "fullname": "Zhen Jia", "url": "http://mlsys.org/api/miniconf/users/15476?format=json", "institution": "Amazon"}, {"id": 17626, "fullname": "Nandita Vijaykumar", "url": "http://mlsys.org/api/miniconf/users/17626?format=json", "institution": "Department of Computer Science, University of Toronto"}, {"id": 11990, "fullname": "Yida Wang", "url": "http://mlsys.org/api/miniconf/users/11990?format=json", "institution": "Amazon"}, {"id": 15013, "fullname": "Kunle Olukotun", "url": "http://mlsys.org/api/miniconf/users/15013?format=json", "institution": "Stanford"}], "abstract": "We present AccelOpt, a self-improving large language model (LLM) agentic system that autonomously optimizes kernels for emerging AI acclerators, eliminating the need for expert-provided hardware-specific optimization knowledge. AccelOpt explores the kernel optimization space through iterative generation, informed by an optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs. We build NKIBench, a new benchmark suite of AWS Trainium accelerator kernels with varying complexity extracted from real-world LLM workloads to evaluate the effectiveness of AccelOpt. Our evaluation confirms that AccelOpt's capability improves over time, boosting the average percentage of peak throughput from $49\\%$ to $61\\%$ on Trainium 1 and from $45\\%$ to $59\\%$ on Trainium 2 for NKIBench kernels. Moreover, AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being $26\\times$ cheaper.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3585", "url": null, "sourceid": 36, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=SBS4NJHYjZ", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 879, "modified": "2026-03-23T21:52:46.269486-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=SBS4NJHYjZ", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3629, "uid": "33e75ff09dd601bbe69f351039152189", "name": "Dataflow Is All You Need", "authors": [{"id": 27864, "fullname": "Darshan Gandhi", "url": "http://mlsys.org/api/miniconf/users/27864?format=json", "institution": "Sambanova Systems"}, {"id": 27865, "fullname": "Pushkar Nandkar", "url": "http://mlsys.org/api/miniconf/users/27865?format=json", "institution": "Sambanova Systems"}, {"id": 27866, "fullname": "David Koeplinger", "url": "http://mlsys.org/api/miniconf/users/27866?format=json", "institution": "SambaNova"}, {"id": 25920, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/25920?format=json", "institution": null}, {"id": 27166, "fullname": "Romy Tsoupidi", "url": "http://mlsys.org/api/miniconf/users/27166?format=json", "institution": "Sambanova Systems"}, {"id": 27867, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27867?format=json", "institution": null}, {"id": 27868, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27868?format=json", "institution": null}, {"id": 18777, "fullname": "Tuowen Zhao", "url": "http://mlsys.org/api/miniconf/users/18777?format=json", "institution": "SambaNova Systems, Inc."}, {"id": 27869, "fullname": "Reid Goodbar", "url": "http://mlsys.org/api/miniconf/users/27869?format=json", "institution": "SambaNova Systems"}, {"id": 27870, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27870?format=json", "institution": null}, {"id": 27871, "fullname": "Leon Zhang", "url": "http://mlsys.org/api/miniconf/users/27871?format=json", "institution": "Sambanova Systems"}, {"id": 27872, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27872?format=json", "institution": null}, {"id": 27873, "fullname": "John Long", "url": "http://mlsys.org/api/miniconf/users/27873?format=json", "institution": "Sambanova Systems"}, {"id": 27167, "fullname": "Han Wang", "url": "http://mlsys.org/api/miniconf/users/27167?format=json", "institution": "SambaNova"}, {"id": 27874, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27874?format=json", "institution": null}, {"id": 27875, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27875?format=json", "institution": null}, {"id": 27876, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27876?format=json", "institution": null}, {"id": 18774, "fullname": "Yun Du", "url": "http://mlsys.org/api/miniconf/users/18774?format=json", "institution": null}, {"id": 27877, "fullname": "H\u00e5kan Zeffer", "url": "http://mlsys.org/api/miniconf/users/27877?format=json", "institution": "SambaNova Systems, Inc"}, {"id": 27878, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27878?format=json", "institution": null}, {"id": 27879, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27879?format=json", "institution": null}], "abstract": "The autoregressive decode phase of token generation is often the performance bottleneck in modern AI workflows, thanks to powerful open-source models with large context windows coupled with techniques like chain-of-thought reasoning. Decoding is memory bandwidth bound: the speed of token generation is limited by the memory bandwidth utilized to read weights and KV cache values. However, GPUs only use as little as 21\\% of the available bandwidth on weights and KV caches. Asynchronous execution is hard on GPUs, which creates CPU scheduling overheads, kernel synchronization overheads, and inadequate compute-communication overlap. While prior work attempts to address these overheads with kernel fusion and asynchronous execution on GPUs, they mostly focus on a single GPU and do not generalize across different types of model architectures.  We argue that to truly mitigate these overheads, \\emph{Dataflow Is All You Need}. Dataflow architectures execute subgraphs of operations asynchronously on one or more chips, thereby naturally mitigating the overhead faced on GPUs. In this paper, we chronicle a co-design approach to achieve peak decoding performance on a dataflow architecture -- the SambaNova SN40 Reconfigurable Dataflow Unit (RDU).  We describe three key optimizations enabled by dataflow -- \\emph{\\textbf{KernelLooping}}, \\emph{\\textbf{BatchStreaming}}, and \\emph{\\textbf{ScheduleOffloading}} -- that generalize over models that are small, large, dense, MoEs, hybrids, and with different attention mechanisms. Collectively, these optimizations deliver more than \\textbf{75\\%} of the theoretical peak roofline performance for a wide range of popular open-source models. We study speculative decoding in detail and demonstrate a speed-up of more than \\textbf{6$\\times$} with speculative decoding. Finally, we also show that speculative decoding runs \\textbf{1.7$\\times$} faster on 16 SN40 RDUs than DGX H100 despite having comparable HBM bandwidth. The techniques described in this paper and the models used in the evaluation are deployed in a production AI inference cloud at cloud.sambanova.ai.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3629", "url": null, "sourceid": 28, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=7wOOhxkuN8", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 923, "modified": "2026-03-23T21:52:47.993911-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=7wOOhxkuN8", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3573, "uid": "a1d0c6e83f027327d8461063f4ac58a6", "name": "RagInfer: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval", "authors": [{"id": 27607, "fullname": "Chien-Yu Lin", "url": "http://mlsys.org/api/miniconf/users/27607?format=json", "institution": "Meta"}, {"id": 26295, "fullname": "Keisuke Kamahori", "url": "http://mlsys.org/api/miniconf/users/26295?format=json", "institution": "University of Washington"}, {"id": 27608, "fullname": "Yiyu Liu", "url": "http://mlsys.org/api/miniconf/users/27608?format=json", "institution": "Harvard University"}, {"id": 27609, "fullname": "Xiaoxiang Shi", "url": "http://mlsys.org/api/miniconf/users/27609?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 27610, "fullname": "Madhav Kashyap", "url": "http://mlsys.org/api/miniconf/users/27610?format=json", "institution": "University of Washington"}, {"id": 27416, "fullname": "Yile Gu", "url": "http://mlsys.org/api/miniconf/users/27416?format=json", "institution": "Department of Computer Science, University of Washington"}, {"id": 27611, "fullname": "Rulin Shao", "url": "http://mlsys.org/api/miniconf/users/27611?format=json", "institution": "University of Washington"}, {"id": 12026, "fullname": "Zihao Ye", "url": "http://mlsys.org/api/miniconf/users/12026?format=json", "institution": "University of Washington"}, {"id": 17683, "fullname": "Kan Zhu", "url": "http://mlsys.org/api/miniconf/users/17683?format=json", "institution": "University of Washington"}, {"id": 27612, "fullname": "Rohan Kadekodi", "url": "http://mlsys.org/api/miniconf/users/27612?format=json", "institution": "University of Washington"}, {"id": 17972, "fullname": "Stephanie Wang", "url": "http://mlsys.org/api/miniconf/users/17972?format=json", "institution": "UW &amp; Anyscale"}, {"id": 11122, "fullname": "Arvind Krishnamurthy", "url": "http://mlsys.org/api/miniconf/users/11122?format=json", "institution": "University of Washington"}, {"id": 11020, "fullname": "Luis Ceze", "url": "http://mlsys.org/api/miniconf/users/11020?format=json", "institution": "University of Washington and NVIDIA"}, {"id": 17670, "fullname": "Baris Kasikci", "url": "http://mlsys.org/api/miniconf/users/17670?format=json", "institution": "University of Michigan"}], "abstract": "Retrieval-augmented generation (RAG) extends large language models (LLMs) with external data sources to enhance factual correctness and domain coverage. Modern RAG pipelines rely on large datastores, creating a significant system challenge: achieving high throughput and low latency is difficult, especially when GPU memory is limited. To address these challenges, we propose RAGInfer, an efficient inference system that reduces latency and improves throughput with minimal GPU memory requirements. The core innovation of RAGInfer is \\emph{lookahead retrieval}, a prefetching mechanism that predicts required data and transfers them from CPU to GPU in parallel with LLM generation. In addition, RAGInfer adopts a prefetching scheduler and a cache-aware scheduler to support efficient multi-GPU inference with minimal overhead. Evaluations show RAGInfer achieves up to a 1.53$\\times$ average end-to-end latency reduction (single-query) and 1.83$\\times$ higher average throughput (batched), as well as good scalability in throughput. This confirms the practical utility of RAGInfer for faster and more memory-efficient deployments of RAG applications.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3573", "url": null, "sourceid": 42, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=YsOyCpMUYD", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 867, "modified": "2026-03-23T21:52:45.805417-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=YsOyCpMUYD", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3630, "uid": "c0c7c76d30bd3dcaefc96f40275bdc0a", "name": "ADS: AN AGENTIC DETECTION SYSTEM FOR ENTERPRISE AGENTIC AI SECURITY", "authors": [{"id": 27880, "fullname": "Chenning Li", "url": "http://mlsys.org/api/miniconf/users/27880?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 17735, "fullname": "Pan Hu", "url": "http://mlsys.org/api/miniconf/users/17735?format=json", "institution": "Alibaba Group"}, {"id": 27881, "fullname": "Justin Xu", "url": "http://mlsys.org/api/miniconf/users/27881?format=json", "institution": "University of Oxford"}, {"id": 27882, "fullname": "Baris Ozbas", "url": "http://mlsys.org/api/miniconf/users/27882?format=json", "institution": "Uber"}, {"id": 27883, "fullname": "Olivia Liu", "url": "http://mlsys.org/api/miniconf/users/27883?format=json", "institution": ""}, {"id": 25578, "fullname": "Caroline Van", "url": "http://mlsys.org/api/miniconf/users/25578?format=json", "institution": "Uber Technologies"}, {"id": 27884, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27884?format=json", "institution": null}, {"id": 27885, "fullname": "Wei Zhou", "url": "http://mlsys.org/api/miniconf/users/27885?format=json", "institution": "Uber"}, {"id": 12450, "fullname": "Mohammad Alizadeh", "url": "http://mlsys.org/api/miniconf/users/12450?format=json", "institution": "MIT CSAIL"}, {"id": 27886, "fullname": "Pengyu Zhang", "url": "http://mlsys.org/api/miniconf/users/27886?format=json", "institution": "Uber"}], "abstract": "We present ADR (Agentic AI Detection and Response), the first large-scale, production-proven enterprise framework for securing AI agents operating through the Model Context Protocol (MCP). We identify three persistent challenges in this domain: (1) limited observability, as existing telemetry fails to capture reasoning and tool-execution chains; (2) insufficient robustness, given vast, dynamic enterprise contexts and extreme class imbalance; and (3) high detection costs, as LLM-based inference is computationally expensive. ADR addresses these challenges via three components: the ADR Sensor for high-fidelity agentic telemetry, the ADR Explorer for continuous red teaming and hard-example generation, and the ADR Detector for scalable, two-tier online detection combining fast triage with context-aware reasoning. On ADR-Bench (302 tasks, 17 techniques, 133 MCP servers), ADR achieves zero false positives while detecting 67% of attacks\u2014outperforming three state-of-the-art baselines (ALRPHFS, GuardAgent, LlamaFirewall) by 2\u20134\u00d7. On AgentDojo (public prompt injection benchmark), ADR detects all attacks with only three false alarms out of 93 tasks. Over ten months of telemetry, ADR sustained reliable detection in production, uncovering credential exposures and enabling a shift-left prevention layer with 97.2% precision. ADR\u2019s source code and benchmark will be publicly available.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3630", "url": null, "sourceid": 50, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=7B91Naeszw", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 924, "modified": "2026-03-23T21:52:48.031481-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=7B91Naeszw", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3572, "uid": "d9d4f495e875a2e075a1a4a6e1b9770f", "name": "BOute: Cost-Efficient LLM Serving with Heterogeneous LLMs and GPUs via Multi-Objective Bayesian Optimization", "authors": [{"id": 21099, "fullname": "YOUHE JIANG", "url": "http://mlsys.org/api/miniconf/users/21099?format=json", "institution": "University of Cambridge"}, {"id": 27606, "fullname": "Fangcheng Fu", "url": "http://mlsys.org/api/miniconf/users/27606?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 10630, "fullname": "Eiko Yoneki", "url": "http://mlsys.org/api/miniconf/users/10630?format=json", "institution": "University of Cambridge"}], "abstract": "The rapid growth of large language model (LLM) deployments has made cost-efficient serving systems essential. Recent efforts to enhance system cost-efficiency adopt two main perspectives: (\\textbf{\\underline{i}}) An \\textit{algorithmic} perspective that exploits heterogeneous model capabilities to route simpler queries to lower-cost models and complex queries to higher-cost models (i.e., heterogeneous query routing); and (\\textbf{\\underline{ii}}) a \\textit{systems} perspective that utilizes heterogeneous GPU resources as cost-effective alternatives to homogeneous high-end GPUs (i.e., heterogeneous model deployment). However, algorithm-system co-design for cost-efficient LLM serving necessitates sophisticated management: (\\textbf{\\underline{i}}) Determining optimal query routing strategies under latency and quality requirements, (\\textbf{\\underline{ii}}) configuring model deployment across heterogeneous GPUs with appropriate resource allocation and parallelism strategies, and (\\textbf{\\underline{iii}}) co-optimizing routing and deployment decisions to maximize overall system performance. To address these challenges, we present BOute, a \\textit{quality-aware scheduling system} that jointly exploits heterogeneous model and GPU capabilities for cost-efficient LLM serving. BOute employs a \\textit{multi-objective Bayesian optimization (MOBO) framework} to co-optimize the routing strategy and model deployment, thereby maximizing the cost-efficiency of the serving system while guaranteeing response quality. Evaluation results demonstrate that \\sys outperforms state-of-the-art LLM serving systems by up to 157\\% and 59\\% on average under \\textit{identical} cost budgets and quality requirements, or reducing serving costs by 15\\%-61\\% (38\\% on average) while maintaining the \\textit{same} performance targets, validating its effectiveness in achieving cost-efficient LLM serving.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3572", "url": null, "sourceid": 46, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=ZVQb92umqX", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 866, "modified": "2026-03-23T21:52:45.770781-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=ZVQb92umqX", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3626, "uid": "43ec517d68b6edd3015b3edc9a11367b", "name": "BEAM: Joint Resource\u2013Power Optimization for Energy-Efficient LLM Inference under SLO contraints", "authors": [{"id": 27852, "fullname": "Hyunjae Lee", "url": "http://mlsys.org/api/miniconf/users/27852?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 27853, "fullname": "Sangjin Choi", "url": "http://mlsys.org/api/miniconf/users/27853?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 27854, "fullname": "Seungjae Lim", "url": "http://mlsys.org/api/miniconf/users/27854?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 27855, "fullname": "Youngjin Kwon", "url": "http://mlsys.org/api/miniconf/users/27855?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}], "abstract": "Large Language Model (LLM) serving is rapidly becoming one of the most power-intensive workloads in modern datacenters. Unlike training, where throughput dominates, inference must satisfy strict per-request latency targets such as Time-to-First-Token (TTFT) and Time-Between-Tokens (TBT). Once an SLO is met, the remaining latency slack between the earliest possible completion and the deadline offers an opportunity for energy savings. Existing systems, however, exploit only one dimension of this trade-off: batching improves resource efficiency, while DVFS improves power efficiency. These two axes are tightly coupled, and optimizing one while fixing the other yields only a local optimum. We present BEAM, a fine-grained controller that dynamically co-optimizes resource and power efficiency under per-request SLOs. BEAM continuously allocates the available latency slack across both dimensions by jointly tuning GPU frequency, chunk size, and microbatch count in real time. Its event-driven design responds instantly to request arrivals and completions, while a lightweight predictive model enables sub-millisecond decision making with negligible overhead. Implemented atop the vLLM runtime, BEAM reduces end-to-end GPU energy consumption by up to 51\\% compared to vLLM.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3626", "url": null, "sourceid": 81, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=BfNBXM8CCT", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 920, "modified": "2026-03-23T21:52:47.855700-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=BfNBXM8CCT", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3628, "uid": "1679091c5a880faf6fb5e6087eb1b2dc", "name": "GriNNder: Breaking the Memory Capacity Wall in Full-Graph GNN Training with Storage Offloading", "authors": [{"id": 27858, "fullname": "Jaeyong Song", "url": "http://mlsys.org/api/miniconf/users/27858?format=json", "institution": "Seoul National University"}, {"id": 27859, "fullname": "Seongyeon Park", "url": "http://mlsys.org/api/miniconf/users/27859?format=json", "institution": "Seoul National University"}, {"id": 27860, "fullname": "Hongsun Jang", "url": "http://mlsys.org/api/miniconf/users/27860?format=json", "institution": "Seoul National University"}, {"id": 27861, "fullname": "Jaewon Jung", "url": "http://mlsys.org/api/miniconf/users/27861?format=json", "institution": "Seoul National University"}, {"id": 26223, "fullname": "Hunseong Lim", "url": "http://mlsys.org/api/miniconf/users/26223?format=json", "institution": "Seoul National University"}, {"id": 27862, "fullname": "Junguk Hong", "url": "http://mlsys.org/api/miniconf/users/27862?format=json", "institution": "Seoul National University"}, {"id": 27863, "fullname": "Jinho Lee", "url": "http://mlsys.org/api/miniconf/users/27863?format=json", "institution": "Seoul National University"}], "abstract": "Full-graph training of graph neural networks (GNNs) is widely used as it enables direct validation of algorithmic improvements by preserving complete neighborhood information.  However, it typically requires multiple GPUs or servers, incurring substantial hardware and inter-device communication costs. While existing single-server methods reduce infrastructure requirements, they remain constrained by GPU and host memory capacity as graph sizes increase. To address this limitation, we introduce **GriNNder**, which is the first work to leverage storage devices to enable full-graph training even with limited memory. Because modern NVMe SSDs offer multi-terabyte capacities and bandwidths exceeding 10 GB/s, they provide an appealing option when memory resources are scarce. Yet, directly applying storage-based methods from other domains fails to address the unique access patterns and data dependencies in full-graph GNN training. GriNNder tackles these challenges by *structured storage offloading (SSO)*, a framework that manages the GPU-host-storage hierarchy through coordinated *cache*, *(re)gather*, and *bypass* mechanisms. To realize the framework, we devise (i) a partition-wise caching strategy for host memory that exploits the observation on cross-partition dependencies, (ii) a regathering strategy for gradient computation that eliminates redundant storage operations, and (iii) a lightweight partitioning scheme that mitigates the memory requirements of existing graph partitioners. In experiments performed over various models and datasets, GriNNder achieves up to 9.78$\\times$ speedup over state-of-the-art baselines and throughput comparable to distributed systems, enabling previously infeasible large-scale full-graph training even on a single GPU.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3628", "url": null, "sourceid": 6, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=8SNPzGRldN", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 922, "modified": "2026-03-23T21:52:47.956516-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=8SNPzGRldN", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3587, "uid": "38b3eff8baf56627478ec76a704e9b52", "name": "RAGBoost: Efficient Retrieval-Augmented Generation with Accuracy-Preserving Context Reuse", "authors": [{"id": 25934, "fullname": "Yinsicheng Jiang", "url": "http://mlsys.org/api/miniconf/users/25934?format=json", "institution": "University of Edinburgh"}, {"id": 27651, "fullname": "Yeqi Huang", "url": "http://mlsys.org/api/miniconf/users/27651?format=json", "institution": "University of Edinburgh, University of Edinburgh"}, {"id": 27652, "fullname": "Liang Cheng", "url": "http://mlsys.org/api/miniconf/users/27652?format=json", "institution": "University of Edinburgh"}, {"id": 25084, "fullname": "Cheng Deng", "url": "http://mlsys.org/api/miniconf/users/25084?format=json", "institution": "The University of Edinburgh"}, {"id": 27653, "fullname": "Xuan Sun", "url": "http://mlsys.org/api/miniconf/users/27653?format=json", "institution": "University of Edinburgh, University of Edinburgh"}, {"id": 27654, "fullname": "Luo Mai", "url": "http://mlsys.org/api/miniconf/users/27654?format=json", "institution": "Edinburgh University"}], "abstract": "Retrieval-augmented generation (RAG) enhances large language models (LLMs) with retrieved context but often suffers from downgraded prefill performance as modern applications demand longer and more complex inputs. Existing caching techniques either preserve accuracy with low cache reuse or improve reuse at the cost of degraded reasoning quality. We present RAGBoost, an efficient RAG system that achieves high cache reuse without sacrificing accuracy through accuracy-preserving context reuse. RAGBoost detects overlapping retrieved items across concurrent sessions and multi-turn interactions, using efficient context indexing, ordering, and de-duplication to maximize reuse, while lightweight contextual hints maintain reasoning fidelity.  It integrates seamlessly with existing inference engines (SGLang and vLLM) and improves performance by 1.5\u20133\u00d7 over state-of-the-art methods (CacheBlend, RadixCache, LMCache, HiCache, and RAGCache), while preserving or even enhancing reasoning accuracy across diverse RAG and agentic AI workloads.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3587", "url": null, "sourceid": 101, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=RnKvDy1jv2", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 881, "modified": "2026-03-23T21:52:46.341828-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=RnKvDy1jv2", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3551, "uid": "ac627ab1ccbdb62ec96e702f07f6425b", "name": "CATWILD: Compiler Autotuning for TPU workloads in the Wild", "authors": [{"id": 27432, "fullname": "Ignacio Cano", "url": "http://mlsys.org/api/miniconf/users/27432?format=json", "institution": "Google"}, {"id": 11105, "fullname": "Yu Wang", "url": "http://mlsys.org/api/miniconf/users/11105?format=json", "institution": "Harvard University"}, {"id": 27183, "fullname": "Phitchaya Phothilimthana", "url": "http://mlsys.org/api/miniconf/users/27183?format=json", "institution": "OpenAI"}, {"id": 12264, "fullname": "Mike Burrows", "url": "http://mlsys.org/api/miniconf/users/12264?format=json", "institution": "Google Brain"}, {"id": 27433, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27433?format=json", "institution": null}, {"id": 27434, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27434?format=json", "institution": null}, {"id": 27435, "fullname": "Alexander Wertheim", "url": "http://mlsys.org/api/miniconf/users/27435?format=json", "institution": "Google"}, {"id": 27436, "fullname": "Chao Wang", "url": "http://mlsys.org/api/miniconf/users/27436?format=json", "institution": ""}, {"id": 27437, "fullname": "David Liu", "url": "http://mlsys.org/api/miniconf/users/27437?format=json", "institution": "Google"}, {"id": 27438, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27438?format=json", "institution": null}, {"id": 27181, "fullname": "Arissa Wongpanich", "url": "http://mlsys.org/api/miniconf/users/27181?format=json", "institution": "Google"}, {"id": 27439, "fullname": "Christof Angermueller", "url": "http://mlsys.org/api/miniconf/users/27439?format=json", "institution": "Google"}, {"id": 27440, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27440?format=json", "institution": null}, {"id": 27441, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27441?format=json", "institution": null}, {"id": 27146, "fullname": "Vineetha Govindaraj", "url": "http://mlsys.org/api/miniconf/users/27146?format=json", "institution": "Deepmind"}, {"id": 12263, "fullname": "Amit Sabne", "url": "http://mlsys.org/api/miniconf/users/12263?format=json", "institution": "Google"}, {"id": 27442, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27442?format=json", "institution": null}, {"id": 27443, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27443?format=json", "institution": null}, {"id": 27444, "fullname": "Berkin Ilbeyi", "url": "http://mlsys.org/api/miniconf/users/27444?format=json", "institution": "Google"}, {"id": 27445, "fullname": "Ryan Lefever", "url": "http://mlsys.org/api/miniconf/users/27445?format=json", "institution": "Google"}, {"id": 27446, "fullname": "Mehrdad Khani", "url": "http://mlsys.org/api/miniconf/users/27446?format=json", "institution": "Google"}, {"id": 27447, "fullname": "Subhankar Shah", "url": "http://mlsys.org/api/miniconf/users/27447?format=json", "institution": "Google"}, {"id": 27448, "fullname": "Ankit Sinha", "url": "http://mlsys.org/api/miniconf/users/27448?format=json", "institution": "Google"}, {"id": 27449, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27449?format=json", "institution": null}, {"id": 27450, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27450?format=json", "institution": null}, {"id": 27451, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27451?format=json", "institution": null}, {"id": 27452, "fullname": "Nikhil Sarda", "url": "http://mlsys.org/api/miniconf/users/27452?format=json", "institution": "Research, Google"}, {"id": 27453, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27453?format=json", "institution": null}, {"id": 27454, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27454?format=json", "institution": null}, {"id": 27455, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27455?format=json", "institution": null}, {"id": 27456, "fullname": "Emily Donahue", "url": "http://mlsys.org/api/miniconf/users/27456?format=json", "institution": "Cornell University"}, {"id": 27457, "fullname": "Sami Abu-El-Haija", "url": "http://mlsys.org/api/miniconf/users/27457?format=json", "institution": "Research, Google"}, {"id": 27458, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27458?format=json", "institution": null}, {"id": 27459, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27459?format=json", "institution": null}, {"id": 27460, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27460?format=json", "institution": null}, {"id": 27182, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27182?format=json", "institution": null}, {"id": 27461, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27461?format=json", "institution": null}, {"id": 27462, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27462?format=json", "institution": null}, {"id": 27463, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27463?format=json", "institution": null}, {"id": 11194, "fullname": "Naveen Kumar", "url": "http://mlsys.org/api/miniconf/users/11194?format=json", "institution": "Google"}], "abstract": "Compilers play a fundamental role at achieving peak performance for machine learning (ML) workloads. However, given the diverse nature of workloads and accelerators, compilers' heuristics and analytical cost models often result in sub-optimal performance, and thus waste precious datacenter resources. Furthermore, the multitude of tunable parameters and their complex interplay often make it impossible for human experts to manually find optimal configurations. In this paper, we present CATWILD, a system that automatically optimizes ML jobs in Google's TPU fleet using compiler autotuning techniques. We describe CATWILD\u2019s design and implementation, and evaluate its performance using a handful of representative metrics. We further report experiences and lessons learned from its five-year development and operation. To the best of our knowledge, CATWILD represents the first ML compiler autotuning solution deployed in datacenters at scale. Its successful rollout yielded substantial benefits, optimizing over 70% of daily TPU training jobs and achieving significant chip savings.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3551", "url": null, "sourceid": 99, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=hB3nov3gIP", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 845, "modified": "2026-03-23T21:52:44.871029-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=hB3nov3gIP", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3511, "uid": "d1fe173d08e959397adf34b1d77e88d7", "name": "ML Fleet Efficiency: Improving TPU Systems at Scale with ML Productivity Goodput", "authors": [{"id": 27181, "fullname": "Arissa Wongpanich", "url": "http://mlsys.org/api/miniconf/users/27181?format=json", "institution": "Google"}, {"id": 11197, "fullname": "Tayo Oguntebi", "url": "http://mlsys.org/api/miniconf/users/11197?format=json", "institution": "Google LLC"}, {"id": 27182, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27182?format=json", "institution": null}, {"id": 11105, "fullname": "Yu Wang", "url": "http://mlsys.org/api/miniconf/users/11105?format=json", "institution": "Harvard University"}, {"id": 27183, "fullname": "Phitchaya Phothilimthana", "url": "http://mlsys.org/api/miniconf/users/27183?format=json", "institution": "OpenAI"}, {"id": 27184, "fullname": "Ritwika Mitra", "url": "http://mlsys.org/api/miniconf/users/27184?format=json", "institution": "Google"}, {"id": 27185, "fullname": "Zongwei Zhou", "url": "http://mlsys.org/api/miniconf/users/27185?format=json", "institution": null}, {"id": 11194, "fullname": "Naveen Kumar", "url": "http://mlsys.org/api/miniconf/users/11194?format=json", "institution": "Google"}, {"id": 10754, "fullname": "Vijay Janapa Reddi", "url": "http://mlsys.org/api/miniconf/users/10754?format=json", "institution": "Harvard University"}], "abstract": "Machine learning (ML) infrastructures operating at warehouse scale present unique performance characterization challenges beyond traditional high-performance computing metrics. This paper introduces a systematic framework for analyzing ML fleet efficiency, demonstrated on Google's production TPU infrastructure comprising thousands of accelerators running diverse workloads. Our fleet-wide analysis reveals performance dependencies spanning the entire ML system stack, from hardware to model architecture, data pipelines, frameworks, compilers, and schedulers. We identify critical gaps in conventional utilization-based performance metrics and propose \"ML Productivity Goodput\" (MPG) to capture fleet-wide efficiency across heterogeneous ML environments. MPG decomposes efficiency into scheduling, runtime, and program components, enabling precise identification of bottlenecks at specific system layers. Applied to Google's production TPU workloads, our segmented analysis identified optimization opportunities across the stack: scheduling goodput exceeding 95% for all job sizes through careful preemption tuning, runtime improvements via framework modernization and asynchronous checkpointing, and program-level gains through compiler optimizations like communication-computation overlap. This establishes MPG as a practical methodology for managing large-scale ML computing infrastructure.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3511", "url": null, "sourceid": 79, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=y31QSL9yMG", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 805, "modified": "2026-03-23T21:52:43.351317-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=y31QSL9yMG", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3517, "uid": "6ea9ab1baa0efb9e19094440c317e21b", "name": "ProfInfer: An eBPF-based Fine-Grained LLM Inference Profiler", "authors": [{"id": 26249, "fullname": "Bohua Zou", "url": "http://mlsys.org/api/miniconf/users/26249?format=json", "institution": "Technical University of Munich"}, {"id": 27201, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27201?format=json", "institution": null}, {"id": 27202, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27202?format=json", "institution": null}, {"id": 27203, "fullname": "Weihao Xu", "url": "http://mlsys.org/api/miniconf/users/27203?format=json", "institution": "TUM"}, {"id": 25570, "fullname": "Binqi Sun", "url": "http://mlsys.org/api/miniconf/users/25570?format=json", "institution": "Technical University of Munich"}, {"id": 27204, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27204?format=json", "institution": null}, {"id": 27205, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27205?format=json", "institution": null}], "abstract": "As large language models (LLMs) move from research to production, understanding how inference engines behave in real time has become both essential and elusive. Unlike general-purpose engines such as ONNX Runtime, today\u2019s LLM inference systems offer little operator-level visibility, leaving developers blind to where time and resources go. Even basic questions\u2014is this workload memory-bound or compute-bound?\u2014often remain unanswered. To close this gap, we develop a fine-grained, non-intrusive profiling framework for modern LLM inference engines, exemplified by llama.cpp but applicable to similar runtime architectures. Built on extended Berkeley Packet Filter (eBPF) technology, our system dynamically attaches probes to runtime functions across multiple layers\u2014without modifying or recompiling the source. It transforms collected traces into rich visualizations of operators, graphs, timelines, and hardware counter trends, exposing how dense inference, Mixture-of-Experts routing, and operator offloading behave in practice. With less than 4% runtime overhead and high profiling fidelity, our framework makes LLM inference both transparent and diagnosable, turning performance profiling into a practical tool for optimization, scheduling, and resource-aware deployment.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3517", "url": null, "sourceid": 29, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=tYHWS7YPof", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 811, "modified": "2026-03-23T21:52:43.546541-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=tYHWS7YPof", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3598, "uid": "2838023a778dfaecdc212708f721b788", "name": "FreeScale: Distributed Training for Sequence Recommendation Models with Minimal Scaling Cost", "authors": [{"id": 27718, "fullname": "Chenhao Feng", "url": "http://mlsys.org/api/miniconf/users/27718?format=json", "institution": ""}, {"id": 25617, "fullname": "Haoli Zhang", "url": "http://mlsys.org/api/miniconf/users/25617?format=json", "institution": "Meta"}, {"id": 27158, "fullname": "Shakhzod Ali-zade", "url": "http://mlsys.org/api/miniconf/users/27158?format=json", "institution": "Meta Platforms, Inc."}, {"id": 15737, "fullname": "Yanli Zhao", "url": "http://mlsys.org/api/miniconf/users/15737?format=json", "institution": null}, {"id": 11119, "fullname": "Liang Luo", "url": "http://mlsys.org/api/miniconf/users/11119?format=json", "institution": "Meta"}, {"id": 27719, "fullname": "Jennifer Cao", "url": "http://mlsys.org/api/miniconf/users/27719?format=json", "institution": "Facebook"}, {"id": 27720, "fullname": "Lisen Deng", "url": "http://mlsys.org/api/miniconf/users/27720?format=json", "institution": "Meta"}, {"id": 27721, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27721?format=json", "institution": null}, {"id": 27722, "fullname": "Chenyu Zhao", "url": "http://mlsys.org/api/miniconf/users/27722?format=json", "institution": "Facebook"}, {"id": 27723, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27723?format=json", "institution": null}, {"id": 27724, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27724?format=json", "institution": null}, {"id": 27725, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27725?format=json", "institution": null}, {"id": 27726, "fullname": "Tiantu Xu", "url": "http://mlsys.org/api/miniconf/users/27726?format=json", "institution": "Meta"}, {"id": 27727, "fullname": "Yi Zhang", "url": "http://mlsys.org/api/miniconf/users/27727?format=json", "institution": "Facebook"}, {"id": 27728, "fullname": "Evgenii Kolpakov", "url": "http://mlsys.org/api/miniconf/users/27728?format=json", "institution": ""}, {"id": 27729, "fullname": "Siqi Yan", "url": "http://mlsys.org/api/miniconf/users/27729?format=json", "institution": "Facebook"}, {"id": 23924, "fullname": "Chuanhao Zhuge", "url": "http://mlsys.org/api/miniconf/users/23924?format=json", "institution": "Meta"}, {"id": 27730, "fullname": "Min Ni", "url": "http://mlsys.org/api/miniconf/users/27730?format=json", "institution": "Northwestern University"}, {"id": 27731, "fullname": "Bi Xue", "url": "http://mlsys.org/api/miniconf/users/27731?format=json", "institution": "Thinking Machines Lab"}, {"id": 27732, "fullname": "Qunshu Zhang", "url": "http://mlsys.org/api/miniconf/users/27732?format=json", "institution": "Facebook"}, {"id": 16149, "fullname": "Shen Li", "url": "http://mlsys.org/api/miniconf/users/16149?format=json", "institution": "Meta"}], "abstract": "Modern industrial Deep Learning Recommendation Models typically extract user preferences through the analysis of sequential interaction histories, subsequently generating predictions based on these derived interests. The inherent heterogeneity in data characteristics frequently result in substantial under-utilization of computational resources during large-scale training, primarily due to computational bubbles caused by severe stragglers and slow blocking communications. This paper introduces FreeScale, a solution designed to (1) mitigate the strag- gler problem through meticulously load balanced input samples (2) minimize the blocking communication by overlapping prioritized embedding communications with computations (3) resolve the GPU resource competition during computation and communication overlapping by communicating through SM-Free techniques. Empirical evaluation demonstrates that FreeScale achieves up to 90.3% reduction in computational bubbles when applied to real-world workloads running on 256 H100 GPUs.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3598", "url": null, "sourceid": 51, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=MY0BIdK4hn", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 892, "modified": "2026-03-23T21:52:46.752798-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=MY0BIdK4hn", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3524, "uid": "3416a75f4cea9109507cacd8e2f2aefc", "name": "Using Span Queries to Optimize Cache and Attention Locality", "authors": [{"id": 27258, "fullname": "Paul Castro", "url": "http://mlsys.org/api/miniconf/users/27258?format=json", "institution": "International Business Machines"}, {"id": 25924, "fullname": "Nick Mitchell", "url": "http://mlsys.org/api/miniconf/users/25924?format=json", "institution": "IBM Research"}, {"id": 27259, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27259?format=json", "institution": null}, {"id": 27260, "fullname": "Thomas Parnell", "url": "http://mlsys.org/api/miniconf/users/27260?format=json", "institution": "IBM Research"}, {"id": 27261, "fullname": "Mudhakar Srivatsa", "url": "http://mlsys.org/api/miniconf/users/27261?format=json", "institution": "International Business Machines"}, {"id": 27262, "fullname": "Antoni Viros i Martin", "url": "http://mlsys.org/api/miniconf/users/27262?format=json", "institution": "International Business Machines"}], "abstract": "Clients are evolving beyond chat completion, and now include a variety of innovative inference-time scaling and deep reasoning techniques. At the same time, inference servers remain heavily optimized for chat completion. Prior work has shown that large improvements to KV cache hit rate are possible if inference servers evolve towards these non-chat use cases. However, they offer solutions that are also optimized for a single use case, RAG. In this paper, we introduce the \\emph{span query} to generalize the interface to the inference server. We demonstrate that chat, RAG, inference-time scaling, and agentic workloads can all be expressed as span queries. We show how the critical distinction that had been assumed by prior work lies in whether the order of the inputs matter --- do they \\emph{commute}? In chat, they do not. In RAG, they often do. This paper introduces span queries, which are expression trees of inference calls, linked together with commutativity constraints. We describe span query syntax and semantics. We show how they can be automatically optimized to improve KV cache locality. We show how a small change to vLLM (affecting only 492 lines) can enable high-performance execution of span queries. Using this stack, we demonstrate that span queries can achieve 10-20x reductions in TTFT for two distinct non-chat use cases. Finally, we show that span queries can also be optimized to improve \\emph{attention locality}, so as to avoid the so-called lost-in-the-middle problem. We demonstrate that an attention-optimized span query on a 2b parameter model vastly outperforms the accuracy of a stock inference server using an 8b model.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3524", "url": null, "sourceid": 41, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=qcGGSXpFcM", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 818, "modified": "2026-03-23T21:52:43.795108-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=qcGGSXpFcM", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3594, "uid": "e2ef524fbf3d9fe611d5a8e90fefdc9c", "name": "Agentic Operator Generation for ML ASICs", "authors": [{"id": 27682, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27682?format=json", "institution": null}, {"id": 27683, "fullname": "Aram Markosyan", "url": "http://mlsys.org/api/miniconf/users/27683?format=json", "institution": "Facebook"}, {"id": 27684, "fullname": "Aman Dontula", "url": "http://mlsys.org/api/miniconf/users/27684?format=json", "institution": "Meta"}, {"id": 27685, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27685?format=json", "institution": null}, {"id": 27686, "fullname": "Zacharias Fisches", "url": "http://mlsys.org/api/miniconf/users/27686?format=json", "institution": "Facebook"}, {"id": 27687, "fullname": "Dmitrii Pedchenko", "url": "http://mlsys.org/api/miniconf/users/27687?format=json", "institution": "Meta FAIR"}, {"id": 27688, "fullname": "Keyur Muzumdar", "url": "http://mlsys.org/api/miniconf/users/27688?format=json", "institution": "Meta (FAIR)"}, {"id": 27689, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27689?format=json", "institution": null}, {"id": 27690, "fullname": "Mark Saroufim", "url": "http://mlsys.org/api/miniconf/users/27690?format=json", "institution": "Facebook"}, {"id": 27691, "fullname": "Joe Isaacson", "url": "http://mlsys.org/api/miniconf/users/27691?format=json", "institution": "PyTorch (Meta)"}, {"id": 27692, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27692?format=json", "institution": null}, {"id": 27693, "fullname": "Warren Hunt", "url": "http://mlsys.org/api/miniconf/users/27693?format=json", "institution": "Facebook"}, {"id": 27694, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27694?format=json", "institution": null}, {"id": 27695, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27695?format=json", "institution": null}, {"id": 27696, "fullname": "Gabriel Synnaeve", "url": "http://mlsys.org/api/miniconf/users/27696?format=json", "institution": "Meta"}, {"id": 27697, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27697?format=json", "institution": null}, {"id": 27698, "fullname": "Jacob Kahn", "url": "http://mlsys.org/api/miniconf/users/27698?format=json", "institution": "FAIR, Meta AI"}, {"id": 27527, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27527?format=json", "institution": null}], "abstract": "We present TritorX, an agentic AI system designed to generate functionally correct Triton PyTorch ATen kernels at scale for emerging accelerator platforms. TritorX integrates open-source large language models with a custom linter, JIT compilation, and a PyTorch OpInfo-based test harness. This pipeline operates both on deployed Meta Training and Inference Accelerator (MTIA) silicon and in hardware simulation environments for next-generation devices. In contrast to previous kernel-generation approaches that prioritize performance for a limited set of high-usage kernels, TritorX prioritizes coverage. Our system emphasizes correctness and generality across the entire operator set, including diverse data types, shapes, and argument patterns. In our experiments, TritorX successfully generated kernels and wrappers for 481 unique ATen operators that pass all corresponding PyTorch OpInfo tests (over 20,000 in total). TritorX paves the way for overnight generation of complete PyTorch ATen backends for new accelerator platforms.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3594", "url": null, "sourceid": 97, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=O3Bx0nNGnW", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 888, "modified": "2026-03-23T21:52:46.615166-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=O3Bx0nNGnW", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3544, "uid": "37693cfc748049e45d87b8c7d8b9aacd", "name": "NEST: Network- and Memory-Aware Device Placement for Distributed Deep Learning", "authors": [{"id": 26264, "fullname": "Irene Wang", "url": "http://mlsys.org/api/miniconf/users/26264?format=json", "institution": "Georgia Institute of Technology"}, {"id": 27364, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27364?format=json", "institution": null}, {"id": 11122, "fullname": "Arvind Krishnamurthy", "url": "http://mlsys.org/api/miniconf/users/11122?format=json", "institution": "University of Washington"}, {"id": 27358, "fullname": "Divya Mahajan", "url": "http://mlsys.org/api/miniconf/users/27358?format=json", "institution": "Georgia Institute of Technology"}], "abstract": "The growing scale of deep learning demands distributed training frameworks that jointly reason about parallelism, memory, and network topology. Prior works often rely on heuristic or topology-agnostic search, handling communication and memory separately. Without per-device memory awareness, these methods typically ensure feasibility post hoc by sharding parameters and activations across many devices, increasing synchronization, inflating communication, and underutilizing compute, limiting scalability and efficiency on real datacenter networks. We present NEST, a network-, compute-, and memory-aware device placement framework that unifies model parallelism, topology modeling, and memory feasibility via structured dynamic programming. NEST\u2019s DP operates on operator graphs with tensor and expert parallel configurations, explicit allreduce latencies across hierarchical or arbitrary networks, and memory/compute profiles. By factoring parallelism across tensor, pipeline, data, and expert dimensions, NEST defines a principled search space for hybrid strategies while jointly optimizing co-location, network latency, and memory feasibility. Evaluations across diverse hardware and networks show NEST achieves up to 2.35 times higher throughput, better memory efficiency, and improved scalability over state-of-the-art baselines, providing a foundation for co-designing parallelization strategies and datacenter interconnects for next-generation AI infrastructure.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3544", "url": null, "sourceid": 23, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=jpIoO2zSKA", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 838, "modified": "2026-03-23T21:52:44.568592-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=jpIoO2zSKA", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3638, "uid": "c9f0f895fb98ab9159f51fd0297e236d", "name": "Charon: A Unified and Fine-Grained Simulator for Large-Scale LLM Training and Inference", "authors": [{"id": 25502, "fullname": "Mengtian Yang", "url": "http://mlsys.org/api/miniconf/users/25502?format=json", "institution": "The University of Texas at Austin"}, {"id": 27169, "fullname": "Zhekun Zhang", "url": "http://mlsys.org/api/miniconf/users/27169?format=json", "institution": "ByteDance Inc."}, {"id": 25517, "fullname": "Mingheng Wu", "url": "http://mlsys.org/api/miniconf/users/25517?format=json", "institution": "ByteDance"}, {"id": 27965, "fullname": "jianwen yan", "url": "http://mlsys.org/api/miniconf/users/27965?format=json", "institution": ""}, {"id": 27966, "fullname": "Hanshi Sun", "url": "http://mlsys.org/api/miniconf/users/27966?format=json", "institution": "ByteDance Seed"}, {"id": 20944, "fullname": "Li-Wen Chang", "url": "http://mlsys.org/api/miniconf/users/20944?format=json", "institution": "ByteDance Inc."}], "abstract": "Deploying large-scale LLM training and inference with optimal performance is exceptionally challenging due to a complex design space of parallelism strategies, system optimizations, and hardware configurations. Accurate and rapid performance simulation is critical for guiding optimization efforts and system studies by validating \u201cwhat-if\u201d Hooker Figure hypotheses. To address this, we introduce Charon, a unified, modular, and fine-grained simulator for accurately predicting LLM performance. Experiments show Charon achieves high accuracy across different models and configurations, with an overall prediction error consistently under 5.35%, and even under 3.74% for training with over 10,000 GPUs. In a practical inference deployment case, Charon discovered a configuration that improved system throughput by 275% over a manually-tuned baseline, demonstrating its significant real-world value.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3638", "url": null, "sourceid": 8, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=19O6GAS7Su", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 932, "modified": "2026-03-23T21:52:48.378560-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=19O6GAS7Su", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3522, "uid": "c16a5320fa475530d9583c34fd356ef5", "name": "Scaling Up Large Language Models Serving Systems for Semantic Job Search", "authors": [{"id": 27235, "fullname": "Kayhan Behdin", "url": "http://mlsys.org/api/miniconf/users/27235?format=json", "institution": "LinkedIn"}, {"id": 19093, "fullname": "Qingquan Song", "url": "http://mlsys.org/api/miniconf/users/19093?format=json", "institution": null}, {"id": 25809, "fullname": "Sriram Vasudevan", "url": "http://mlsys.org/api/miniconf/users/25809?format=json", "institution": "LinkedIn Corporation"}, {"id": 27236, "fullname": "Jian Sheng", "url": "http://mlsys.org/api/miniconf/users/27236?format=json", "institution": "LinkedIn"}, {"id": 27237, "fullname": "Xiaojing Ma", "url": "http://mlsys.org/api/miniconf/users/27237?format=json", "institution": "LinkedIn"}, {"id": 27238, "fullname": "Zhengze Zhou", "url": "http://mlsys.org/api/miniconf/users/27238?format=json", "institution": "LinkedIn"}, {"id": 26224, "fullname": "Chuanrui Zhu", "url": "http://mlsys.org/api/miniconf/users/26224?format=json", "institution": "LinkedIn"}, {"id": 27239, "fullname": "Guoyao Li", "url": "http://mlsys.org/api/miniconf/users/27239?format=json", "institution": "xAI"}, {"id": 19228, "fullname": "Chanh Nguyen", "url": "http://mlsys.org/api/miniconf/users/19228?format=json", "institution": "LinkedIn"}, {"id": 27240, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27240?format=json", "institution": null}, {"id": 27241, "fullname": "Hejian Sang", "url": "http://mlsys.org/api/miniconf/users/27241?format=json", "institution": "LinkedIn"}, {"id": 24223, "fullname": "Ata Fatahi", "url": "http://mlsys.org/api/miniconf/users/24223?format=json", "institution": "LinkedIn Inc"}, {"id": 27242, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27242?format=json", "institution": null}, {"id": 27243, "fullname": "Xiaoqing Wang", "url": "http://mlsys.org/api/miniconf/users/27243?format=json", "institution": "LinkedIn"}, {"id": 26222, "fullname": "Qing Lan", "url": "http://mlsys.org/api/miniconf/users/26222?format=json", "institution": "LinkedIn"}, {"id": 27244, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27244?format=json", "institution": null}, {"id": 27245, "fullname": "Qi Guo", "url": "http://mlsys.org/api/miniconf/users/27245?format=json", "institution": "LinkedIn"}, {"id": 27246, "fullname": "Caleb Johnson", "url": "http://mlsys.org/api/miniconf/users/27246?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 17756, "fullname": "Zhipeng Wang", "url": "http://mlsys.org/api/miniconf/users/17756?format=json", "institution": "LinkedIn Corporation"}, {"id": 27247, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27247?format=json", "institution": null}], "abstract": "Large Language Models (LLMs) have demonstrated impressive quality when applied to predictive tasks such as relevance ranking and semantic search. However, deployment of such LLMs remains prohibitively expensive for industry applications with strict latency and throughput requirements. In this work, we present lessons and efficiency insights from developing a purely text-based decoder-only Small Language Model (SLM) for a semantic search application at LinkedIn. Particularly, we discuss model compression techniques such as pruning that allow us to reduce the model size by up to 40% while maintaining the accuracy. Additionally, we present context compression techniques that allow us to reduce the input context length by more than 10x with minimal loss of accuracy. Finally, we present practical lessons from optimizing the serving infrastructure for deploying such a system on GPUs at scale, serving millions of requests per second. Taken together, this allows us to increase our system\u2019s throughput by 10x in a real-world deployment, while meeting our quality bar.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3522", "url": null, "sourceid": 31, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=re82zZczHj", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 816, "modified": "2026-03-23T21:52:43.728713-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=re82zZczHj", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3525, "uid": "a3f390d88e4c41f2747bfa2f1b5f87db", "name": "Automated Algorithm Design for Auto-Tuning Optimizers", "authors": [{"id": 27263, "fullname": "Floris-Jan Willemsen", "url": "http://mlsys.org/api/miniconf/users/27263?format=json", "institution": "Leiden University LIACS"}, {"id": 27264, "fullname": "Niki van Stein", "url": "http://mlsys.org/api/miniconf/users/27264?format=json", "institution": "LIACS, Leiden University"}, {"id": 27143, "fullname": "Ben van Werkhoven", "url": "http://mlsys.org/api/miniconf/users/27143?format=json", "institution": "Leiden University"}], "abstract": "Automatic performance tuning (auto-tuning) is essential for optimizing high-performance applications, where vast and irregular search spaces make manual exploration infeasible. While auto-tuners traditionally rely on classical approaches such as evolutionary, annealing, or surrogate-based optimizers, designing algorithms that efficiently find near-optimal configurations robustly across diverse tasks is challenging. We propose a new paradigm: using large language models (LLMs) to automatically generate optimization algorithms tailored to auto-tuning problems. We introduce a framework that prompts LLMs with problem descriptions and search space characteristics to synthesize, test, and iteratively refine specialized optimizers. These generated algorithms are evaluated on four real-world auto-tuning applications across six hardware platforms and compared against the state-of-the-art in two contemporary auto-tuning frameworks.  The evaluation demonstrates that providing additional application- and search space-specific information in the generation stage results in an average performance improvement of 30.7% and 14.6%, respectively.  In addition, our results show that LLM-generated optimizers can rival, and in various cases outperform, existing human-designed algorithms, with our best-performing generated optimization algorithms achieving an average 72.4% improvement over state-of-the-art optimizers for auto-tuning.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3525", "url": null, "sourceid": 68, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=qKlHJCbY6m", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 819, "modified": "2026-03-23T21:52:43.828851-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=qKlHJCbY6m", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3627, "uid": "3988c7f88ebcb58c6ce932b957b6f332", "name": "ProToken: Token-Level Attribution for Federated Large Language Models", "authors": [{"id": 27165, "fullname": "Waris Gill", "url": "http://mlsys.org/api/miniconf/users/27165?format=json", "institution": "Virginia Tech"}, {"id": 27856, "fullname": "Ahmad Humayun", "url": "http://mlsys.org/api/miniconf/users/27856?format=json", "institution": "Virginia Polytechnic Institute and State University"}, {"id": 20986, "fullname": "Ali Anwar", "url": "http://mlsys.org/api/miniconf/users/20986?format=json", "institution": "University of Minnesota"}, {"id": 27857, "fullname": "Muhammad Ali Gulzar", "url": "http://mlsys.org/api/miniconf/users/27857?format=json", "institution": "Virginia Tech"}], "abstract": "Federated Learning (FL) enables collaborative training of Large Language Models (LLMs) across distributed data sources while preserving privacy. However, when federated LLMs are deployed in critical applications, it remains unclear which client(s) contributed to specific generated responses, hindering debugging, malicious client identification, fair reward allocation, and trust verification. We present ProToken, a novel Provenance methodology for Token-level attribution in federated LLMs that addresses client attribution during autoregressive text generation while maintaining FL privacy constraints. ProToken leverages two key insights to enable provenance at each token: (1) transformer architectures concentrate task-specific signals in later blocks, enabling strategic layer selection for computational tractability, and (2) gradient-based relevance weighting filters out irrelevant neural activations, focusing attribution on neurons that directly influence token generation. We evaluate ProToken across 16 configurations spanning four LLM architectures (Gemma, Llama, Qwen, SmolLM) and four domains (medical, financial, mathematical, coding). ProToken achieves 98.62% average attribution accuracy in correctly localizing responsible client(s), and maintains high accuracy when the number of clients are scaled, validating its practical viability for real-world deployment settings.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3627", "url": null, "sourceid": 137, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=8WXUjbFr0Z", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 921, "modified": "2026-03-23T21:52:47.894190-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=8WXUjbFr0Z", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3597, "uid": "98dce83da57b0395e163467c9dae521b", "name": "Shannonic: Efficient Entropy-Optimal Compression for ML Workloads", "authors": [{"id": 17719, "fullname": "Kareem Ibrahim", "url": "http://mlsys.org/api/miniconf/users/17719?format=json", "institution": "University of Toronto"}, {"id": 27717, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27717?format=json", "institution": null}, {"id": 17666, "fullname": "Andreas Moshovos", "url": "http://mlsys.org/api/miniconf/users/17666?format=json", "institution": "University of Toronto"}], "abstract": "We present Shannonic, a lossless compression method for machine learning tensors that achieves near-entropy-optimal compression, minimal state footprint, and high throughput.  Shannonic uses an off-line pre-processing step to partition the tensor value space into optimally selected subranges and generates encoding/decoding tables that encode each value as a (range index, offset) pair where the range is entropy encoded using the asymmetric numeral systems (ANS) method. We formally prove and empirically show that Shannonic achieves higher compression efficiency than standard ANS. For a variety of 8b-quantized models, Shannonic's codec uses just 530B of state and achieves coding efficiency within 1\\% of the Shannon limit. Shannonic enables 1.3-3.1$\\times$ faster federated learning over bandwidth-constrained networks and 29-32\\% latency reduction in edge-cloud LLM inference.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3597", "url": null, "sourceid": 93, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=NhMxI0GbB8", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 891, "modified": "2026-03-23T21:52:46.716165-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=NhMxI0GbB8", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3557, "uid": "c7e1249ffc03eb9ded908c236bd1996d", "name": "Optimizing Deployment Configurations for LLM Inference", "authors": [{"id": 27501, "fullname": "Sungmin Cho", "url": "http://mlsys.org/api/miniconf/users/27501?format=json", "institution": "Meta"}, {"id": 27148, "fullname": "Jaewon Lee", "url": "http://mlsys.org/api/miniconf/users/27148?format=json", "institution": "Facebook"}, {"id": 27502, "fullname": "Chunqiang Tang", "url": "http://mlsys.org/api/miniconf/users/27502?format=json", "institution": "Meta Platforms"}, {"id": 27503, "fullname": "Yejin Lee", "url": "http://mlsys.org/api/miniconf/users/27503?format=json", "institution": "META"}, {"id": 13318, "fullname": "Geonhwa Jeong", "url": "http://mlsys.org/api/miniconf/users/13318?format=json", "institution": "Meta"}, {"id": 27504, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27504?format=json", "institution": null}, {"id": 27505, "fullname": "Scott Batura", "url": "http://mlsys.org/api/miniconf/users/27505?format=json", "institution": null}, {"id": 27506, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27506?format=json", "institution": null}, {"id": 27507, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27507?format=json", "institution": null}, {"id": 27508, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27508?format=json", "institution": null}, {"id": 27509, "fullname": "Sijia Chen", "url": "http://mlsys.org/api/miniconf/users/27509?format=json", "institution": "Facebook"}, {"id": 27510, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27510?format=json", "institution": null}, {"id": 27511, "fullname": "Bradley Davis", "url": "http://mlsys.org/api/miniconf/users/27511?format=json", "institution": null}, {"id": 27512, "fullname": "Summer Deng", "url": "http://mlsys.org/api/miniconf/users/27512?format=json", "institution": "Facebook"}, {"id": 27513, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27513?format=json", "institution": null}, {"id": 27149, "fullname": "Emad El-Haraty", "url": "http://mlsys.org/api/miniconf/users/27149?format=json", "institution": "Facebook"}, {"id": 27514, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27514?format=json", "institution": null}, {"id": 27515, "fullname": "Lu Fang", "url": "http://mlsys.org/api/miniconf/users/27515?format=json", "institution": "Facebook"}, {"id": 21303, "fullname": "Lu Fang", "url": "http://mlsys.org/api/miniconf/users/21303?format=json", "institution": "Meta"}, {"id": 27516, "fullname": "Joshua Fromm", "url": "http://mlsys.org/api/miniconf/users/27516?format=json", "institution": "Facebook"}, {"id": 27517, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27517?format=json", "institution": null}, {"id": 27518, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27518?format=json", "institution": null}, {"id": 27519, "fullname": "Liangpeng Guo", "url": "http://mlsys.org/api/miniconf/users/27519?format=json", "institution": "Meta Platforms"}, {"id": 27520, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27520?format=json", "institution": null}, {"id": 27521, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27521?format=json", "institution": null}, {"id": 20937, "fullname": "Jianyu Huang", "url": "http://mlsys.org/api/miniconf/users/20937?format=json", "institution": "Research, Meta"}, {"id": 20923, "fullname": "Aya Ibrahim", "url": "http://mlsys.org/api/miniconf/users/20923?format=json", "institution": "Facebook"}, {"id": 27522, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27522?format=json", "institution": null}, {"id": 16156, "fullname": "Hongyi Jia", "url": "http://mlsys.org/api/miniconf/users/16156?format=json", "institution": "Meta"}, {"id": 27523, "fullname": "Changkyu Kim", "url": "http://mlsys.org/api/miniconf/users/27523?format=json", "institution": "Facebook"}, {"id": 27524, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27524?format=json", "institution": null}, {"id": 27525, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27525?format=json", "institution": null}, {"id": 27526, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27526?format=json", "institution": null}, {"id": 27527, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27527?format=json", "institution": null}, {"id": 27150, "fullname": "Xiaozhu Meng", "url": "http://mlsys.org/api/miniconf/users/27150?format=json", "institution": "Facebook"}, {"id": 27528, "fullname": "Vlad Tiberiu Mihailescu", "url": "http://mlsys.org/api/miniconf/users/27528?format=json", "institution": "Meta"}, {"id": 27529, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27529?format=json", "institution": null}, {"id": 15117, "fullname": "Maxim Naumov", "url": "http://mlsys.org/api/miniconf/users/15117?format=json", "institution": "Meta"}, {"id": 27530, "fullname": "Michal Ostrowski", "url": "http://mlsys.org/api/miniconf/users/27530?format=json", "institution": ""}, {"id": 27531, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27531?format=json", "institution": null}, {"id": 27532, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27532?format=json", "institution": null}, {"id": 16190, "fullname": "Sarunya Pumma", "url": "http://mlsys.org/api/miniconf/users/16190?format=json", "institution": "Meta"}, {"id": 27533, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27533?format=json", "institution": null}, {"id": 27534, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27534?format=json", "institution": null}, {"id": 27535, "fullname": "Jeremy Francis Reizenstein", "url": "http://mlsys.org/api/miniconf/users/27535?format=json", "institution": "Meta AI"}, {"id": 27536, "fullname": "Rajasi Saha", "url": "http://mlsys.org/api/miniconf/users/27536?format=json", "institution": "Facebook"}, {"id": 27537, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27537?format=json", "institution": null}, {"id": 27538, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27538?format=json", "institution": null}, {"id": 27539, "fullname": "Ruan Silva", "url": "http://mlsys.org/api/miniconf/users/27539?format=json", "institution": "Meta"}, {"id": 27540, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27540?format=json", "institution": null}, {"id": 27151, "fullname": "Jon Swenson", "url": "http://mlsys.org/api/miniconf/users/27151?format=json", "institution": "Facebook"}, {"id": 27541, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27541?format=json", "institution": null}, {"id": 27542, "fullname": "Chris Thi", "url": "http://mlsys.org/api/miniconf/users/27542?format=json", "institution": ""}, {"id": 27543, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27543?format=json", "institution": null}, {"id": 27152, "fullname": "Yunfan Wang", "url": "http://mlsys.org/api/miniconf/users/27152?format=json", "institution": "Facebook"}, {"id": 27544, "fullname": "Pengchao Wang", "url": "http://mlsys.org/api/miniconf/users/27544?format=json", "institution": "Meta Inc."}, {"id": 23937, "fullname": "Wenchen Wang", "url": "http://mlsys.org/api/miniconf/users/23937?format=json", "institution": null}, {"id": 27545, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27545?format=json", "institution": null}, {"id": 10964, "fullname": "Bram Wasti", "url": "http://mlsys.org/api/miniconf/users/10964?format=json", "institution": "Facebook"}, {"id": 27546, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27546?format=json", "institution": null}, {"id": 27547, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27547?format=json", "institution": null}, {"id": 20938, "fullname": "Jingyi Yang", "url": "http://mlsys.org/api/miniconf/users/20938?format=json", "institution": "Facebook"}, {"id": 27548, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27548?format=json", "institution": null}, {"id": 27549, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27549?format=json", "institution": null}, {"id": 27550, "fullname": "Jing Zhang", "url": "http://mlsys.org/api/miniconf/users/27550?format=json", "institution": "Facebook"}, {"id": 27153, "fullname": "Yi Zhen", "url": "http://mlsys.org/api/miniconf/users/27153?format=json", "institution": "Meta"}, {"id": 27551, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27551?format=json", "institution": null}], "abstract": "Meta's Large Language Models (LLMs)---the Llama model family---serve nearly one billion monthly active users. Deploying these models for inference involved navigating a complex design space that spanned diverse hardware options (e.g., H100, H200, MI300X), multiple parallelism strategies (tensor, pipeline, expert, context, and data parallelism), and nuanced runtime choices (e.g., continuous batching versus prefill-decode disaggregation)---all while leveraging workload-specific characteristics and meeting stringent service level objectives (SLOs). This paper presents insights we gained from developing and applying a systematic approach to analyze millions of deployment configurations and identify those that maximize throughput while meeting latency SLOs. We share lessons learned from our experience operating Llama inference at scale, including trade-offs among runtime designs, the phase-specific nature of parallelism strategies, opportunities for leveraging hardware heterogeneity, platform scaling behaviors, and system-level implications of model architectures such as Mixture-of-Experts (MoE). We hope our production experience offers practical insights for the broader LLM inference community.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3557", "url": null, "sourceid": 87, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=gEbKQeIdxB", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 851, "modified": "2026-03-23T21:52:45.120481-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=gEbKQeIdxB", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3593, "uid": "fc490ca45c00b1249bbe3554a4fdf6fb", "name": "MorphServe: Efficient and Workload-Aware LLM Serving via Runtime Quantized Layer Swapping and KV Cache Resizing", "authors": [{"id": 27401, "fullname": "Zhaoyuan Su", "url": "http://mlsys.org/api/miniconf/users/27401?format=json", "institution": "University of Virginia, Charlottesville"}, {"id": 27679, "fullname": "Zeyu Zhang", "url": "http://mlsys.org/api/miniconf/users/27679?format=json", "institution": "University of Virginia, Charlottesville"}, {"id": 27403, "fullname": "Tingfeng Lan", "url": "http://mlsys.org/api/miniconf/users/27403?format=json", "institution": "University of Virginia, Charlottesville"}, {"id": 27405, "fullname": "Zirui Wang", "url": "http://mlsys.org/api/miniconf/users/27405?format=json", "institution": "University of Virginia, Charlottesville"}, {"id": 27680, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27680?format=json", "institution": null}, {"id": 27681, "fullname": "Juncheng Yang", "url": "http://mlsys.org/api/miniconf/users/27681?format=json", "institution": "Harvard University"}, {"id": 27406, "fullname": "Yue Cheng", "url": "http://mlsys.org/api/miniconf/users/27406?format=json", "institution": "University of Virginia, Charlottesville"}], "abstract": "Efficiently serving large language models (LLMs) under dynamic and bursty workloads remains a key challenge for real-world deployment. Existing serving frameworks and static model compression techniques fail to adapt to workload fluctuations, leading to either service-level objective (SLO) violations under full-precision serving or persistent accuracy degradation with static quantization. We present MorphServe, a dynamic, workload-aware LLM serving framework based on morphological adaptation. MorphServe introduces two asynchronous, token-level runtime mechanisms: quantized layer swapping, which selectively replaces less impactful layers with quantized alternatives during high-load periods, and pressure-aware KV cache resizing, which dynamically adjusts KV cache capacity in response to memory pressure. These mechanisms enable state-preserving transitions with minimum runtime overhead and are fully compatible with modern scheduling and attention techniques. Extensive experiments on Vicuna and Llama family models with real-world workloads demonstrate that MorphServe reduces average SLO violations by 92.45% and improves the P95 TTFT latency by 2.2\u20133.9$\\times$ compared to full-precision serving, without compromising generation quality. These results establish MorphServe as a practical and elastic solution for LLM deployment in dynamic environments.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3593", "url": null, "sourceid": 65, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=PDu13oOl4G", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 887, "modified": "2026-03-23T21:52:46.576422-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=PDu13oOl4G", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3560, "uid": "7f1de29e6da19d22b51c68001e7e0e54", "name": "Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes", "authors": [{"id": 27558, "fullname": "Justin Bauer", "url": "http://mlsys.org/api/miniconf/users/27558?format=json", "institution": "Snorkel AI"}, {"id": 27559, "fullname": "Thomas Walshe", "url": "http://mlsys.org/api/miniconf/users/27559?format=json", "institution": "Reflection AI"}, {"id": 27560, "fullname": "Derek Pham", "url": "http://mlsys.org/api/miniconf/users/27560?format=json", "institution": "Snorkel AI"}, {"id": 27561, "fullname": "Harit Vishwakarma", "url": "http://mlsys.org/api/miniconf/users/27561?format=json", "institution": "University of Wisconsin, Madison"}, {"id": 27562, "fullname": "Armin Parchami", "url": "http://mlsys.org/api/miniconf/users/27562?format=json", "institution": "Snorkel AI"}, {"id": 27563, "fullname": "Frederic Sala", "url": "http://mlsys.org/api/miniconf/users/27563?format=json", "institution": "University of Wisconsin, Madison"}, {"id": 27564, "fullname": "Paroma Varma", "url": "http://mlsys.org/api/miniconf/users/27564?format=json", "institution": "Snorkel AI"}], "abstract": "Fine-tuning Large Language Models (LLMs) typically relies on large quantities of high-quality annotated data, or questions with well-defined ground truth answers in the case of Reinforcement Learning with Verifiable Rewards (RLVR). While previous work has explored the benefits to model reasoning capabilities by scaling both data and compute used for RLVR, these results lack applicability in many real-world settings where annotated data and accessible compute may be scarce. In this work, we present a comprehensive empirical study of open-source Small Language Model (SLM) performance after RLVR in low data regimes. Across three novel datasets covering number counting problems, graph reasoning, and spatial reasoning, we characterize how model performance scales with both dataset size, diversity, and complexity. We demonstrate that (1) procedural datasets allow for fine-grained evaluation and training dataset development with controllable properties (size, diversity, and complexity), (2) RLVR enables models trained on lower complexity tasks to generalize to higher complexity tasks, and (3) training on mixed complexity datasets offers the greatest benefits in low data regimes, providing up to 5$\\times$ sample efficiency versus training on easy tasks. These findings inspire future work on the development of data scaling laws for RLVR and the use of procedural data generators to further understand effective data development for efficient LLM fine-tuning.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3560", "url": null, "sourceid": 135, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=fV4t4kYvgi", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 854, "modified": "2026-03-23T21:52:45.251890-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=fV4t4kYvgi", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3607, "uid": "fe9fc289c3ff0af142b6d3bead98a923", "name": "BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models", "authors": [{"id": 23955, "fullname": "Zhengyang Wang", "url": "http://mlsys.org/api/miniconf/users/23955?format=json", "institution": "University of California, Santa Barbara"}, {"id": 23956, "fullname": "Ziyue Liu", "url": "http://mlsys.org/api/miniconf/users/23956?format=json", "institution": "University of California Santa Barbara"}, {"id": 27774, "fullname": "Ruijie Zhang", "url": "http://mlsys.org/api/miniconf/users/27774?format=json", "institution": "University of California, Santa Barbara"}, {"id": 27161, "fullname": "Avinash Maurya", "url": "http://mlsys.org/api/miniconf/users/27161?format=json", "institution": "Argonne National Laboratory"}, {"id": 27775, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27775?format=json", "institution": null}, {"id": 27776, "fullname": "Paul Hovland", "url": "http://mlsys.org/api/miniconf/users/27776?format=json", "institution": "Argonne National Laboratory"}, {"id": 27777, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27777?format=json", "institution": null}, {"id": 24510, "fullname": "zheng Zhang", "url": "http://mlsys.org/api/miniconf/users/24510?format=json", "institution": "University of California, Santa Barbara"}], "abstract": "The scale of transformer model pre-training is constrained by the increasing computation and communication cost. Low-rank bottleneck architectures offer a promising solution to significantly reduce the training time and memory footprint with minimum impact on accuracy. Despite algorithmic efficiency, bottleneck architectures scale poorly under standard tensor parallelism. Simply applying 3D parallelism designed for full-rank methods leads to excessive communication and poor GPU utilization. To address this limitation, we propose BOOST, an efficient training framework tailored for large-scale low-rank bottleneck architectures. BOOST introduces a novel Bottleneck-aware Tensor Parallelism, and combines optimizations such as online-RMSNorm, linear layer grouping, and low-rank activation checkpointing to achieve end-to-end training speedup. Evaluations on different low-rank bottleneck architectures demonstrate that BOOST achieves 1.46\u20131.91$\\times$ speedup over full-rank model baselines and 1.87\u20132.27$\\times$ speedup over low-rank model with naively integrated 3D parallelism, with improved GPU utilization and reduced communication overhead.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3607", "url": null, "sourceid": 83, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=JhN5hldx4V", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 901, "modified": "2026-03-23T21:52:47.082168-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=JhN5hldx4V", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3641, "uid": "92cc227532d17e56e07902b254dfad10", "name": "SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models", "authors": [{"id": 27979, "fullname": "Jiayi Tian", "url": "http://mlsys.org/api/miniconf/users/27979?format=json", "institution": "University of California, Santa Barbara"}, {"id": 27980, "fullname": "Seyedarmin Azizi", "url": "http://mlsys.org/api/miniconf/users/27980?format=json", "institution": "University of Southern California"}, {"id": 24084, "fullname": "Yequan Zhao", "url": "http://mlsys.org/api/miniconf/users/24084?format=json", "institution": "University of California Santa Barbara"}, {"id": 27981, "fullname": "Erfan Potraghloo", "url": "http://mlsys.org/api/miniconf/users/27981?format=json", "institution": "University of Southern California"}, {"id": 27982, "fullname": "Sean McPherson", "url": "http://mlsys.org/api/miniconf/users/27982?format=json", "institution": "Intel"}, {"id": 27983, "fullname": "Sharath Nittur Sridhar", "url": "http://mlsys.org/api/miniconf/users/27983?format=json", "institution": "Intel Labs"}, {"id": 23955, "fullname": "Zhengyang Wang", "url": "http://mlsys.org/api/miniconf/users/23955?format=json", "institution": "University of California, Santa Barbara"}, {"id": 24510, "fullname": "zheng Zhang", "url": "http://mlsys.org/api/miniconf/users/24510?format=json", "institution": "University of California, Santa Barbara"}, {"id": 27347, "fullname": "Massoud Pedram", "url": "http://mlsys.org/api/miniconf/users/27347?format=json", "institution": "University of Southern California"}, {"id": 27984, "fullname": "Souvik Kundu", "url": "http://mlsys.org/api/miniconf/users/27984?format=json", "institution": "Intel"}], "abstract": "Large reasoning models (LRMs) often cost significant key-value (KV) cache overhead, due to their linear growth with the verbose chain-of-thought (CoT) reasoning process. This costs both memory and throughput bottleneck limiting their efficient deployment. Towards reducing KV cache size during inference, we first investigate the effectiveness of existing KV cache eviction methods for CoT reasoning. Interestingly, we find that due to unstable token-wise scoring and the reduced effective KV budget caused by padding tokens, state-of-the-art (SoTA) eviction methods fail to maintain accuracy in the multi-batch setting. Additionally, these methods often generate longer sequences than the original model, as semantic-unaware token-wise eviction leads to repeated revalidation during reasoning. To address these issues, we present \\textbf{SkipKV}, a \\textbf{\\textit{training-free}} KV compression method for selective \\textit{eviction} and \\textit{generation} operating at a coarse-grained sentence-level sequence removal for efficient CoT reasoning. In specific, it introduces a \\textit{sentence-scoring metric} to identify and remove highly similar sentences while maintaining semantic coherence. To suppress redundant generation, SkipKV dynamically adjusts a steering vector to update the hidden activation states during inference enforcing the LRM to generate concise response. Extensive evaluations on multiple reasoning benchmarks demonstrate the effectiveness of SkipKV in maintaining up to $\\mathbf{26.7}\\%$ improved accuracy compared to the alternatives, at a similar compression budget. Additionally, compared to SoTA, SkipKV yields up to $\\mathbf{1.6}\\times$ fewer generation length while improving throughput up to $\\mathbf{1.7}\\times$.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3641", "url": null, "sourceid": 92, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=0EsV9SIm8p", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 935, "modified": "2026-03-23T21:52:48.481281-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=0EsV9SIm8p", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3576, "uid": "4c56ff4ce4aaf9573aa5dff913df997a", "name": "DriftBench: Measuring and Predicting Infrastructure Drift in LLM Serving Systems", "authors": [{"id": 27155, "fullname": "Gianluigi Vitale", "url": "http://mlsys.org/api/miniconf/users/27155?format=json", "institution": "Universitas Mercatorum"}], "abstract": "Production LLM deployments lack systematic methods to assess output consistency risks when infrastructure changes. We present DriftBench, a measurement and prediction framework comprising 236,985 prompt-response pairs across 105 configurations spanning 5 models, 4 GPU platforms, 3 frameworks, 3 precisions. We develop the Portability Risk Index (PRI), achieving $R^2$=0.987 on held-out test data ($R^2$ ranges from 0 to 1, with higher values indicating better predictive accuracy) with held-out-dimension generalization: hardware $R^2$=0.909, precision $R^2$=0.763. We discover a fundamental dichotomy: hardware/precision changes exhibit systematic drift ($R^2 \\geq 0.76$) enabling predict-once deployment, while framework/model changes show idiosyncratic drift ($R^2 < 0.48$) requiring re-measurement. Production validation blocked a +9.23pp drift upgrade affecting 1 in 5 queries, demonstrating operational value. Our contribution is measurement and risk assessment; we do not propose drift mitigation techniques, as this remains an open challenge for future work. Verification: https://anonymous.4open.science/r/reviewer-verification-5F4E/ | DriftBench CLI: https://anonymous.4open.science/r/driftbench-7FEC/", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3576", "url": null, "sourceid": 121, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=Xfzzp6grRP", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 870, "modified": "2026-03-23T21:52:45.920644-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=Xfzzp6grRP", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3508, "uid": "17e62166fc8586dfa4d1bc0e1742c08b", "name": "CRAFT: Cost-aware Expert Replica Allocation with Fine-Grained Layerwise Estimations", "authors": [{"id": 27141, "fullname": "Adrian Zhao", "url": "http://mlsys.org/api/miniconf/users/27141?format=json", "institution": "University of Toronto"}, {"id": 27170, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27170?format=json", "institution": null}, {"id": 27171, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27171?format=json", "institution": null}, {"id": 27172, "fullname": "Lingfan Yu", "url": "http://mlsys.org/api/miniconf/users/27172?format=json", "institution": "Amazon"}, {"id": 27173, "fullname": "Haozheng Fan", "url": "http://mlsys.org/api/miniconf/users/27173?format=json", "institution": "Amazon"}, {"id": 20941, "fullname": "Jun Wu", "url": "http://mlsys.org/api/miniconf/users/20941?format=json", "institution": "Amazon"}, {"id": 11990, "fullname": "Yida Wang", "url": "http://mlsys.org/api/miniconf/users/11990?format=json", "institution": "Amazon"}, {"id": 17626, "fullname": "Nandita Vijaykumar", "url": "http://mlsys.org/api/miniconf/users/17626?format=json", "institution": "Department of Computer Science, University of Toronto"}], "abstract": "Mixture-of-Experts (MoE) has recently emerged as the mainstream architecture for efficiently scaling large language models while maintaining near-constant computational cost. Expert parallelism distributes parameters by partitioning experts across devices, but this introduces token-level load imbalance during inference. Expert replication is a widely adopted load-balancing technique in serving frameworks that alleviates load imbalance in large-scale deployments by replicating experts with high loads. In this work, we demonstrate that existing replication schemes often _over-replicate_, with many replicas providing marginal improvement. Replicas consume substantial GPU memory, which may lead to resource contention and throughput degradation. We present CRAFT, an efficient expert replication framework that maximizes load balance under a given memory budget by performing fine-grained, per-layer replication based on the estimated replication benefit. CRAFT can be seamlessly integrated into existing serving frameworks without any additional training or model changes. Our evaluation shows that CRAFT increases end-to-end serving throughput by $1.14\\times$ on average (up to $1.2\\times$) over existing replication techniques in large-scale deployments with models ranging from hundreds of billions to a trillion parameters.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3508", "url": null, "sourceid": 43, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=zdRvzU9ZCe", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 802, "modified": "2026-03-23T21:52:43.222124-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=zdRvzU9ZCe", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3578, "uid": "4e732ced3463d06de0ca9a15b6153677", "name": "Privatar: Scalable Privacy-preserving Multi-user VR via Secure Offloading", "authors": [{"id": 13393, "fullname": "Jianming Tong", "url": "http://mlsys.org/api/miniconf/users/13393?format=json", "institution": "Georgia Tech/Google"}, {"id": 27626, "fullname": "Hanshen Xiao", "url": "http://mlsys.org/api/miniconf/users/27626?format=json", "institution": "Purdue University"}, {"id": 27627, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27627?format=json", "institution": null}, {"id": 19218, "fullname": "Hao Kang", "url": "http://mlsys.org/api/miniconf/users/19218?format=json", "institution": "Georgia Institute of Technology"}, {"id": 27628, "fullname": "Ashish Sirasao", "url": "http://mlsys.org/api/miniconf/users/27628?format=json", "institution": "Amd inc"}, {"id": 27629, "fullname": "Ziqi Zhang", "url": "http://mlsys.org/api/miniconf/users/27629?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 27630, "fullname": "G. Edward Suh", "url": "http://mlsys.org/api/miniconf/users/27630?format=json", "institution": "NVIDIA"}, {"id": 11662, "fullname": "Tushar Krishna", "url": "http://mlsys.org/api/miniconf/users/11662?format=json", "institution": "Georgia Institute of Technology"}], "abstract": "Multi-user virtual reality (VR) applications such as football and concert experiences rely on real-time avatar reconstruction to enable immersive interaction. However, rendering avatars for numerous participants on each headset incurs prohibitive computational overhead, fundamentally limiting scalability. This work introduces a framework, Privatar, to offload avatar reconstruction from headset to untrusted devices within the same local network while safeguarding sensitive facial features against adversaries capable of intercepting offloaded data.  Privatar builds on two insights. (1) **System level**. We observe identity-bearing information in facial inputs is highly skewed across frequency, and propose **Horizontal Partitioning (HP)** to keep the most identifying frequency components on-device and offloads only low-identifiability components. HP offloads local computation while preserves privacy against expression identification attacks. (2) **Privacy accounting level**. For **individually** offloaded, **multi-dimensional** signals without aggregation, worst-case local Differential Privacy requires prohibitive noise, ruining utility. We observe users\u2019 expression statistical distribution are **stable over time**, and hence propose Distribution-Aware Minimal Perturbation (DAMP). DAMP minimizes noise based on each user\u2019s expression distribution to significantly reduce its effects on utility and accuracy, retaining formal privacy guarantee.  On a Meta Quest Pro, Privatar supports up to 2.37$\\times$ more concurrent users at 5.7~6.5\\% higher reconstruction loss and ~9\\% energy overhead, providing a better Pareto frontier in Throughout-Loss over SotA quantization, sparsity, and local reconstruction baseline. Privatar further provides both provable privacy guarantee and stays robust against both empirical attack and NN-based Expression Identification Attack, proving its resilience in practice. Our code is open-sourced at https://anonymous.4open.science/r/Privatar-372A.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3578", "url": null, "sourceid": 26, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=WjJfnNhY65", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 872, "modified": "2026-03-23T21:52:46.003516-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=WjJfnNhY65", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3519, "uid": "34173cb38f07f89ddbebc2ac9128303f", "name": "MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces", "authors": [{"id": 27214, "fullname": "Srinivas", "url": "http://mlsys.org/api/miniconf/users/27214?format=json", "institution": "NVIDIA"}, {"id": 27215, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27215?format=json", "institution": null}, {"id": 27216, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27216?format=json", "institution": null}, {"id": 27217, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27217?format=json", "institution": null}, {"id": 27218, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27218?format=json", "institution": null}, {"id": 27219, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27219?format=json", "institution": null}, {"id": 27220, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27220?format=json", "institution": null}, {"id": 27221, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27221?format=json", "institution": null}, {"id": 27142, "fullname": "Hanjiang Wu", "url": "http://mlsys.org/api/miniconf/users/27142?format=json", "institution": "Georgia Institute of Technology"}, {"id": 16517, "fullname": "Changhai Man", "url": "http://mlsys.org/api/miniconf/users/16517?format=json", "institution": "Georgia Institute of Technology"}, {"id": 27222, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27222?format=json", "institution": null}, {"id": 27223, "fullname": "Huan Xu", "url": "http://mlsys.org/api/miniconf/users/27223?format=json", "institution": "Georgia Institute of Technology"}, {"id": 14298, "fullname": "William Won", "url": "http://mlsys.org/api/miniconf/users/14298?format=json", "institution": "Georgia Institute of Technology"}, {"id": 27224, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27224?format=json", "institution": null}, {"id": 25869, "fullname": "Winston Liu", "url": "http://mlsys.org/api/miniconf/users/25869?format=json", "institution": "Keysight Technologies"}, {"id": 27225, "fullname": "Andrey Balogh", "url": "http://mlsys.org/api/miniconf/users/27225?format=json", "institution": ""}, {"id": 27226, "fullname": "Dan Mihailescu", "url": "http://mlsys.org/api/miniconf/users/27226?format=json", "institution": "Keysight Technologies"}, {"id": 27227, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27227?format=json", "institution": null}, {"id": 26285, "fullname": "Vinay Ramakrishnaiah", "url": "http://mlsys.org/api/miniconf/users/26285?format=json", "institution": "AMD"}, {"id": 27228, "fullname": "Spandan More", "url": "http://mlsys.org/api/miniconf/users/27228?format=json", "institution": "AMD"}, {"id": 27229, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27229?format=json", "institution": null}, {"id": 20878, "fullname": "Louis Feng", "url": "http://mlsys.org/api/miniconf/users/20878?format=json", "institution": "University of California, Davis"}, {"id": 27230, "fullname": "Ashwin Ramachandran", "url": "http://mlsys.org/api/miniconf/users/27230?format=json", "institution": ""}, {"id": 14774, "fullname": "Puneet Sharma", "url": "http://mlsys.org/api/miniconf/users/14774?format=json", "institution": "HP Labs"}, {"id": 27231, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27231?format=json", "institution": null}, {"id": 10754, "fullname": "Vijay Janapa Reddi", "url": "http://mlsys.org/api/miniconf/users/10754?format=json", "institution": "Harvard University"}, {"id": 14827, "fullname": "David Kanter", "url": "http://mlsys.org/api/miniconf/users/14827?format=json", "institution": "MLCommons"}, {"id": 11662, "fullname": "Tushar Krishna", "url": "http://mlsys.org/api/miniconf/users/11662?format=json", "institution": "Georgia Institute of Technology"}], "abstract": "We present Chakra, an open and portable ecosystem for performance benchmarking and co-design. The core component of Chakra is an open and interoperable graph-based representation of distributed AI/ML workloads, called Chakra Execution Traces~(ET). These ETs represent key operations, such as compute, memory, and communication, data and control dependencies, timing, and resource constraints. Additionally, Chakra includes a complementary set of tools and capabilities to enable the collection, analysis, generation, and adoption of Chakra ETs by a broad range of simulators, emulators, and replay tools. We present analysis of Chakra traces collected on production AI clusters and demonstrate value via real-world case studies. Chakra has been adopted by MLCommons and has active contributions and engagement across the industry, including but not limited to, NVIDIA, AMD, Meta, Keysight, HPE, and Scala, to name a few.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3519", "url": null, "sourceid": 30, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=s2WcSv2Hzt", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 813, "modified": "2026-03-23T21:52:43.622745-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=s2WcSv2Hzt", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3580, "uid": "da4fb5c6e93e74d3df8527599fa62642", "name": "Zero redundancy distributed learning with differential privacy", "authors": [{"id": 27633, "fullname": "Zhiqi Bu", "url": "http://mlsys.org/api/miniconf/users/27633?format=json", "institution": "FAIR MSL"}, {"id": 27634, "fullname": "Justin Chiu", "url": "http://mlsys.org/api/miniconf/users/27634?format=json", "institution": "University of Washington"}, {"id": 27635, "fullname": "Ruixuan Liu", "url": "http://mlsys.org/api/miniconf/users/27635?format=json", "institution": "Emory University"}, {"id": 27636, "fullname": "Sheng Zha", "url": "http://mlsys.org/api/miniconf/users/27636?format=json", "institution": "Amazon"}, {"id": 27637, "fullname": "George Karypis", "url": "http://mlsys.org/api/miniconf/users/27637?format=json", "institution": "University of Minnesota, Minneapolis"}], "abstract": "Deep learning using large models has achieved great success in a wide range of domains. However, training these models on billions of parameters is very challenging in terms of training speed, memory cost, and communication efficiency, especially under the privacy-preserving regime with differential privacy (DP). On the one hand, the efficiency of DP optimization is comparable to that of standard non-DP optimization on a single GPU, but existing DP distributed learning is significantly inefficient on multiple GPUs. On the other hand, the Zero Redundancy Optimizer (ZeRO) is a state-of-the-art solution to the standard distributed learning, which can be technically complicated to work compatibly with DP. In this work, we develop a new systematic solution, DP-ZeRO, (I) to scale up the trainable DP model size, e.g. to GPT-100B, (II) to obtain the same computation and communication efficiency as the standard ZeRO, and (III) to enable mixed-precision DP training. Our DP-ZeRO, like the standard ZeRO, has the potential to train models with arbitrary size and exhibits excellent training efficiency on large models. Code at \\url{https://anonymous.4open.science/r/fast-differential-privacy-3B50}.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3580", "url": null, "sourceid": 120, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=VGacNNZfgo", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 874, "modified": "2026-03-23T21:52:46.076093-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=VGacNNZfgo", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3537, "uid": "b6d767d2f8ed5d21a44b0e5886680cb9", "name": "FlashAgents: Accelerating Multi-Agent LLM Systems via Streaming Prefill Overlap", "authors": [{"id": 25922, "fullname": "Taosong Fang", "url": "http://mlsys.org/api/miniconf/users/25922?format=json", "institution": "Institute of Software Chinese Academy of Sciences"}, {"id": 27257, "fullname": "Zhen Zheng", "url": "http://mlsys.org/api/miniconf/users/27257?format=json", "institution": "Microsoft"}, {"id": 27342, "fullname": "Zhengzhao Ma", "url": "http://mlsys.org/api/miniconf/users/27342?format=json", "institution": "University of the Chinese Academy of Sciences"}, {"id": 25829, "fullname": "Yaojie Lu", "url": "http://mlsys.org/api/miniconf/users/25829?format=json", "institution": null}, {"id": 27343, "fullname": "Hongyu Lin", "url": "http://mlsys.org/api/miniconf/users/27343?format=json", "institution": "Institute of Software, Chinese Academy of Sciences"}, {"id": 27344, "fullname": "Xianpei Han", "url": "http://mlsys.org/api/miniconf/users/27344?format=json", "institution": "Institute of Software, CAS"}, {"id": 27345, "fullname": "Le Sun", "url": "http://mlsys.org/api/miniconf/users/27345?format=json", "institution": "Institute of Software, Chinese Academy of Sciences"}], "abstract": "Large Language Models (LLMs) are increasingly deployed as collaborating agents in Multi-Agent Systems (MAS), where sequential agent interactions create significant latency bottlenecks. Traditional serving systems require each downstream agent to wait for complete upstream generation before starting prefill, leaving substantial idle time during inter-agent transitions. We present FlashAgents, a system that accelerates multi-agent workflows through token-level streaming and prefix-aware coordination. FlashAgents introduces Inter-agent streaming and incremental prefill, which streams tokens between agents and performs incremental prefill to overlap downstream prefill with upstream decode, reducing inter-agent latency. For concurrent workloads, an intra-turn prefix cache built on radix trees detects and eliminates redundant prefill across requests sharing common instruction templates, avoiding recomputation of shared prefixes within the same processing turn. Implemented on SGLang, FlashAgents achieves up to 46\\% end-to-end latency reduction on real workflows and 3.5$\\times$ speedup in controlled two-agent benchmarks, demonstrating consistent improvements across diverse models and interaction patterns.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3537", "url": null, "sourceid": 22, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=m14PPUfgEc", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 831, "modified": "2026-03-23T21:52:44.294163-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=m14PPUfgEc", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3534, "uid": "a5771bce93e200c36f7cd9dfd0e5deaa", "name": "ApproxMLIR : Accuracy-Aware Compiler for Compound ML System", "authors": [{"id": 27335, "fullname": "Hao Ren", "url": "http://mlsys.org/api/miniconf/users/27335?format=json", "institution": "University of Illinois Urbana-Champaign"}, {"id": 27336, "fullname": "Yi Mu", "url": "http://mlsys.org/api/miniconf/users/27336?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 16274, "fullname": "Sasa Misailovic", "url": "http://mlsys.org/api/miniconf/users/16274?format=json", "institution": "UIUC"}], "abstract": "Many compound AI systems are inherently \u201capproximate\u201d because the the ML components (e.g. a large language model) are probabilistic models and the non-ML components (e.g. retrieval-augmented generation) are heuristic. Such systems benefit from trading off result quality for improved performance. While extensive work exists on approximating ML and non-ML components individually, the wide deployment of LLMs in compound systems presents significant opportunities for end-to-end, accuracy-aware compilation. However, tailoring approximations across these different components is challenging to implement. This difficulty comes from their reliance on different software stacks for compilation and execution, as well as deployment on different  hardware.  To address these issues, we present ApproxMLIR, a reusable accuracy-aware compilation toolchain. ApproxMLIR introduces the approx MLIR dialect that serves as a unified and centralized interface for defining approximations and approx-opt, a reusable MLIR-based optimizer, which applies approximate transformations on ML and non-ML components. Our evaluation on three compound AI systems, which combine LLMs with information retrieval tasks and tool calling. The evaluation shows that ApproxMLIR can can effectively represent many common approximation choices, discover profitable points in the accuracy-performance space and consistently achieve higher speedups compared to static approximation strategies.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3534", "url": null, "sourceid": 38, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=nKm25GWbuB", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 828, "modified": "2026-03-23T21:52:44.161765-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=nKm25GWbuB", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3513, "uid": "1c383cd30b7c298ab50293adfecb7b18", "name": "GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving", "authors": [{"id": 25189, "fullname": "Shakya Jayakody", "url": "http://mlsys.org/api/miniconf/users/25189?format=json", "institution": "University of Central Florida"}, {"id": 17878, "fullname": "Youpeng Zhao", "url": "http://mlsys.org/api/miniconf/users/17878?format=json", "institution": "University of Central Florida"}, {"id": 27193, "fullname": "Chinmay Dhanraj Nehate", "url": "http://mlsys.org/api/miniconf/users/27193?format=json", "institution": "University of Central Florida"}, {"id": 26288, "fullname": "Jun Wang", "url": "http://mlsys.org/api/miniconf/users/26288?format=json", "institution": "University of Central Florida"}], "abstract": "The rise of million-token, agent-based applications has placed unprecedented demands on large language model (LLM) inference services.  The long-running nature of these tasks increases their susceptibility to hardware and software faults, leading to costly job failures, wasted resources, and degraded user experience. The stateful key-value (KV) cache, which grows with the sequence length, presents a central challenge as it is a critical and vulnerable component in distributed serving systems. In this work, we propose \\textbf{GhostServe}, a novel checkpointing solution to facilitate fault-tolerant LLM serving. Specifically, GhostServe protects the streaming KV cache \\textit{in the shadow} by applying erasure coding to generate and store the parity shards in host memory. In the event of device failures, GhostServe enables fast reconstruction of the lost KV cache, allowing the inference process to resume seamlessly without costly full recomputation or state replication. Evaluations demonstrate that GhostServe reduces checkpointing latency by up to 2.7$\\times$ and recovery latency by 2.1$\\times$ over existing methods, paving the way for reliable and high-availability LLM serving at scale.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3513", "url": null, "sourceid": 35, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=xKjYiUgeOK", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 807, "modified": "2026-03-23T21:52:43.414974-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=xKjYiUgeOK", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3588, "uid": "d09bf41544a3365a46c9077ebb5e35c3", "name": "G-HEMP: FAST MULTI-GPU PRIVATE INFERENCE FOR LARGE-SCALE GCNS WITH HOMOMORPHIC ENCRYPTION", "authors": [{"id": 27655, "fullname": "Ran Ran", "url": "http://mlsys.org/api/miniconf/users/27655?format=json", "institution": "North Carolina State University"}, {"id": 27656, "fullname": "Zhaoting Gong", "url": "http://mlsys.org/api/miniconf/users/27656?format=json", "institution": "North Carolina State University"}, {"id": 27657, "fullname": "Zhaowei Li", "url": "http://mlsys.org/api/miniconf/users/27657?format=json", "institution": "North Carolina State University"}, {"id": 27658, "fullname": "Xianting Lu", "url": "http://mlsys.org/api/miniconf/users/27658?format=json", "institution": "North Carolina State University"}, {"id": 27659, "fullname": "Jiajia Li", "url": "http://mlsys.org/api/miniconf/users/27659?format=json", "institution": "North Carolina State University"}, {"id": 27660, "fullname": "Wujie Wen", "url": "http://mlsys.org/api/miniconf/users/27660?format=json", "institution": "North Carolina State University"}], "abstract": "Homomorphic Encryption (HE) offers a promising solution for privacy-preserving Graph Convolutional Net- works (GCN) inference in untrusted cloud environments by enabling computation directly on encrypted data. This capability is particularly valuable in applications such as recommendation systems, financial analysis, and bioinformatics, where the data is subject to strict privacy requirements. However, applying HE to large-scale GCN inference introduces substantial computational and memory overhead, which significantly limits scalability and runtime performance. Although prior works have demonstrated promising results with CPU-based implementa- tions, these approaches remain constrained in terms of throughput and scalability due to redundant HE operations and high memory demands. In this work, we present G-HEMP, the first framework that leverages the power of multi-GPU systems to accelerate large-scale private GCN inference. G-HEMP introduces two key innovations: (i) a block-diagonal parallel packing technique that eliminates redundant data replication for encrypted adjacency matrices, achieving up to 4.41\u00d7 latency speedup over traditional feature-wise packing; and (ii) a multi-GPU workload partitioning strategy that reduces peak memory usage by 50% and improves inference latency by up to 1.98\u00d7. By combining these techniques, the number of HE operations is significantly reduced, and the encrypted computation can be partitioned and efficiently distributed across multiple GPUs to maximize throughput and hardware utilization. Our G-HEMP framework is model-agnostic and scales seamlessly with large GCN inference tasks. Together, these contributions enable scalable and efficient privacy-preserving GCN inference, advancing the practicality of HE-based GCN analytics on modern heterogeneous hardware.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3588", "url": null, "sourceid": 75, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=RSTrFSPIMy", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 882, "modified": "2026-03-23T21:52:46.377970-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=RSTrFSPIMy", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3610, "uid": "a97da629b098b75c294dffdc3e463904", "name": "BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching", "authors": [{"id": 27257, "fullname": "Zhen Zheng", "url": "http://mlsys.org/api/miniconf/users/27257?format=json", "institution": "Microsoft"}, {"id": 27793, "fullname": "Xin Ji", "url": "http://mlsys.org/api/miniconf/users/27793?format=json", "institution": "Microsoft"}, {"id": 25922, "fullname": "Taosong Fang", "url": "http://mlsys.org/api/miniconf/users/25922?format=json", "institution": "Institute of Software Chinese Academy of Sciences"}, {"id": 27794, "fullname": "Fanghao Zhou", "url": "http://mlsys.org/api/miniconf/users/27794?format=json", "institution": "Microsoft Corp."}, {"id": 27642, "fullname": "Chuanjie Liu", "url": "http://mlsys.org/api/miniconf/users/27642?format=json", "institution": "Microsoft"}, {"id": 27795, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27795?format=json", "institution": null}], "abstract": "Large language models (LLMs) increasingly play an important role in a wide range of information processing and management tasks in industry. Many of these tasks are performed in large batches or even offline, and the performance indictor for which is throughput. These tasks usually show the characteristic of prefix sharing, where different prompt input can partially show the common prefix. However, the existing LLM inference engines tend to optimize the streaming requests and show limitations of supporting the large batched tasks with the prefix sharing characteristic. The existing solutions use the LRU-based cache to reuse the KV context of common prefix between requests. The KV context that are about to be reused may prematurely evicted with the implicit cache management. Besides, the streaming oriented systems do not leverage the request-batch information and can not mix the decoding tokens with the prefill chunks to the best for the batched scenarios, and thus fails to saturate the GPU. We propose BatchLLM to address the above problems. BatchLLM explicitly identifies the common prefixes globally. The requests sharing the same prefix will be scheduled together to reuse the KV context the best. BatchLLM reorders the requests and schedules the requests with larger ratio of decoding first to better mix the decoding tokens with the latter prefill chunks, and applies memory-centric token batching to enlarge the token-batch sizes, which helps to increase the GPU utilization. Extensive evaluation shows that BatchLLM outperforms vLLM and SGLang by $1.3\\times$ to $10.8\\times$ on a set of microbenchmarks and a typical industry workload under different hardware environments.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3610", "url": null, "sourceid": 107, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=IuVHde07l6", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 904, "modified": "2026-03-23T21:52:47.187677-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=IuVHde07l6", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3535, "uid": "ad61ab143223efbc24c7d2583be69251", "name": "SAKURAONE: An Open Ethernet\u2013Based AI HPC System and Its Observed Workload Dynamics in a Single-Tenant LLM Development Environment", "authors": [{"id": 27145, "fullname": "Fumikazu KONISHI", "url": "http://mlsys.org/api/miniconf/users/27145?format=json", "institution": "SAKURA internet inc."}, {"id": 26137, "fullname": "Yuuki Tsubouchi", "url": "http://mlsys.org/api/miniconf/users/26137?format=json", "institution": "SAKURA internet Inc."}, {"id": 27337, "fullname": "Hirofumi Tsuruta", "url": "http://mlsys.org/api/miniconf/users/27337?format=json", "institution": "SAKURA internet Inc."}], "abstract": "SAKURAONE is a managed high performance computing (HPC) cluster developed and operated by the SAKURA Internet Research Center. It builds on the \\emph{KOKARYOKU PHY} bare metal GPU platform and is optimized for advanced workloads, including large language model (LLM) training.   In ISC 2025 TOP500, SAKURAONE is ranked \\textbf{49th} by HPL and is the only top 100 system that uses a fully open networking stack\u2014\\textbf{800~GbE} with \\textbf{SONiC}\u2014demonstrating the scalability of vendor-neutral technology.   Measured performance is 33.95~PFLOP/s (HPL~Rmax), 396.295~TFLOP/s (HPCG), and 339.86~PFLOP/s on HPL-MxP with FP8. The system consists of 100 nodes, each with eight NVIDIA H100 GPUs and a 2~PB all-flash Lustre file system, interconnected via a rail-optimized 800~GbE leaf\u2013spine fabric with RoCEv2.   Through exclusive use by a single research project, we observed the characteristics of development-related jobs. Consistent with previous HPC studies, small-scale jobs dominated in number, while a few large-scale jobs accounted for most GPU resource time. As the project progressed, resource use shifted from large-scale to mid-scale jobs, reflecting a transition from initial large-scale training to iterative refinement. These observations illustrate the real-world utilization dynamics of GPU clusters under unified project workloads.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3535", "url": null, "sourceid": 74, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=n7o6C3p3wk", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 829, "modified": "2026-03-23T21:52:44.206854-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=n7o6C3p3wk", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3609, "uid": "c8ffe9a587b126f152ed3d89a146b445", "name": "LLMInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems", "authors": [{"id": 17934, "fullname": "Shanli Xing", "url": "http://mlsys.org/api/miniconf/users/17934?format=json", "institution": "University of Washington"}, {"id": 27789, "fullname": "Vivian Zhai", "url": "http://mlsys.org/api/miniconf/users/27789?format=json", "institution": "Carnegie Mellon University"}, {"id": 25656, "fullname": "Alexander Jiang", "url": "http://mlsys.org/api/miniconf/users/25656?format=json", "institution": "Carnegie Mellon University"}, {"id": 23351, "fullname": "Yixin Dong", "url": "http://mlsys.org/api/miniconf/users/23351?format=json", "institution": "Carnegie Mellon University"}, {"id": 27790, "fullname": "Yong Wu", "url": "http://mlsys.org/api/miniconf/users/27790?format=json", "institution": "Nvidia"}, {"id": 12026, "fullname": "Zihao Ye", "url": "http://mlsys.org/api/miniconf/users/12026?format=json", "institution": "University of Washington"}, {"id": 25650, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/25650?format=json", "institution": null}, {"id": 27673, "fullname": "Yingyi Huang", "url": "http://mlsys.org/api/miniconf/users/27673?format=json", "institution": "Nvidia, CMU"}, {"id": 21035, "fullname": "Yineng Zhang", "url": "http://mlsys.org/api/miniconf/users/21035?format=json", "institution": "Baseten"}, {"id": 27791, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27791?format=json", "institution": null}, {"id": 27792, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27792?format=json", "institution": null}, {"id": 11020, "fullname": "Luis Ceze", "url": "http://mlsys.org/api/miniconf/users/11020?format=json", "institution": "University of Washington and NVIDIA"}, {"id": 11984, "fullname": "Tianqi Chen", "url": "http://mlsys.org/api/miniconf/users/11984?format=json", "institution": "CMU"}], "abstract": "Recent advances show that large language models (LLMs) can act as autonomous agents capable of generating GPU kernels, but integrating these AI-generated kernels into real-world inference systems remains challenging. LLMInfer-Bench addresses this gap by establishing a standardized, closed-loop framework that connects kernel generation, benchmarking, and deployment. At its core, LLMInfer Trace provides a unified schema describing kernel definitions, workloads, implementations, and evaluations, enabling consistent communication between agents and systems. Built on real serving traces, LLMInfer-Bench includes a curated dataset, a robust correctness- and performance-aware benchmarking framework, a public leaderboard to track LLM agents\u2019 GPU programming capabilities, and a dynamic substitution mechanism (apply()) that seamlessly injects the best-performing kernels into production LLM engines such as SGLang and vLLM. Using LLMInfer-Bench, we further evaluate the performance and limitations of LLM agents, compare the trade-offs among different GPU programming languages, and provide insights for future agent design. LLMInfer-Bench thus establishes a practical, reproducible pathway for continuously improving AI-generated kernels and deploying them safely into large-scale LLM inference systems.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3609", "url": null, "sourceid": 124, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=IyryZno8Hh", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 903, "modified": "2026-03-23T21:52:47.151524-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=IyryZno8Hh", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3606, "uid": "73278a4a86960eeb576a8fd4c9ec6997", "name": "Hawkeye: Reproducing GPU-Level Non-Determinism", "authors": [{"id": 27770, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27770?format=json", "institution": null}, {"id": 27771, "fullname": "Dan Boneh", "url": "http://mlsys.org/api/miniconf/users/27771?format=json", "institution": "Stanford University"}, {"id": 27772, "fullname": "Ilan Komargodski", "url": "http://mlsys.org/api/miniconf/users/27772?format=json", "institution": null}, {"id": 27773, "fullname": "Megha Srivastava", "url": "http://mlsys.org/api/miniconf/users/27773?format=json", "institution": "Stanford University"}], "abstract": "We present Hawkeye, a system for analyzing and reproducing GPU-level arithmetic operations on CPUs. Using our framework, an auditor can re-execute a full model training or inference workflow executed on NVIDIA GPUs on a CPU, without any precision loss and without introducing any additional operations or slowdown on the GPU side.  This is in stark contrast to  prior approaches to verifiable machine learning that introduced significant computational overhead for the model provider. The main technical contribution underlying Hawkeye is a systematic algorithmic framework for numerical treatment within NVIDIA's Tensor Cores rounding, subnormal number handling, and order of (non-associative) accumulation during matrix multiplication. Our framework consists of a sequence of carefully crafted tests that reduce the (otherwise exponential size) search space of potential options for  each operation. We test and evaluate our framework on a variety of GPU architectures (including Ampere, and Hopper), as well as all available precision types (FP16, BF16). In all test cases, our framework recovers the exact implementation of operations underlying matrix multiplication, and therefore allows for the full reproduction of model training and inference workflows on a CPU.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3606", "url": null, "sourceid": 113, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=JnmgsTFQQv", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 900, "modified": "2026-03-23T21:52:47.044074-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=JnmgsTFQQv", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3566, "uid": "65ded5353c5ee48d0b7d48c591b8f430", "name": "PRISM: PARAMETRICALLY RESTRUCTURED INFERENCE FOR SPECULATIVE SAMPLING DRAFT MODELS", "authors": [{"id": 25914, "fullname": "Xuliang Wang", "url": "http://mlsys.org/api/miniconf/users/25914?format=json", "institution": "University of Waterloo"}, {"id": 27581, "fullname": "Yuetao Chen", "url": "http://mlsys.org/api/miniconf/users/27581?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 27582, "fullname": "Maochan Zhen", "url": "http://mlsys.org/api/miniconf/users/27582?format=json", "institution": "Central China Institute of Artificial Intelligence"}, {"id": 27583, "fullname": "Fang LIU", "url": "http://mlsys.org/api/miniconf/users/27583?format=json", "institution": "CIAI"}, {"id": 27584, "fullname": "Xinzhou Zheng", "url": "http://mlsys.org/api/miniconf/users/27584?format=json", "institution": "University of Science and Technology of China"}, {"id": 27585, "fullname": "Xingwu Liu", "url": "http://mlsys.org/api/miniconf/users/27585?format=json", "institution": "Dalian University of Technology"}, {"id": 27586, "fullname": "Hong Xu", "url": "http://mlsys.org/api/miniconf/users/27586?format=json", "institution": "The Chinese University of Hong Kong"}, {"id": 27587, "fullname": "Ming Li", "url": "http://mlsys.org/api/miniconf/users/27587?format=json", "institution": "University of Waterloo"}], "abstract": "Large Language Models (LLMs), constrained by their auto-regressive nature, have long suffered from expensive and slow decoding. Speculative sampling methods, capable of alleviating the memory bandwidth bottleneck, have attracted attention from both the system and AI research communities. The demand for high predictive performance has created a growing trend of training parametrically larger and more powerful draft models, which also introduces growing computation overhead. While existing works balance trade-offs to find a sweet spot, in this paper we dive further into this effectiveness and efficiency dilemma, addressing the issue with architectural innovation. By disaggregating the computation of each predictive step across different parameter sets, we restructure the computational paths for the draft models, successfully decoupling the representation capacity from the inference cost, which enables the model scalable and fast at the same time. We conduct extensive experiments showing that our PRISM drafter outperforms SoTA draft architectures on acceptance length and end-to-end throughput when trained with the same dataset. We also show that PRISM scales exceptionally well on large datasets while some other architectures fail. On average, PRISM speculative decoding can achieve more than 2.6x end-to-end speedup when integrated with an already highly optimized inference engine.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3566", "url": null, "sourceid": 132, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=cvU2HuuxEf", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 860, "modified": "2026-03-23T21:52:45.514870-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=cvU2HuuxEf", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3622, "uid": "3295c76acbf4caaed33c36b1b5fc2cb1", "name": "ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels", "authors": [{"id": 27842, "fullname": "Stuart H. Sul", "url": "http://mlsys.org/api/miniconf/users/27842?format=json", "institution": "Stanford University"}, {"id": 27192, "fullname": "Simran Arora", "url": "http://mlsys.org/api/miniconf/users/27192?format=json", "institution": "Computer Science Department, Stanford University"}, {"id": 27843, "fullname": "Benjamin Spector", "url": "http://mlsys.org/api/miniconf/users/27843?format=json", "institution": "Stanford University"}, {"id": 11444, "fullname": "Christopher R\u00e9", "url": "http://mlsys.org/api/miniconf/users/11444?format=json", "institution": "Stanford University"}], "abstract": "Inter-GPU communication has become a major bottleneck for modern AI workloads as models scale and improvements in hardware compute throughput outpace improvements in interconnect bandwidth. Existing systems mitigate this through compute-communication overlap but often fail to meet theoretical peak performance across heterogeneous workloads and new accelerators. Instead of operator-specific techniques, we ask whether a small set of simple, reusable principles can systematically guide the design of optimal multi-GPU kernels. We present ParallelKittens (PK), a minimal CUDA framework that drastically simplifies the development of overlapped multi-GPU kernels. PK extends the ThunderKittens framework and embodies the principles of multi-GPU kernel design through eight core primitives and a unified programming template, derived from a comprehensive analysis of the factors that govern multi-GPU performance\u2014data-transfer mechanisms, resource scheduling, and design overheads. With fewer than 50 lines of device code, PK achieves up to $2.33\\times$ speedup for data- and tensor-parallel workloads, $4.08\\times$ for sequence-parallel workloads, and $1.22\\times$ for expert-parallel workloads.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3622", "url": null, "sourceid": 66, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=Cv5e5uRXFb", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 916, "modified": "2026-03-23T21:52:47.696063-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=Cv5e5uRXFb", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3635, "uid": "28dd2c7955ce926456240b2ff0100bde", "name": "AXLearn: Modular, Hardware-Agnostic Large Model Training", "authors": [{"id": 27168, "fullname": "Mark Lee", "url": "http://mlsys.org/api/miniconf/users/27168?format=json", "institution": "Meta"}, {"id": 27918, "fullname": "Tom Gunter", "url": "http://mlsys.org/api/miniconf/users/27918?format=json", "institution": "Apple"}, {"id": 27919, "fullname": "Chang Lan", "url": "http://mlsys.org/api/miniconf/users/27919?format=json", "institution": "Apple"}, {"id": 27920, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27920?format=json", "institution": null}, {"id": 16539, "fullname": "Hanzhi Zhou", "url": "http://mlsys.org/api/miniconf/users/16539?format=json", "institution": "ByteDance"}, {"id": 27921, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27921?format=json", "institution": null}, {"id": 27922, "fullname": "Sneha Bangalore", "url": "http://mlsys.org/api/miniconf/users/27922?format=json", "institution": ""}, {"id": 27923, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27923?format=json", "institution": null}, {"id": 27924, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27924?format=json", "institution": null}, {"id": 27925, "fullname": "Xianzhi Du", "url": "http://mlsys.org/api/miniconf/users/27925?format=json", "institution": "Apple"}, {"id": 27926, "fullname": "Philipp Dufter", "url": "http://mlsys.org/api/miniconf/users/27926?format=json", "institution": "Apple"}, {"id": 27927, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27927?format=json", "institution": null}, {"id": 27928, "fullname": "Ruixuan Hou", "url": "http://mlsys.org/api/miniconf/users/27928?format=json", "institution": "Apple Inc"}, {"id": 27929, "fullname": "Haoshuo Huang", "url": "http://mlsys.org/api/miniconf/users/27929?format=json", "institution": "Apple"}, {"id": 27930, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27930?format=json", "institution": null}, {"id": 27931, "fullname": "Xiang Kong", "url": "http://mlsys.org/api/miniconf/users/27931?format=json", "institution": "Apple"}, {"id": 27932, "fullname": "Jinhao Lei", "url": "http://mlsys.org/api/miniconf/users/27932?format=json", "institution": ", Columbia University"}, {"id": 27933, "fullname": "Tao Lei", "url": "http://mlsys.org/api/miniconf/users/27933?format=json", "institution": "Apple"}, {"id": 27934, "fullname": "Meng Li", "url": "http://mlsys.org/api/miniconf/users/27934?format=json", "institution": "Apple"}, {"id": 15067, "fullname": "Li Li", "url": "http://mlsys.org/api/miniconf/users/15067?format=json", "institution": "Apple"}, {"id": 27935, "fullname": "Jiarui Lu", "url": "http://mlsys.org/api/miniconf/users/27935?format=json", "institution": "Apple"}, {"id": 27936, "fullname": "Zhiyun Lu", "url": "http://mlsys.org/api/miniconf/users/27936?format=json", "institution": "Apple"}, {"id": 27937, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27937?format=json", "institution": null}, {"id": 27938, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27938?format=json", "institution": null}, {"id": 27939, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27939?format=json", "institution": null}, {"id": 27940, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27940?format=json", "institution": null}, {"id": 27941, "fullname": "Zhucheng Tu", "url": "http://mlsys.org/api/miniconf/users/27941?format=json", "institution": "Apple"}, {"id": 27942, "fullname": "Chong Wang", "url": "http://mlsys.org/api/miniconf/users/27942?format=json", "institution": "Meta"}, {"id": 27943, "fullname": "Jianyu Wang", "url": "http://mlsys.org/api/miniconf/users/27943?format=json", "institution": "Apple"}, {"id": 27944, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27944?format=json", "institution": null}, {"id": 27945, "fullname": "Zirui Wang", "url": "http://mlsys.org/api/miniconf/users/27945?format=json", "institution": "Google Deepmind"}, {"id": 27946, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27946?format=json", "institution": null}, {"id": 27947, "fullname": "Sam Wiseman", "url": "http://mlsys.org/api/miniconf/users/27947?format=json", "institution": "Apple"}, {"id": 27948, "fullname": "Guoli Yin", "url": "http://mlsys.org/api/miniconf/users/27948?format=json", "institution": "Apple"}, {"id": 27949, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27949?format=json", "institution": null}, {"id": 12530, "fullname": "Xiyou Zhou", "url": "http://mlsys.org/api/miniconf/users/12530?format=json", "institution": "OctoML"}, {"id": 17608, "fullname": "Danyang Zhuo", "url": "http://mlsys.org/api/miniconf/users/17608?format=json", "institution": "Duke University"}, {"id": 27950, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27950?format=json", "institution": null}, {"id": 27951, "fullname": "Ruoming Pang", "url": "http://mlsys.org/api/miniconf/users/27951?format=json", "institution": "OpenAI"}], "abstract": "AXLearn is a production system which facilitates scalable and high-performance training of large deep learning models. Compared to other state-of-art deep learning systems, AXLearn has a unique focus on modularity and support for hardware-agnostic training. AXLearn's internal interfaces between software components follow strict encapsulation, allowing different components to be assembled to facilitate rapid model development and experimentation on different hardware infrastructure. AXLearn maintains constant complexity as we scale the components in the system, compared to linear or quadratic complexity in state-of-the-art training systems. This allows integrating features such as Rotary Position Embeddings (RoPE) into AXLearn across hundred of modules with just 10 lines of code, compared to hundreds as required in other systems. At the same time, AXLearn maintains equivalent performance compared to state-of-the-art training systems. Finally, we share our experience in the development and operation of AXLearn at Apple.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3635", "url": null, "sourceid": 77, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=41x11EB3bc", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 929, "modified": "2026-03-23T21:52:48.241206-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=41x11EB3bc", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3512, "uid": "2a38a4a9316c49e5a833517c45d31070", "name": "HipKittens: Fast and Furious AMD Kernels", "authors": [{"id": 27186, "fullname": "William Hu", "url": "http://mlsys.org/api/miniconf/users/27186?format=json", "institution": "Stanford University"}, {"id": 25851, "fullname": "Drew Wadsworth", "url": "http://mlsys.org/api/miniconf/users/25851?format=json", "institution": ""}, {"id": 27187, "fullname": "Sean Siddens", "url": "http://mlsys.org/api/miniconf/users/27187?format=json", "institution": "AMD"}, {"id": 27188, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27188?format=json", "institution": null}, {"id": 27189, "fullname": "Daniel Fu", "url": "http://mlsys.org/api/miniconf/users/27189?format=json", "institution": "University of California, San Diego"}, {"id": 27190, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27190?format=json", "institution": null}, {"id": 27191, "fullname": "Muhammad Osama", "url": "http://mlsys.org/api/miniconf/users/27191?format=json", "institution": "AMD"}, {"id": 11444, "fullname": "Christopher R\u00e9", "url": "http://mlsys.org/api/miniconf/users/11444?format=json", "institution": "Stanford University"}, {"id": 27192, "fullname": "Simran Arora", "url": "http://mlsys.org/api/miniconf/users/27192?format=json", "institution": "Computer Science Department, Stanford University"}], "abstract": "AMD GPUs offer state-of-the-art compute and memory bandwidth; however, peak performance AMD kernels are written in raw assembly. To address the difficulty of mapping AI algorithms to hardware, recent work proposes C++ embedded and PyTorch-inspired domain-specific languages like ThunderKittens (TK) to simplify high performance AI kernel development on NVIDIA hardware. We explore the extent to which such primitives \u2014 for explicit tile-based programming with optimized memory accesses and fine-grained asynchronous execution across workers \u2014 are NVIDIA-specific or general. We provide the first detailed study of the programming primitives that lead to performant AMD AI kernels, and we encapsulate these insights in the HipKittens (HK) programming framework. We find that tile-based abstractions used in prior DSLs generalize to AMD GPUs, however we need to rethink the algorithms that instantiate these abstractions for AMD. We validate the HK primitives across CDNA3 and CDNA4 AMD platforms. In evaluations, HK kernels compete with AMD\u2019s hand-optimized assembly kernels for GEMMs and attention, and consistently outperform compiler baselines. Moreover, assembly is difficult to scale to the breadth of AI workloads; reflecting this, in some settings HK outperforms all available baselines by $1.2 \u2212 2.4\\times$ ($d = 64$ attention, GQA non-causal backwards, memory-bound kernels). These findings help pave the way for a single, tile-based software layer for high-performance AI kernels across GPU vendors.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3512", "url": null, "sourceid": 88, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=xxSSrndQrI", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 806, "modified": "2026-03-23T21:52:43.387482-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=xxSSrndQrI", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3528, "uid": "67c6a1e7ce56d3d6fa748ab6d9af3fd7", "name": "TiDAR: Think in Diffusion, Talk in Autoregression", "authors": [{"id": 27286, "fullname": "Jingyu Liu", "url": "http://mlsys.org/api/miniconf/users/27286?format=json", "institution": "University of Chicago"}, {"id": 27287, "fullname": "Xin Dong", "url": "http://mlsys.org/api/miniconf/users/27287?format=json", "institution": "NVIDIA"}, {"id": 27288, "fullname": "Zhifan Ye", "url": "http://mlsys.org/api/miniconf/users/27288?format=json", "institution": "Georgia Institute of Technology"}, {"id": 27289, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27289?format=json", "institution": null}, {"id": 27290, "fullname": "Yonggan Fu", "url": "http://mlsys.org/api/miniconf/users/27290?format=json", "institution": "NVIDIA"}, {"id": 27291, "fullname": "vartika singh", "url": "http://mlsys.org/api/miniconf/users/27291?format=json", "institution": "State University of New York, Buffalo"}, {"id": 18868, "fullname": "Ce Zhang", "url": "http://mlsys.org/api/miniconf/users/18868?format=json", "institution": null}, {"id": 27292, "fullname": "Pavlo Molchanov", "url": "http://mlsys.org/api/miniconf/users/27292?format=json", "institution": "NVIDIA Research"}], "abstract": "Diffusion language models hold the promise of fast parallel generation, while autoregressive (AR) models typically excel in quality due to their causal structure aligning naturally with language modeling. This raises a fundamental question: can we achieve a synergy with high throughput, higher GPU utilization, and AR level quality? Existing methods fail to effectively balance these two aspects, either prioritizing AR using a weaker model for sequential drafting (speculative decoding), leading to lower drafting efficiency, or using some form of left-to-right (AR-like) decoding logic for diffusion, which still suffers from quality degradation and forfeits its potential parallelizability.  We introduce TIDAR, a sequence-level hybrid architecture that drafts tokens (Thinking) in Diffusion and samples final outputs (Talking) AutoRegressively - all within a single forward pass using specially designed structured attention masks. This design exploits the free compute density on GPUs, achieving a strong balance between drafting and verification capacity. Moreover, we design TIDAR to be serving-friendly as a standalone model.   We extensively evaluate TIDAR against AR models, speculative decoding, and diffusion variants across generative and likelihood tasks at both 1.5B and 8B scales. Thanks to parallel drafting and sampling as well as efficient exact KV cache support, TIDAR outperforms speculative decoding in measured throughput and surpasses diffusion models like Dream and Llada in both efficiency and quality. Most notably, TIDAR is the first architecture to close the quality gap with AR models while delivering 4.71\u00d7 to 5.91\u00d7 more tokens per second.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3528", "url": null, "sourceid": 47, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=onfxEjoE4L", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 822, "modified": "2026-03-23T21:52:43.945409-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=onfxEjoE4L", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3575, "uid": "d3d9446802a44259755d38e6d163e820", "name": "db-SP: Accelerating Sparse Attention for Visual Generative Models with Dual-Balanced Sequence Parallelism", "authors": [{"id": 25683, "fullname": "Siqi Chen", "url": "http://mlsys.org/api/miniconf/users/25683?format=json", "institution": "Tsinghua University"}, {"id": 15899, "fullname": "Ke Hong", "url": "http://mlsys.org/api/miniconf/users/15899?format=json", "institution": "Tsinghua University"}, {"id": 27616, "fullname": "Tianchen Zhao", "url": "http://mlsys.org/api/miniconf/users/27616?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 27617, "fullname": "Ruiqi Xie", "url": "http://mlsys.org/api/miniconf/users/27617?format=json", "institution": "Tsinghua University"}, {"id": 27618, "fullname": "Zhenhua Zhu", "url": "http://mlsys.org/api/miniconf/users/27618?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 27619, "fullname": "Xudong Zhang", "url": "http://mlsys.org/api/miniconf/users/27619?format=json", "institution": "Tsinghua University, Tsinghua University"}, {"id": 17647, "fullname": "Yu Wang", "url": "http://mlsys.org/api/miniconf/users/17647?format=json", "institution": "Tsinghua University, Tsinghua University"}], "abstract": "Scaling Diffusion Transformer (DiT) inference via sequence parallelism is critical for reducing latency in visual generation, but is severely hampered by workload imbalance when applied to models employing block-wise sparse attention. The imbalance stems from the inherent variation in sparsity across attention heads and the irregular distribution of dense blocks within the sparse mask, when sequence parallelism is applied along the head dimension (as in Ulysses) or the block dimension (as in Ring Attention). In this paper, we formalize a \\textit{sparse imbalance ratio} to quantify the imbalance, and propose \\textit{db}-SP, a sparsity-aware sequence parallelism technique that tackles the challenge. \\textit{db}-SP contains a dual-level partitioning approach that achieves near-perfect workload balance at both the head and block levels with negligible overhead. Furthermore, to handle the evolving sparsity patterns across denoising steps and layers, \\textit{db}-SP dynamically determines the parallel degrees for the head and block dimensions at runtime. Experimental results demonstrate that \\nickname delivers an end-to-end speedup of 1.25\u00d7 and an attention-specific speedup of 1.40\u00d7 over state-of-the-art sequence parallel methods on average.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3575", "url": null, "sourceid": 10, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=XgKteNxNe0", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 869, "modified": "2026-03-23T21:52:45.881732-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=XgKteNxNe0", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3549, "uid": "182be0c5cdcd5072bb1864cdee4d3d6e", "name": "When Machine Learning Isn\u2019t Sure: Building Resilient ML-Based Computer Systems by Embracing Uncertainty", "authors": [{"id": 27421, "fullname": "Varun Gohil", "url": "http://mlsys.org/api/miniconf/users/27421?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 27422, "fullname": "Nevena Stojkovic", "url": "http://mlsys.org/api/miniconf/users/27422?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 27423, "fullname": "Noman Bashir", "url": "http://mlsys.org/api/miniconf/users/27423?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 27424, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27424?format=json", "institution": null}, {"id": 27425, "fullname": "Gaurang Upasani", "url": "http://mlsys.org/api/miniconf/users/27425?format=json", "institution": "Google"}, {"id": 27426, "fullname": "David Lo", "url": "http://mlsys.org/api/miniconf/users/27426?format=json", "institution": "Google"}, {"id": 27427, "fullname": "Parthasarathy Ranganathan", "url": "http://mlsys.org/api/miniconf/users/27427?format=json", "institution": "Google"}, {"id": 20861, "fullname": "Christina Delimitrou", "url": "http://mlsys.org/api/miniconf/users/20861?format=json", "institution": "Cornell University"}], "abstract": "Machine learning (ML) models are increasingly used in computer systems but often suffer from poor generalizability, leading to costly failures on out-of-distribution (OOD) data. We propose an uncertainty-aware framework that improves system resilience by quantifying prediction uncertainty at runtime and rejecting unreliable outputs before they cause harm. When a prediction is uncertain, the system gracefully degrades to a safe fallback strategy. We evaluate the framework across three case studies, server provisioning, cluster management, and storage I/O admission, and find that the best uncertainty estimator is not universal but depends on how its properties align with each task\u2019s design and resource constraints. Similarly, the optimal fallback workflow (e.g., a lightweight and parallel vs. resource-intensive and sequential ) depends on task\u2019s runtime latency constraints. Together, these findings offer a practical path towards building more reliable ML-driven computer systems.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3549", "url": null, "sourceid": 33, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=i0iOQL2MF5", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 843, "modified": "2026-03-23T21:52:44.765164-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=i0iOQL2MF5", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3533, "uid": "6974ce5ac660610b44d9b9fed0ff9548", "name": "TriInfer: Hybrid EPD Disaggregation for Efficient Multimodal Large Language Model Inference", "authors": [{"id": 27326, "fullname": "Xianzhe Dong", "url": "http://mlsys.org/api/miniconf/users/27326?format=json", "institution": "University of Science and Technology of China"}, {"id": 27327, "fullname": "Tongxuan Liu", "url": "http://mlsys.org/api/miniconf/users/27327?format=json", "institution": "JD.com"}, {"id": 27328, "fullname": "Yuting Zeng", "url": "http://mlsys.org/api/miniconf/users/27328?format=json", "institution": "University of Science and Technology of China"}, {"id": 27144, "fullname": "Weizhe Huang", "url": "http://mlsys.org/api/miniconf/users/27144?format=json", "institution": "University of Science and Technology of China"}, {"id": 25671, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/25671?format=json", "institution": null}, {"id": 27329, "fullname": "Siyu Wu", "url": "http://mlsys.org/api/miniconf/users/27329?format=json", "institution": "Beihang University"}, {"id": 27330, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27330?format=json", "institution": null}, {"id": 27331, "fullname": "Liu Yang", "url": "http://mlsys.org/api/miniconf/users/27331?format=json", "institution": ""}, {"id": 26165, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/26165?format=json", "institution": null}, {"id": 27332, "fullname": "Hailong Yang", "url": "http://mlsys.org/api/miniconf/users/27332?format=json", "institution": null}, {"id": 27333, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27333?format=json", "institution": null}, {"id": 27334, "fullname": "Jing Li", "url": "http://mlsys.org/api/miniconf/users/27334?format=json", "institution": "University of Science and Technology of China"}], "abstract": "Existing MLLM inference systems are typically designed based on the architecture of language models, coupling image processing and language processing. This design struggles to accommodate the heterogeneous demands of different stages in terms of computational resources, memory access patterns, and service-level objectives (SLOs), leading to low resource utilization and high request latency, ultimately failing to meet the service requirements of diverse inference scenarios. To address these challenges, we propose TriInfer, an efficient MLLM inference system that adopts a Hybrid Encode-Prefill-Decode (EPD) Disaggregation architecture. By scheduling the three stages \u2014 encode, prefill, and decode \u2014 onto separate heterogeneous inference instances, the system flexibly reallocates resources across stages, significantly reducing idle computation, alleviating resource bottlenecks, and improving overall system throughput and scalability. In addition, TriInfer supports a stage-level batching strategy that enhances load balancing, enables parallel execution of visual and language models, and further optimizes inference performance. Experiments under real multimodal inference workloads demonstrate that TriInfer can achieve up to 3.7\u00d7 higher inference throughput compared to state-of-the-art systems (e.g., vLLM, SGLang) while meeting the 90th percentile request SLO.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3533", "url": null, "sourceid": 103, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=nNovi8fvGN", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 827, "modified": "2026-03-23T21:52:44.125266-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=nNovi8fvGN", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3569, "uid": "9bf31c7ff062936a96d3c8bd1f8f2ff3", "name": "EarthSight: A Distributed Framework for Low-Latency Satellite Intelligence", "authors": [{"id": 27599, "fullname": "Ansel Erol", "url": "http://mlsys.org/api/miniconf/users/27599?format=json", "institution": "Georgia Institute of Technology"}, {"id": 27600, "fullname": "Seungjun Lee", "url": "http://mlsys.org/api/miniconf/users/27600?format=json", "institution": "Korea Advanced Institute of Science &amp; Technology"}, {"id": 27358, "fullname": "Divya Mahajan", "url": "http://mlsys.org/api/miniconf/users/27358?format=json", "institution": "Georgia Institute of Technology"}], "abstract": "Low-latency delivery of satellite imagery is essential for time-critical applications such as disaster response, intelligence, and infrastructure monitoring. However, traditional pipelines rely on downlinking all captured images before analysis, introducing delays of hours to days due to restricted communication bandwidth. To address these bottlenecks, emerging systems perform onboard machine learning to prioritize which images to transmit. However, these solutions typically treat each satellite as an isolated compute node, limiting scalability and efficiency. Redundant inference across satellites and tasks further strains onboard power and compute costs, constraining mission scope and responsiveness. We present EarthSight, a distributed runtime framework that redefines satellite image intelligence as a \\emph{distributed decision problem} between orbit and ground. EarthSight introduces three core innovations: (1) \\emph{multi-task inference} on satellites using shared backbones to amortize computation across multiple vision tasks; (2) a \\emph{ground-station query scheduler} that aggregates user requests, predicts priorities, and assigns compute budgets to incoming imagery; and (3) \\emph{dynamic filter ordering}, which integrates model selectivity, accuracy, and execution cost to reject low-value images early and conserve resources. EarthSight leverages global context from ground stations and resource-aware adaptive decisions in orbit to enable constellations to perform scalable, low-latency image analysis within strict downlink bandwidth and onboard power budgets. Evaluations using a prior established satellite simulator show that EarthSight reduces average compute time per image by 1.9$\\times$ and lowers 90th percentile end-to-end latency from first contact to delivery from 51 to 21 minutes compared to the state-of-the-art baseline.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3569", "url": null, "sourceid": 15, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=c3O6DnhUYm", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 863, "modified": "2026-03-23T21:52:45.635713-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=c3O6DnhUYm", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3631, "uid": "d82c8d1619ad8176d665453cfb2e55f0", "name": "BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding", "authors": [{"id": 27887, "fullname": "Jiayi Yuan", "url": "http://mlsys.org/api/miniconf/users/27887?format=json", "institution": "Rice University"}, {"id": 15623, "fullname": "Cameron Shinn", "url": "http://mlsys.org/api/miniconf/users/15623?format=json", "institution": "UC Davis"}, {"id": 27888, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27888?format=json", "institution": null}, {"id": 27889, "fullname": "Jingze Cui", "url": "http://mlsys.org/api/miniconf/users/27889?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 27890, "fullname": "George Klimiashvili", "url": "http://mlsys.org/api/miniconf/users/27890?format=json", "institution": "NVIDIA"}, {"id": 27891, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27891?format=json", "institution": null}, {"id": 27892, "fullname": "Perkz Zheng", "url": "http://mlsys.org/api/miniconf/users/27892?format=json", "institution": "NVIDIA"}, {"id": 27893, "fullname": "Bo Li", "url": "http://mlsys.org/api/miniconf/users/27893?format=json", "institution": "NVIDIA"}, {"id": 27894, "fullname": "Zhou Yuxin", "url": "http://mlsys.org/api/miniconf/users/27894?format=json", "institution": "NVIDIA"}, {"id": 27895, "fullname": "Zhouhai Ye", "url": "http://mlsys.org/api/miniconf/users/27895?format=json", "institution": "NVIDIA"}, {"id": 27896, "fullname": "Weijie You", "url": "http://mlsys.org/api/miniconf/users/27896?format=json", "institution": "NVIDIA"}, {"id": 27897, "fullname": "Richard Cai", "url": "http://mlsys.org/api/miniconf/users/27897?format=json", "institution": "NVIDIA"}, {"id": 27898, "fullname": "Julien Demouth", "url": "http://mlsys.org/api/miniconf/users/27898?format=json", "institution": "University of Lorraine"}, {"id": 27899, "fullname": "John D. Owens", "url": "http://mlsys.org/api/miniconf/users/27899?format=json", "institution": "UC Davis"}, {"id": 27900, "fullname": "Xia Hu", "url": "http://mlsys.org/api/miniconf/users/27900?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 27901, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27901?format=json", "institution": null}, {"id": 25625, "fullname": "Timmy Liu", "url": "http://mlsys.org/api/miniconf/users/25625?format=json", "institution": "Nvidia"}, {"id": 12477, "fullname": "Huizi Mao", "url": "http://mlsys.org/api/miniconf/users/12477?format=json", "institution": "Stanford University"}], "abstract": "The growing demand for long-context inference capabilities in Large Language Models (LLMs) has intensified the computational and memory bottlenecks inherent to the standard attention mechanism. To address this challenge, we introduce BLASST, a drop-in sparse attention method that dynamically prunes the attention matrix without any pre-computation or proxy scores. Our method uses a fixed threshold and existing information from online softmax to identify negligible attention scores, skipping softmax computation, Value block loading, and the subsequent matrix multiplication. This fits seamlessly into existing FlashAttention kernel designs with negligible latency overhead. The approach is applicable to both prefill and decode stages across all attention variants (MHA, GQA, MQA, and MLA), providing a unified solution for accelerating long-context inference. We develop an automated calibration procedure that reveals a simple inverse relationship the between optimal threshold and context length, enabling robust deployment across diverse scenarios. Maintaining high accuracy, we demonstrate a 1.62$\\times$ speedup for prefill at 74.7\\% sparsity and a 1.40$\\times$ speedup for decode at 73.2\\% sparsity on modern GPUs. Furthermore, we explore sparsity-aware training as a natural extension, showing that models can be trained to be inherently more robust to sparse attention patterns, pushing the accuracy-sparsity frontier even further.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3631", "url": null, "sourceid": 53, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=6INSBXTQ4x", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 925, "modified": "2026-03-23T21:52:48.064377-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=6INSBXTQ4x", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3561, "uid": "32bb90e8976aab5298d5da10fe66f21d", "name": "Breaking the Ice: Analyzing Cold Start Latency in vLLM", "authors": [{"id": 25925, "fullname": "Huzaifa Shaaban Kabakibo", "url": "http://mlsys.org/api/miniconf/users/25925?format=json", "institution": "Paderborn University"}, {"id": 27154, "fullname": "Animesh Trivedi", "url": "http://mlsys.org/api/miniconf/users/27154?format=json", "institution": "International Business Machines"}, {"id": 27565, "fullname": "Lin Wang", "url": "http://mlsys.org/api/miniconf/users/27565?format=json", "institution": "Paderborn University"}], "abstract": "As scalable inference services become popular, the cold start latency of an inference engine becomes important. Today, vLLM has evolved into the de-facto inference engine of choice for many inference workloads. Although popular, due to its complexity and rapid evolution, there has not been a systematic study on the startup latency of its engine. With major architectural innovations under it (e.g., the `V1` API, introduction of `torch.compile`), in this paper, we present the first detailed performance characterization of vLLM startup latency. We break down the startup process into six foundational steps and demonstrate that this process is predominantly CPU-bound. Each step exhibits consistent and interpretable scaling trends with respect to model- and system-level parameters, enabling fine-grained attribution of latency sources. Building on these insights, we develop a lightweight analytical model that accurately predicts vLLM\u2019s startup latency for a given hardware configuration, providing actionable guidance for serverless scheduling and resource planning in large-scale inference environments.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3561", "url": null, "sourceid": 72, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=eoEobeKTNZ", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 855, "modified": "2026-03-23T21:52:45.294986-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=eoEobeKTNZ", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3556, "uid": "c9e1074f5b3f9fc8ea15d152add07294", "name": "Attribution-based Sparse Activation in Large Language Models", "authors": [{"id": 27495, "fullname": "Jifeng Song", "url": "http://mlsys.org/api/miniconf/users/27495?format=json", "institution": "University of Pittsburgh"}, {"id": 27496, "fullname": "Xiangyu Yin", "url": "http://mlsys.org/api/miniconf/users/27496?format=json", "institution": "University of Pittsburgh"}, {"id": 27497, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27497?format=json", "institution": null}, {"id": 27498, "fullname": "Kai Huang", "url": "http://mlsys.org/api/miniconf/users/27498?format=json", "institution": "University of Pittsburgh"}, {"id": 27499, "fullname": "Weichen Liu", "url": "http://mlsys.org/api/miniconf/users/27499?format=json", "institution": "University of Pittsburgh"}, {"id": 27500, "fullname": "Wei Gao", "url": "http://mlsys.org/api/miniconf/users/27500?format=json", "institution": "University of Pittsburgh"}], "abstract": "LLM inference is computationally expensive due to the LLM's large parameter sizes. Existing techniques reduce the computing cost via model retraining, but cannot well adapt to different downstream tasks or variant input data at runtime. To avoid such retraining efforts for runtime adaptability, a better option is \\emph{sparse activation} that selectively deactivates an input-dependent set of neurons in inference, but current methods of \\emph{lossless} sparse activation only deactivate neurons with zero output magnitudes, and are ineffective on recent LLMs with higher parameter efficiency. In this paper, we present a new technique of attribution-based sparse activation, which is a \\emph{lossy} sparse activation technique that deactivates neurons with low attribution scores and aims to achieve the best tradeoff between model accuracy and computing costs. To ensure optimal sparse activation, we quantified the large errors of existing attribution metrics when used for sparse activation, due to the interdependency among attribution scores of different neurons, and further proposed a new attribution metric that can provably correct such errors. Experiments show that our technique can achieve 70\\% model sparsity in difficult generative tasks such as question answering and text summarization with <5\\% model accuracy loss. Such high model sparsity enables us to reduce the computing latency and memory use of LLM inference by 35\\% and 40\\%, respectively.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3556", "url": null, "sourceid": 104, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=gJFigZeb5D", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 850, "modified": "2026-03-23T21:52:45.078334-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=gJFigZeb5D", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3592, "uid": "07e1cd7dca89a1678042477183b7ac3f", "name": "Event Tensor: A Unified Abstraction for Compiling Dynamic Megakernel", "authors": [{"id": 16930, "fullname": "Hongyi Jin", "url": "http://mlsys.org/api/miniconf/users/16930?format=json", "institution": "Carnegie Mellon University"}, {"id": 15229, "fullname": "Bohan Hou", "url": "http://mlsys.org/api/miniconf/users/15229?format=json", "institution": "Carnegie Mellon University"}, {"id": 27670, "fullname": "Guanjie Wang", "url": "http://mlsys.org/api/miniconf/users/27670?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 16412, "fullname": "Ruihang Lai", "url": "http://mlsys.org/api/miniconf/users/16412?format=json", "institution": "Carnegie Mellon University"}, {"id": 27671, "fullname": "Jinqi Chen", "url": "http://mlsys.org/api/miniconf/users/27671?format=json", "institution": "School of Computer Science, Carnegie Mellon University"}, {"id": 12026, "fullname": "Zihao Ye", "url": "http://mlsys.org/api/miniconf/users/12026?format=json", "institution": "University of Washington"}, {"id": 20904, "fullname": "Yaxing Cai", "url": "http://mlsys.org/api/miniconf/users/20904?format=json", "institution": "NVIDIA"}, {"id": 23351, "fullname": "Yixin Dong", "url": "http://mlsys.org/api/miniconf/users/23351?format=json", "institution": "Carnegie Mellon University"}, {"id": 27672, "fullname": "Xinhao Cheng", "url": "http://mlsys.org/api/miniconf/users/27672?format=json", "institution": "Carnegie Mellon University"}, {"id": 25596, "fullname": "Zhihao Zhang", "url": "http://mlsys.org/api/miniconf/users/25596?format=json", "institution": "Carnegie Mellon University"}, {"id": 20906, "fullname": "Yilong Zhao", "url": "http://mlsys.org/api/miniconf/users/20906?format=json", "institution": "University of California, Berkeley"}, {"id": 27673, "fullname": "Yingyi Huang", "url": "http://mlsys.org/api/miniconf/users/27673?format=json", "institution": "Nvidia, CMU"}, {"id": 27674, "fullname": "Lijie Yang", "url": "http://mlsys.org/api/miniconf/users/27674?format=json", "institution": "Princeton University"}, {"id": 27675, "fullname": "Jinchen Jiang", "url": "http://mlsys.org/api/miniconf/users/27675?format=json", "institution": "Tsinghua University"}, {"id": 27676, "fullname": "Gabriele Oliaro", "url": "http://mlsys.org/api/miniconf/users/27676?format=json", "institution": "Carnegie Mellon University"}, {"id": 27677, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27677?format=json", "institution": null}, {"id": 27678, "fullname": "Xupeng Miao", "url": "http://mlsys.org/api/miniconf/users/27678?format=json", "institution": "Purdue University"}, {"id": 12410, "fullname": "Vinod Grover", "url": "http://mlsys.org/api/miniconf/users/12410?format=json", "institution": "NVIDIA"}, {"id": 12052, "fullname": "Todd Mowry", "url": "http://mlsys.org/api/miniconf/users/12052?format=json", "institution": "Carnegie Mellon University"}, {"id": 16147, "fullname": "Zhihao Jia", "url": "http://mlsys.org/api/miniconf/users/16147?format=json", "institution": "Carnegie Mellon University and Amazon"}, {"id": 11984, "fullname": "Tianqi Chen", "url": "http://mlsys.org/api/miniconf/users/11984?format=json", "institution": "CMU"}], "abstract": "Modern GPU workloads, especially large language model (LLM) inference, suffer from kernel launch overheads and coarse synchronization that limit inter-kernel parallelism. Recent megakernel techniques fuse multiple operators into a single persistent kernel to eliminate launch gaps and expose inter-kernel parallelism, but struggle to handle dynamic shapes and data-dependent computation in real workloads. We present Event Tensor, a unified compiler abstraction for dynamic megakernels. Event Tensor encodes dependencies between tiled tasks, and enables first-class support for both shape and data-dependent dynamism. Built atop this abstraction, our Event Tensor Compiler (ETC) applies static and dynamic scheduling transformations to generate high-performance persistent kernels. Evaluations show that ETC achieves state-of-the-art LLM serving latency while significantly reducing system warmup overhead.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3592", "url": null, "sourceid": 119, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=PJqFhAbUHa", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 886, "modified": "2026-03-23T21:52:46.536288-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=PJqFhAbUHa", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3545, "uid": "a5bfc9e07964f8dddeb95fc584cd965d", "name": "ExecuTorch - A Unified PyTorch Solution to Run ML Models On-Device", "authors": [{"id": 25593, "fullname": "Chen Lai", "url": "http://mlsys.org/api/miniconf/users/25593?format=json", "institution": "Meta"}, {"id": 27365, "fullname": "Cemal Bilgin", "url": "http://mlsys.org/api/miniconf/users/27365?format=json", "institution": "MSL"}, {"id": 27366, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27366?format=json", "institution": null}, {"id": 27367, "fullname": "Gregory Comer", "url": "http://mlsys.org/api/miniconf/users/27367?format=json", "institution": ""}, {"id": 27368, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27368?format=json", "institution": null}, {"id": 27369, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27369?format=json", "institution": null}, {"id": 27370, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27370?format=json", "institution": null}, {"id": 27371, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27371?format=json", "institution": null}, {"id": 27372, "fullname": "Mengwei Liu", "url": "http://mlsys.org/api/miniconf/users/27372?format=json", "institution": "Meta Platforms Inc"}, {"id": 27373, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27373?format=json", "institution": null}, {"id": 27374, "fullname": "Songhao Jia", "url": "http://mlsys.org/api/miniconf/users/27374?format=json", "institution": "Meta"}, {"id": 27375, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27375?format=json", "institution": null}, {"id": 27376, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27376?format=json", "institution": null}, {"id": 27377, "fullname": "Digant Desai", "url": "http://mlsys.org/api/miniconf/users/27377?format=json", "institution": "Meta Platforms"}, {"id": 27378, "fullname": "Hansong Zhang", "url": "http://mlsys.org/api/miniconf/users/27378?format=json", "institution": "Meta Platforms"}, {"id": 27379, "fullname": "Manuel Candales", "url": "http://mlsys.org/api/miniconf/users/27379?format=json", "institution": "Meta"}, {"id": 27380, "fullname": "Scott Roy", "url": "http://mlsys.org/api/miniconf/users/27380?format=json", "institution": "Meta"}, {"id": 27381, "fullname": "Sicheng Jia", "url": "http://mlsys.org/api/miniconf/users/27381?format=json", "institution": "Meta, Inc."}, {"id": 27382, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27382?format=json", "institution": null}, {"id": 27383, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27383?format=json", "institution": null}, {"id": 27384, "fullname": "Yanan Cao", "url": "http://mlsys.org/api/miniconf/users/27384?format=json", "institution": "Meta Platforms"}, {"id": 27385, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27385?format=json", "institution": null}, {"id": 27386, "fullname": "Shunting Zhang", "url": "http://mlsys.org/api/miniconf/users/27386?format=json", "institution": ""}, {"id": 27387, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27387?format=json", "institution": null}, {"id": 27388, "fullname": "Angela Yi", "url": "http://mlsys.org/api/miniconf/users/27388?format=json", "institution": "Stanford University"}, {"id": 27389, "fullname": "Zhenrui Zhang", "url": "http://mlsys.org/api/miniconf/users/27389?format=json", "institution": "Facebook"}, {"id": 27390, "fullname": "Andrew Or", "url": "http://mlsys.org/api/miniconf/users/27390?format=json", "institution": "Facebook"}, {"id": 27391, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27391?format=json", "institution": null}, {"id": 27392, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27392?format=json", "institution": null}, {"id": 27393, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27393?format=json", "institution": null}, {"id": 27394, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27394?format=json", "institution": null}, {"id": 27395, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27395?format=json", "institution": null}, {"id": 27396, "fullname": "Supriya Rao", "url": "http://mlsys.org/api/miniconf/users/27396?format=json", "institution": "Facebook"}, {"id": 27397, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27397?format=json", "institution": null}, {"id": 16164, "fullname": "Soumith Chintala", "url": "http://mlsys.org/api/miniconf/users/16164?format=json", "institution": "Meta"}], "abstract": "Local execution of AI on edge devices is critical for privacy, low latency, and offline operation. However, deploying models on diverse hardware remains fragmented, often requiring model conversion or complete implementation outside the PyTorch ecosystem where the model was originally authored. We introduce ExecuTorch, a unified PyTorch-native deployment framework for edge AI. ExecuTorch enables seamless deployment of machine learning models across heterogeneous compute environments. It scales from completely embedded microcontrollers to complex system-on-chips (SoCs) with dedicated accelerators, powering devices ranging from wearables and smartphones to large compute clusters. ExecuTorch preserves PyTorch semantics while allowing customization, support for optimizations like quantization, and pluggable execution ''backends''. These features together enable fast experimentation, allowing researchers to validate deployment behavior entirely within PyTorch, bridging the gap between research and production.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3545", "url": null, "sourceid": 37, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=jmE5nwC9kb", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 839, "modified": "2026-03-23T21:52:44.608714-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=jmE5nwC9kb", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3616, "uid": "65b9eea6e1cc6bb9f0cd2a47751a186f", "name": "REPARO: LOSS-RESILIENT GENERATIVE CODEC FOR VIDEO CONFERENCING", "authors": [{"id": 27821, "fullname": "Tianhong Li", "url": "http://mlsys.org/api/miniconf/users/27821?format=json", "institution": "Meta"}, {"id": 27822, "fullname": "Vibhaalakshmi Sivaraman", "url": "http://mlsys.org/api/miniconf/users/27822?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 27823, "fullname": "Pantea Karimi", "url": "http://mlsys.org/api/miniconf/users/27823?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 27824, "fullname": "Lijie Fan", "url": "http://mlsys.org/api/miniconf/users/27824?format=json", "institution": "Google DeepMind"}, {"id": 12450, "fullname": "Mohammad Alizadeh", "url": "http://mlsys.org/api/miniconf/users/12450?format=json", "institution": "MIT CSAIL"}, {"id": 13201, "fullname": "Dina Katabi", "url": "http://mlsys.org/api/miniconf/users/13201?format=json", "institution": "MIT"}], "abstract": "Packet loss during video conferencing often results in poor quality and video freezing. Retransmitting lost packets is often impractical due to the need for real-time playback, and using Forward Error Correction (FEC) for packet recovery is challenging due to the unpredictable and bursty nature of Internet losses. Excessive redundancy leads to inefficiency and wasted bandwidth, while insufficient redundancy results in undecodable frames, causing video freezes and quality degradation in subsequent frames.  We introduce Reparo \u2014 a loss-resilient video conferencing framework based on generative deep learning models to address these issues. Our approach generates missing information when a frame or part of a frame is lost. This generation is conditioned on the data received thus far, considering the model's understanding of how people and objects appear and interact within the visual realm. Experimental results, using publicly available video conferencing datasets, show that Reparo outperforms state-of-the-art FEC-based video conferencing solutions in terms of both video quality (measured through PSNR, SSIM, and LPIPS) and the occurrence of video freezes.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3616", "url": null, "sourceid": 105, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=GaBGzA7fpe", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 910, "modified": "2026-03-23T21:52:47.459053-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=GaBGzA7fpe", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3520, "uid": "698d51a19d8a121ce581499d7b701668", "name": "FarSkip-Collectives: Unhobbling Blocking Communication in Mixture of Experts Models", "authors": [{"id": 26274, "fullname": "Yonatan Dukler", "url": "http://mlsys.org/api/miniconf/users/26274?format=json", "institution": "AMD"}, {"id": 27232, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27232?format=json", "institution": null}, {"id": 27233, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27233?format=json", "institution": null}, {"id": 27234, "fullname": "Vikram Appia", "url": "http://mlsys.org/api/miniconf/users/27234?format=json", "institution": "Advanced Micro Devices"}, {"id": 12648, "fullname": "Emad Barsoum", "url": "http://mlsys.org/api/miniconf/users/12648?format=json", "institution": "Cerebras"}], "abstract": "Blocking communication presents a major hurdle in running MoEs efficiently in distributed settings. To address this, we present FarSkip-Collective which modifies the architecture of modern models to enable overlapping of their computation with communication. Our approach modifies the architecture to skip connections in the model and it is unclear a priori whether the modified model architecture can remain equally capable, especially for large state-of-the-art models and while modifying all of the model layers.  We answer this question in the affirmative and fully convert a series of state-of-the-art models varying from 16B to 109B parameters to enable overlapping of their communication while achieving accuracy on par with their original open-source releases. For example, we convert Llama 4 Scout (109B) via self-distillation and achieve average accuracy within 1% of its instruction tuned release averaged over wide-range of downstream evaluations.  In addition to demonstrating retained accuracy of the large modified models, we realize the benefits of FarSkip-Collective through optimized implementations that explicitly overlap communication with computation, accelerating both training and inference in existing frameworks. For inference, we demonstrate 18.5% speed-up in Time To First Token when serving Llama-4 Scout with expert parallelism in vLLM and achieve 97.6% communication-computation overlap during the prefill stage.  During training, our approach enables 88.9% communication overlap of the all-to-all communication collectives when pre-training DeepSeek-V3 MoE layers with expert parallelism.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3520", "url": null, "sourceid": 111, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=ruOpvLzsGV", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 814, "modified": "2026-03-23T21:52:43.667545-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=ruOpvLzsGV", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3529, "uid": "8613985ec49eb8f757ae6439e879bb2a", "name": "Meeting SLOs, Slashing Hours: Automated Enterprise LLM Optimization with OptiKIT", "authors": [{"id": 25907, "fullname": "Nicholas Santavas", "url": "http://mlsys.org/api/miniconf/users/25907?format=json", "institution": "DUTH"}, {"id": 27293, "fullname": "Kareem Eissa", "url": "http://mlsys.org/api/miniconf/users/27293?format=json", "institution": "Siemens Healthineers"}, {"id": 27294, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27294?format=json", "institution": null}, {"id": 27295, "fullname": "Piotr Florek", "url": "http://mlsys.org/api/miniconf/users/27295?format=json", "institution": ""}, {"id": 27296, "fullname": "Matteo Nulli", "url": "http://mlsys.org/api/miniconf/users/27296?format=json", "institution": "eBay Inc."}, {"id": 27297, "fullname": "Stefan Vasilev", "url": "http://mlsys.org/api/miniconf/users/27297?format=json", "institution": "eBay Inc."}, {"id": 27298, "fullname": "Seyyed Hashemi", "url": "http://mlsys.org/api/miniconf/users/27298?format=json", "institution": "eBay Inc."}, {"id": 27299, "fullname": "Antonios Gasteratos", "url": "http://mlsys.org/api/miniconf/users/27299?format=json", "institution": "Dimocritus University of Thrace"}, {"id": 27300, "fullname": "Shahram Khadivi", "url": "http://mlsys.org/api/miniconf/users/27300?format=json", "institution": "eBay Inc."}], "abstract": "Enterprise LLM deployment faces a critical scalability challenge: organizations must optimize models systematically to scale AI initiatives within constrained compute budgets, yet the specialized expertise required for manual optimization remains a niche and scarce skillset. This challenge is particularly evident in managing GPU utilization across heterogeneous infrastructure while enabling teams with diverse workloads and limited LLM optimization experience to deploy models efficiently. We present OPTIKIT, a distributed LLM optimization framework that democratizes model compression and tuning by automating complex optimization workflows for non-expert teams. OPTIKIT provides dynamic resource allocation, staged pipeline execution with automatic cleanup, and seamless enterprise integration. In production, it delivers more than 2\u00d7 GPU throughput improvement while empowering application teams to achieve consistent performance improvements without deep LLM optimization expertise. We share both the platform design and key engineering insights into resource allocation algorithms, pipeline orchestration, and integration patterns that enable large-scale, production-grade democratization of model optimization. Finally, we open-source the system to enable external contributions and broader reproducibility.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3529", "url": null, "sourceid": 90, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=om4H7AI2hc", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 823, "modified": "2026-03-23T21:52:43.982129-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=om4H7AI2hc", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3539, "uid": "9b8619251a19057cff70779273e95aa6", "name": "Practical Adversarial Multi-Armed Bandits with Sublinear Runtime", "authors": [{"id": 21072, "fullname": "Kasper Overgaard Mortensen", "url": "http://mlsys.org/api/miniconf/users/21072?format=json", "institution": "Aarhus University"}, {"id": 27348, "fullname": "Ama Bembua Bainson", "url": "http://mlsys.org/api/miniconf/users/27348?format=json", "institution": "Aarhus University"}, {"id": 27349, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27349?format=json", "institution": null}, {"id": 27350, "fullname": "Kristoffer Strube", "url": "http://mlsys.org/api/miniconf/users/27350?format=json", "institution": "Kristoffer Strube Consulting"}, {"id": 27351, "fullname": "Renata Borovica-Gajic", "url": "http://mlsys.org/api/miniconf/users/27351?format=json", "institution": "University of Melbourne"}, {"id": 27352, "fullname": "Andrea Paudice", "url": "http://mlsys.org/api/miniconf/users/27352?format=json", "institution": "Aarhus University"}, {"id": 21058, "fullname": "Davide Mottin", "url": "http://mlsys.org/api/miniconf/users/21058?format=json", "institution": "Aarhus University"}, {"id": 21084, "fullname": "Panagiotis Karras", "url": "http://mlsys.org/api/miniconf/users/21084?format=json", "institution": "Copenhagen University"}], "abstract": "We study the Multi-Armed Bandit problem in nonstationary adversarial environments, where the identity of the optimal arm can change over time due to shifts in the loss sequence. Motivated by applications such as physical design tuning in database systems, we focus on settings with a very large number of arms and seek practical algorithms with sublinear runtime. Our main contribution is a novel algorithm, Queuing Behind the Leader (QBL), which achieves a per-iteration complexity of O(m log k), where m is the number of arms selected at each step. QBL combines limited update operations via a priority queue, a constant sampling overhead, and a balanced exploration strategy. We evaluate QBL extensively on state-of-the-art benchmarks and demonstrate that it consistently outperforms existing methods in both time and solution quality.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3539", "url": null, "sourceid": 130, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=lfHvcstuo2", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 833, "modified": "2026-03-23T21:52:44.377815-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=lfHvcstuo2", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3603, "uid": "2b44928ae11fb9384c4cf38708677c48", "name": "MoEBlaze: Breaking the Memory Wall for Efficient MoE Training on Modern GPUs", "authors": [{"id": 27752, "fullname": "Jiyuan Zhang", "url": "http://mlsys.org/api/miniconf/users/27752?format=json", "institution": "Facebook"}, {"id": 25811, "fullname": "Yining Liu", "url": "http://mlsys.org/api/miniconf/users/25811?format=json", "institution": ""}, {"id": 27729, "fullname": "Siqi Yan", "url": "http://mlsys.org/api/miniconf/users/27729?format=json", "institution": "Facebook"}, {"id": 27720, "fullname": "Lisen Deng", "url": "http://mlsys.org/api/miniconf/users/27720?format=json", "institution": "Meta"}, {"id": 27719, "fullname": "Jennifer Cao", "url": "http://mlsys.org/api/miniconf/users/27719?format=json", "institution": "Facebook"}, {"id": 27159, "fullname": "Shuqi Yang", "url": "http://mlsys.org/api/miniconf/users/27159?format=json", "institution": "Meta"}, {"id": 27731, "fullname": "Bi Xue", "url": "http://mlsys.org/api/miniconf/users/27731?format=json", "institution": "Thinking Machines Lab"}, {"id": 27730, "fullname": "Min Ni", "url": "http://mlsys.org/api/miniconf/users/27730?format=json", "institution": "Northwestern University"}, {"id": 16149, "fullname": "Shen Li", "url": "http://mlsys.org/api/miniconf/users/16149?format=json", "institution": "Meta"}], "abstract": "The pervasive \u201cmemory wall\u201d bottleneck is significantly amplified in modern large-scale Mixture-of-Experts (MoE) architectures. MoE's inherent architectural sparsity leads to sparse arithmetic compute and also introduces substantial activation memory overheads\u2014driven by large token routing buffers and the need to materialize and buffer intermediate tensors. This memory pressure limits the maximum batch size and sequence length that can fit on GPUs, and also results in excessive data movements that hinders performance and efficient model scaling. We present MoEBlaze, a memory-efficient MoE training framework that addresses these issues through a co-designed system approach: (i) an end-to-end token dispatch and MoE training method with optimized data structures to eliminate intermediate buffers and activation materializing, and (ii) co-designed kernels with smart activation checkpoint to mitigate memory footprint while simultaneously achieving better performance. We demonstrate that MoEBlaze can achieve over $4\\times$ speedups and over $50\\%$ memory savings compared to existing MoE frameworks. MoEBlaze has been deployed in Meta recommendation production.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3603", "url": null, "sourceid": 115, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=L8qKfWWkry", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 897, "modified": "2026-03-23T21:52:46.937141-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=L8qKfWWkry", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3604, "uid": "c74d97b01eae257e44aa9d5bade97baf", "name": "XProf: An Open, Scalable, and Extensible Profiling System for the Modern ML Stack", "authors": [{"id": 27160, "fullname": "Clive Verghese", "url": "http://mlsys.org/api/miniconf/users/27160?format=json", "institution": "Google"}, {"id": 26196, "fullname": "Prasanna Rengasamy", "url": "http://mlsys.org/api/miniconf/users/26196?format=json", "institution": "google"}, {"id": 27753, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27753?format=json", "institution": null}, {"id": 26287, "fullname": "Yin Zhang", "url": "http://mlsys.org/api/miniconf/users/26287?format=json", "institution": "Google"}, {"id": 25661, "fullname": "Jiya Zhang", "url": "http://mlsys.org/api/miniconf/users/25661?format=json", "institution": "Google"}, {"id": 27754, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27754?format=json", "institution": null}, {"id": 26205, "fullname": "Charles Alaras", "url": "http://mlsys.org/api/miniconf/users/26205?format=json", "institution": "Google"}, {"id": 27755, "fullname": "Aditya Sharma", "url": "http://mlsys.org/api/miniconf/users/27755?format=json", "institution": "Indian Institute of Information Technology, Allahabad"}, {"id": 27756, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27756?format=json", "institution": null}, {"id": 27182, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27182?format=json", "institution": null}, {"id": 27757, "fullname": "Rushabh Lalwani", "url": "http://mlsys.org/api/miniconf/users/27757?format=json", "institution": "Google India"}, {"id": 27758, "fullname": "Sannidhya Chauhan", "url": "http://mlsys.org/api/miniconf/users/27758?format=json", "institution": ""}, {"id": 27759, "fullname": "Sai Ganesh Bandiatmakuri", "url": "http://mlsys.org/api/miniconf/users/27759?format=json", "institution": "Google"}, {"id": 27760, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27760?format=json", "institution": null}, {"id": 23805, "fullname": "Ani Udipi", "url": "http://mlsys.org/api/miniconf/users/23805?format=json", "institution": "Google"}, {"id": 27761, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27761?format=json", "institution": null}, {"id": 27762, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27762?format=json", "institution": null}, {"id": 27763, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27763?format=json", "institution": null}, {"id": 27764, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27764?format=json", "institution": null}, {"id": 27765, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27765?format=json", "institution": null}, {"id": 11194, "fullname": "Naveen Kumar", "url": "http://mlsys.org/api/miniconf/users/11194?format=json", "institution": "Google"}, {"id": 27766, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27766?format=json", "institution": null}, {"id": 27767, "fullname": "Sayce Falk", "url": "http://mlsys.org/api/miniconf/users/27767?format=json", "institution": "Google"}, {"id": 27768, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27768?format=json", "institution": null}, {"id": 27769, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27769?format=json", "institution": null}], "abstract": "Optimizing Large Models across thousands of accelerators requires deep system expertise. To address modern machine learning (ML) optimization needs, we present **XProf**, the ML profiler for the OpenXLA ecosystem. **XProf** delivers actionable optimization suggestions and in-depth performance analysis, empowering ML researchers and framework users to improve efficiency without specialized systems knowledge. **XProf** provides a unified, full-stack view of both host (CPU) and device (accelerator - TPUs/GPUs) performance, leveraging tools like the Roofline Model for comprehensive analysis. **XProf**\u2019s distributed architecture is designed to monitor thousands of chips with minimal workload overhead (<1%). This architecture is made pluggable through the open-source PJRT C API extension, which has facilitated its adoption by third-party accelerator vendors. **XProf** has been instrumental in achieving significant efficiency gains at Google and winning MLPerf submissions. This paper presents the design and architecture of **XProf**, showcases its differentiating tools and capabilities, and highlights its impact within Google and across the industry as a state of the art ML profiler. **XProf** is available as part of the OpenXLA project at https://github.com/openxla/xprof.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3604", "url": null, "sourceid": 16, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=KqRLAdGK6C", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 898, "modified": "2026-03-23T21:52:46.968487-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=KqRLAdGK6C", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3639, "uid": "a684eceee76fc522773286a895bc8436", "name": "Sparing Strategies to Minimize Reliability Impact On Large Training Jobs", "authors": [{"id": 27967, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27967?format=json", "institution": null}, {"id": 27968, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27968?format=json", "institution": null}, {"id": 27969, "fullname": "Ehsan K. Ardestani", "url": "http://mlsys.org/api/miniconf/users/27969?format=json", "institution": "Meta"}, {"id": 27970, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27970?format=json", "institution": null}, {"id": 27971, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27971?format=json", "institution": null}, {"id": 27972, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27972?format=json", "institution": null}, {"id": 27973, "fullname": "Zhaodong Wang", "url": "http://mlsys.org/api/miniconf/users/27973?format=json", "institution": "Facebook"}, {"id": 27974, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27974?format=json", "institution": null}, {"id": 27975, "fullname": "Xu Zhang", "url": "http://mlsys.org/api/miniconf/users/27975?format=json", "institution": "Meta Platforms"}, {"id": 27976, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27976?format=json", "institution": null}, {"id": 27977, "fullname": "Ying Zhang", "url": "http://mlsys.org/api/miniconf/users/27977?format=json", "institution": "University of Michigan - Ann Arbor"}], "abstract": "Training large language models (LLMs) on Meta\u2019s AI clusters requires running long, distributed jobs that are vulnerable to hardware failures. To maintain high availability and efficiency, production systems use sparing strategy, i.e., pre-allocating spare compute resources that can replace failed components. However, choosing the optimal sparing strategy-including compute block size, number of spare blocks, and spare GPU trays\u2014is complex and directly impacts cluster performance and reliability. We present an analytical framework with closed-form expressions to guide sparing strategy decisions, making practical, first-order recommendations for production environments. We also develop a simulation component to cross-validate the analytical model. Applied in Meta\u2019s hyperscale infrastructure, this model helps engineers optimize fault tolerance, minimize downtime, and maximize goodput during LLM training. Our real-world use case demonstrates how the framework informs robust, cost-effective design choices critical to Meta\u2019s AI operations.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3639", "url": null, "sourceid": 54, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=18jPgte2tM", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 933, "modified": "2026-03-23T21:52:48.414812-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=18jPgte2tM", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3527, "uid": "ec8956637a99787bd197eacd77acce5e", "name": "StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation", "authors": [{"id": 27276, "fullname": "Tianrui Feng", "url": "http://mlsys.org/api/miniconf/users/27276?format=json", "institution": "The University of Texas at Austin"}, {"id": 27277, "fullname": "Zhi Li", "url": "http://mlsys.org/api/miniconf/users/27277?format=json", "institution": "University of California, Berkeley"}, {"id": 27278, "fullname": "Shuo Yang", "url": "http://mlsys.org/api/miniconf/users/27278?format=json", "institution": "University of California, Berkeley"}, {"id": 27279, "fullname": "Haocheng Xi", "url": "http://mlsys.org/api/miniconf/users/27279?format=json", "institution": "University of California, Berkeley"}, {"id": 27280, "fullname": "Muyang Li", "url": "http://mlsys.org/api/miniconf/users/27280?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 24423, "fullname": "Xiuyu Li", "url": "http://mlsys.org/api/miniconf/users/24423?format=json", "institution": "UC Berkeley"}, {"id": 27281, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27281?format=json", "institution": null}, {"id": 27282, "fullname": "Keting Yang", "url": "http://mlsys.org/api/miniconf/users/27282?format=json", "institution": "Google"}, {"id": 27283, "fullname": "Kelly Peng", "url": "http://mlsys.org/api/miniconf/users/27283?format=json", "institution": "Stanford University"}, {"id": 12133, "fullname": "Song Han", "url": "http://mlsys.org/api/miniconf/users/12133?format=json", "institution": "MIT"}, {"id": 27284, "fullname": "Maneesh Agrawala", "url": "http://mlsys.org/api/miniconf/users/27284?format=json", "institution": "Stanford University"}, {"id": 11240, "fullname": "Kurt Keutzer", "url": "http://mlsys.org/api/miniconf/users/11240?format=json", "institution": "EECS, UC Berkeley"}, {"id": 27285, "fullname": "Akio Kodaira", "url": "http://mlsys.org/api/miniconf/users/27285?format=json", "institution": "Shizuku AI"}, {"id": 24298, "fullname": "Chenfeng Xu", "url": "http://mlsys.org/api/miniconf/users/24298?format=json", "institution": "UC Berkeley"}], "abstract": "Generative models are reshaping the live-streaming industry by redefining how content is created, styled, and delivered. Previous image-based streaming diffusion models have powered efficient and creative live streaming products but has hit limits on temporal consistency due to the foundation of image-based designs. Recent advances in video diffusion have markedly improved temporal consistency and sampling efficiency for offline generation. However, offline generation systems primarily optimize throughput by batching large workloads. In contrast, live online streaming operates under strict service-level objectives (SLOs): time-to-first-frame must be minimal, and every frame must meet a per-frame deadline with low jitter. Besides, scalable multi-GPU serving for real-time streams remains largely unresolved so far. To address this, we present \\textbf{StreamDiffusionV2}, a \\emph{training-free} pipeline for interactive live streaming with video diffusion models. StreamDiffusionV2 integrates an SLO-aware batching scheduler and a block scheduler, together with a sink-token\u2013guided rolling KV cache, a motion-aware noise controller, and other system-level optimizations. Moreover, we introduce a scalable pipeline orchestration that parallelizes the diffusion process across denoising steps and network layers, achieving near-linear FPS scaling without violating latency guarantees. The system scales seamlessly across heterogeneous GPU environments and supports flexible denoising steps (e.g., 1\u20134), enabling both ultra-low-latency and higher-quality modes. Without TensorRT or quantization, StreamDiffusionV2 renders the first frame within 0.5s and attains 58.28 FPS with a 14B-parameter model and 64.52 FPS with a 1.3B-parameter model on four H100 GPUs. Even when increasing denoising steps to improve quality, it sustains 31.62 FPS (14B) and 61.58 FPS (1.3B), making state-of-the-art generative live streaming practical and accessible\u2014from individual creators to enterprise-scale platforms.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3527", "url": null, "sourceid": 102, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=p9WALNBvc6", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 821, "modified": "2026-03-23T21:52:43.907988-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=p9WALNBvc6", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3567, "uid": "9f61408e3afb633e50cdf1b20de6f466", "name": "DreamDDP: Accelerating Low-Bandwidth Geo-Distributed LLM Training with Layer-wise Partial Synchronization", "authors": [{"id": 27588, "fullname": "Zhenheng Tang", "url": "http://mlsys.org/api/miniconf/users/27588?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 27589, "fullname": "Zichen TANG", "url": "http://mlsys.org/api/miniconf/users/27589?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 27590, "fullname": "Junlin Huang", "url": "http://mlsys.org/api/miniconf/users/27590?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 27591, "fullname": "Xinglin Pan", "url": "http://mlsys.org/api/miniconf/users/27591?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 27592, "fullname": "Rudan Yan", "url": "http://mlsys.org/api/miniconf/users/27592?format=json", "institution": "The Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 27593, "fullname": "Yuxin Wang", "url": "http://mlsys.org/api/miniconf/users/27593?format=json", "institution": "Huawei Technologies Ltd."}, {"id": 27594, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27594?format=json", "institution": null}, {"id": 27595, "fullname": "Shaohuai Shi", "url": "http://mlsys.org/api/miniconf/users/27595?format=json", "institution": "Harbin Institute of Technology, Shenzhen"}, {"id": 27596, "fullname": "Xiaowen Chu", "url": "http://mlsys.org/api/miniconf/users/27596?format=json", "institution": "Hong Kong University of Science and Technology (Guangzhou)"}, {"id": 27597, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27597?format=json", "institution": null}], "abstract": "Scaling up training large language models (LLMs) in computing and data perspectives motivates distributed training across different geo-distributed data centers. Communication in geo-distributed data parallel training (DDP) with stochastic gradient descent (S-SGD) is the main bottleneck in low-bandwidth environments. Recent studies have successfully applied Local SGD to mitigate the communication overhead and geo-distributedly pre-train LLMs. However, we identify that the strict model synchronization mechanism in Local SGD prevents overlapping communication and computation, which makes the system lose opportunities to overlap communication and computation. To overcome this limitation, we expand the design space of local SGD by layer-wisely decoupling model synchronization. In each iteration, only partial layers are synchronized instead of the entire model after a specific number of iterations.   Leveraging this methodology, we introduce DreamDDP, a training framework to accelerate low-bandwidth distributed training with three key innovations: (1) partial local SGD with theoretical assurances of convergence rates comparable to S-SGD; (2) overlapping parameter synchronization with computation without extra GPU memory occupation; (3) identifying and exploiting three properties to schedule communication and computation based on fine-grained layer-wise profiling to reduce training time. Empirical evaluations conducted on 32 GPUs using prominent deep learning models, including ResNet-18, ResNet-50, GPT-2, and Llama-2, demonstrate that DreamDDP enhances the convergence properties of Local SGD (and Adam) and achieves speedups ranging from $1.49\\times$ to $3.91\\times$ over leading baseline methods.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3567", "url": null, "sourceid": 56, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=cnvw0mbZQp", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 861, "modified": "2026-03-23T21:52:45.555547-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=cnvw0mbZQp", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3612, "uid": "45c48cce2e2d7fbdea1afc51c7c6ad26", "name": "When Enough is Enough: Rank-Aware Early Termination for Vector Search", "authors": [{"id": 27807, "fullname": "Jianan Lu", "url": "http://mlsys.org/api/miniconf/users/27807?format=json", "institution": "Princeton University"}, {"id": 11918, "fullname": "Asaf Cidon", "url": "http://mlsys.org/api/miniconf/users/11918?format=json", "institution": "Columbia University"}, {"id": 11221, "fullname": "Michael None Freedman", "url": "http://mlsys.org/api/miniconf/users/11221?format=json", "institution": "Princeton University"}], "abstract": "Graph-based vector search underpins modern LLM applications such as retrieval-augmented generation (RAG), but its efficiency is increasingly constrained by disk I/O. Existing systems continue searching long after discovering the higher-ranked (i.e., most valuable) results for downstream applications. We present Terminus, a rank-aware early termination mechanism that dynamically aligns I/O spending with application utility. Terminus models per-I/O search utility using a rank-weighted function and terminates once recent I/Os yield negligible utility gains. By prioritizing I/O toward results that matter most to downstream tasks, Terminus achieves a better performance\u2013accuracy trade-off. It delivers up to 1.4\u00d7 higher throughput at the same accuracy target compared to existing early termination schemes, and up to 3.2\u00d7 higher throughput than a baseline without early termination, with minimal impact on RAG accuracy.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3612", "url": null, "sourceid": 9, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=IFz0pROwF1", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 906, "modified": "2026-03-23T21:52:47.283768-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=IFz0pROwF1", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3614, "uid": "a0a080f42e6f13b3a2df133f073095dd", "name": "DisAgg: Distributed Aggregators for Efficient Secure Aggregation", "authors": [{"id": 27813, "fullname": "Haaris Mehmood", "url": "http://mlsys.org/api/miniconf/users/27813?format=json", "institution": "Samsung"}, {"id": 27814, "fullname": "Giorgos Tatsis", "url": "http://mlsys.org/api/miniconf/users/27814?format=json", "institution": "CERTH"}, {"id": 27815, "fullname": "Dimitrios Alexopoulos", "url": "http://mlsys.org/api/miniconf/users/27815?format=json", "institution": "Pragma IoT Solutions"}, {"id": 27816, "fullname": "Karthikeyan Saravanan", "url": "http://mlsys.org/api/miniconf/users/27816?format=json", "institution": "Samsung"}, {"id": 25860, "fullname": "Jie Xi", "url": "http://mlsys.org/api/miniconf/users/25860?format=json", "institution": "Samsung R&amp;D Institute UK (SRUK)"}, {"id": 27817, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27817?format=json", "institution": null}, {"id": 27818, "fullname": "Mete Ozay", "url": "http://mlsys.org/api/miniconf/users/27818?format=json", "institution": "Samsung Research"}], "abstract": "Federated learning enables collaborative model training across distributed clients, yet vanilla FL exposes client updates to the central server.  Secure\u2011aggregation schemes protect privacy against an honest\u2011but\u2011curious server, but existing approaches often suffer from many communication rounds, heavy public\u2011key operations, or difficulty handling client dropouts.  Recent methods like One\u2011Shot Private Aggregation (OPA) cut rounds to a single server interaction per FL iteration, yet they impose substantial cryptographic and computational overhead on both server and clients.  We propose a new protocol that leverages a small committee of clients called \\textit{aggregators} to perform the aggregation itself: each client secret\u2011shares its update vector to aggregators, which locally compute partial sums and return only aggregated shares for server\u2011side reconstruction.  This design eliminates local masking and expensive homomorphic encryption, reducing endpoint computation while preserving privacy against a curious server and a limited fraction of colluding clients.  By leveraging optimal trade-offs between communication and computation costs, extensive experiments with upto 50k users and 10k\u2011dimensional update vectors show that our protocol is at least $1.9\\times$ faster than OPA, the previous best protocol.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3614", "url": null, "sourceid": 122, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=H0BLKrOgik", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 908, "modified": "2026-03-23T21:52:47.381737-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=H0BLKrOgik", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3596, "uid": "202cb962ac59075b964b07152d234b70", "name": "Beyond the Buzz: A Pragmatic Take on Inference Disaggregation", "authors": [{"id": 27700, "fullname": "Tiyasa Mitra", "url": "http://mlsys.org/api/miniconf/users/27700?format=json", "institution": "NVIDIA"}, {"id": 27701, "fullname": "Ritika Borkar", "url": "http://mlsys.org/api/miniconf/users/27701?format=json", "institution": "NVIDIA"}, {"id": 27702, "fullname": "Nidhi Bhatia", "url": "http://mlsys.org/api/miniconf/users/27702?format=json", "institution": "NVIDIA Corporation"}, {"id": 27703, "fullname": "Shivam Raj", "url": "http://mlsys.org/api/miniconf/users/27703?format=json", "institution": "NVIDIA"}, {"id": 25864, "fullname": "hongkuan zhou", "url": "http://mlsys.org/api/miniconf/users/25864?format=json", "institution": "Nvidia"}, {"id": 27704, "fullname": "Yan Ru Pei", "url": "http://mlsys.org/api/miniconf/users/27704?format=json", "institution": "NVIDIA"}, {"id": 27705, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27705?format=json", "institution": null}, {"id": 27706, "fullname": "Kyle", "url": "http://mlsys.org/api/miniconf/users/27706?format=json", "institution": "NVIDIA"}, {"id": 27707, "fullname": "Ramon Matas", "url": "http://mlsys.org/api/miniconf/users/27707?format=json", "institution": "NVIDIA"}, {"id": 18519, "fullname": "Dheevatsa Mudigere", "url": "http://mlsys.org/api/miniconf/users/18519?format=json", "institution": "NVIDIA"}, {"id": 27708, "fullname": "Ritchie Zhao", "url": "http://mlsys.org/api/miniconf/users/27708?format=json", "institution": "NVIDIA"}, {"id": 27709, "fullname": "Maximilian Golub", "url": "http://mlsys.org/api/miniconf/users/27709?format=json", "institution": "NVIDIA"}, {"id": 27710, "fullname": "Arpan Dutta", "url": "http://mlsys.org/api/miniconf/users/27710?format=json", "institution": "NVIDIA"}, {"id": 27711, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27711?format=json", "institution": null}, {"id": 27712, "fullname": "Sailaja Madduri", "url": "http://mlsys.org/api/miniconf/users/27712?format=json", "institution": "NVIDIA"}, {"id": 27713, "fullname": "Dharmesh Jani", "url": "http://mlsys.org/api/miniconf/users/27713?format=json", "institution": "NVIDIA"}, {"id": 27714, "fullname": "Brian Pharris", "url": "http://mlsys.org/api/miniconf/users/27714?format=json", "institution": "NVIDIA"}, {"id": 27715, "fullname": "Itay Neeman", "url": "http://mlsys.org/api/miniconf/users/27715?format=json", "institution": "NVIDIA"}, {"id": 27716, "fullname": "Bita Darvish Rouhani", "url": "http://mlsys.org/api/miniconf/users/27716?format=json", "institution": "NVIDIA"}], "abstract": "As inference scales to multi-node deployments, prefill-decode disaggregation \u2014 splitting inference into distinct phases \u2014 offers a promising path to improving the throughput-interactivity Pareto frontier. Despite growing enthusiasm and a surge of open-source efforts, large-scale deployment of disaggregated serving remains limited due to the complexity of the optimization search space and system-level coordination. In this paper, we present the first systematic study of disaggregated inference at scale, evaluating hundreds of thousands of design points across diverse workloads and hardware configurations. We find that disaggregation is most effective for prefill-heavy traffic patterns and larger models. Our results highlight the critical role of dynamic rate matching and elastic scaling in achieving Pareto-optimal performance. These insights, in conjunction with the deployment flexibility offered by NVIDIA Dynamo, provide a foundation to navigate the trade-off between system throughput and interactivity in efficient disaggregated deployments.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3596", "url": null, "sourceid": 123, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=NqC5tcBsa0", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 890, "modified": "2026-03-23T21:52:46.684672-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=NqC5tcBsa0", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3601, "uid": "c45147dee729311ef5b5c3003946c48f", "name": "Pylo: Towards Accessible Learned Optimizers in PyTorch", "authors": [{"id": 27741, "fullname": "Paul Janson", "url": "http://mlsys.org/api/miniconf/users/27741?format=json", "institution": "Concordia University"}, {"id": 27742, "fullname": "Benjamin Th\u00e9rien", "url": "http://mlsys.org/api/miniconf/users/27742?format=json", "institution": "Mila / Universit\u00e9 de Montr\u00e9al"}, {"id": 27743, "fullname": "Quentin Anthony", "url": "http://mlsys.org/api/miniconf/users/27743?format=json", "institution": "EleutherAI"}, {"id": 27744, "fullname": "Xiaolong Huang", "url": "http://mlsys.org/api/miniconf/users/27744?format=json", "institution": "Concordia University"}, {"id": 27745, "fullname": "Abhinav Moudgil", "url": "http://mlsys.org/api/miniconf/users/27745?format=json", "institution": "Concordia University"}, {"id": 27746, "fullname": "Eugene Belilovsky", "url": "http://mlsys.org/api/miniconf/users/27746?format=json", "institution": "Concordia University"}], "abstract": "Learned optimizers have been an active research topic over the past decade, with increasing progress toward practical, general-purpose optimizers that can serve as drop-in replacements for widely used methods like Adam. However, recent advances such as VeLO, which was meta-trained for 4000 TPU-months, remain largely inaccessible to the broader community, in part due to their reliance on JAX and the absence of user-friendly packages for independently using the optimizers after meta-training. To address this gap, we introduce PyLO, a PyTorch-based library that brings learned optimizers to the remaining \u2248 80% of machine learning community via the familiar torch.optim.Optimizer interface. Unlike prior work focused on limited-scale academic tasks, our emphasis is on applying learned optimization to real-world large-scale pre-training tasks. Our systems contribution includes CUDA-accelerated implementations of the small fc lopt(Metz et al.,  2022a) and VeLO(Metz et al., 2022b) learned optimizers, achieving substantial performance gains, with training throughput on ViT-B/16 (batch size 32) increasing from 39.36 and 49.73 to 205.59 and 191.18 samples per second, respectively. PyLO has the versatility that allows us to easily combine learned optimizers with existing optimization tools such as learning rate schedules and weight decay. When doing so, we discover that learned optimizers can substantially benefit from it. Our code is available at https://anonymous.4open.science/r/pylo-C91E32", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3601", "url": null, "sourceid": 116, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=M9V1n4KxSd", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 895, "modified": "2026-03-23T21:52:46.873684-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=M9V1n4KxSd", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3518, "uid": "812b4ba287f5ee0bc9d43bbf5bbe87fb", "name": "Blueprint, Bootstrap, and Bridge: A Security Look at NVIDIA GPU Confidential Computing", "authors": [{"id": 27206, "fullname": "Zhongshu Gu", "url": "http://mlsys.org/api/miniconf/users/27206?format=json", "institution": "IBM Research"}, {"id": 27207, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27207?format=json", "institution": null}, {"id": 27208, "fullname": "Salman Ahmed", "url": "http://mlsys.org/api/miniconf/users/27208?format=json", "institution": "IBM"}, {"id": 27209, "fullname": "Julian James stephen", "url": "http://mlsys.org/api/miniconf/users/27209?format=json", "institution": "IBM"}, {"id": 27210, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27210?format=json", "institution": null}, {"id": 27211, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27211?format=json", "institution": null}, {"id": 27212, "fullname": "Shixuan Zhao", "url": "http://mlsys.org/api/miniconf/users/27212?format=json", "institution": "The Ohio State University"}, {"id": 27213, "fullname": "Zhiqiang Lin", "url": "http://mlsys.org/api/miniconf/users/27213?format=json", "institution": "Ohio State University, Columbus"}], "abstract": "GPU Confidential Computing (GPU-CC), introduced with the NVIDIA Hopper architecture, extends confidential computing protections from CPUs to GPUs, enabling secure execution of AI workloads. For end users, enabling GPU-CC is seamless and requires no modifications to existing applications. However, behind this ease of adoption lies a proprietary and highly complex system whose opacity presents significant challenges for early adopters and system researchers seeking to understand its architecture and security landscape. In this work, we provide a security-focused look at GPU-CC by reconstructing a coherent view of the system. Our analysis begins from the GPU-CC\u2019s blueprint, focusing on the specialized architectural engines that underpin its security design. We then investigate GPU-CC\u2019s bootstrap process, which orchestrates hardware and software components to establish core security mechanisms. Finally, we conduct targeted experiments to evaluate whether, under the GPU-CC\u2019s threat model, data transfers via different data paths remain secure when they cross the bridge between trusted CPU and GPU domains. All security findings presented in this paper have been reported responsibly to the NVIDIA Product Security Incident Response Team (PSIRT).", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3518", "url": null, "sourceid": 95, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=t9RDCO1aL7", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 812, "modified": "2026-03-23T21:52:43.587993-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=t9RDCO1aL7", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3632, "uid": "35f4a8d465e6e1edc05f3d8ab658c551", "name": "VeriMoA: A Mixture-of-Agents Framework for Spec-to-HDL Generation", "authors": [{"id": 27902, "fullname": "Heng Ping", "url": "http://mlsys.org/api/miniconf/users/27902?format=json", "institution": "University of Southern California"}, {"id": 27903, "fullname": "Arijit Bhattacharjee", "url": "http://mlsys.org/api/miniconf/users/27903?format=json", "institution": "Iowa State University"}, {"id": 27904, "fullname": "Peiyu Zhang", "url": "http://mlsys.org/api/miniconf/users/27904?format=json", "institution": "University of Southern California"}, {"id": 27905, "fullname": "Shixuan Li", "url": "http://mlsys.org/api/miniconf/users/27905?format=json", "institution": "University of Southern California"}, {"id": 27906, "fullname": "Wei Yang", "url": "http://mlsys.org/api/miniconf/users/27906?format=json", "institution": "University of Southern California"}, {"id": 27907, "fullname": "Anzhe Cheng", "url": "http://mlsys.org/api/miniconf/users/27907?format=json", "institution": "University of Southern California"}, {"id": 27908, "fullname": "Xiaole Zhang", "url": "http://mlsys.org/api/miniconf/users/27908?format=json", "institution": "University of Southern California"}, {"id": 27909, "fullname": "Jesse Thomason", "url": "http://mlsys.org/api/miniconf/users/27909?format=json", "institution": "University of Southern California"}, {"id": 15789, "fullname": "Ali Jannesari", "url": "http://mlsys.org/api/miniconf/users/15789?format=json", "institution": "Iowa State University"}, {"id": 27910, "fullname": "Nesreen Ahmed", "url": "http://mlsys.org/api/miniconf/users/27910?format=json", "institution": "Cisco"}, {"id": 11952, "fullname": "Paul Bogdan", "url": "http://mlsys.org/api/miniconf/users/11952?format=json", "institution": "USC"}], "abstract": "Automation of Register Transfer Level (RTL) design can help developers meet increasing computational demands. Large Language Models (LLMs) show promise for Hardware Description Language (HDL) generation, but face challenges due to limited parametric knowledge and domain-specific constraints. While prompt engineering and fine-tuning have limitations in knowledge coverage and training costs, multi-agent architectures offer a training-free paradigm to enhance reasoning through collaborative generation. However, current multi-agent approaches suffer from two critical deficiencies: susceptibility to noise propagation and constrained reasoning space exploration. We propose \\textbf{VeriMoA}, a training-free mixture-of-agents (MoA) framework with two synergistic innovations. First, a \\textbf{quality-guided caching mechanism} to maintain all intermediate HDL outputs and enables quality-based ranking and selection across the entire generation process, encouraging knowledge accumulation over layers of reasoning. Second, a \\textbf{multi-path generation strategy} that leverages C++ and Python as intermediate representations, decomposing specification-to-HDL translation into two-stage processes that exploit LLM fluency in high-resource languages while promoting solution diversity. Comprehensive experiments on VerilogEval 2.0 and RTLLM 2.0 benchmarks demonstrate that \\ourtool achieves 15--30\\% improvements in Pass@1 across diverse LLM backbones, especially enabling smaller models to match larger models and fine-tuned alternatives without requiring costly training.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3632", "url": null, "sourceid": 78, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=5wgZXJ0kWA", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 926, "modified": "2026-03-23T21:52:48.099852-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=5wgZXJ0kWA", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3554, "uid": "ea5d2f1c4608232e07d3aa3d998e5135", "name": "Virtual Machine NUMA Placement at Scale: Learning the Norm, Shielding the Tail", "authors": [{"id": 26262, "fullname": "Yibo Zhao", "url": "http://mlsys.org/api/miniconf/users/26262?format=json", "institution": "Northeastern University"}, {"id": 27473, "fullname": "Tianyuan Wu", "url": "http://mlsys.org/api/miniconf/users/27473?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 27474, "fullname": "HUI XUE", "url": "http://mlsys.org/api/miniconf/users/27474?format=json", "institution": "Research, Microsoft"}, {"id": 27475, "fullname": "Qi Chen", "url": "http://mlsys.org/api/miniconf/users/27475?format=json", "institution": "Microsoft Research"}, {"id": 27476, "fullname": "Zhenhua Han", "url": "http://mlsys.org/api/miniconf/users/27476?format=json", "institution": "Microsoft"}, {"id": 27477, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27477?format=json", "institution": null}, {"id": 27478, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27478?format=json", "institution": null}, {"id": 27479, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27479?format=json", "institution": null}, {"id": 27480, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27480?format=json", "institution": null}, {"id": 27481, "fullname": "Jui-Hao Chiang", "url": "http://mlsys.org/api/miniconf/users/27481?format=json", "institution": "Microsoft"}, {"id": 27482, "fullname": "Mingxia Li", "url": "http://mlsys.org/api/miniconf/users/27482?format=json", "institution": "Alibaba Group"}, {"id": 16293, "fullname": "Yuqing Yang", "url": "http://mlsys.org/api/miniconf/users/16293?format=json", "institution": "Microsoft Research"}, {"id": 16145, "fullname": "Cheng Tan", "url": "http://mlsys.org/api/miniconf/users/16145?format=json", "institution": "Northeastern"}, {"id": 27483, "fullname": "Fan Yang", "url": "http://mlsys.org/api/miniconf/users/27483?format=json", "institution": "Research, Microsoft"}, {"id": 15807, "fullname": "Peng Cheng", "url": "http://mlsys.org/api/miniconf/users/15807?format=json", "institution": null}, {"id": 27484, "fullname": "Yongqiang Xiong", "url": "http://mlsys.org/api/miniconf/users/27484?format=json", "institution": "Microsoft Research"}, {"id": 27469, "fullname": "Lili Qiu", "url": "http://mlsys.org/api/miniconf/users/27469?format=json", "institution": "Microsoft Research Asia"}, {"id": 26227, "fullname": "Lidong Zhou", "url": "http://mlsys.org/api/miniconf/users/26227?format=json", "institution": "Microsoft"}], "abstract": "In modern data centers, servers organize memory and CPUs into Non-Uniform Memory Access (NUMA) nodes, where unequal memory-to-CPU proximity leads to varying memory latency. Hypervisors must carefully place Virtual Machines (VMs) to reduce remote memory access. Poor placements can lead to significant performance degradation\u2014sometimes up to 30%. However, achieving optimal placement at scale is challenging due to the large number of VM configurations, diverse NUMA structures, and evolving workload patterns. We present Catur, a NUMA placement system designed for large-scale cloud environments. Catur leverages reinforcement learning to learn from production data. Moreover, to address real-world challenges, Catur integrates several techniques: robust action space design to prevent model collapse, reward shaping to address learning inefficiency, drift-aware continuous training for evolving workload patterns, and speculative shielding to mitigate VM performance anomalies. Evaluations on production traces with 100 million VMs demonstrate that Catur reduces average resource defect by 34.2%\u201350.0% compared to state-of-the-art hypervisor policies.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3554", "url": null, "sourceid": 64, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=guCUThRvX5", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 848, "modified": "2026-03-23T21:52:44.998424-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=guCUThRvX5", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3589, "uid": "6c8349cc7260ae62e3b1396831a8398f", "name": "CSLE: A Reinforcement Learning Platform for Autonomous Security Management", "authors": [{"id": 27661, "fullname": "Kim Hammar", "url": "http://mlsys.org/api/miniconf/users/27661?format=json", "institution": "University of Melbourne"}], "abstract": "Reinforcement learning is a promising approach to autonomous and adaptive security management in networked systems. However, current reinforcement learning solutions for security management are mostly limited to simulation environments and it is unclear how they generalize to operational systems. In this paper, we address this limitation by presenting CSLE: a reinforcement learning platform for autonomous security management that enables experimentation under semi-operational conditions. Conceptually, CSLE encompasses two systems. First, it includes an emulation system that replicates key components of the target system in a virtualized environment. We use this system to gather measurements and logs, based on which we identify a system model, such as a Markov decision process. Second, it includes a simulation system where security strategies are efficiently learned through simulations of the system model. The learned strategies are then evaluated and refined in the emulation system to close the gap between theoretical and operational performance. We demonstrate CSLE through four use cases: flow control, replication control, segmentation control, and recovery control. Through these use cases, we show that CSLE enables near-optimal security management in a semi-operational environment.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3589", "url": null, "sourceid": 45, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=QGuRWjFsnm", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 883, "modified": "2026-03-23T21:52:46.423159-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=QGuRWjFsnm", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3571, "uid": "5ef059938ba799aaa845e1c2e8a762bd", "name": "MAC-Attention: a Match--Amend--Complete scheme for fast and accurate attention computation", "authors": [{"id": 19233, "fullname": "Jinghan Yao", "url": "http://mlsys.org/api/miniconf/users/19233?format=json", "institution": "The Ohio State University"}, {"id": 21021, "fullname": "Sam Jacobs", "url": "http://mlsys.org/api/miniconf/users/21021?format=json", "institution": "Microsoft"}, {"id": 27604, "fullname": "Walid Krichene", "url": "http://mlsys.org/api/miniconf/users/27604?format=json", "institution": "Microsoft"}, {"id": 27605, "fullname": "Masahiro Tanaka", "url": "http://mlsys.org/api/miniconf/users/27605?format=json", "institution": "Anyscale"}, {"id": 20998, "fullname": "Dhabaleswar Panda", "url": "http://mlsys.org/api/miniconf/users/20998?format=json", "institution": "Ohio State University"}], "abstract": "Long-context decoding in LLMs is IO-bound: each token re-reads an ever-growing KV cache. Prior accelerations cut bytes via compression (lowering fidelity) or selection/eviction (restricting what remains accessible), which can degrade delayed recall and long-form generation. We introduce MAC-Attention, a fidelity and access-preserving alternative that accelerates decode by reusing prior attention computations for semantically similar recent queries. It starts with a match stage that performs pre-RoPE L2 matching over a short local window; an amend stage rectifies the reused attention by recomputing a small band near the match boundary; and a complete stage fuses the rectified results with a fresh attention computed on the KV tail, via a numerically stable merge. On a match hit, the compute and bandwidth complexity is constant regardless of the context length. The method is model-agnostic, and composes with IO-aware kernels, paged-KV managers, and MQA/GQA. Across LongBench v2 (120K), RULER (120K), and LongGenBench (16K continuous generation), MAC-Attention reduces KV accesses by up to 99%, cuts token generation latency by over 60% at 128K, and achieves over 14.3x attention-phase speedups (up to 2.6x end-to-end), while maintaining full-attention quality. By reusing computation rather than compressing or discarding tokens, MAC-Attention delivers long-context inference that is both fast and faithful. Code is available.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3571", "url": null, "sourceid": 118, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=b6HBRCejb7", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 865, "modified": "2026-03-23T21:52:45.719722-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=b6HBRCejb7", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3532, "uid": "14bfa6bb14875e45bba028a21ed38046", "name": "SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding", "authors": [{"id": 27322, "fullname": "Jameson Sandler", "url": "http://mlsys.org/api/miniconf/users/27322?format=json", "institution": "University of Virginia"}, {"id": 27323, "fullname": "Jacob K Christopher", "url": "http://mlsys.org/api/miniconf/users/27323?format=json", "institution": "University of Virginia, Charlottesville"}, {"id": 27324, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27324?format=json", "institution": null}, {"id": 27325, "fullname": "Ferdinando Fioretto", "url": "http://mlsys.org/api/miniconf/users/27325?format=json", "institution": "University of Virginia"}], "abstract": "Speculative decoding has become the standard approach for accelerating Large Language Model (LLM) inference. It exploits a lossless draft-then-verify procedure to circumvent the latency of autoregressive decoding, achieving impressive speed-ups.     Yet, current speculative decoding approaches remain limited by two fundamental bottlenecks: \\textbf{(1)} the autoregressive dependency during drafting which limits parallelism, and \\textbf{(2)} frequent rejections of draft tokens caused by misalignment between the draft and verify models.      This paper proposes \\emph{SpecDiff-2}, a novel framework to jointly address these two bottlenecks. It leverages discrete diffusion as a non-autoregressive drafter to address bottleneck (1) and develops novel techniques to calibrate discrete diffusion drafters with autoregressive verifiers, addressing bottleneck (2). Experimental results across a comprehensive benchmark suite show that \\emph{SpecDiff-2} achieves a new state-of-the-art across reasoning, coding, and mathematical benchmarks, improving tokens-per-second by up to an average of $+55\\%$ over previous baselines and obtaining up to $5.5\\times$ average speed-up over standard decoding, without any loss of accuracy.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3532", "url": null, "sourceid": 69, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=o42VU86ZsV", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 826, "modified": "2026-03-23T21:52:44.093141-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=o42VU86ZsV", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3625, "uid": "c20ad4d76fe97759aa27a0c99bff6710", "name": "IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference", "authors": [{"id": 25556, "fullname": "Wanli Zhong", "url": "http://mlsys.org/api/miniconf/users/25556?format=json", "institution": "Southern University of Science and Technology"}, {"id": 27848, "fullname": "Haibo Feng", "url": "http://mlsys.org/api/miniconf/users/27848?format=json", "institution": "Southern University of Science and Technology"}, {"id": 27849, "fullname": "Zirui Zhou", "url": "http://mlsys.org/api/miniconf/users/27849?format=json", "institution": "Southern University of Science and Technology"}, {"id": 27850, "fullname": "Hanyang Peng", "url": "http://mlsys.org/api/miniconf/users/27850?format=json", "institution": "Pengcheng Loboratory"}, {"id": 27851, "fullname": "Shiqi Yu", "url": "http://mlsys.org/api/miniconf/users/27851?format=json", "institution": "Southern University of Science and Technology, Shenzhen, China"}], "abstract": "Deploying Transformer models on edge devices is limited by latency and energy budgets. While INT8 quantization effectively accelerates the primary matrix multiplications, it exposes the softmax as the dominant bottleneck. This stage incurs a costly $\\mathrm{dequantize}\\rightarrow\\mathrm{softmax}\\rightarrow\\mathrm{requantize}$ detour, which can account for up to 65\\% of total attention latency and disrupts the end-to-end integer dataflow critical for edge hardware efficiency. To address this limitation, we present \\emph{IntAttention}, the first fully integer, plug-and-play attention pipeline without retraining. At the core of our approach lies \\emph{IndexSoftmax}, a hardware-friendly operator that replaces floating-point exponentials entirely within the integer domain. \\emph{IntAttention} integrates sparsity-aware clipping, a 32-entry lookup-table approximation, and direct integer normalization, thereby eliminating all datatype conversion overhead. We evaluate \\emph{IntAttention} and demonstrate consistent and substantial gains. Our method achieves up to \\textbf{3.7\u00d7} speedup and \\textbf{61\\%} energy reduction over FP16 baselines and \\textbf{2.0x} faster than conventional INT8 attention pipelines on Armv8 CPUs. These gains are achieved with high-fidelity accuracy comparable to baselines across diverse language and vision models, enabling practical and efficient Transformer inference on commodity edge devices.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3625", "url": null, "sourceid": 12, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=CPCRITwAaP", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 919, "modified": "2026-03-23T21:52:47.824478-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=CPCRITwAaP", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3542, "uid": "3c59dc048e8850243be8079a5c74d079", "name": "Cost-aware Duration Prediction for Software Upgrades in Datacenters", "authors": [{"id": 27360, "fullname": "Yi Ding", "url": "http://mlsys.org/api/miniconf/users/27360?format=json", "institution": "Purdue University"}, {"id": 14772, "fullname": "Henry (Hank) Hoffmann", "url": "http://mlsys.org/api/miniconf/users/14772?format=json", "institution": "The University of Chicago"}], "abstract": "Software upgrades are critical to maintaining server reliability in datacenters. While job duration prediction and scheduling have been extensively studied, the unique challenges posed by software upgrades remain largely under-explored. This paper presents the first in-depth investigation into software upgrade scheduling at datacenter scale. We begin by characterizing various types of upgrades and then frame the scheduling task as a constrained optimization problem. To address this problem, we introduce Zephyr, a cost-aware duration prediction framework designed to improve upgrade scheduling efficiency and throughput while meeting service-level objectives (SLOs). Zephyr accounts for asymmetric misprediction costs, strategically selects the best predictive models, and mitigates straggler-induced overestimations. Evaluations on Meta's production datacenter systems demonstrate that Zephyr significantly outperforms the existing upgrade scheduler by improving upgrade window utilization by 1.25x, increasing the number of scheduled and completed upgrades by 33% and 41%, and reducing cancellation rates by 2.4x. The code and data sets will be released after paper acceptance.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3542", "url": null, "sourceid": 21, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=l72e5oROLT", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 836, "modified": "2026-03-23T21:52:44.495969-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=l72e5oROLT", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3523, "uid": "e369853df766fa44e1ed0ff613f563bd", "name": "Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost", "authors": [{"id": 27248, "fullname": "Haojun Xia", "url": "http://mlsys.org/api/miniconf/users/27248?format=json", "institution": "University of Sydney"}, {"id": 23865, "fullname": "Xiaoxia Wu", "url": "http://mlsys.org/api/miniconf/users/23865?format=json", "institution": "TogtherAI"}, {"id": 25960, "fullname": "Jisen Li", "url": "http://mlsys.org/api/miniconf/users/25960?format=json", "institution": "University of illinois Urbana-Champaign"}, {"id": 27249, "fullname": "Rupert CQ Wu", "url": "http://mlsys.org/api/miniconf/users/27249?format=json", "institution": "AMD"}, {"id": 27250, "fullname": "Junxiong Wang", "url": "http://mlsys.org/api/miniconf/users/27250?format=json", "institution": "TogetherAI"}, {"id": 27251, "fullname": "Jue Wang", "url": "http://mlsys.org/api/miniconf/users/27251?format=json", "institution": "Together AI"}, {"id": 27252, "fullname": "Chenxi Li", "url": "http://mlsys.org/api/miniconf/users/27252?format=json", "institution": "Together AI"}, {"id": 27253, "fullname": "Aman Singhal", "url": "http://mlsys.org/api/miniconf/users/27253?format=json", "institution": "Together AI"}, {"id": 24132, "fullname": "Alay Dilipbhai Shah", "url": "http://mlsys.org/api/miniconf/users/24132?format=json", "institution": "Together AI"}, {"id": 27254, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27254?format=json", "institution": null}, {"id": 27255, "fullname": "Donglin Zhuang", "url": "http://mlsys.org/api/miniconf/users/27255?format=json", "institution": "The University of Sydney"}, {"id": 27256, "fullname": "Zhongzhu Zhou", "url": "http://mlsys.org/api/miniconf/users/27256?format=json", "institution": "Together.AI &amp; University of Sydney"}, {"id": 18231, "fullname": "Ben Athiwaratkun", "url": "http://mlsys.org/api/miniconf/users/18231?format=json", "institution": null}, {"id": 27257, "fullname": "Zhen Zheng", "url": "http://mlsys.org/api/miniconf/users/27257?format=json", "institution": "Microsoft"}, {"id": 14753, "fullname": "Shuaiwen Song", "url": "http://mlsys.org/api/miniconf/users/14753?format=json", "institution": "University of Sydney"}], "abstract": "The KV cache is a dominant memory bottleneck for LLM inference. While 4-bit KV quantization preserves accuracy, 2-bit often degrades it, especially on long-context reasoning. We close this gap via an algorithm\u2013system co-design for mixed-precision KV caching: \\emph{Kitty}. On the algorithm side, extensive experiments show that \\emph{Dynamic Channel-wise Precision Boost} \u2014 which ranks Key-cache channels by sensitivity and keeps only a small fraction at higher precision \u2014 maintains near-zero loss in accuracy drop while approaching 2-bit memory.  The main challenge is handling dynamic 4-bit channel boosts while keeping the page layout coalesced and the dequantization uniform, with no scattered reads or hard-coded masks. \\emph{Kitty} addresses these issues by decompose each mixed-precision Key page into two tensors with unified 2-bit precision. Based on this, Kitty provides a page-centric KV layout, Triton-compatible page dequantization kernels, and a lightweight runtime pipeline that preserves coalescing and avoids divergence. Across seven tasks and two model families (Qwen3, LLaMA3), \\emph{Kitty} cuts KV memory by nearly $8\\times$ with negligible accuracy loss, enabling up to $8\\times$ larger batches and $2.1\\times$\u2013$4.1\\times$ higher throughput under the same memory budget.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3523", "url": null, "sourceid": 34, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=r3mQiuYKIN", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 817, "modified": "2026-03-23T21:52:43.761494-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=r3mQiuYKIN", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3547, "uid": "3ef815416f775098fe977004015c6193", "name": "Search Your NVFP4 Scales!", "authors": [{"id": 27410, "fullname": "Tanmaey Gupta", "url": "http://mlsys.org/api/miniconf/users/27410?format=json", "institution": "Cornell University"}, {"id": 27411, "fullname": "Hayden Prairie", "url": "http://mlsys.org/api/miniconf/users/27411?format=json", "institution": "University of California, San Diego"}, {"id": 23865, "fullname": "Xiaoxia Wu", "url": "http://mlsys.org/api/miniconf/users/23865?format=json", "institution": "TogtherAI"}, {"id": 27412, "fullname": "Reyna Abhyankar", "url": "http://mlsys.org/api/miniconf/users/27412?format=json", "institution": "University of California, San Diego"}, {"id": 25635, "fullname": "Qingyang Wu", "url": "http://mlsys.org/api/miniconf/users/25635?format=json", "institution": "Together AI"}, {"id": 24490, "fullname": "Austin Silveria", "url": "http://mlsys.org/api/miniconf/users/24490?format=json", "institution": "University of California, San Diego"}, {"id": 27413, "fullname": "Pragaash Ponnusamy", "url": "http://mlsys.org/api/miniconf/users/27413?format=json", "institution": "Together AI"}, {"id": 27251, "fullname": "Jue Wang", "url": "http://mlsys.org/api/miniconf/users/27251?format=json", "institution": "Together AI"}, {"id": 18231, "fullname": "Ben Athiwaratkun", "url": "http://mlsys.org/api/miniconf/users/18231?format=json", "institution": null}, {"id": 14753, "fullname": "Shuaiwen Song", "url": "http://mlsys.org/api/miniconf/users/14753?format=json", "institution": "University of Sydney"}, {"id": 27414, "fullname": "Tri Dao", "url": "http://mlsys.org/api/miniconf/users/27414?format=json", "institution": "Princeton, TogetherAI"}, {"id": 27189, "fullname": "Daniel Fu", "url": "http://mlsys.org/api/miniconf/users/27189?format=json", "institution": "University of California, San Diego"}, {"id": 11289, "fullname": "Christopher De Sa", "url": "http://mlsys.org/api/miniconf/users/11289?format=json", "institution": "Cornell University"}], "abstract": "Quantization has emerged as a standard technique for accelerating inference for generative models by enabling faster low-precision computations and reduced memory transfers. Recently, GPU accelerators have added first-class support for microscaling Block Floating Point (BFP) formats.  Standard BFP algorithms use a fixed scale based on the maximum magnitude of the block. We observe that this scale choice can be suboptimal with respect to quantization errors. In this work, we propose \\textbf{ScaleSearch}, an alternative strategy for selecting these scale factors: using a fine-grained search leveraging the mantissa bits in microscaling formats to minimize the quantization error for the given distribution. \\textbf{ScaleSearch} can be integrated with existing quantization methods such as Post Training Quantization and low-precision attention, and is shown to improve their performance. Additionally, we introduce \\textbf{ScaleSearchAttention}, an accelerated NVFP4-based attention algorithm, which uses \\textbf{ScaleSearch} and adapted prior techniques to ensure near-0 performance loss for causal language modeling. Experiments show that \\textbf{ScaleSearch} improves language model weight PTQ by up to 7.5 points for GPQA (Qwen3-8B), video generation on Mochi by up to 14 points in VQA-a over SageAttention3. \\textbf{ScaleSearchAttention} improves Wikitext-2 PPL by 0.9 points for Llama 3.1 70B.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3547", "url": null, "sourceid": 85, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=innqECyZPK", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 841, "modified": "2026-03-23T21:52:44.701711-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=innqECyZPK", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3543, "uid": "f899139df5e1059396431415e770c6dd", "name": "Beat the long tail: Distribution-Aware Speculative Decoding for RL Training", "authors": [{"id": 21038, "fullname": "ZELEI SHAO", "url": "http://mlsys.org/api/miniconf/users/21038?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 27361, "fullname": "Vikranth Srivatsa", "url": "http://mlsys.org/api/miniconf/users/27361?format=json", "institution": "University of California, San Diego"}, {"id": 27250, "fullname": "Junxiong Wang", "url": "http://mlsys.org/api/miniconf/users/27250?format=json", "institution": "TogetherAI"}, {"id": 24298, "fullname": "Chenfeng Xu", "url": "http://mlsys.org/api/miniconf/users/24298?format=json", "institution": "UC Berkeley"}, {"id": 23865, "fullname": "Xiaoxia Wu", "url": "http://mlsys.org/api/miniconf/users/23865?format=json", "institution": "TogtherAI"}, {"id": 27362, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27362?format=json", "institution": null}, {"id": 27254, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27254?format=json", "institution": null}, {"id": 25635, "fullname": "Qingyang Wu", "url": "http://mlsys.org/api/miniconf/users/25635?format=json", "institution": "Together AI"}, {"id": 27251, "fullname": "Jue Wang", "url": "http://mlsys.org/api/miniconf/users/27251?format=json", "institution": "Together AI"}, {"id": 18639, "fullname": "Ameen Patel", "url": "http://mlsys.org/api/miniconf/users/18639?format=json", "institution": "Together.ai"}, {"id": 26292, "fullname": "Yiying Zhang", "url": "http://mlsys.org/api/miniconf/users/26292?format=json", "institution": "UCSD and GenseeAI"}, {"id": 27363, "fullname": "Percy Liang", "url": "http://mlsys.org/api/miniconf/users/27363?format=json", "institution": "Stanford University"}, {"id": 18765, "fullname": "Tri Dao", "url": "http://mlsys.org/api/miniconf/users/18765?format=json", "institution": "Princeton University, Together AI"}, {"id": 18231, "fullname": "Ben Athiwaratkun", "url": "http://mlsys.org/api/miniconf/users/18231?format=json", "institution": null}, {"id": 18868, "fullname": "Ce Zhang", "url": "http://mlsys.org/api/miniconf/users/18868?format=json", "institution": null}], "abstract": "Reinforcement learning (RL) post-training has become essential for aligning large language models (LLMs), yet its efficiency is increasingly constrained by the rollout phase, where long trajectories are generated token by token. We identify a major bottleneck\u2014the long-tail distribution of rollout lengths, where a small fraction of long generations dominates wall-clock time\u2014and a complementary opportunity\u2014the availability of historical rollouts that reveal stable prompt-level patterns across training epochs. Motivated by these observations, we propose \\textbf{DAS, a Distribution-Aware Speculative decoding framework} that accelerates RL rollouts without altering model outputs. DAS integrates two key ideas: a \\textbf{self-evolving, nonparametric drafter} built from recent rollouts using an incrementally maintained suffix tree, and a \\textbf{length-aware speculation policy} that allocates more aggressive draft budgets to long trajectories that dominate makespan. This design exploits rollout history to sustain acceptance while balancing base and token-level costs during decoding. Experiments on math and code reasoning tasks show that DAS reduces rollout time by over 30\\% while preserving identical training curves, demonstrating that distribution-aware speculative decoding can significantly accelerate RL post-training without compromising learning quality.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3543", "url": null, "sourceid": 100, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=kMeqqPBjSl", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 837, "modified": "2026-03-23T21:52:44.534495-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=kMeqqPBjSl", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3531, "uid": "93db85ed909c13838ff95ccfa94cebd9", "name": "Efficient Long-Context Language Model Training by Core Attention Disaggregation", "authors": [{"id": 27316, "fullname": "Yonghao Zhuang", "url": "http://mlsys.org/api/miniconf/users/27316?format=json", "institution": "CMU, Carnegie Mellon University"}, {"id": 17900, "fullname": "Junda Chen", "url": "http://mlsys.org/api/miniconf/users/17900?format=json", "institution": "University of California San Diego"}, {"id": 27317, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27317?format=json", "institution": null}, {"id": 27318, "fullname": "Yi Gu", "url": "http://mlsys.org/api/miniconf/users/27318?format=json", "institution": "University of California, San Diego"}, {"id": 27319, "fullname": "Yibo Zhu", "url": "http://mlsys.org/api/miniconf/users/27319?format=json", "institution": "StepFun"}, {"id": 27320, "fullname": "Yimin Jiang", "url": "http://mlsys.org/api/miniconf/users/27320?format=json", "institution": "Anuttacon"}, {"id": 11118, "fullname": "Ion Stoica", "url": "http://mlsys.org/api/miniconf/users/11118?format=json", "institution": "UC Berkeley"}, {"id": 27321, "fullname": "Hao Zhang", "url": "http://mlsys.org/api/miniconf/users/27321?format=json", "institution": "University of California, San Diego"}, {"id": 16300, "fullname": "Eric Xing", "url": "http://mlsys.org/api/miniconf/users/16300?format=json", "institution": "MBZUAI, CMU, and Petuum Inc."}], "abstract": "We present core attention disaggregation (CAD), a technique that improves long-context LLM training by disaggregating the core attention (CA) -- the parameter-free $\\mathrm{softmax}(\\mathbf{QK}^{\\top})\\mathbf{V}$ computation -- and schedules it on an independent pool of resources. Existing systems co-locate core attention with other components. At long context, the quadratic growth of CA computation and near-linear growth of the rest create load imbalance -- hence stragglers across data and pipeline groups. CAD is enabled by two key observations: (i) \\emph{statelessness}: CA has no trainable parameters and minimal transient state, so balancing reduces to scheduling compute-bound tasks; and (ii) \\emph{composability}: modern attention kernels sustain high utilization on fused batches of arbitrary-length token-level shards.  CAD dynamically partitions the core attention computation into token-level tasks (CA-tasks), and dispatches them to a pool of devices specialized for CA computation (attention servers). It then rebatches CA-tasks to equalize CA compute across attention servers without loss of kernel efficiency. We have implemented CAD in a system called DistCA with a ping-pong scheme to completely overlap communication with compute, and in-place attention servers to improve memory utilization.  Scaling to 512 H200 GPUs and 512K context length, DistCA eliminates DP/PP stragglers, achieves near-perfect compute and memory balance, and improves end-to-end training throughput by up to 1.9\u00d7 over Megatron-LM and 1.35\u00d7 over existing load-balancing methods", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3531", "url": null, "sourceid": 86, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=oIonqkc8hM", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 825, "modified": "2026-03-23T21:52:44.055719-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=oIonqkc8hM", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3577, "uid": "a3c65c2974270fd093ee8a9bf8ae7d0b", "name": "ProTrain: Efficient LLM Training via Automatic Memory Management", "authors": [{"id": 27620, "fullname": "Hanmei Yang", "url": "http://mlsys.org/api/miniconf/users/27620?format=json", "institution": "UMass Amherst &amp; Meta"}, {"id": 27621, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27621?format=json", "institution": null}, {"id": 27622, "fullname": "Yao Fu", "url": "http://mlsys.org/api/miniconf/users/27622?format=json", "institution": "Advanced Micro Devices"}, {"id": 27623, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27623?format=json", "institution": null}, {"id": 27624, "fullname": "Ramine Roane", "url": "http://mlsys.org/api/miniconf/users/27624?format=json", "institution": "Advanced Micro Devices"}, {"id": 11878, "fullname": "Hui Guan", "url": "http://mlsys.org/api/miniconf/users/11878?format=json", "institution": "University of Massachusetts, Amherst; Amazon"}, {"id": 27625, "fullname": "Tongping Liu", "url": "http://mlsys.org/api/miniconf/users/27625?format=json", "institution": "XPeng Motors"}], "abstract": "Memory pressure has emerged as a dominant constraint in scaling the training of large language models (LLMs), particularly in resource-constrained environments. While modern frameworks incorporate various memory-saving techniques, they often expose low-level configuration knobs that require manual tuning and specialized system expertise. This not only adds engineering overhead but also risks suboptimal hardware utilization when misconfigured. This paper introduces ProTrain, a novel training system that automatically tailors memory management policies to the model architecture and underlying hardware resources, eliminating the need for manual intervention. The core of ProTrain is its automated memory management that abstracts complex memory management strategies into a few tunable configuration parameters, allowing searches for optimal parameter settings using cost models. ProTrain is equipped with a runtime profiler that provides precise estimates of latency, memory usage, and I/O bandwidth to build high-fidelity cost models.  ProTrain does not change the training algorithm and thus does not compromise accuracy. Experiments show that ProTrain improves training throughput by 1.43$\\times$ to 2.71$\\times$ compared to the state-of-the-art training systems.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3577", "url": null, "sourceid": 108, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=XDkOn0iTiH", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 871, "modified": "2026-03-23T21:52:45.964855-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=XDkOn0iTiH", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3555, "uid": "68d30a9594728bc39aa24be94b319d21", "name": "WAVE: A SYMBOLIC PYTHON DSL AND COMPILER FOR HIGH PERFORMANCE MACHINE LEARNING", "authors": [{"id": 18549, "fullname": "Harsh Menon", "url": "http://mlsys.org/api/miniconf/users/18549?format=json", "institution": "AMD"}, {"id": 27485, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27485?format=json", "institution": null}, {"id": 27486, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27486?format=json", "institution": null}, {"id": 26125, "fullname": "Martin P. L\u00fccke", "url": "http://mlsys.org/api/miniconf/users/26125?format=json", "institution": "AMD"}, {"id": 27188, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27188?format=json", "institution": null}, {"id": 27487, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27487?format=json", "institution": null}, {"id": 27488, "fullname": "Nithin Meganathan", "url": "http://mlsys.org/api/miniconf/users/27488?format=json", "institution": "AMD"}, {"id": 27147, "fullname": "Sanket Pandit", "url": "http://mlsys.org/api/miniconf/users/27147?format=json", "institution": "Advanced Micro Devices"}, {"id": 27489, "fullname": "William Gallard Hatch", "url": "http://mlsys.org/api/miniconf/users/27489?format=json", "institution": "AMD"}, {"id": 27490, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27490?format=json", "institution": null}, {"id": 27491, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27491?format=json", "institution": null}, {"id": 27492, "fullname": "Sahil FAIZAL", "url": "http://mlsys.org/api/miniconf/users/27492?format=json", "institution": "AMD"}, {"id": 27493, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27493?format=json", "institution": null}, {"id": 27494, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27494?format=json", "institution": null}], "abstract": "Modern ML models demand ever-greater compute, prompting hardware vendors to add specialized matrix cores to their GPUs. While these units unlock high throughput, they impose intricate programming models and addressing schemes that are difficult to manage by hand. This paper introduces Wave, a Python-embedded DSL for kernel authoring that automates these complex address computations and lets authors focus on core computation. In experiments, it matches or surpasses the performance of state-of-the-art kernel DSLs and libraries.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3555", "url": null, "sourceid": 84, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=gcXV1E8HRH", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 849, "modified": "2026-03-23T21:52:45.042858-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=gcXV1E8HRH", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3564, "uid": "ec5decca5ed3d6b8079e2e7e7bacc9f2", "name": "PLA-Serve: A Prefill-Length-Aware LLM Serving System", "authors": [{"id": 17078, "fullname": "Jianshu She", "url": "http://mlsys.org/api/miniconf/users/17078?format=json", "institution": "MBZUAI"}, {"id": 27572, "fullname": "Zonghang Li", "url": "http://mlsys.org/api/miniconf/users/27572?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 27573, "fullname": "HONGCHAO DU", "url": "http://mlsys.org/api/miniconf/users/27573?format=json", "institution": ""}, {"id": 27574, "fullname": "Shangyu Wu", "url": "http://mlsys.org/api/miniconf/users/27574?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 27575, "fullname": "Wenhao Zheng", "url": "http://mlsys.org/api/miniconf/users/27575?format=json", "institution": "University of North Carolina at Chapel Hill"}, {"id": 27576, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27576?format=json", "institution": null}, {"id": 27577, "fullname": "Zhengzhong Liu", "url": "http://mlsys.org/api/miniconf/users/27577?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 27578, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27578?format=json", "institution": null}, {"id": 27579, "fullname": "Chun Jason Xue", "url": "http://mlsys.org/api/miniconf/users/27579?format=json", "institution": "Mohamed bin Zayed University of Artificial Intelligence"}, {"id": 16309, "fullname": "Qirong Ho", "url": "http://mlsys.org/api/miniconf/users/16309?format=json", "institution": "MBZUAI"}], "abstract": "PLA-Serve identifies and disaggregates requests with different prompt lengths in LLM serving to reduce TTFT latency. While recent systems have decoupled the prefill and decode stages to improve throughput, they still rely on unified scheduling policies that fail to adapt to heterogeneous workload characteristics. We observe that prompt-length variations lead to distinct performance bottlenecks, motivating an adaptive scheduling strategy. PLA-Serve disaggregates multi-round long-prefill requests from short-prefill ones and introduces a length-aware smart batching mechanism for short-prefill workloads. It adopts a dual-queue design that supports temporal disaggregation on a single prefill instance or spatial disaggregation across multiple instances. For short-prefill batches, a batch waiting window and CUDA Graph\u2013based clustering mitigate interference from heterogeneous computation, reducing batching delay and lowering average latency. In real multi-turn workloads, PLA-Serve reduces short-prefill latency by over 30% compared to vanilla SGLang under prefill\u2013decode disaggregation, and decreases SLO violations by 28% in multi-instance deployments. Compared to the SGLang router with load balancing, it further lowers SLO violations by 12% in multi-GPU settings. Under high concurrency and mixed-request scenarios, PLA-Serve improves throughput by up to 35% for prefill instance, demonstrating its effectiveness in optimizing heterogeneous LLM serving workloads.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3564", "url": null, "sourceid": 127, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=dzjCkSEDyG", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 858, "modified": "2026-03-23T21:52:45.430681-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=dzjCkSEDyG", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3540, "uid": "b53b3a3d6ab90ce0268229151c9bde11", "name": "Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants", "authors": [{"id": 27353, "fullname": "Bozhi You", "url": "http://mlsys.org/api/miniconf/users/27353?format=json", "institution": "University of Texas at Austin"}, {"id": 26264, "fullname": "Irene Wang", "url": "http://mlsys.org/api/miniconf/users/26264?format=json", "institution": "Georgia Institute of Technology"}, {"id": 27354, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27354?format=json", "institution": null}, {"id": 27355, "fullname": "Abhinav Jangda", "url": "http://mlsys.org/api/miniconf/users/27355?format=json", "institution": "Microsoft"}, {"id": 27356, "fullname": "Ang\u00e9lica Moreira", "url": "http://mlsys.org/api/miniconf/users/27356?format=json", "institution": "Research, Microsoft"}, {"id": 27357, "fullname": "Roshan Dathathri", "url": "http://mlsys.org/api/miniconf/users/27357?format=json", "institution": "Microsoft Research"}, {"id": 27358, "fullname": "Divya Mahajan", "url": "http://mlsys.org/api/miniconf/users/27358?format=json", "institution": "Georgia Institute of Technology"}, {"id": 27359, "fullname": "Keshav Pingali", "url": "http://mlsys.org/api/miniconf/users/27359?format=json", "institution": ", University of Texas, Austin"}], "abstract": "Attention is a fundamental building block of large language models (LLMs), so there have been many efforts to implement it efficiently. For example, FlashAttention leverages tiling and kernel fusion to optimize attention. Recently, a number of variants of attention have been introduced to enhance model quality or efficiency. Supporting them efficiently remains difficult since they usually require specialized kernels or hand-tuned implementations. FlexAttention recently addressed part of this gap by using static programming templates to support FlashAttention-like kernels for a subset of attention variants.  In this paper, we introduce Flashlight, a compiler-native framework within the PyTorch ecosystem that automatically generates fused, FlashAttention-style kernels for arbitrary attention-based programs, without relying on static templates or predefined kernel specializations. Flashlight leverages PyTorch\u2019s compilation workflow to fuse and tile attention computations transparently, enabling efficient execution for diverse attention patterns. Not only does it support all variants expressible in the FlexAttention model but it also handles more general, data-dependent attention formulations that are beyond the capabilities of FlexAttention.  Our results show that Flashlight produces kernels with competitive or superior performance to FlexAttention, while offering the flexibility of native PyTorch code, enabling developers to rapidly explore  new attention models without sacrificing performance.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3540", "url": null, "sourceid": 55, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=lboOMA8XWr", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 834, "modified": "2026-03-23T21:52:44.426020-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=lboOMA8XWr", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3591, "uid": "aab3238922bcc25a6f606eb525ffdc56", "name": "Rethinking DVFS for Mobile LLMs: Unified Energy-Aware Scheduling with CORE", "authors": [{"id": 27664, "fullname": "Zongpu Zhang", "url": "http://mlsys.org/api/miniconf/users/27664?format=json", "institution": "Purdue University"}, {"id": 27665, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27665?format=json", "institution": null}, {"id": 27666, "fullname": "Y. Charlie Hu", "url": "http://mlsys.org/api/miniconf/users/27666?format=json", "institution": "Purdue University"}, {"id": 27667, "fullname": "Qiang Xu", "url": "http://mlsys.org/api/miniconf/users/27667?format=json", "institution": "NVIDIA"}, {"id": 27668, "fullname": "Jian Li", "url": "http://mlsys.org/api/miniconf/users/27668?format=json", "institution": "Shanghai Jiao Tong University"}, {"id": 27669, "fullname": "Haibing Guan", "url": "http://mlsys.org/api/miniconf/users/27669?format=json", "institution": "Shanghai Jiao Tong University"}], "abstract": "Despite the rapid adoption of large language models (LLMs) in mobile applications, deploying them efficiently on resource-constrained devices remains challenging due to limited compute, memory, and energy constraints. In this paper, we first evaluate the energy efficiency of state-of-the-art mobile LLM frameworks across multiple models and uncover a key inefficiency: the default governors make independent decisions which can result in 23.0\u201340.4% longer latency or 5.0\u201316.6% higher energy use compared to optimal frequency combinations. We then conduct an in-depth analysis to reveal the root cause\u2013the lack of cross-resource coordination of these governors during prefilling and decoding. Building on these findings, we present CORE, a unified, energy-aware governor that jointly coordinates CPU, GPU, and memory frequencies for mobile LLM inference. Experiments across diverse LLMs show that CORE reduces time-to-first-token by 7.0\u201316.9% and time-per-token by 25.4\u201336.8% on average, without increasing energy per token.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3591", "url": null, "sourceid": 14, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=PSyHQ8kVUT", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 885, "modified": "2026-03-23T21:52:46.497723-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=PSyHQ8kVUT", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3546, "uid": "1ff1de774005f8da13f42943881c655f", "name": "FaaScale: Unlocking Fast LLM Scaling for Serverless Inference", "authors": [{"id": 27398, "fullname": "Minchen Yu", "url": "http://mlsys.org/api/miniconf/users/27398?format=json", "institution": "The Chinese University of Hong Kong, Shenzhen"}, {"id": 27399, "fullname": "Rui Yang", "url": "http://mlsys.org/api/miniconf/users/27399?format=json", "institution": "University of Virginia"}, {"id": 27400, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27400?format=json", "institution": null}, {"id": 27401, "fullname": "Zhaoyuan Su", "url": "http://mlsys.org/api/miniconf/users/27401?format=json", "institution": "University of Virginia, Charlottesville"}, {"id": 27402, "fullname": "Sheng Yao", "url": "http://mlsys.org/api/miniconf/users/27402?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 27403, "fullname": "Tingfeng Lan", "url": "http://mlsys.org/api/miniconf/users/27403?format=json", "institution": "University of Virginia, Charlottesville"}, {"id": 27404, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27404?format=json", "institution": null}, {"id": 27405, "fullname": "Zirui Wang", "url": "http://mlsys.org/api/miniconf/users/27405?format=json", "institution": "University of Virginia, Charlottesville"}, {"id": 27406, "fullname": "Yue Cheng", "url": "http://mlsys.org/api/miniconf/users/27406?format=json", "institution": "University of Virginia, Charlottesville"}, {"id": 27407, "fullname": "Wei Wang", "url": "http://mlsys.org/api/miniconf/users/27407?format=json", "institution": "HKUST"}, {"id": 27408, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27408?format=json", "institution": null}, {"id": 27409, "fullname": "Ruichuan Chen", "url": "http://mlsys.org/api/miniconf/users/27409?format=json", "institution": "Nokia Bell Labs"}], "abstract": "Serverless computing is an attractive paradigm for cloud-based large language model (LLM) inference, but scaling LLMs on demand remains a major challenge due to high data transfer cost. We present FaaScale, a serverless LLM system that enables fast and resource-efficient model scaling. The key idea is a co-design principle\u2014pipelined multicast inference\u2014which synergizes multicast with dynamic, cross-node pipeline-parallel execution during model transfer. FaaScale implements this design through PipeCast, a model scaling scheme that adaptively multicasts model blocks and dynamically forms inference pipelines on the fly. Coupled with efficient memory management across GPU and host memory, FaaScale handles bursty LLM inference workloads effectively, achieving up to 5\u00d7 lower tail time-to-first-token latency and 31.3% cost reduction on real-world LLM traces.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3546", "url": null, "sourceid": 24, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=jgL8LuOVyT", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 840, "modified": "2026-03-23T21:52:44.651887-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=jgL8LuOVyT", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3602, "uid": "7f39f8317fbdb1988ef4c628eba02591", "name": "HetRL: Efficient Reinforcement Learning for LLMs in Heterogeneous Environments", "authors": [{"id": 26252, "fullname": "Yongjun He", "url": "http://mlsys.org/api/miniconf/users/26252?format=json", "institution": "ETH Zurich"}, {"id": 12135, "fullname": "Shuai Zhang", "url": "http://mlsys.org/api/miniconf/users/12135?format=json", "institution": "Amazon Web Services"}, {"id": 27747, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27747?format=json", "institution": null}, {"id": 27748, "fullname": "Xiyuan Zhang", "url": "http://mlsys.org/api/miniconf/users/27748?format=json", "institution": "AWS"}, {"id": 27749, "fullname": "Boran Han", "url": "http://mlsys.org/api/miniconf/users/27749?format=json", "institution": "Amazon/AWS"}, {"id": 27750, "fullname": "Bernie Wang", "url": "http://mlsys.org/api/miniconf/users/27750?format=json", "institution": "Amazon"}, {"id": 27751, "fullname": "Huzefa Rangwala", "url": "http://mlsys.org/api/miniconf/users/27751?format=json", "institution": "Siemens"}, {"id": 27637, "fullname": "George Karypis", "url": "http://mlsys.org/api/miniconf/users/27637?format=json", "institution": "University of Minnesota, Minneapolis"}], "abstract": "As large language models (LLMs) scale and new GPUs are released even more frequent, there is an increasing demand for LLM post-training in heterogeneous environments to fully leverage underutilized mid-range or previous-generation GPUs across regions and alleviate the shortage of homogeneous high-end GPUs in a single region. However, achieving high-performance reinforcement learning (RL) training for LLMs on such computing resources remains challenging because the workflow involves multiple models and tasks with complex computation and data dependencies. In this paper, we present HetRL, a distributed system for efficient RL training in infrastructures with heterogeneous GPUs and network. HetRL formulates RL training scheduling in heterogeneous environments as a constrained joint optimization problem and introduces a novel scheduling algorithm that (1) decomposes the complex search space with a multi-level search framework; and (2) allocates the search budget via successive halving. Our extensive evaluation consuming 20,000 GPU-hours shows that HetRL delivers up to 9.17\u00d7 and 3.17\u00d7 on average the throughput of state-of-the-art systems under various workloads and settings.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3602", "url": null, "sourceid": 61, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=LRLyuaz1W7", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 896, "modified": "2026-03-23T21:52:46.907660-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=LRLyuaz1W7", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3552, "uid": "e2c420d928d4bf8ce0ff2ec19b371514", "name": "MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training", "authors": [{"id": 27464, "fullname": "Wenxuan Li", "url": "http://mlsys.org/api/miniconf/users/27464?format=json", "institution": "Microsoft"}, {"id": 27465, "fullname": "Chengruidong Zhang", "url": "http://mlsys.org/api/miniconf/users/27465?format=json", "institution": "Alibaba Group"}, {"id": 27466, "fullname": "Huiqiang Jiang", "url": "http://mlsys.org/api/miniconf/users/27466?format=json", "institution": "Qwen"}, {"id": 27467, "fullname": "Yucheng Li", "url": "http://mlsys.org/api/miniconf/users/27467?format=json", "institution": "Alibaba Group"}, {"id": 27468, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27468?format=json", "institution": null}, {"id": 27469, "fullname": "Lili Qiu", "url": "http://mlsys.org/api/miniconf/users/27469?format=json", "institution": "Microsoft Research Asia"}], "abstract": "The adoption of long context windows has become a standard feature in Large Language Models (LLMs), as extended contexts significantly enhance their capacity for complex reasoning and broaden their applicability across diverse scenarios. Dynamic sparse attention is a promising approach for reducing the computational cost of long-context. However, efficiently training LLMs with dynamic sparse attention on ultra-long contexts\u2014especially in distributed settings\u2014remains a significant challenge, due in large part to worker- and step-level imbalance. This paper introduces MTraining, a novel distributed methodology leveraging dynamic sparse attention to enable efficient training for LLMs with ultra-long contexts. Specifically, MTraining integrates three key components: a distributed sparse index approximating algorithm, balanced sparse ring attention, and hierarchical sparse ring attention. These components are designed to synergistically address the computational imbalance and communication overheads inherent in dynamic sparse attention mechanisms during training LLMs with extensive context lengths. We demonstrate the efficacy of MTraining mainly by training Qwen2.5-3B and Llama-3.1-8B, successfully expanding its context window from 32K/128K to 512K tokens on a cluster of 32x A100 GPUs. Our evaluations on a comprehensive suite of downstream tasks, including RULER, PG-19, InfiniteBench, and NIAH, reveal that MTraining achieves up to a 6x higher training throughput while preserving model accuracy.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3552", "url": null, "sourceid": 71, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=h6SD2zgwGq", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 846, "modified": "2026-03-23T21:52:44.915489-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=h6SD2zgwGq", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3574, "uid": "a87ff679a2f3e71d9181a67b7542122c", "name": "Spira: Exploiting Voxel Data Structural Properties for Efficient Sparse Convolution in Point Cloud Networks", "authors": [{"id": 27613, "fullname": "Dionysios Adamopoulos", "url": "http://mlsys.org/api/miniconf/users/27613?format=json", "institution": null}, {"id": 27614, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27614?format=json", "institution": null}, {"id": 27615, "fullname": "Georgios Goumas", "url": "http://mlsys.org/api/miniconf/users/27615?format=json", "institution": "National Technical University of Athens"}, {"id": 16879, "fullname": "Christina Giannoula", "url": "http://mlsys.org/api/miniconf/users/16879?format=json", "institution": "Max Planck Institute for Software Systems (MPI-SWS)"}], "abstract": "Sparse Convolution (SpC) powers 3D point cloud networks widely used in autonomous driving and AR/VR. SpC builds a kernel map that stores mappings between input voxel coordinates, output coordinates, and weight offsets, then uses this map to compute feature vectors for output coordinates. Our work identifies three key properties of voxel coordinates: they are integer-valued, bounded within a limited spatial range, and geometrically continuous\u2014neighboring voxels on the same object surface are highly likely to exist at small spatial offsets from each other. Prior SpC engines do not fully exploit these properties and suffer from high pre-processing and post-processing overheads during kernel map construction. To address this, we design Spira, the first voxel-property-aware SpC engine for GPUs. Spira proposes: (i) a high-performance one-shot search algorithm that builds the kernel map with no preprocessing and high memory locality, (ii) an effective packed-native processing scheme that accesses packed voxel coordinates at low cost, (iii) a flexible dual-dataflow execution mechanism that efficiently computes output feature vectors by adapting to layer characteristics, and (iv) a network-wide parallelization strategy that builds kernel maps for all SpC layers concurrently at network start. Our evaluation shows that Spira significantly outperforms prior SpC engines by 1.71\u00d7 on average and up to 2.31\u00d7 for end-to-end inference, and by 2.13\u00d7 on average and up to 3.32\u00d7 for layer-wise execution across diverse layer configurations.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3574", "url": "https://github.com/SPIN-Research-Group/Spira", "sourceid": 4, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=YQMilw805Q", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 868, "modified": "2026-03-23T21:52:45.845629-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=YQMilw805Q", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3521, "uid": "e4da3b7fbbce2345d7772b0674a318d5", "name": "TokenBlend: Accelerating Tensor Parallelism LLM Inference Through Efficient Compute-Communication Overlap", "authors": [{"id": 19241, "fullname": "Raja Gond", "url": "http://mlsys.org/api/miniconf/users/19241?format=json", "institution": null}, {"id": 17734, "fullname": "Nipun Kwatra", "url": "http://mlsys.org/api/miniconf/users/17734?format=json", "institution": "Microsoft Research India"}, {"id": 16198, "fullname": "Ramachandran Ramjee", "url": "http://mlsys.org/api/miniconf/users/16198?format=json", "institution": "Microsoft Research"}], "abstract": "Distributed inference of large language models (LLMs) using tensor parallelism can introduce communication overheads of $20\\%$ even over GPUs connected via NVLink. Several techniques have been proposed to mitigate these overheads by decomposing computations into smaller tasks and overlapping communication with these computation subtasks. However, as of this writing, none of the open-source LLM serving systems (vLLM, SGLANG, TensorRT-LLM) support compute-communication overlap for LLMs served using tensor parallelism. This is because the number of tokens processed per iteration is kept small to support low latency serving and decomposing these smaller workloads to enable communication overlap results in worse performance.   We present TOKENBLEND, the first system to enable efficient compute-communication overlap for tensor-parallel models for token lengths as small as 1024. TOKENBLEND identifies RMSNorm, a previously overlooked operation, as crucial and optimizes it along with communication by implementing a novel fused \\textbf{AllReduce--RMSNorm} kernel. Further, this kernel leverages the multimem feature available on modern GPUs (e.g., Hopper, Blackwell) to jointly perform communication and RMSNorm efficiently using only 2--8 SMs. Our evaluations demonstrate up to $\\boldsymbol{1.28\\times}$ speedup in latency and $\\boldsymbol{1.19\\times}$ higher throughput across multiple models and workloads. In several settings, TOKENBLEND delivers \\textit{better performance than an equivalent model with all communication removed}. The source code of TOKENBLEND is available at https://anonymous.4open.science/r/tokenblend-mlsys/.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3521", "url": null, "sourceid": 5, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=rh2Ylffkq6", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 815, "modified": "2026-03-23T21:52:43.699125-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=rh2Ylffkq6", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3586, "uid": "8f14e45fceea167a5a36dedd4bea2543", "name": "SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips", "authors": [{"id": 25942, "fullname": "Jiahuan Yu", "url": "http://mlsys.org/api/miniconf/users/25942?format=json", "institution": "University of Illinois Urbana-Champaign"}, {"id": 27649, "fullname": "Mingtao Hu", "url": "http://mlsys.org/api/miniconf/users/27649?format=json", "institution": "University of Illinois at Urbana-Champaign"}, {"id": 27650, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27650?format=json", "institution": null}, {"id": 19015, "fullname": "Minjia Zhang", "url": "http://mlsys.org/api/miniconf/users/19015?format=json", "institution": "UIUC"}], "abstract": "Large Language Model (LLM) serving faces a fundamental tension between stringent latency Service Level Objectives (SLOs) and limited GPU memory capacity. When high request rates exhaust the KV cache budget, existing LLM inference systems often suffer severe head-of-line (HOL) blocking. While prior work explored PCIe-based offloading, these approaches cannot sustain responsiveness under high request rates, often failing to meet tight Time-To-First-Token (TTFT) and Time-Between-Tokens (TBT) SLOs. We present SuperInfer, a high-performance LLM inference system designed for emerging Superchips (e.g., NVIDIA GH200) with tightly coupled GPU-CPU architecture via NVLink-C2C. SuperInfer introduces RotaSched, the first proactive, SLO-aware rotary scheduler that rotates requests to maintain responsiveness on Superchips, and DuplexKV, a high-performance rotation engine that enables full-duplex transfer over NVLink-C2C. Evaluations on GH200 using various models and datasets show that SuperInfer improves TTFT SLO attainment rates by up to 74.7% while maintaining comparable TBT and throughput compared to state-of-the-art systems, demonstrating that SLO-aware scheduling and memory co-design unlocks the full potential of Superchips for responsive LLM serving.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3586", "url": null, "sourceid": 7, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=RuslSHdIHa", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 880, "modified": "2026-03-23T21:52:46.304024-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=RuslSHdIHa", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3595, "uid": "6512bd43d9caa6e02c990b0a82652dca", "name": "Once-for-All Channel Mixers (HyperTinyPW): Generative Compression for TinyML", "authors": [{"id": 27699, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27699?format=json", "institution": null}], "abstract": "Deploying neural networks on microcontrollers is constrained by kilobytes of flash and SRAM, where 1x1 pointwise (PW) channel mixers often dominate memory even after INT8 quantization. We present HyperTinyPW, a compression-as-generation approach that replaces most stored PW weights with generated weights. A shared micro-MLP synthesizes PW kernels once at load time from tiny per-layer codes; the kernels are cached and then executed with standard integer operators, so the deployment stack stays unchanged. A shared latent basis across layers reduces redundancy, and keeping the first PW layer in INT8 stabilizes early morphology-sensitive mixing. Our contributions are: (1) TinyML-faithful packed-byte accounting that includes the generator, heads or factorization, per-layer codes, the kept first PW layer, and the backbone; (2) a unified evaluation protocol with a validation-tuned threshold (t*) and bootstrap confidence intervals; and (3) a deployability analysis covering integer-only inference and boot-versus-lazy synthesis trade-offs. On three ECG benchmarks (Apnea-ECG, PTB-XL, MIT-BIH), HyperTinyPW shifts the macro-F1 versus flash Pareto frontier: at about 225 kB it matches a ~1.4 MB CNN while being 6.31x smaller (84.15% fewer bytes), retaining at least 95% of large-model macro-F1. Under 32-64 kB budgets it sustains balanced detection where compact baselines degrade. The mechanism applies broadly to other 1D biosignals, on-device speech, and embedded sensing tasks where per-layer redundancy dominates, suggesting a wider role for compression-as-generation in resource-constrained ML systems.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3595", "url": null, "sourceid": 11, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=NrDa5Fu10D", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 889, "modified": "2026-03-23T21:52:46.652716-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=NrDa5Fu10D", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3584, "uid": "c51ce410c124a10e0db5e4b97fc2af39", "name": "RDMA Point-to-Point Communication for LLM Systems", "authors": [{"id": 27645, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27645?format=json", "institution": null}, {"id": 23368, "fullname": "Kevin Hu", "url": "http://mlsys.org/api/miniconf/users/23368?format=json", "institution": "Perplexity AI"}, {"id": 23407, "fullname": "Vladimir Zaytsev", "url": "http://mlsys.org/api/miniconf/users/23407?format=json", "institution": "Perplexity AI"}, {"id": 21008, "fullname": "Lequn Chen", "url": "http://mlsys.org/api/miniconf/users/21008?format=json", "institution": "Perplexity AI"}], "abstract": "Emerging Large Language Model (LLM) system patterns, such as disaggregated inference, Mixture-of-Experts (MoE) routing, and asynchronous reinforcement fine-tuning, require flexible point-to-point communication beyond simple collectives. Existing implementations are locked to specific Network Interface Controllers (NICs), hindering integration into inference engines and portability across hardware providers. We present TransferEngine, which bridges the functionality of common NICs to expose a uniform interface. TransferEngine exposes one-sided WriteImm operations with a ImmCounter primitive for completion notification, without ordering assumptions of network transport, transparently managing multiple NICs per GPU. We demonstrate peak throughput of 400 Gbps on both NVIDIA ConnectX-7 and AWS Elastic Fabric Adapter (EFA). We showcase TransferEngine through three production systems: (1) KvCache transfer for disaggregated inference with dynamic scaling, (2) RL weight updates achieving 1.3 seconds for trillion-parameter models, and (3) MoE dispatch/combine  implementation exceeding DeepEP decode latency on ConnectX-7, with the first viable latencies on EFA. We demonstrate that our portable point-to-point communication complements collectives while avoiding lock-in.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3584", "url": null, "sourceid": 13, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=SjVa05wEiY", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 878, "modified": "2026-03-23T21:52:46.236179-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=SjVa05wEiY", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3510, "uid": "6f4922f45568161a8cdf4ad2299f6d23", "name": "Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding", "authors": [{"id": 20906, "fullname": "Yilong Zhao", "url": "http://mlsys.org/api/miniconf/users/20906?format=json", "institution": "University of California, Berkeley"}, {"id": 21005, "fullname": "Jiaming Tang", "url": "http://mlsys.org/api/miniconf/users/21005?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 17683, "fullname": "Kan Zhu", "url": "http://mlsys.org/api/miniconf/users/17683?format=json", "institution": "University of Washington"}, {"id": 12026, "fullname": "Zihao Ye", "url": "http://mlsys.org/api/miniconf/users/12026?format=json", "institution": "University of Washington"}, {"id": 27178, "fullname": "Chi-Chih Chang", "url": "http://mlsys.org/api/miniconf/users/27178?format=json", "institution": "Cornell University"}, {"id": 27179, "fullname": "Chaofan Lin", "url": "http://mlsys.org/api/miniconf/users/27179?format=json", "institution": "Tsinghua University"}, {"id": 27180, "fullname": "Jongseok Park", "url": "http://mlsys.org/api/miniconf/users/27180?format=json", "institution": "University of California, Berkeley"}, {"id": 17675, "fullname": "Guangxuan Xiao", "url": "http://mlsys.org/api/miniconf/users/17675?format=json", "institution": "MIT"}, {"id": 17625, "fullname": "Mohamed Abdelfattah", "url": "http://mlsys.org/api/miniconf/users/17625?format=json", "institution": "Cornell University"}, {"id": 11143, "fullname": "Mingyu Gao", "url": "http://mlsys.org/api/miniconf/users/11143?format=json", "institution": "Tsinghua University"}, {"id": 17670, "fullname": "Baris Kasikci", "url": "http://mlsys.org/api/miniconf/users/17670?format=json", "institution": "University of Michigan"}, {"id": 12133, "fullname": "Song Han", "url": "http://mlsys.org/api/miniconf/users/12133?format=json", "institution": "MIT"}, {"id": 11118, "fullname": "Ion Stoica", "url": "http://mlsys.org/api/miniconf/users/11118?format=json", "institution": "UC Berkeley"}], "abstract": "Reasoning language models have demonstrated remarkable capabilities on challenging tasks by generating elaborate chain-of-thought (CoT) solutions. However, such lengthy generation shifts the inference bottleneck from compute-bound to memory-bound. To generate each token, the model applies full attention to all previously generated tokens, requiring memory access to an increasingly large KV-Cache. Consequently, longer generations demand more memory access for every step, leading to substantial pressure on memory bandwidth.   To address this, we introduce SpecGen, a speculative decoding framework that reuses the same model as the draft and target models (i.e., self-speculation). SpecGen features a novel sparse attention mechanism \\textit{PillarAttn} as the draft model, which accurately selects critical tokens via elegantly reusing information from the verification stage. Furthermore, SpecGen co-designs self-speculation with three system innovations: (1) a unified scheduler to batch token drafting and verification, (2) delayed verification for CPU/GPU overlap, and (3) dynamic KV-Cache management to maximize memory utilization. Across various models and datasets, SpecGen outperforms state-of-the-art solutions, with an up to $2.13\\times$ throughput speedup.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3510", "url": null, "sourceid": 18, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=yeqrwcWjPu", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 804, "modified": "2026-03-23T21:52:43.314747-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=yeqrwcWjPu", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3623, "uid": "1f0e3dad99908345f7439f8ffabdffc4", "name": "HELIOS : Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving", "authors": [{"id": 25962, "fullname": "Avinash Kumar", "url": "http://mlsys.org/api/miniconf/users/25962?format=json", "institution": "The University of Texas at Austin"}, {"id": 19188, "fullname": "Shashank Nag", "url": "http://mlsys.org/api/miniconf/users/19188?format=json", "institution": "The University of Texas at Austin"}, {"id": 27844, "fullname": "Jason Clemons", "url": "http://mlsys.org/api/miniconf/users/27844?format=json", "institution": "NVIDIA"}, {"id": 12633, "fullname": "LIZY JOHn", "url": "http://mlsys.org/api/miniconf/users/12633?format=json", "institution": "UT-Austin"}, {"id": 27845, "fullname": "Poulami Das", "url": "http://mlsys.org/api/miniconf/users/27845?format=json", "institution": "University of Texas at Austin"}], "abstract": "Early-Exit Large Language Models (EE-LLMs) enable high throughput inference by allowing tokens to exit early at intermediate layers. However, their throughput is limited by the computational and memory savings. Existing EE-LLM frameworks rely on a single model and therefore, their token generation latencies are bottlenecked by tokens that do not exit early and traverse additional layers. Moreover, early exits are only known at runtime and depend on the request. Therefore, these frameworks load the weights of all model layers even though large portions remain unused when tokens exit early. The lack of memory savings limit us from scaling the batch sizes.   We propose \\textit{HELIOS}, a framework that improves both token generation latency and batch sizes to enable high-throughput in EE-LLMs. HELIOS exploits two insights. \\textit{First}, early exits are often complimentary across models, tokens that do not exit early on one model often take an early-exit on another. HELIOS employs multiple models and dynamically switches between them to collectively maximize the number of tokens that exit early, and minimize token generation latencies. \\textit{Second}, even when a predicted token does not exit early due to poor confidence, it often remains unchanged even after additional layer traversal. HELIOS greedily allows such tokens to exit early and only loads the weights of the most likely to be used layers, yielding memory savings which is then re-purposed to increase batch sizes. HELIOS employs real-time profiling to accurately identify the early-exit distributions, and adaptively switches between models by tracking tokens in real-time to minimize the performance degradation caused by greedy model loading and exiting. Our evaluations show that HELIOS achieves $1.48\\times$ higher throughput and $15.14\\times$ larger batch size compared to existing EE-LLM frameworks.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3623", "url": null, "sourceid": 19, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=CV52m9NJFK", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 917, "modified": "2026-03-23T21:52:47.742568-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=CV52m9NJFK", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3642, "uid": "6364d3f0f495b6ab9dcf8d3b5c6e0b01", "name": "OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents", "authors": [{"id": 27412, "fullname": "Reyna Abhyankar", "url": "http://mlsys.org/api/miniconf/users/27412?format=json", "institution": "University of California, San Diego"}, {"id": 27985, "fullname": "Qi Qi", "url": "http://mlsys.org/api/miniconf/users/27985?format=json", "institution": "University of California, San Diego"}, {"id": 26292, "fullname": "Yiying Zhang", "url": "http://mlsys.org/api/miniconf/users/26292?format=json", "institution": "UCSD and GenseeAI"}], "abstract": "Generative AI is being leveraged to solve a variety of computer-use tasks involving desktop applications. State-of-the-art systems have focused solely on improving accuracy on leading benchmarks. However, these systems are practically unusable due to extremely high end-to-end latency (e.g., tens of minutes) for tasks that typically take humans just a few minutes to complete. To understand the cause behind this and to guide future developments of computer agents, we conduct the first study on the temporal performance of computer-use agents on OSWorld, the flagship benchmark in computer-use AI. We find that large model calls for planning, reflection, and judging account for most of the overall latency, and as an agent uses more steps to complete a task, each successive step can take 3x longer than steps at the beginning of a task. We then construct OSWorld-Human, a manually annotated version of the original OSWorld dataset that contains a human-determined trajectory for each task. We evaluate 16 agents on their efficiency using OSWorld-Human and found that even the best agents take 1.5-2.4x more steps than necessary.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3642", "url": null, "sourceid": 32, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=0Cp8l6cvyq", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 936, "modified": "2026-03-23T21:52:48.518354-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=0Cp8l6cvyq", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3618, "uid": "d67d8ab4f4c10bf22aa353e27879133c", "name": "CAGE: Curvature-Aware Gradient Estimation For Accurate Quantization-Aware Training", "authors": [{"id": 27826, "fullname": "Soroush Tabesh", "url": "http://mlsys.org/api/miniconf/users/27826?format=json", "institution": "Institute of Science and Technology Austria"}, {"id": 27827, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27827?format=json", "institution": null}, {"id": 27828, "fullname": "Andrei Panferov", "url": "http://mlsys.org/api/miniconf/users/27828?format=json", "institution": "Institute of Science and Technology Austria"}, {"id": 27829, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27829?format=json", "institution": null}, {"id": 27830, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27830?format=json", "institution": null}], "abstract": "Despite significant work on low-bit quantization-aware training (QAT), there is still an accuracy gap between such techniques and native training. To address this, we introduce CAGE (Curvature-Aware Gradient Estimation), a new QAT method that augments the straight-through estimator (STE) gradient with a curvature-aware correction designed to counteract the loss increase induced by quantization. CAGE is derived from a multi-objective view of QAT that balances loss minimization with adherence to quantization constraints, yielding a principled correction term that depends on local curvature information.  On the theoretical side, we introduce the notion of Pareto-optimal solutions for quantized optimization, and establish that CAGE yields strong convergence guarantees in the smooth non-convex setting. In terms of implementation, our approach is optimizer-agnostic, but we provide a highly-efficient implementation that leverages Adam statistics.  CAGE significantly improves upon the prior state-of-the-art methods in terms of accuracy, for similar computational cost: for QAT fine-tuning, it halves the compression accuracy loss relative to the prior best method, while for QAT pre-training of Llama models, its accuracy for 3-bit weights-and-activations (W3A3) matches that of 4-bit training (W4A4) with the prior best method (QuEST).", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3618", "url": null, "sourceid": 39, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=Fubm1TtWeo", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 912, "modified": "2026-03-23T21:52:47.532702-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=Fubm1TtWeo", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3640, "uid": "d645920e395fedad7bbbed0eca3fe2e0", "name": "Hippocampus: An Efficient and Scalable Memory Module for Agentic AI", "authors": [{"id": 27978, "fullname": "Yi Li", "url": "http://mlsys.org/api/miniconf/users/27978?format=json", "institution": "University of Texas at Dallas"}, {"id": 12896, "fullname": "Lianjie Cao", "url": "http://mlsys.org/api/miniconf/users/12896?format=json", "institution": "Hewlett Packard Labs"}, {"id": 15334, "fullname": "Faraz Ahmed", "url": "http://mlsys.org/api/miniconf/users/15334?format=json", "institution": "Hewlett Packard Labs"}, {"id": 14774, "fullname": "Puneet Sharma", "url": "http://mlsys.org/api/miniconf/users/14774?format=json", "institution": "HP Labs"}, {"id": 23337, "fullname": "Bingzhe Li", "url": "http://mlsys.org/api/miniconf/users/23337?format=json", "institution": "University of Texas at Dallas"}], "abstract": "Agentic AI require persistent memory to store user-specific histories beyond the limited context window of LLMs. Existing memory systems use dense vector databases or knowledge-graph traversal (or hybrid), incurring high retrieval latency and poor storage scalability. We introduce \\textbf{Hippocampus}, an agentic memory management system that uses compact binary signatures for semantic search and lossless token-ID streams for exact content reconstruction. Its core is a Dynamic Wavelet Matrix (DWM) that compresses and co-indexes both streams to support ultra-fast search in the compressed domain, thus avoiding costly dense-vector or graph computations. This design scales linearly with memory size, making it suitable for long-horizon agentic deployments.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3640", "url": null, "sourceid": 40, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=0sUYZh9D4a", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 934, "modified": "2026-03-23T21:52:48.449896-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=0sUYZh9D4a", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3550, "uid": "f457c545a9ded88f18ecee47145a72c0", "name": "Flash3DGS: Algorithm and System Co-Optimization for Fast 3D Gaussian Splatting on GPUs", "authors": [{"id": 27428, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27428?format=json", "institution": null}, {"id": 27429, "fullname": "Zhican Wang", "url": "http://mlsys.org/api/miniconf/users/27429?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 27430, "fullname": "Zhiwen Mo", "url": "http://mlsys.org/api/miniconf/users/27430?format=json", "institution": "Imperial College London"}, {"id": 27431, "fullname": "Hongxiang Fan", "url": "http://mlsys.org/api/miniconf/users/27431?format=json", "institution": "Imperial College London"}], "abstract": "Recent advances in 3D Gaussian Splatting (3DGS) have enabled high-quality and efficient novel view synthesis, demonstrating great potential in real-world applications such as robotic perception and digital-twin construction.  However, 3DGS requires processing up to millions of Gaussians in parallel, imposing significant computational and memory demands that limit its deployment on resource-constrained platforms. Through systematic profiling and analysis, this paper identifies several redundancy at both the algorithmic and system implementation levels. These insights motivate us to explore several novel optimizations, including adaptive early sorting, GPU-efficient axis-shared rasterization, and dynamic thresholding. Unlike prior work that focuses only on either algorithmic improvements or systems optimization, our approach explores a joint algorithm and system co-optimization to push the performance limits of 3DGS on GPUs. Comprehensive evaluation demonstrates that our co-optimization approach, named \\textit{Flash3DGS} achieves a speed-up of up to $1.41 \\times$ with negligible algorithmic performance drop in rendering image quality compared with the \\textit{gsplat} baseline. Importantly, our co-optimization is orthogonal to most existing 3DGS acceleration methods, allowing for synergistic performance gains when used in combination. We plan to release our code publicly upon paper acceptance to support reproducibility and future research.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3550", "url": null, "sourceid": 49, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=i05mMLR9BX", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 844, "modified": "2026-03-23T21:52:44.827486-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=i05mMLR9BX", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3605, "uid": "9a1158154dfa42caddbd0694a4e9bdc8", "name": "HexiScale: Facilitating Large Language Model Training over Heterogeneous Hardware", "authors": [{"id": 25937, "fullname": "Ran Yan", "url": "http://mlsys.org/api/miniconf/users/25937?format=json", "institution": "The Hong Kong University of Science and Technology"}, {"id": 21099, "fullname": "YOUHE JIANG", "url": "http://mlsys.org/api/miniconf/users/21099?format=json", "institution": "University of Cambridge"}, {"id": 27733, "fullname": "Xiaonan Nie", "url": "http://mlsys.org/api/miniconf/users/27733?format=json", "institution": "ByteDance Inc."}, {"id": 27606, "fullname": "Fangcheng Fu", "url": "http://mlsys.org/api/miniconf/users/27606?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 21049, "fullname": "Bin CUI", "url": "http://mlsys.org/api/miniconf/users/21049?format=json", "institution": "Peking University"}, {"id": 26259, "fullname": "Binhang Yuan", "url": "http://mlsys.org/api/miniconf/users/26259?format=json", "institution": "HKUST"}], "abstract": "Training large language models (LLMs) is a computationally intensive task, which is typically conducted in data centers with homogeneous high-performance GPUs. In this paper, we explore an alternative approach by deploying training computations across heterogeneous GPUs to enable better flexibility and efficiency for heterogeneous resource utilization. Toward this end, we propose a novel system, HexiScale, that can flexibly support asymmetric partition of training computations in the scope of data-, pipeline-, and tensor model parallelism. We further formalize the allocation of asymmetric partitioned training computations over a set of heterogeneous GPUs as a constrained optimization problem and propose an efficient hierarchical graph partitioning algorithm. Our approach effectively allocates training computations across heterogeneous GPUs, fully leveraging the available computational power. We compare the performance of HexiScale with state-of-the-art homogeneous and heterogeneous training systems. When training LLMs at different scales (from 7B to 30B), empirical results demonstrate that: (\\underline{i}) compared to state-of-the-art homogeneous baselines running over homogeneous GPUs, HexiScale achieves \\textit{similar} performance when running over heterogeneous GPUs with the \\textit{same} theoretical FLOPS; (\\underline{ii}) compared to state-of-the-art heterogeneous baselines running on the same heterogeneous clusters, HexiScale delivers $1.5\\times$ to $2.4\\times$ higher throughput.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3605", "url": null, "sourceid": 52, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=KgcqSNio0U", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 899, "modified": "2026-03-23T21:52:47.001246-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=KgcqSNio0U", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3536, "uid": "72b32a1f754ba1c09b3695e0cb6cde7f", "name": "FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling", "authors": [{"id": 27338, "fullname": "Ted Zadouri", "url": "http://mlsys.org/api/miniconf/users/27338?format=json", "institution": "University of California, Los Angeles"}, {"id": 27339, "fullname": "Markus Hoehnerbach", "url": "http://mlsys.org/api/miniconf/users/27339?format=json", "institution": "Meta"}, {"id": 25645, "fullname": "Jay Shah", "url": "http://mlsys.org/api/miniconf/users/25645?format=json", "institution": "Colfax Research"}, {"id": 27340, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27340?format=json", "institution": null}, {"id": 27341, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27341?format=json", "institution": null}], "abstract": "Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. While FlashAttention-3 optimized attention for Hopper GPUs through asynchronous execution and warp specialization, it primarily targets the H100 architecture. The AI industry has rapidly transitioned to deploying Blackwell-based systems such as the B200 and GB200, which exhibit fundamentally different performance characteristics due to asymmetric hardware scaling: tensor core throughput doubles while other functional units (shared memory bandwidth, exponential units) scale more slowly or remain unchanged. We develop several techniques to address these shifting bottlenecks on Blackwell GPUs: (1) redesigned pipelines that exploit fully asynchronous MMA operations and larger tile sizes, (2) software-emulated exponential and conditional softmax rescaling that reduces non-matmul operations, and (3) leveraging tensor memory to reduce shared memory traffic in the backward pass. We demonstrate that our method, FlashAttention-4, achieves up to 1.3$\\times$ speedup over cuDNN and 2.4$\\times$ over Triton on B200 GPUs with BF16, reaching up to 1605 TFLOPs/s (71\\% utilization). Beyond algorithmic innovations, we implement FlashAttention-4 entirely in CuTe-DSL embedded in Python, achieving 20-30$\\times$ faster compile times compared to traditional C++ template-based approaches while maintaining full expressivity.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3536", "url": null, "sourceid": 57, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=mN5RtvuYl3", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 830, "modified": "2026-03-23T21:52:44.253699-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=mN5RtvuYl3", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3516, "uid": "66f041e16a60928b05a7e228a89c3799", "name": "Toward Principled LLM Safety Testing: Solving the Jailbreak Oracle Problem", "authors": [{"id": 16907, "fullname": "Shuyi Lin", "url": "http://mlsys.org/api/miniconf/users/16907?format=json", "institution": "Northeastern University"}, {"id": 27199, "fullname": "Anshuman Suri", "url": "http://mlsys.org/api/miniconf/users/27199?format=json", "institution": "Northeastern University"}, {"id": 27200, "fullname": "Alina Oprea", "url": "http://mlsys.org/api/miniconf/users/27200?format=json", "institution": "Northeastern University"}, {"id": 16145, "fullname": "Cheng Tan", "url": "http://mlsys.org/api/miniconf/users/16145?format=json", "institution": "Northeastern"}], "abstract": "As large language models (LLMs) become increasingly deployed in safety-critical applications, the lack of systematic methods to assess their vulnerability to jailbreak attacks presents a critical security gap. We introduce the \\emph{jailbreak oracle problem}: given a model, prompt, and decoding strategy, determine whether a jailbreak response can be generated with likelihood exceeding a specified threshold. This formalization enables a principled study of jailbreak vulnerabilities. Answering the jailbreak oracle problem poses significant computational challenges, as the search space grows exponentially with response length. We present BOA, the first system designed for efficiently solving the jailbreak oracle problem. BOA employs a two-phase search strategy: (1) breadth-first sampling to identify easily accessible jailbreaks, followed by (2) depth-first priority search guided by fine-grained safety scores to systematically explore promising yet low-probability paths. BOA enables rigorous security assessments including systematic defense evaluation, standardized comparison of red team attacks, and model certification under extreme adversarial conditions.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3516", "url": null, "sourceid": 58, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=vr3Rrg6Xnm", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 810, "modified": "2026-03-23T21:52:43.509340-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=vr3Rrg6Xnm", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3514, "uid": "072b030ba126b2f4b2374f342be9ed44", "name": "FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error", "authors": [{"id": 25853, "fullname": "Fengjuan Wang", "url": "http://mlsys.org/api/miniconf/users/25853?format=json", "institution": "zhejianglab"}, {"id": 27194, "fullname": "Zhiyi Su", "url": "http://mlsys.org/api/miniconf/users/27194?format=json", "institution": "Zhejiang Lab, Zhejiang Lab"}, {"id": 27195, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27195?format=json", "institution": null}, {"id": 27196, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27196?format=json", "institution": null}, {"id": 25609, "fullname": "Sun Mou", "url": "http://mlsys.org/api/miniconf/users/25609?format=json", "institution": "Zhejiang Lab"}], "abstract": "Training large Mixture-of-Experts (MoE) models remains computationally prohibitive due to their extreme compute and memory demands. Although low-precision training promises to accelerate computation and reduce memory footprint, existing implementations still rely on BF16-dominated dataflows with frequent quantize\u2013dequantize (Q/DQ) conversions. These redundant casts erode much of FP8\u2019s theoretical efficiency. However, naively removing these casts by keeping dataflows entirely in FP8 introduces double quantization error: tensors quantized along different dimensions accumulate inconsistent scaling factors, degrading numerical stability.  We propose FP8-Flow-MoE, an FP8 training recipe featuring a quantization-consistent FP8-centric dataflow with a scaling-aware transpose and fused FP8 operators that streamline computation and eliminate explicit cast operations from 12 to 2. Evaluations on a 671B-parameter MoE model demonstrate up to 21\\% higher throughput and 16.5~GB lower memory usage per GPU compared to BF16 and na\u00efve FP8 baselines, while maintaining stable convergence. We provide a plug-and-play FP8 recipe compatible with TransformerEngine and Megatron-LM, which will be open-sourced after the camera-ready release of this paper.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3514", "url": null, "sourceid": 60, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=wyH60Su6G7", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 808, "modified": "2026-03-23T21:52:43.443255-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=wyH60Su6G7", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3565, "uid": "44f683a84163b3523afe57c2e008bc8c", "name": "Locality-Aware Beam Scheduling for Efficient Test-Time Compute with a Consumer-grade GPU", "authors": [{"id": 25535, "fullname": "Hsing-Ti Wang", "url": "http://mlsys.org/api/miniconf/users/25535?format=json", "institution": "NTU ECLab"}, {"id": 26242, "fullname": "Hung-Tso Shiao", "url": "http://mlsys.org/api/miniconf/users/26242?format=json", "institution": "National Taiwan University"}, {"id": 27580, "fullname": "Chia-Lin Yang", "url": "http://mlsys.org/api/miniconf/users/27580?format=json", "institution": "Department of computer science and informational engineering, National Taiwan University"}], "abstract": "Large Language Models (LLMs) are central to modern NLP applications, yet their deployment on consumer-grade GPUs is limited by limited memory capacity and bandwidth. In typical single-batch inference on local devices, the key\u2013value (KV) cache occupies only a small fraction of total memory, so prior studies have largely focused on model weights. The rise of test-time compute (TTC), however, introduces a new bottleneck: the rapidly expanding KV cache. In TTC methods such as step-wise beam search, concurrent decoding paths cause KV cache size and transfer costs to scale with exploration space, resulting in severe I/O stalls on consumer-grade GPUs. We identify two complementary forms of data locality in TTC workloads. Inter-token locality occurs within each decoding step, as consecutive tokens in the same beam access nearly identical KV cache data. Inter-beam locality arises across decoding steps, as beams that share common prefixes reuse overlapping KV segments. Building on these observations, we propose Locality-Aware Beam Scheduling, which exploits these locality patterns to reduce redundant KV cache transfers. It also employs balanced grouping with prefetching to overlap data movement with computation. Evaluated on OPT-6.7B, LLaMA-2-7B, and Qwen-7B, our method reduces KV cache transfer volume by over 95\\% and achieves consistent end-to-end speedups of 3.39\u00d7\u20139.72\u00d7, 3.60\u00d7\u20138.74\u00d7, and 4.17\u00d7\u20137.99\u00d7, respectively, compared to layer-wise offloading.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3565", "url": null, "sourceid": 62, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=dTo8jAXm9K", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 859, "modified": "2026-03-23T21:52:45.475684-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=dTo8jAXm9K", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3620, "uid": "03afdbd66e7929b125f8597834fa83a4", "name": "PROMPTS: PeRformance Optimization via Multi-Agent Planning for LLM Training and Serving", "authors": [{"id": 27835, "fullname": "Yuran Ding", "url": "http://mlsys.org/api/miniconf/users/27835?format=json", "institution": "Google"}, {"id": 27836, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27836?format=json", "institution": null}, {"id": 27837, "fullname": "Xiaofan Zhang", "url": "http://mlsys.org/api/miniconf/users/27837?format=json", "institution": "Google"}, {"id": 27838, "fullname": "Xinwei Chen", "url": "http://mlsys.org/api/miniconf/users/27838?format=json", "institution": "Google"}], "abstract": "Optimizing large-language model (LLM) training and serving on large-scale distributed systems is a significant challenge. This difficulty stems from the rapidly evolving LLM landscape, the requirement for deep domain expertise, and the need for workload-specific optimization strategies. Existing methods rely on either handcrafted optimization performed by human experts, which is tedious and time-consuming, or resource-intensive black-box searches, which lack the extensibility to keep pace with evolving models and hardware. To address this, we introduce \\textbf{PROMPTS}, a novel multi-agent framework that complements traditional search methods with expert-informed reasoning to deliver system-level optimization with much fewer shots. Key components of the proposed framework include an \\textit{Analyzer Agent} that diagnoses performance bottlenecks by synthesizing profiler data and a \\textit{Proposal Agent} that leverages a knowledge base to generate optimized sharding configurations with detailed justifications through retrieval-augmented generation (RAG).  Experimental results across eight real-world LLM workloads have demonstrated that PROMPTS can provide valid reasoning and accurate recommendations by considering LLM workload characteristics and backend hardware features, delivering performance improvements of up to \\textbf{434\\%}. These workloads spanned LLMs with Mixture-of-Experts (MoE) and dense models, system configurations from 2-TPU chips to 512-chip systems with 2D/3D Torus interconnects, and the full LLM lifecycle including pre-training, post-training, and serving.  To validate our agent's system optimization proposals, we benchmarked them against production configurations that were previously optimized by experts, either through extensive manual analysis or automated black-box searches. In every case, our agent independently identified this expert-validated solution within its top three recommendations from a \\textbf{single invocation}. Furthermore, the agent's top-ranked recommendation matched the production solution in \\textbf{87.5\\%} of cases, demonstrating its ability to not only find optimized configurations but also to correctly prioritize the optimization candidates.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3620", "url": null, "sourceid": 63, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=FTOfgVHcZn", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 914, "modified": "2026-03-23T21:52:47.624044-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=FTOfgVHcZn", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3590, "uid": "d2ddea18f00665ce8623e36bd4e3c7c5", "name": "PLayer-FL: A Principled Approach to Personalized Layer-wise Cross-Silo Federated Learning", "authors": [{"id": 27662, "fullname": "Ahmed Elhussein", "url": "http://mlsys.org/api/miniconf/users/27662?format=json", "institution": "Columbia University"}, {"id": 25923, "fullname": "Florent Pollet", "url": "http://mlsys.org/api/miniconf/users/25923?format=json", "institution": "Columbia University"}, {"id": 27663, "fullname": "Gamze Gursoy", "url": "http://mlsys.org/api/miniconf/users/27663?format=json", "institution": "Columbia University"}], "abstract": "Federated learning (FL) with non-IID data often degrades client performance below local training baselines. Partial FL addresses this by federating only early layers that learn transferable features, but existing methods rely on ad-hoc, architecture-specific heuristics. We first conduct a systematic analysis of layer-wise generalization dynamics in FL, revealing an early-emerging transition between generalizable (safe-to-federate) and task-specific (should-remain-local) layers. Building on this, we introduce Principled Layer-wise Federated Learning (PLayer-FL), which aims to deliver the benefits of federation more robustly. PLayer-FL computes a novel federation-sensitivity metric efficiently after a single training epoch to choose the optimal split point for a given task. Inspired by model pruning, the metric quantifies each layer\u2019s robustness to aggregation and highlights where federation shifts from beneficial to detrimental. We show that this metric correlates strongly with established generalization measures across diverse architectures. Crucially, experiments demonstrate that PLayer-FL achieves consistently competitive performance across a wide range of tasks while distributing gains more equitably and reducing client-side regressions relative to baselines.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3590", "url": null, "sourceid": 73, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=QBUy1HdKrZ", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 884, "modified": "2026-03-23T21:52:46.459439-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=QBUy1HdKrZ", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3541, "uid": "fbd7939d674997cdb4692d34de8633c4", "name": "Demystifying the Mixture of Experts Serving Tax", "authors": [{"id": 14926, "fullname": "Pratyush Patel", "url": "http://mlsys.org/api/miniconf/users/14926?format=json", "institution": "University of Washington"}, {"id": 11122, "fullname": "Arvind Krishnamurthy", "url": "http://mlsys.org/api/miniconf/users/11122?format=json", "institution": "University of Washington"}], "abstract": "Mixture-of-Experts (MoEs) enable massive model sizes but suffer from serving overheads compared to dense models with the same per-token compute costs. This MoE tax varies with the model architecture, inference phase, and parallelism strategy. We comprehensively study the tax for different MoE models, finding that they perform 2-3x worse than equivalent dense models. Through microbenchmarks, we analyze and categorize the underlying tax sources and show how they manifest differently under different configurations.  Our key result is that prefill and decode phases incur vastly different taxes; counterintuitively, factors like load imbalance, which harm prefill, can sometimes benefit decode. To gain deeper intuition, we propose a balls-bins-buckets performance model and study recent MoE developments like fine-grained experts and data parallel attention. We conclude by discussing existing and new techniques to reduce the MoE tax and their associated trade-offs.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3541", "url": null, "sourceid": 76, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=lELxqcgrsN", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 835, "modified": "2026-03-23T21:52:44.457499-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=lELxqcgrsN", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3633, "uid": "f033ab37c30201f73f142449d037028d", "name": "RaidServe: High-performance Resilient Serving", "authors": [{"id": 20912, "fullname": "Ziyi Xu", "url": "http://mlsys.org/api/miniconf/users/20912?format=json", "institution": "Shanghai Jiaotong University"}, {"id": 20911, "fullname": "Zhiqiang Xie", "url": "http://mlsys.org/api/miniconf/users/20911?format=json", "institution": "Stanford University"}, {"id": 23398, "fullname": "Swapnil Gandhi", "url": "http://mlsys.org/api/miniconf/users/23398?format=json", "institution": "Stanford"}, {"id": 20928, "fullname": "Christos Kozyrakis", "url": "http://mlsys.org/api/miniconf/users/20928?format=json", "institution": "Computer Science Department, Stanford University"}], "abstract": "Tensor parallelism (TP) enables large language models (LLMs) to scale inference efficiently across multiple GPUs, but its tight coupling makes systems fragile: a single GPU failure can halt execution, trigger costly KVCache recomputation, and introduce long-term compute and memory imbalance. We present RaidServe , a fault-tolerant TP serving system that sustains high performance under irregular GPU availability. RaidServe introduces three techniques to balance computation and memory across GPUs: (1) Cyclic KVCache Placement for even memory utilization, (2) Hybrid Attention combining tensor- and data-parallel attention to eliminate stragglers, and (3) Fine-Grained Load-Aware Routing to dynamically balance requests. It further employs proactive KVCache backup and on-demand weight recovery to avoid expensive recomputation and redundant data transfers. Implemented in a lightweight serving engine compatible with existing infrastructures, RaidServe achieves up to 2\u00d7 higher throughput and two orders of magnitude faster recovery than standard fault-handling methods on an 8\u00d7H100 DGX system, maintaining strong performance even with multiple GPU failures.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3633", "url": null, "sourceid": 80, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=5pl9fdbEkq", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 927, "modified": "2026-03-23T21:52:48.141148-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=5pl9fdbEkq", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3538, "uid": "9778d5d219c5080b9a6a17bef029331c", "name": "Unified LLM Model for Power, Performance, and Area Prediction from Hardware Code", "authors": [{"id": 25630, "fullname": "Armin Abdollahi", "url": "http://mlsys.org/api/miniconf/users/25630?format=json", "institution": "University of Southern California"}, {"id": 27346, "fullname": "Mehdi Kamal", "url": "http://mlsys.org/api/miniconf/users/27346?format=json", "institution": "University of Southern California"}, {"id": 27347, "fullname": "Massoud Pedram", "url": "http://mlsys.org/api/miniconf/users/27347?format=json", "institution": "University of Southern California"}], "abstract": "We present RocketPPA, a unified LLM-based model that predicts power, performance, and area for Verilog designs across technology nodes and optimization styles. The approach combines a large language model backbone with mixture-of-experts regression and low-rank adaptation for parameter efficiency. To improve generalization, we introduce a contrastive learning framework that encourages semantically similar designs to cluster in embedding space, providing an inductive bias that reflects the structure of the hardware design space. Trained on 15nm and 45nm nodes with area- and delay-optimized flows, the model achieves 9.4 percentage point improvement in pass rate at ten percent tolerance over prior methods, with approximately 20$\\times$ higher throughput (0.12 seconds per design). Ablations show contrastive learning contributes 2.5 points to accuracy, while leave-one-regime-out experiments demonstrate robust cross-regime generalization with minimal degradation. These results validate that combining supervised and contrastive objectives enables rapid, accurate PPA prediction across nodes and optimization styles.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3538", "url": null, "sourceid": 82, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=lpO7kxiayb", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 832, "modified": "2026-03-23T21:52:44.335537-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=lpO7kxiayb", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3611, "uid": "7647966b7343c29048673252e490f736", "name": "SHIP: SRAM-Based Huge Inference Pipelines for Fast LLM Serving", "authors": [{"id": 27796, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27796?format=json", "institution": null}, {"id": 27797, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27797?format=json", "institution": null}, {"id": 27798, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27798?format=json", "institution": null}, {"id": 27799, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27799?format=json", "institution": null}, {"id": 27800, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27800?format=json", "institution": null}, {"id": 27801, "fullname": "Sahil Parmar", "url": "http://mlsys.org/api/miniconf/users/27801?format=json", "institution": "NVIDIA"}, {"id": 27802, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27802?format=json", "institution": null}, {"id": 27803, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27803?format=json", "institution": null}, {"id": 27804, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27804?format=json", "institution": null}, {"id": 27805, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27805?format=json", "institution": null}, {"id": 27806, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27806?format=json", "institution": null}], "abstract": "The proliferation of large language models (LLMs) demands inference systems with both low latency and high efficiency at scale. GPU-based serving relies on HBM for model weights and KV caches, creating a memory bandwidth bottleneck during decode. To break through this bottleneck, we present the first large-scale, SRAM-based LLM inference deployment\u2014Groq\u2019s public cloud\u2014serving hundreds of billions of tokens daily. This paper reviews Groq\u2019s first-generation SRAM-based Huge Inference Pipelines (SHIP), highlighting: (1) a synchronous, low-diameter interconnect enabling low-latency scaling across thousands of chips; (2) optimizations for LLM serving under limited memory capacity; and (3) a large pipeline design that sustains efficiency and latency under varying prefill-to-decode ratios and context lengths. Together, these yield state-of-the-art latency while maintaining efficiency across diverse traffic scenarios\u2014key to real-world LLM serving.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3611", "url": null, "sourceid": 89, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=IZaXDwDtL1", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 905, "modified": "2026-03-23T21:52:47.224690-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=IZaXDwDtL1", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3636, "uid": "26657d5ff9020d2abefe558796b99584", "name": "Grolar: Efficient LLM Training on Heterogeneous Clusters", "authors": [{"id": 14867, "fullname": "Runsheng Guo", "url": "http://mlsys.org/api/miniconf/users/14867?format=json", "institution": "University of Waterloo"}, {"id": 27952, "fullname": "Utkarsh Anand", "url": "http://mlsys.org/api/miniconf/users/27952?format=json", "institution": "University of Waterloo"}, {"id": 14868, "fullname": "Khuzaima Daudjee", "url": "http://mlsys.org/api/miniconf/users/14868?format=json", "institution": "University of Waterloo"}, {"id": 27953, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27953?format=json", "institution": null}], "abstract": "Large language models (LLMs) require vast amounts of GPU compute to train, but limited availability and high costs of GPUs make homogeneous clusters impractical for many organizations. Instead, assembling heterogeneous clusters by pooling together GPUs of different generations allows them to achieve higher aggregate compute and make use of all available GPUs. However, training on heterogeneous clusters presents significant challenges. The workload must be carefully partitioned such that GPUs in the cluster with limited compute, memory, or network bandwidth do not bottleneck the training process. Existing heterogeneous training systems cannot do so efficiently since they integrate data, pipeline, and tensor parallelism in a way that trades off communication for memory overhead. Combining vanilla data parallelism with pipeline parallelism is communication-efficient but results in high memory overhead from replicating model parameters. Alternatively, using sharded data parallelism or tensor parallelism reduces memory overhead but increases communication overhead when combined with pipeline parallelism. To address this problem, we designed Grolar, a system that uses Pipeline-Efficient ZeRO DP, a novel integration of pipeline parallelism and data parallelism that is both communication- and memory-efficient. Grolar uses a planner to automatically find an optimized training configuration from the vast search space of possibilities on heterogeneous clusters, and our evaluation shows that Grolar achieves up to 3\u00d7 higher training throughput than state-of-the-art systems across representative heterogeneous training scenarios.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3636", "url": null, "sourceid": 96, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=40leuGH3iO", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 930, "modified": "2026-03-23T21:52:48.292004-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=40leuGH3iO", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3559, "uid": "f0935e4cd5920aa6c7c996a5ee53a70f", "name": "Speculative Decoding: Performance or Illusion?", "authors": [{"id": 12395, "fullname": "Lily Liu", "url": "http://mlsys.org/api/miniconf/users/12395?format=json", "institution": "UC Berkeley"}, {"id": 25915, "fullname": "Jiaxiang Yu", "url": "http://mlsys.org/api/miniconf/users/25915?format=json", "institution": "UC Berkeley"}, {"id": 27180, "fullname": "Jongseok Park", "url": "http://mlsys.org/api/miniconf/users/27180?format=json", "institution": "University of California, Berkeley"}, {"id": 27557, "fullname": "Alvin Cheung", "url": "http://mlsys.org/api/miniconf/users/27557?format=json", "institution": "University of California, Berkeley"}, {"id": 11118, "fullname": "Ion Stoica", "url": "http://mlsys.org/api/miniconf/users/11118?format=json", "institution": "UC Berkeley"}], "abstract": "Speculative decoding (SD) has become a popular technique to accelerate Large Language Model (LLM) inference, yet its real-world effectiveness remains unclear as prior evaluations rely on research prototypes and unrealistically small batch sizes. We present, to our knowledge, the first systematic study of SD on a production-grade and widely deployed inference engine (vLLM), covering multiple SD variants ($n$-gram, EAGLE/EAGLE-3, Draft-Model, Multi-Token Prediction) across diverse workloads, model scales, and batch sizes. We analyze key factors governing SD performance, and quantify a theoretical upper bound on SD speedup. Our results show that verification by the target model dominates the execution, while acceptance length varies markedly across output token positions, requests, and datasets. Comparing measured performance with theoretical bounds reveals substantial gaps between observed and theoretical upper bounds, and we leverage this observation to highlight new research opportunities that our study opens up in improving SD.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3559", "url": null, "sourceid": 106, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=fzkqtezFEi", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 853, "modified": "2026-03-23T21:52:45.206502-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=fzkqtezFEi", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3582, "uid": "2723d092b63885e0d7c260cc007e8b9d", "name": "MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design", "authors": [{"id": 27257, "fullname": "Zhen Zheng", "url": "http://mlsys.org/api/miniconf/users/27257?format=json", "institution": "Microsoft"}, {"id": 27641, "fullname": "Xiaonan Song", "url": "http://mlsys.org/api/miniconf/users/27641?format=json", "institution": "Microsoft"}, {"id": 27642, "fullname": "Chuanjie Liu", "url": "http://mlsys.org/api/miniconf/users/27642?format=json", "institution": "Microsoft"}], "abstract": "Quantization has become one of the most effective methodologies to compress LLMs into smaller size. However, the existing quantization solutions still show limitations of either non-negligible accuracy drop or low system efficiency. In this paper, we propose MixLLM that explores the optimization space of mixed-precision quantization between output features, based on the insight that different features matter differently in the model. MixLLM identifies the important output features in the global view rather than within each single layer, effectively assigning larger bit-width to output features that need it the most to achieve high accuracy and low memory usage. We present the sweet spot of quantization configuration of algorithm-system co-design with high accuracy and system efficiency. To address the system challenge, we design the two-step dequantization to make use of the Tensor Core easily and fast data type conversion to reduce dequantization overhead, and present the software pipeline to overlap the memory access, dequantization and the MatMul to the best. Extensive experiments show that with only 10\\% more bits, the perplexity increase can be reduced from about 0.5 in SOTA to within 0.2 for Llama 3.1 70B, while MMLU-Pro loss can be reduced from 1.92 to 0.99 over the SOTA of three popular models. Besides its superior accuracy, MixLLM also achieves state-of-the-art system efficiency.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3582", "url": null, "sourceid": 109, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=VBbMRQ4VOc", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 876, "modified": "2026-03-23T21:52:46.162606-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=VBbMRQ4VOc", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3624, "uid": "5f93f983524def3dca464469d2cf9f3e", "name": "Massive-Scale Out-Of-Core UMAP on the GPU", "authors": [{"id": 27164, "fullname": "Jinsol Park", "url": "http://mlsys.org/api/miniconf/users/27164?format=json", "institution": "NVIDIA"}, {"id": 11859, "fullname": "Corey Nolet", "url": "http://mlsys.org/api/miniconf/users/11859?format=json", "institution": "NVIDIA"}, {"id": 14733, "fullname": "Edward Raff", "url": "http://mlsys.org/api/miniconf/users/14733?format=json", "institution": "Booz Allen Hamilton"}, {"id": 27846, "fullname": "Tim Oates", "url": "http://mlsys.org/api/miniconf/users/27846?format=json", "institution": "University of Maryland, Baltimore County"}, {"id": 27847, "fullname": "Akira Naruse", "url": "http://mlsys.org/api/miniconf/users/27847?format=json", "institution": "NVIDIA"}], "abstract": "The Uniform Manifold Approximation and Projection (UMAP) algorithm has become a widely popular technique to reduce the dimensionality of a set of vectors, both for visualization and as a pre-processing step for follow-on machine learning tasks. UMAP is often an integral part of iterative and exploratory workflows, but the heavy amount of compute and memory required makes scaling to tens or even hundreds of gigabytes of vectors intractable on the CPU, often taking several hours to days to complete. In this paper, we show how we improved UMAP while unlocking performance that permits interactive analysis, even at massive-scale. We introduce an out-of-core strategy with optional multi-GPU support, achieving up to 74\u00d7 faster performance than the CPU baseline.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3624", "url": null, "sourceid": 110, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=CR35IJQD2J", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 918, "modified": "2026-03-23T21:52:47.789624-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=CR35IJQD2J", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3526, "uid": "5fd0b37cd7dbbb00f97ba6ce92bf5add", "name": "The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents", "authors": [{"id": 27265, "fullname": "Xingyao Wang", "url": "http://mlsys.org/api/miniconf/users/27265?format=json", "institution": "All Hands AI"}, {"id": 27266, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27266?format=json", "institution": null}, {"id": 27267, "fullname": "Juan Michelini", "url": "http://mlsys.org/api/miniconf/users/27267?format=json", "institution": "Universidad de la Rep\u00fablica"}, {"id": 27268, "fullname": "Calvin Smith", "url": "http://mlsys.org/api/miniconf/users/27268?format=json", "institution": "OpenHands"}, {"id": 27269, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27269?format=json", "institution": null}, {"id": 27270, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27270?format=json", "institution": null}, {"id": 27271, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27271?format=json", "institution": null}, {"id": 27272, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27272?format=json", "institution": null}, {"id": 27273, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27273?format=json", "institution": null}, {"id": 27274, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27274?format=json", "institution": null}, {"id": 27275, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27275?format=json", "institution": null}], "abstract": "Building production-ready software engineering agents requires balancing fast research iteration with operational stability, secure deployment, and reproducible execution across diverse environments. \\textbf{OpenHands V0}\u2014an open-source agent system with 64k+ GitHub stars\u2014validated community demand but revealed four key tensions: rigid sandboxing, scattered mutable configuration, blurred core\u2013application boundaries, and limited extensibility.  We present the \\textbf{OpenHands Software Agent SDK}\u2014the core of \\textbf{OpenHands V1}\u2014a complete architectural redesign that \\emph{separates agent core from downstream applications}.  The SDK embodies four principles: (i) \\emph{optional isolation} (local-first, sandbox-on-demand); (ii) \\emph{stateless components} with immutable configuration and event-sourced state; (iii) \\emph{strict separation of concerns} between core and applications; and (iv) \\emph{two-layer composability} enabling modular deployment across four packages (SDK, Tools, Workspace, Server) and extensibility through typed, swappable components.   Built on these foundations, the SDK delivers \\emph{seamless local-to-remote execution portability}, integrated REST/WebSocket services, and visual workspaces (VS Code, VNC, browser) for human-agent collaboration.  Compared with existing SDKs from OpenAI, Claude and Google, OpenHands uniquely integrates native sandboxed execution, lifecycle control, model-agnostic multi-LLM routing, and built-in QA and security analysis.  Empirical results on SWE-Bench Verified and GAIA benchmarks demonstrate strong performance. By codifying lessons from V0, the OpenHands Agent SDK provides a practical foundation for prototyping, unlocking new classes of custom applications, \\emph{and} reliably deploying agents at scale.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3526", "url": null, "sourceid": 114, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=pzVmWs6yGq", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 820, "modified": "2026-03-23T21:52:43.861434-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=pzVmWs6yGq", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3568, "uid": "3def184ad8f4755ff269862ea77393dd", "name": "Parrot: Persuasion and Agreement Robustness Rating of Output Truth", "authors": [{"id": 25980, "fullname": "Yusuf \u00c7elebi", "url": "http://mlsys.org/api/miniconf/users/25980?format=json", "institution": "Newmind AI"}, {"id": 27598, "fullname": "Mahmoud ElHussieni", "url": "http://mlsys.org/api/miniconf/users/27598?format=json", "institution": "Istanbul Medipol University"}, {"id": 25643, "fullname": "\u00d6zay Ezerceli", "url": "http://mlsys.org/api/miniconf/users/25643?format=json", "institution": "NewMind AI"}], "abstract": "This study presents PARROT (Persuasion and Agreement Robustness Rating of Output Truth), a robustness-focused framework designed to measure the degradation in accuracy that occurs under social pressure exerted on users through authority and persuasion in large language models (LLMs) the phenomenon of sycophancy (excessive conformity). PARROT (i) isolates causal effects by comparing the neutral version of the same question with an authoritatively false version using a double-blind evaluation, (ii) quantifies confidence shifts toward the correct and imposed false responses using log-likelihood-based calibration tracking, and (iii) systematically classifies failure modes (e.g., robust correct, sycophantic agreement, reinforced error, stubborn error, self-correction, etc.) using an eight-state behavioral taxonomy. We evaluated 22 models using 1,302 MMLU-style multiple-choice questions across 13 domains and domain-specific authority templates. Findings show marked heterogeneity: advanced models (e.g., GPT-5, GPT-4.1, Claude Sonnet 4.5) exhibit low \u201cfollow rates\u201d ($\\leq11\\%$, GPT-5: 4\\%) and minimal accuracy loss, while older/smaller models show severe epistemic collapse (GPT-4: 80\\%, Qwen 2.5-1.5B: 94\\%). The danger is not limited to response changes; weak models reduce confidence in the correct response while increasing confidence in the imposed incorrect response. While international law and global knowledge at the domain level exhibit high fragility, elementary mathematics is relatively resilient. Consequently, we argue that the goal of \u201cresistance to overfitting pressure\u201d should be addressed as a primary objective alongside accuracy, harm avoidance, and privacy for safe deployment in the real world.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3568", "url": null, "sourceid": 125, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=cU2wiOnfm5", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 862, "modified": "2026-03-23T21:52:45.595554-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=cU2wiOnfm5", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3599, "uid": "d1f491a404d6854880943e5c3cd9ca25", "name": "Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP", "authors": [{"id": 20906, "fullname": "Yilong Zhao", "url": "http://mlsys.org/api/miniconf/users/20906?format=json", "institution": "University of California, Berkeley"}, {"id": 27733, "fullname": "Xiaonan Nie", "url": "http://mlsys.org/api/miniconf/users/27733?format=json", "institution": "ByteDance Inc."}, {"id": 17683, "fullname": "Kan Zhu", "url": "http://mlsys.org/api/miniconf/users/17683?format=json", "institution": "University of Washington"}, {"id": 27734, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27734?format=json", "institution": null}, {"id": 27735, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27735?format=json", "institution": null}, {"id": 27736, "fullname": "Hongxiang Hao", "url": "http://mlsys.org/api/miniconf/users/27736?format=json", "institution": "ByteDance Inc."}, {"id": 19070, "fullname": "Yang Zhou", "url": "http://mlsys.org/api/miniconf/users/19070?format=json", "institution": "UC Berkeley"}, {"id": 17670, "fullname": "Baris Kasikci", "url": "http://mlsys.org/api/miniconf/users/17670?format=json", "institution": "University of Michigan"}, {"id": 11118, "fullname": "Ion Stoica", "url": "http://mlsys.org/api/miniconf/users/11118?format=json", "institution": "UC Berkeley"}], "abstract": "Context parallelism (CP) has been widely adopted to support the growing context length in foundation model pretraining. However, existing designs fail to handle the large variation in sequence length from training datasets, resulting in suboptimal performance. These methods often over-shard short sequences, leading to compute inefficiency and excessive communication, or process long and short sequences separately without proper bin-packing, causing workload imbalance. In this paper, we propose FCP, a flexible context parallelism paradigm that shards and schedules sequences at block-level granularity. Instead of relying on rigid communication topologies such as ring, FCP enables arbitrary peer-to-peer communication, allowing flexible placement of sequence blocks across workers. By bin-packing blocks from both short and long sequences, FCP achieves both high compute efficiency and balanced workload distribution. Extensive evaluations show that FCP attains near-linear scalability on up to $256\\times$H20 and GB200 GPUs, with $1.13\\times$\u2013$2.21\\times$ improvement in the attention MFU.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3599", "url": null, "sourceid": 129, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=MPVycRsIn6", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 893, "modified": "2026-03-23T21:52:46.799467-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=MPVycRsIn6", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3634, "uid": "9fc3d7152ba9336a670e36d0ed79bc43", "name": "SONAR: Benchmarking Topology and Collaboration in Decentralized Learning", "authors": [{"id": 27911, "fullname": "Joyce Yuan", "url": "http://mlsys.org/api/miniconf/users/27911?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 27912, "fullname": "Yichuan Shi", "url": "http://mlsys.org/api/miniconf/users/27912?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 27913, "fullname": "Abhishek Singh", "url": "http://mlsys.org/api/miniconf/users/27913?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 27914, "fullname": "Rishi Sharma", "url": "http://mlsys.org/api/miniconf/users/27914?format=json", "institution": "EPFL - EPF Lausanne"}, {"id": 27915, "fullname": "Ramesh Raskar", "url": "http://mlsys.org/api/miniconf/users/27915?format=json", "institution": "Massachusetts Institute of Technology"}, {"id": 27916, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27916?format=json", "institution": null}, {"id": 27917, "fullname": "Martin Jaggi", "url": "http://mlsys.org/api/miniconf/users/27917?format=json", "institution": "EPFL"}], "abstract": "The performance, efficiency, and reliability of decentralized machine learning hinge on systems factors such as network topology, communication budget, and device heterogeneity\u2014yet existing frameworks treat these as fixed or opaque. Federated learning remains centrally orchestrated, while peer-to-peer (P2P) approaches lack a unified foundation for analyzing how topology and system design jointly shape learning outcomes. We present \\textbf{SONAR}, a systems framework for reproducible, topology-aware decentralized learning. SONAR unifies communication, topology, and telemetry in a layered architecture supporting multiple backends (gRPC, MPI, WebRTC), static and adaptive graphs, and per-node logging of bandwidth, latency, and collaboration dynamics. Using SONAR, we make three observations: (1) topology and its graph-level statistics show no consistent or linear correlation with learning performance across accuracy, robustness, and privacy metrics, underscoring the need to study topology as an independent systems variable; (2) under realistic constraints such as limited communication rounds or bandwidth, topology governs how quickly information propagates\u2014producing up to \u2248 20% performance differences between graph families; and (3) adaptive neighbor selection can induce collaborator collapse\u2014a failure mode where network diversity erodes over time. By exposing topology as a first-class experimental dimension, SONAR enables systematic, reproducible evaluation of decentralized learning across performance, efficiency, and robustness regimes.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3634", "url": null, "sourceid": 133, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=4Bqg7Xyk5t", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 928, "modified": "2026-03-23T21:52:48.197417-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=4Bqg7Xyk5t", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3619, "uid": "02522a2b2726fb0a03bb19f2d8d9524d", "name": "Stream2LLM: Overlap Context Streaming and Prefill for Reduced Time-to-First-Token", "authors": [{"id": 27831, "fullname": "Rajveer Bachkaniwala", "url": "http://mlsys.org/api/miniconf/users/27831?format=json", "institution": "Georgia Institute of Technology"}, {"id": 27832, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27832?format=json", "institution": null}, {"id": 27833, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27833?format=json", "institution": null}, {"id": 27358, "fullname": "Divya Mahajan", "url": "http://mlsys.org/api/miniconf/users/27358?format=json", "institution": "Georgia Institute of Technology"}, {"id": 27834, "fullname": "Kexin Rong", "url": "http://mlsys.org/api/miniconf/users/27834?format=json", "institution": "Georgia Institute of Technology"}], "abstract": "Context retrieval systems for LLM inference face a critical challenge: high retrieval latency creates a fundamental tension between waiting for complete context (poor time-to-first-token) and proceeding without it (reduced quality). Recent work mitigates this via streaming\u2013overlapping retrieval with inference\u2013but prior systems focus on single-request settings and overlook challenges in multi-tenant deployments where concurrent requests contend for GPU memory and scheduling must adapt to dynamic context arrivals. We present Stream2LLM, a system that extends vLLM to support streaming prompts with adaptive scheduling and preemption for two distinct retrieval patterns: append-mode (progressive context accumulation) and update-mode (iterative refinement with cache invalidation). Stream2LLM decouples scheduling decisions from resource acquisition, enabling flexible preemption strategies guided by hardware-specific cost models, and uses cache invalidation based on longest common prefix matching to minimize redundant computation when prompts change dynamically. To evaluate Stream2LLM, we collect and characterize two large-scale, real-world streaming workloads based on web crawling and approximate nearest neighbor search. Our evaluation demonstrates that streaming architecture delivers up to 11\u00d7 TTFT improvements, with cost-aware scheduling providing critical benefits under memory pressure, while maintaining throughput parity with non-streaming baselines.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3619", "url": null, "sourceid": 134, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=FuRo7Ur5Ib", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 913, "modified": "2026-03-23T21:52:47.584624-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=FuRo7Ur5Ib", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3581, "uid": "42a0e188f5033bc65bf8d78622277c4e", "name": "A Lightweight High-Throughput Collective-Capable NoC for Large-Scale ML Accelerators", "authors": [{"id": 25644, "fullname": "Luca Colagrande", "url": "http://mlsys.org/api/miniconf/users/25644?format=json", "institution": "ETH Zurich"}, {"id": 25669, "fullname": "Lorenzo Leone", "url": "http://mlsys.org/api/miniconf/users/25669?format=json", "institution": "ETH Zurich"}, {"id": 27638, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27638?format=json", "institution": null}, {"id": 27639, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27639?format=json", "institution": null}, {"id": 27640, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27640?format=json", "institution": null}, {"id": 21010, "fullname": "Luca Benini", "url": "http://mlsys.org/api/miniconf/users/21010?format=json", "institution": "ETHZ - ETH Zurich"}], "abstract": "The exponential increase in Machine Learning (ML) model size and complexity has driven unprecedented demand for high-performance acceleration systems. As technology scaling enables the integration of thousands of computing elements onto a single die, the boundary between distributed and on-chip systems has blurred, making efficient on-chip collective communication increasingly critical. In this work, we present a lightweight, collective-capable Network on Chip (NoC) that supports efficient barrier synchronization alongside scalable, high-bandwidth multicast and reduction operations, co-designed for the next generation of ML accelerators. We introduce Direct Compute Access (DCA), a novel paradigm that grants the interconnect fabric direct access to the cores\u2019 computational resources, enabling high-throughput in-network reductions with a small 16.5% router area overhead. Through in-network hardware acceleration, we achieve 2.9\u00d7 and 2.5\u00d7 geomean speedups on multicast and reduction operations involving between 1 and 32 KiB of data, respectively. Furthermore, by keeping communication off the critical path in GEMM workloads, these features allow our architecture to scale efficiently to large meshes, resulting in up to 2.1\u00d7 and 2.1\u00d7 estimated performance gains through multicast and reduction support, respectively, compared to a baseline unicast NoC architecture.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3581", "url": null, "sourceid": 136, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=VDuS8N9RCx", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 875, "modified": "2026-03-23T21:52:46.121249-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=VDuS8N9RCx", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3530, "uid": "f4b9ec30ad9f68f89b29639786cb62ef", "name": "Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework", "authors": [{"id": 27301, "fullname": "Dong Wang", "url": "http://mlsys.org/api/miniconf/users/27301?format=json", "institution": "Meta FAIR"}, {"id": 27302, "fullname": "Yang Li", "url": "http://mlsys.org/api/miniconf/users/27302?format=json", "institution": "Facebook"}, {"id": 27303, "fullname": "Ansong Ni", "url": "http://mlsys.org/api/miniconf/users/27303?format=json", "institution": "Meta AI"}, {"id": 27304, "fullname": "Ching-Feng Yeh", "url": "http://mlsys.org/api/miniconf/users/27304?format=json", "institution": "Facebook"}, {"id": 27305, "fullname": "Youssef Emad", "url": "http://mlsys.org/api/miniconf/users/27305?format=json", "institution": "Facebook"}, {"id": 27306, "fullname": "Xinjie Lei", "url": "http://mlsys.org/api/miniconf/users/27306?format=json", "institution": "Meta"}, {"id": 27307, "fullname": "Liam Robbins", "url": "http://mlsys.org/api/miniconf/users/27307?format=json", "institution": "FAIR"}, {"id": 27308, "fullname": "Karthik Padthe", "url": "http://mlsys.org/api/miniconf/users/27308?format=json", "institution": "Meta AI"}, {"id": 27309, "fullname": "Hu Xu", "url": "http://mlsys.org/api/miniconf/users/27309?format=json", "institution": "FAIR, Foundation"}, {"id": 27310, "fullname": "Xian Li", "url": "http://mlsys.org/api/miniconf/users/27310?format=json", "institution": "Facebook AI"}, {"id": 27311, "fullname": "Asli Celikyilmaz", "url": "http://mlsys.org/api/miniconf/users/27311?format=json", "institution": "FAIR"}, {"id": 27312, "fullname": "Ramya Raghavendra", "url": "http://mlsys.org/api/miniconf/users/27312?format=json", "institution": "Facebook"}, {"id": 27313, "fullname": "LIFEI HUANG", "url": "http://mlsys.org/api/miniconf/users/27313?format=json", "institution": "Facebook"}, {"id": 27314, "fullname": "Carole-Jean Wu", "url": "http://mlsys.org/api/miniconf/users/27314?format=json", "institution": "Meta"}, {"id": 27315, "fullname": "Shang-Wen Li", "url": "http://mlsys.org/api/miniconf/users/27315?format=json", "institution": "Facebook"}], "abstract": "Synthetic data has become increasingly important for training large language models, especially when real data is scarce, expensive, or privacy-sensitive. Many such generation tasks require coordinated multi-agent workflows, where specialized agents collaborate to produce data that is higher quality, more diverse, and structurally richer. However, existing frameworks for multi-agent synthesis often depend on a centralized orchestrator, creating scalability bottlenecks, or are hardcoded for specific domains, limiting flexibility. We present \\textbf{Matrix}, a decentralized framework that represents both control and data flow as serialized messages passed through distributed queues. This peer-to-peer design eliminates the central orchestrator. Each task progresses independently through lightweight agents, while compute-intensive operations, such as LLM inference or containerized environments, are handled by distributed services. Built on Ray, Matrix scales to tens of thousands of concurrent agentic workflows and provides a modular, configurable design that enables easy adaptation to a wide range of data generation workflows. We evaluate Matrix across diverse synthesis scenarios, such as multi-agent collaborative dialogue, web-based reasoning data extraction, and tool-use trajectory generation in customer service environments. In all cases, Matrix achieves $2$--$15\\times$ higher data generation throughput under identical hardware resources, without compromising output quality.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3530", "url": null, "sourceid": 94, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=ok96wGyPdI", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 824, "modified": "2026-03-23T21:52:44.017842-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=ok96wGyPdI", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3615, "uid": "76dc611d6ebaafc66cc0879c71b5db5c", "name": "FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management", "authors": [{"id": 26134, "fullname": "Nazmul Takbir", "url": "http://mlsys.org/api/miniconf/users/26134?format=json", "institution": "University of California Irvine"}, {"id": 27819, "fullname": "Hamidreza Koshkak", "url": "http://mlsys.org/api/miniconf/users/27819?format=json", "institution": "University of California, Irvine"}, {"id": 27820, "fullname": "Nikil Dutt", "url": "http://mlsys.org/api/miniconf/users/27820?format=json", "institution": "University of California, Irvine"}, {"id": 11050, "fullname": "Sangeetha Abdu Jyothi", "url": "http://mlsys.org/api/miniconf/users/11050?format=json", "institution": "UC Irvine / VMware research"}], "abstract": "Large Language Model (LLM) serving is increasingly constrained by the growing size of the key-value (KV) cache, which scales with both context length and generation length. Prior work shows that attention is dominated by a small subset of critical tokens, yet existing systems struggle to exploit this efficiently without degrading accuracy, especially in long generation. We make a key observation: the temporal stability of these critical tokens varies significantly across KV heads: some heads consistently focus on the same tokens, while others shift frequently. Building on this insight, we introduce FlexiCache, a hierarchical KV-cache management system that leverages the temporal stability of KV heads to reduce GPU memory usage and computation overhead, while preserving model accuracy. FlexiCache classifies KV heads as stable or unstable: it retains all KV-cache pages from unstable heads in GPU memory, whereas for stable heads, it keeps only the top-K pages on the GPU and offloads the rest to host memory. By exploiting temporal stability, FlexiCache performs periodic reranking for stable heads to fetch newly promoted top pages. Implemented atop vLLM, FlexiCache reduces GPU memory footprint for long-context requests by up to \\textbf{70\\%}, improves offline serving throughput by \\textbf{1.38\u20131.55\u00d7}, and lowers online token latency by \\textbf{1.6\u20132.1\u00d7}, all while maintaining accuracy in long-context, long-generation scenarios.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3615", "url": null, "sourceid": 128, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=GgX6dPJx9M", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 909, "modified": "2026-03-23T21:52:47.421836-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=GgX6dPJx9M", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3562, "uid": "7cbbc409ec990f19c78c75bd1e06f215", "name": "CDLM: CONSISTENCY DIFFUSION LANGUAGE MODELS FOR FASTER SAMPLING", "authors": [{"id": 25522, "fullname": "Minseo Kim", "url": "http://mlsys.org/api/miniconf/users/25522?format=json", "institution": "Seoul National University"}, {"id": 24298, "fullname": "Chenfeng Xu", "url": "http://mlsys.org/api/miniconf/users/24298?format=json", "institution": "UC Berkeley"}, {"id": 17672, "fullname": "Coleman Hooper", "url": "http://mlsys.org/api/miniconf/users/17672?format=json", "institution": "University of California, Berkeley"}, {"id": 27566, "fullname": "Harman Singh", "url": "http://mlsys.org/api/miniconf/users/27566?format=json", "institution": "University of California, Berkeley"}, {"id": 18231, "fullname": "Ben Athiwaratkun", "url": "http://mlsys.org/api/miniconf/users/18231?format=json", "institution": null}, {"id": 18868, "fullname": "Ce Zhang", "url": "http://mlsys.org/api/miniconf/users/18868?format=json", "institution": null}, {"id": 11240, "fullname": "Kurt Keutzer", "url": "http://mlsys.org/api/miniconf/users/11240?format=json", "institution": "EECS, UC Berkeley"}, {"id": 11237, "fullname": "Amir Gholami", "url": "http://mlsys.org/api/miniconf/users/11237?format=json", "institution": "UC Berkeley"}], "abstract": "Diffusion Language Models (DLMs) offer a promising parallel generation paradigm but suffer from slow inference due to numerous refinement steps and an inability to use standard KV caching. We introduce CDLM (Consistency Diffusion Language Models), a training-based acceleration method that simultaneously tackles both bottlenecks. CDLM integrates consistency modeling to drastically reduce the number of required sampling steps by enabling multi-token finalization. Furthermore, we enforce a block-wise causal attention mask during fine-tuning, making the model fully compatible with KV caching. Experiments show CDLM achieves 3.6\u00d7-12.8\u00d7 lower latency while maintaining competitive accuracy on math and coding tasks. The full training and evaluation code is available at https://anonymous.4open.science/r/Consistency_DLM_anonymous-3E88/.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3562", "url": null, "sourceid": 70, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=eB8yjR6alL", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 856, "modified": "2026-03-23T21:52:45.337875-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=eB8yjR6alL", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3563, "uid": "093f65e080a295f8076b1c5722a46aa2", "name": "LEANN: A Low-Storage Overhead Vector Index", "authors": [{"id": 27567, "fullname": "Yichuan Wang", "url": "http://mlsys.org/api/miniconf/users/27567?format=json", "institution": "University of California, Berkeley"}, {"id": 27568, "fullname": "Zhifei Li", "url": "http://mlsys.org/api/miniconf/users/27568?format=json", "institution": "UC Berkeley, University of California, Berkeley"}, {"id": 21025, "fullname": "Shu Liu", "url": "http://mlsys.org/api/miniconf/users/21025?format=json", "institution": "University of California, Berkeley"}, {"id": 17645, "fullname": "Yongji Wu", "url": "http://mlsys.org/api/miniconf/users/17645?format=json", "institution": "Duke University"}, {"id": 27569, "fullname": "Ziming Mao", "url": "http://mlsys.org/api/miniconf/users/27569?format=json", "institution": "University of California, Berkeley"}, {"id": 20906, "fullname": "Yilong Zhao", "url": "http://mlsys.org/api/miniconf/users/20906?format=json", "institution": "University of California, Berkeley"}, {"id": 27570, "fullname": "Xiao Yan", "url": "http://mlsys.org/api/miniconf/users/27570?format=json", "institution": "Wuhan University"}, {"id": 20979, "fullname": "Zhiying Xu", "url": "http://mlsys.org/api/miniconf/users/20979?format=json", "institution": "Amazon"}, {"id": 19070, "fullname": "Yang Zhou", "url": "http://mlsys.org/api/miniconf/users/19070?format=json", "institution": "UC Berkeley"}, {"id": 11118, "fullname": "Ion Stoica", "url": "http://mlsys.org/api/miniconf/users/11118?format=json", "institution": "UC Berkeley"}, {"id": 27571, "fullname": "Sewon Min", "url": "http://mlsys.org/api/miniconf/users/27571?format=json", "institution": "University of California, Berkeley"}, {"id": 21014, "fullname": "Matei Zaharia", "url": "http://mlsys.org/api/miniconf/users/21014?format=json", "institution": "University of California, Berkeley"}, {"id": 11239, "fullname": "Joseph Gonzalez", "url": "http://mlsys.org/api/miniconf/users/11239?format=json", "institution": "UC Berkeley"}], "abstract": "Embedding-based vector search underpins many important applications, such as recommendation and retrieval-augmented generation (RAG). It relies on vector indices to enable efficient search. However, these indices require storing high-dimensional embeddings and large index metadata, whose total size can be several times larger than the original data (e.g., text chunks). Such high storage overhead makes it difficult, or even impractical, to deploy vector search on personal devices or large-scale datasets. To tackle this problem, we propose LEANN, a storage-efficient index for vector search that recomputes embeddings on the fly instead of storing them, and compresses state-of-the-art proximity graph indices while preserving search accuracy. LEANN delivers high-quality vector search while using only a fraction of the storage (e.g., 5% of the original data) and supporting storage-efficient index construction and updates. On real-world benchmarks, LEANN reduces index size by up to 50\u00d7 compared with conventional indices, while maintaining SOTA accuracy and comparable latency for RAG applications.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3563", "url": null, "sourceid": 59, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=e8Dp5QkFxP", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 857, "modified": "2026-03-23T21:52:45.384589-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=e8Dp5QkFxP", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3579, "uid": "eb160de1de89d9058fcb0b968dbbbd68", "name": "Efficient, VRAM-Constrained xLM Inference on Clients", "authors": [{"id": 27156, "fullname": "Aditya Ukarande", "url": "http://mlsys.org/api/miniconf/users/27156?format=json", "institution": "NVIDIA"}, {"id": 27631, "fullname": "Deep Shekhar", "url": "http://mlsys.org/api/miniconf/users/27631?format=json", "institution": "Nvidia"}, {"id": 27632, "fullname": "", "url": "http://mlsys.org/api/miniconf/users/27632?format=json", "institution": null}, {"id": 27157, "fullname": "Ram Rangan", "url": "http://mlsys.org/api/miniconf/users/27157?format=json", "institution": "NVIDIA"}], "abstract": "To usher in the next round of client AI innovation, there is an urgent need to enable efficient, lossless inference of high-accuracy large language models (LLMs) and vision language models (VLMs), jointly referred to as xLMs, on client systems. This means efficient support for: a) interactive use (i.e. batch size 1), b) high resolution VLM inference, c) dense and mixture-of-experts (MoE) LLMs, and d) adapting to system conditions (CPU thread count, CPU-GPU interconnect bandwidth, and VRAM budget) and inference conditions (phase of execution and context size). While recent CPU-GPU hybrid scheduling techniques show promise, to our best knowledge, no single product handles all of the above.  In this paper, we address this problem with pipelined sharding, a novel, benchmark profile-guided CPU-GPU hybrid scheduling technique to achieve efficient, VRAM-constrained inference for both dense and mixture-ofexperts (MoE) LLMs. Using a combination of model sharding at layer or sub-layer levels, CPU offloading, pipelined copy-compute, and prioritized tensor placement in VRAM, it optimizes both time-to-first-token (TTFT) and tokens per second (TPS) metrics, while flexibly adapting to system and inference conditions. For efficient, high-accuracy VLM inference, we combine pipelined sharding with a llama.cpp implementation of three well-understood prior ideas (jointly called VLMOpt), namely, vision tensor CPU offloading, flash attention, and vision and language model VRAM overlap avoidance. These enhancements are targeted at improving client xLM inference in future releases of two important NVIDIA products - the In-Game Inferencing (IGI) software development kit (SDK) and the Cosmos-Reason-1 (CR1) physical AI reasoning VLM. Highlights from our rigorous evaluation spanning multiple models and client systems include: time-to-first-token (TTFT) improves by up to 6.7\u00d7 and tokens per second by up to 30\u00d7 for LLMs, and CR1 inference\u2019s VRAM demand is down by 10\u00d7, compared to their respective aggressive baselines.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3579", "url": null, "sourceid": 117, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=VKqQYg6JPb", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 873, "modified": "2026-03-23T21:52:46.042387-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=VKqQYg6JPb", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3583, "uid": "98f13708210194c475687be6106a3b84", "name": "AgenticCache: Cache-Driven Asynchronous Planning for Embodied AI Agents", "authors": [{"id": 25550, "fullname": "Hojoon Kim", "url": "http://mlsys.org/api/miniconf/users/25550?format=json", "institution": "Seoul National University"}, {"id": 27643, "fullname": "Yuheng Wu", "url": "http://mlsys.org/api/miniconf/users/27643?format=json", "institution": "Stanford University"}, {"id": 27644, "fullname": "Thierry Tambe", "url": "http://mlsys.org/api/miniconf/users/27644?format=json", "institution": "Stanford University"}], "abstract": "Large language models (LLMs) have recently been integrated into embodied AI agents, yet their synchronous plan-act loop imposes severe latency and cost bottlenecks. We present AgenticCache, a cache-driven asynchronous planning framework that decouples LLM reasoning from real-time execution. By identifying strong plan transition locality in embodied tasks, AgenticCache enables agents to reuse frequently occurring plan fragments and update them asynchronously through a background LLM process. This design converts idle waiting time into productive action while preserving context-aware decision quality. Across four multi-agent embodied benchmarks, AgenticCache improves task success rates by 24.34%, reduces simulation latency by 75%, and lowers token usage by 65% on average.  These results demonstrate that caching and asynchronous reasoning together offer a path toward real-time, low-cost, and cognitively inspired autonomy in LLM-based agents.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3583", "url": null, "sourceid": 20, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=UfABxFoSXH", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 877, "modified": "2026-03-23T21:52:46.202830-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=UfABxFoSXH", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}, {"id": 3613, "uid": "8e296a067a37563370ded05f5a3bf3ec", "name": "NexSpec: Towards Optimizing Speculative Decoding in Reinforcement Learning Systems", "authors": [{"id": 27808, "fullname": "qiaoling chen", "url": "http://mlsys.org/api/miniconf/users/27808?format=json", "institution": "Nanyang Technological University"}, {"id": 25950, "fullname": "Zijun Liu", "url": "http://mlsys.org/api/miniconf/users/25950?format=json", "institution": "Tsinghua University"}, {"id": 27809, "fullname": "Peng Sun", "url": "http://mlsys.org/api/miniconf/users/27809?format=json", "institution": "Harbin Institute of Technology"}, {"id": 27810, "fullname": "Shenggui Li", "url": "http://mlsys.org/api/miniconf/users/27810?format=json", "institution": "Nanyang Technological University"}, {"id": 27811, "fullname": "Guoteng Wang", "url": "http://mlsys.org/api/miniconf/users/27811?format=json", "institution": "Shanghai Artificial Intelligence Laboratory"}, {"id": 17655, "fullname": "Ziming Liu", "url": "http://mlsys.org/api/miniconf/users/17655?format=json", "institution": "national university of singaore, National University of Singapore"}, {"id": 20910, "fullname": "Yonggang Wen", "url": "http://mlsys.org/api/miniconf/users/20910?format=json", "institution": "Nanyang Technological University"}, {"id": 27812, "fullname": "Siyuan Feng", "url": "http://mlsys.org/api/miniconf/users/27812?format=json", "institution": "Shanghai Innovation Institute"}, {"id": 20894, "fullname": "Tianwei Zhang", "url": "http://mlsys.org/api/miniconf/users/20894?format=json", "institution": "Nanyang Technological University"}], "abstract": "Adapting large language models (LLMs) via reinforcement learning (RL) is often bottlenecked by the generation stage, which can consume over 75% of the training time. Speculative decoding (SD) accelerates autoregressive generation in serving systems, but its behavior under RL training remains largely unexplored. We identify three critical gaps that hinder the na\u00efve integration of SD into RL systems: diminishing speedups at large batch sizes, drafter staleness under continual actor updates, and drafter-induced policy degradation.   To address these gaps, we present NexSpec, a system that adapts SD to RL through three complementary mechanisms: dynamically tuning SD configurations, evolving the drafter via knowledge distillation, and weighting updates by rollout rewards. On Qwen models (3B\u201314B), NexSpec achieves up to 4.5x speedup while preserving reward convergence and training stability, providing a practical solution for efficient RL-based LLM adaptation.", "topic": null, "keywords": [], "decision": "Conditional Accept", "session": "", "eventtype": "Poster", "event_type": "Poster", "room_name": null, "virtualsite_url": "/virtual/2026/poster/3613", "url": null, "sourceid": 25, "sourceurl": "https://openreview.net/group?id=MLSys.org/2026/Conference", "starttime": null, "endtime": null, "starttime2": null, "endtime2": null, "diversity_event": null, "paper_url": "https://openreview.net/forum?id=HhDSxs7x2R", "paper_pdf_url": null, "children_url": null, "children": [], "children_ids": [], "parent": null, "parent_id": null, "eventmedia": [{"id": 907, "modified": "2026-03-23T21:52:47.330110-07:00", "display_section": 1, "type": "URL", "name": "OpenReview", "visible": true, "sortkey": 0, "is_live_content": false, "uri": "https://openreview.net/forum?id=HhDSxs7x2R", "resourcetype": "UriEventmedia"}], "show_in_schedule_overview": false, "visible": true, "poster_position": null, "schedule_html": "", "latitude": null, "longitude": null, "related_events": [], "related_events_ids": []}]}