Session
Poster Session 2
Evergreen Ballroom
A Lightweight High-Throughput Collective-Capable NoC for Large-Scale ML Accelerators
Luca Colagrande ⋅ Lorenzo Leone ⋅ Chen Wu ⋅ ⋅ ⋅ Luca Benini
The exponential increase in Machine Learning (ML) model size and complexity has driven unprecedented demand for high-performance acceleration systems. As technology scaling enables the integration of thousands of computing elements onto a single die, the boundary between distributed and on-chip systems has blurred, making efficient on-chip collective communication increasingly critical. In this work, we present a lightweight, collective-capable Network on Chip (NoC) that supports efficient barrier synchronization alongside scalable, high-bandwidth multicast and reduction operations, co-designed for the next generation of ML accelerators. We introduce Direct Compute Access (DCA), a novel paradigm that grants the interconnect fabric direct access to the cores’ computational resources, enabling high-throughput in-network reductions with a small 16.5% router area overhead. Through in-network hardware acceleration, we achieve 2.9× and 2.5× geomean speedups on multicast and reduction operations involving between 1 and 32 KiB of data, respectively. Furthermore, by keeping communication off the critical path in GEMM workloads, these features allow our architecture to scale efficiently to large meshes, resulting in up to 2.1× and 2.1× estimated performance gains through multicast and reduction support, respectively, compared to a baseline unicast NoC architecture.
Automated Algorithm Design for Auto-Tuning Optimizers
Floris-Jan Willemsen ⋅ Niki van Stein ⋅ Ben van Werkhoven
Automatic performance tuning (auto-tuning) is essential for optimizing high-performance applications, where vast and irregular search spaces make manual exploration infeasible. While auto-tuners traditionally rely on classical approaches such as evolutionary, annealing, or surrogate-based optimizers, designing algorithms that efficiently find near-optimal configurations robustly across diverse tasks is challenging. We propose a new paradigm: using large language models (LLMs) to automatically generate optimization algorithms tailored to auto-tuning problems. We introduce a framework that prompts LLMs with problem descriptions and search space characteristics to synthesize, test, and iteratively refine specialized optimizers. These generated algorithms are evaluated on four real-world auto-tuning applications across six hardware platforms and compared against the state-of-the-art in two contemporary auto-tuning frameworks. The evaluation demonstrates that providing additional application- and search space-specific information in the generation stage results in an average performance improvement of 30.7% and 14.6%, respectively. In addition, our results show that LLM-generated optimizers can rival, and in various cases outperform, existing human-designed algorithms, with our best-performing generated optimization algorithms achieving an average 72.4% improvement over state-of-the-art optimizers for auto-tuning.
BEAM: Joint Resource–Power Optimization for Energy-Efficient LLM Inference under SLO contraints
Hyunjae Lee ⋅ Sangjin Choi ⋅ Seungjae Lim ⋅ Youngjin Kwon
Large Language Model (LLM) serving is rapidly becoming one of the most power-intensive workloads in modern datacenters. Unlike training, where throughput dominates, inference must satisfy strict per-request latency targets such as Time-to-First-Token (TTFT) and Time-Between-Tokens (TBT). Once an SLO is met, the remaining latency slack between the earliest possible completion and the deadline offers an opportunity for energy savings. Existing systems, however, exploit only one dimension of this trade-off: batching improves resource efficiency, while DVFS improves power efficiency. These two axes are tightly coupled, and optimizing one while fixing the other yields only a local optimum. We present BEAM, a fine-grained controller that dynamically co-optimizes resource and power efficiency under per-request SLOs. BEAM continuously allocates the available latency slack across both dimensions by jointly tuning GPU frequency, chunk size, and microbatch count in real time. Its event-driven design responds instantly to request arrivals and completions, while a lightweight predictive model enables sub-millisecond decision making with negligible overhead. Implemented atop the vLLM runtime, BEAM reduces end-to-end GPU energy consumption by up to 51\% compared to vLLM.
Beat the long tail: Distribution-Aware Speculative Decoding for RL Training
Zelei Shao ⋅ Vikranth Srivatsa ⋅ Junxiong Wang ⋅ Chenfeng Xu ⋅ Xiaoxia Wu ⋅ ⋅ ⋅ Qingyang Wu ⋅ Jue Wang ⋅ Ameen Patel ⋅ Yiying Zhang ⋅ Percy Liang ⋅ Tri Dao ⋅ Ben Athiwaratkun ⋅ Ce Zhang
Reinforcement learning (RL) post-training has become essential for aligning large language models (LLMs), yet its efficiency is increasingly constrained by the rollout phase, where long trajectories are generated token by token. We identify a major bottleneck—the long-tail distribution of rollout lengths, where a small fraction of long generations dominates wall-clock time—and a complementary opportunity—the availability of historical rollouts that reveal stable prompt-level patterns across training epochs. Motivated by these observations, we propose \textbf{DAS, a Distribution-Aware Speculative decoding framework} that accelerates RL rollouts without altering model outputs. DAS integrates two key ideas: a \textbf{self-evolving, nonparametric drafter} built from recent rollouts using an incrementally maintained suffix tree, and a \textbf{length-aware speculation policy} that allocates more aggressive draft budgets to long trajectories that dominate makespan. This design exploits rollout history to sustain acceptance while balancing base and token-level costs during decoding. Experiments on math and code reasoning tasks show that DAS reduces rollout time by over 30\% while preserving identical training curves, demonstrating that distribution-aware speculative decoding can significantly accelerate RL post-training without compromising learning quality.
BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models
Zhengyang Wang ⋅ Ziyue Liu ⋅ Ruijie Zhang ⋅ Avinash Maurya ⋅ ⋅ Paul Hovland ⋅ ⋅ zheng Zhang
The scale of transformer model pre-training is constrained by the increasing computation and communication cost. Low-rank bottleneck architectures offer a promising solution to significantly reduce the training time and memory footprint with minimum impact on accuracy. Despite algorithmic efficiency, bottleneck architectures scale poorly under standard tensor parallelism. Simply applying 3D parallelism designed for full-rank methods leads to excessive communication and poor GPU utilization. To address this limitation, we propose BOOST, an efficient training framework tailored for large-scale low-rank bottleneck architectures. BOOST introduces a novel Bottleneck-aware Tensor Parallelism, and combines optimizations such as online-RMSNorm, linear layer grouping, and low-rank activation checkpointing to achieve end-to-end training speedup. Evaluations on different low-rank bottleneck architectures demonstrate that BOOST achieves 1.46–1.91$\times$ speedup over full-rank model baselines and 1.87–2.27$\times$ speedup over low-rank model with naively integrated 3D parallelism, with improved GPU utilization and reduced communication overhead.
BOute: Cost-Efficient LLM Serving with Heterogeneous LLMs and GPUs via Multi-Objective Bayesian Optimization
YOUHE JIANG ⋅ Fangcheng Fu ⋅ Eiko Yoneki
The rapid growth of large language model (LLM) deployments has made cost-efficient serving systems essential. Recent efforts to enhance system cost-efficiency adopt two main perspectives: (\textbf{\underline{i}}) An \textit{algorithmic} perspective that exploits heterogeneous model capabilities to route simpler queries to lower-cost models and complex queries to higher-cost models (i.e., heterogeneous query routing); and (\textbf{\underline{ii}}) a \textit{systems} perspective that utilizes heterogeneous GPU resources as cost-effective alternatives to homogeneous high-end GPUs (i.e., heterogeneous model deployment). However, algorithm-system co-design for cost-efficient LLM serving necessitates sophisticated management: (\textbf{\underline{i}}) Determining optimal query routing strategies under latency and quality requirements, (\textbf{\underline{ii}}) configuring model deployment across heterogeneous GPUs with appropriate resource allocation and parallelism strategies, and (\textbf{\underline{iii}}) co-optimizing routing and deployment decisions to maximize overall system performance. To address these challenges, we present BOute, a \textit{quality-aware scheduling system} that jointly exploits heterogeneous model and GPU capabilities for cost-efficient LLM serving. BOute employs a \textit{multi-objective Bayesian optimization (MOBO) framework} to co-optimize the routing strategy and model deployment, thereby maximizing the cost-efficiency of the serving system while guaranteeing response quality. Evaluation results demonstrate that \sys outperforms state-of-the-art LLM serving systems by up to 157\% and 59\% on average under \textit{identical} cost budgets and quality requirements, or reducing serving costs by 15\%-61\% (38\% on average) while maintaining the \textit{same} performance targets, validating its effectiveness in achieving cost-efficient LLM serving.
Breaking the Ice: Analyzing Cold Start Latency in vLLM
Huzaifa Shaaban Kabakibo ⋅ Animesh Trivedi ⋅ Lin Wang
As scalable inference services become popular, the cold start latency of an inference engine becomes important. Today, vLLM has evolved into the de-facto inference engine of choice for many inference workloads. Although popular, due to its complexity and rapid evolution, there has not been a systematic study on the startup latency of its engine. With major architectural innovations under it (e.g., the V1 API, introduction of torch.compile), in this paper, we present the first detailed performance characterization of vLLM startup latency. We break down the startup process into six foundational steps and demonstrate that this process is predominantly CPU-bound. Each step exhibits consistent and interpretable scaling trends with respect to model- and system-level parameters, enabling fine-grained attribution of latency sources. Building on these insights, we develop a lightweight analytical model that accurately predicts vLLM’s startup latency for a given hardware configuration, providing actionable guidance for serverless scheduling and resource planning in large-scale inference environments.
CAGE: Curvature-Aware Gradient Estimation For Accurate Quantization-Aware Training
Soroush Tabesh ⋅ ⋅ Andrei Panferov ⋅ ⋅
Despite significant work on low-bit quantization-aware training (QAT), there is still an accuracy gap between such techniques and native training. To address this, we introduce CAGE (Curvature-Aware Gradient Estimation), a new QAT method that augments the straight-through estimator (STE) gradient with a curvature-aware correction designed to counteract the loss increase induced by quantization. CAGE is derived from a multi-objective view of QAT that balances loss minimization with adherence to quantization constraints, yielding a principled correction term that depends on local curvature information. On the theoretical side, we introduce the notion of Pareto-optimal solutions for quantized optimization, and establish that CAGE yields strong convergence guarantees in the smooth non-convex setting. In terms of implementation, our approach is optimizer-agnostic, but we provide a highly-efficient implementation that leverages Adam statistics. CAGE significantly improves upon the prior state-of-the-art methods in terms of accuracy, for similar computational cost: for QAT fine-tuning, it halves the compression accuracy loss relative to the prior best method, while for QAT pre-training of Llama models, its accuracy for 3-bit weights-and-activations (W3A3) matches that of 4-bit training (W4A4) with the prior best method (QuEST).
CRAFT: Cost-aware Expert Replica Allocation with Fine-Grained Layerwise Estimations
Adrian Zhao ⋅ ⋅ ⋅ Lingfan Yu ⋅ Haozheng Fan ⋅ Jun Wu ⋅ Yida Wang ⋅ Nandita Vijaykumar
Mixture-of-Experts (MoE) has recently emerged as the mainstream architecture for efficiently scaling large language models while maintaining near-constant computational cost. Expert parallelism distributes parameters by partitioning experts across devices, but this introduces token-level load imbalance during inference. Expert replication is a widely adopted load-balancing technique in serving frameworks that alleviates load imbalance in large-scale deployments by replicating experts with high loads. In this work, we demonstrate that existing replication schemes often _over-replicate_, with many replicas providing marginal improvement. Replicas consume substantial GPU memory, which may lead to resource contention and throughput degradation. We present CRAFT, an efficient expert replication framework that maximizes load balance under a given memory budget by performing fine-grained, per-layer replication based on the estimated replication benefit. CRAFT can be seamlessly integrated into existing serving frameworks without any additional training or model changes. Our evaluation shows that CRAFT increases end-to-end serving throughput by $1.14\times$ on average (up to $1.2\times$) over existing replication techniques in large-scale deployments with models ranging from hundreds of billions to a trillion parameters.
Mixture-of-Experts (MoEs) enable massive model sizes but suffer from serving overheads compared to dense models with the same per-token compute costs. This MoE tax varies with the model architecture, inference phase, and parallelism strategy. We comprehensively study the tax for different MoE models, finding that they perform 2-3x worse than equivalent dense models. Through microbenchmarks, we analyze and categorize the underlying tax sources and show how they manifest differently under different configurations. Our key result is that prefill and decode phases incur vastly different taxes; counterintuitively, factors like load imbalance, which harm prefill, can sometimes benefit decode. To gain deeper intuition, we propose a balls-bins-buckets performance model and study recent MoE developments like fine-grained experts and data parallel attention. We conclude by discussing existing and new techniques to reduce the MoE tax and their associated trade-offs.
DreamDDP: Accelerating Low-Bandwidth Geo-Distributed LLM Training with Layer-wise Partial Synchronization
Zhenheng Tang ⋅ Zichen TANG ⋅ Junlin Huang ⋅ Xinglin Pan ⋅ Rudan Yan ⋅ Yuxin Wang ⋅ ⋅ Shaohuai Shi ⋅ Xiaowen Chu ⋅
Scaling up training large language models (LLMs) in computing and data perspectives motivates distributed training across different geo-distributed data centers. Communication in geo-distributed data parallel training (DDP) with stochastic gradient descent (S-SGD) is the main bottleneck in low-bandwidth environments. Recent studies have successfully applied Local SGD to mitigate the communication overhead and geo-distributedly pre-train LLMs. However, we identify that the strict model synchronization mechanism in Local SGD prevents overlapping communication and computation, which makes the system lose opportunities to overlap communication and computation. To overcome this limitation, we expand the design space of local SGD by layer-wisely decoupling model synchronization. In each iteration, only partial layers are synchronized instead of the entire model after a specific number of iterations. Leveraging this methodology, we introduce DreamDDP, a training framework to accelerate low-bandwidth distributed training with three key innovations: (1) partial local SGD with theoretical assurances of convergence rates comparable to S-SGD; (2) overlapping parameter synchronization with computation without extra GPU memory occupation; (3) identifying and exploiting three properties to schedule communication and computation based on fine-grained layer-wise profiling to reduce training time. Empirical evaluations conducted on 32 GPUs using prominent deep learning models, including ResNet-18, ResNet-50, GPT-2, and Llama-2, demonstrate that DreamDDP enhances the convergence properties of Local SGD (and Adam) and achieves speedups ranging from $1.49\times$ to $3.91\times$ over leading baseline methods.
Efficient Long-Context Language Model Training by Core Attention Disaggregation
Yonghao Zhuang ⋅ Junda Chen ⋅ ⋅ Yi Gu ⋅ Yibo Zhu ⋅ Yimin Jiang ⋅ Ion Stoica ⋅ Hao Zhang ⋅ Eric Xing
We present core attention disaggregation (CAD), a technique that improves long-context LLM training by disaggregating the core attention (CA) -- the parameter-free $\mathrm{softmax}(\mathbf{QK}^{\top})\mathbf{V}$ computation -- and schedules it on an independent pool of resources. Existing systems co-locate core attention with other components. At long context, the quadratic growth of CA computation and near-linear growth of the rest create load imbalance -- hence stragglers across data and pipeline groups. CAD is enabled by two key observations: (i) \emph{statelessness}: CA has no trainable parameters and minimal transient state, so balancing reduces to scheduling compute-bound tasks; and (ii) \emph{composability}: modern attention kernels sustain high utilization on fused batches of arbitrary-length token-level shards. CAD dynamically partitions the core attention computation into token-level tasks (CA-tasks), and dispatches them to a pool of devices specialized for CA computation (attention servers). It then rebatches CA-tasks to equalize CA compute across attention servers without loss of kernel efficiency. We have implemented CAD in a system called DistCA with a ping-pong scheme to completely overlap communication with compute, and in-place attention servers to improve memory utilization. Scaling to 512 H200 GPUs and 512K context length, DistCA eliminates DP/PP stragglers, achieves near-perfect compute and memory balance, and improves end-to-end training throughput by up to 1.9× over Megatron-LM and 1.35× over existing load-balancing methods
FaaScale: Unlocking Fast LLM Scaling for Serverless Inference
Minchen Yu ⋅ Rui Yang ⋅ ⋅ Zhaoyuan Su ⋅ Sheng Yao ⋅ Tingfeng Lan ⋅ ⋅ Zirui Wang ⋅ Yue Cheng ⋅ Wei Wang ⋅ ⋅ Ruichuan Chen
Serverless computing is an attractive paradigm for cloud-based large language model (LLM) inference, but scaling LLMs on demand remains a major challenge due to high data transfer cost. We present FaaScale, a serverless LLM system that enables fast and resource-efficient model scaling. The key idea is a co-design principle—pipelined multicast inference—which synergizes multicast with dynamic, cross-node pipeline-parallel execution during model transfer. FaaScale implements this design through PipeCast, a model scaling scheme that adaptively multicasts model blocks and dynamically forms inference pipelines on the fly. Coupled with efficient memory management across GPU and host memory, FaaScale handles bursty LLM inference workloads effectively, achieving up to 5× lower tail time-to-first-token latency and 31.3% cost reduction on real-world LLM traces.
FarSkip-Collectives: Unhobbling Blocking Communication in Mixture of Experts Models
Yonatan Dukler ⋅ ⋅ ⋅ Vikram Appia ⋅ Emad Barsoum
Blocking communication presents a major hurdle in running MoEs efficiently in distributed settings. To address this, we present FarSkip-Collective which modifies the architecture of modern models to enable overlapping of their computation with communication. Our approach modifies the architecture to skip connections in the model and it is unclear a priori whether the modified model architecture can remain equally capable, especially for large state-of-the-art models and while modifying all of the model layers. We answer this question in the affirmative and fully convert a series of state-of-the-art models varying from 16B to 109B parameters to enable overlapping of their communication while achieving accuracy on par with their original open-source releases. For example, we convert Llama 4 Scout (109B) via self-distillation and achieve average accuracy within 1% of its instruction tuned release averaged over wide-range of downstream evaluations. In addition to demonstrating retained accuracy of the large modified models, we realize the benefits of FarSkip-Collective through optimized implementations that explicitly overlap communication with computation, accelerating both training and inference in existing frameworks. For inference, we demonstrate 18.5% speed-up in Time To First Token when serving Llama-4 Scout with expert parallelism in vLLM and achieve 97.6% communication-computation overlap during the prefill stage. During training, our approach enables 88.9% communication overlap of the all-to-all communication collectives when pre-training DeepSeek-V3 MoE layers with expert parallelism.
FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management
Nazmul Takbir ⋅ Hamidreza Koshkak ⋅ Nikil Dutt ⋅ Sangeetha Abdu Jyothi
Large Language Model (LLM) serving is increasingly constrained by the growing size of the key-value (KV) cache, which scales with both context length and generation length. Prior work shows that attention is dominated by a small subset of critical tokens, yet existing systems struggle to exploit this efficiently without degrading accuracy, especially in long generation. We make a key observation: the temporal stability of these critical tokens varies significantly across KV heads: some heads consistently focus on the same tokens, while others shift frequently. Building on this insight, we introduce FlexiCache, a hierarchical KV-cache management system that leverages the temporal stability of KV heads to reduce GPU memory usage and computation overhead, while preserving model accuracy. FlexiCache classifies KV heads as stable or unstable: it retains all KV-cache pages from unstable heads in GPU memory, whereas for stable heads, it keeps only the top-K pages on the GPU and offloads the rest to host memory. By exploiting temporal stability, FlexiCache performs periodic reranking for stable heads to fetch newly promoted top pages. Implemented atop vLLM, FlexiCache reduces GPU memory footprint for long-context requests by up to \textbf{70\%}, improves offline serving throughput by \textbf{1.38–1.55×}, and lowers online token latency by \textbf{1.6–2.1×}, all while maintaining accuracy in long-context, long-generation scenarios.
FlexTrain: Scalable Hybrid-Parallel Training with Elastic Resource Utilization and Consistent Accuracy
Weilin Cai ⋅ Diandian Gu ⋅ Jun Wang ⋅ ⋅ Jiayi Huang
Large language model (LLM) training has become a critical workload in shared GPU clusters. However, our observations reveal that these clusters suffer from significant underutilization. To address this inefficiency, various elastic training techniques have been developed to dynamically adjust GPU allocations to harness idle resources. Despite their potential, these methods have seen limited deployment in production environments due to three major challenges: accuracy inconsistency, excessive profiling overhead, and limited flexibility. In this paper, we propose FlexTrain, an elastic training system that achieves consistent model accuracy, high training efficiency, and effective resource utilization. FlexTrain prioritizes adjustments to the pipeline parallelism (PP) degree to preserve deterministic computation and maintain accuracy consistency, while also supporting data parallelism (DP) scaling to further enhance throughput under relaxed consistency requirements. It generates optimal PP schedules, predicts training performance under different configurations, and makes scaling decisions based on job submission intervals, scaling overhead, and expected throughput gains. Evaluation results show that FlexTrain can achieve up to 1.73x speedup for elastic jobs while preserving consistent accuracy, and up to 2.27x when accuracy consistency is relaxed, compared to CompanyX's current scheduling strategy.
FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error
Fengjuan Wang ⋅ Zhiyi Su ⋅ ⋅ ⋅ Sun Mou
Training large Mixture-of-Experts (MoE) models remains computationally prohibitive due to their extreme compute and memory demands. Although low-precision training promises to accelerate computation and reduce memory footprint, existing implementations still rely on BF16-dominated dataflows with frequent quantize–dequantize (Q/DQ) conversions. These redundant casts erode much of FP8’s theoretical efficiency. However, naively removing these casts by keeping dataflows entirely in FP8 introduces double quantization error: tensors quantized along different dimensions accumulate inconsistent scaling factors, degrading numerical stability. We propose FP8-Flow-MoE, an FP8 training recipe featuring a quantization-consistent FP8-centric dataflow with a scaling-aware transpose and fused FP8 operators that streamline computation and eliminate explicit cast operations from 12 to 2. Evaluations on a 671B-parameter MoE model demonstrate up to 21\% higher throughput and 16.5~GB lower memory usage per GPU compared to BF16 and naïve FP8 baselines, while maintaining stable convergence. We provide a plug-and-play FP8 recipe compatible with TransformerEngine and Megatron-LM, which will be open-sourced after the camera-ready release of this paper.
GhostServe: A Lightweight Checkpointing System in the Shadow for Fault-Tolerant LLM Serving
Shakya Jayakody ⋅ Youpeng Zhao ⋅ Chinmay Dhanraj Nehate ⋅ Jun Wang
The rise of million-token, agent-based applications has placed unprecedented demands on large language model (LLM) inference services. The long-running nature of these tasks increases their susceptibility to hardware and software faults, leading to costly job failures, wasted resources, and degraded user experience. The stateful key-value (KV) cache, which grows with the sequence length, presents a central challenge as it is a critical and vulnerable component in distributed serving systems. In this work, we propose \textbf{GhostServe}, a novel checkpointing solution to facilitate fault-tolerant LLM serving. Specifically, GhostServe protects the streaming KV cache \textit{in the shadow} by applying erasure coding to generate and store the parity shards in host memory. In the event of device failures, GhostServe enables fast reconstruction of the lost KV cache, allowing the inference process to resume seamlessly without costly full recomputation or state replication. Evaluations demonstrate that GhostServe reduces checkpointing latency by up to 2.7$\times$ and recovery latency by 2.1$\times$ over existing methods, paving the way for reliable and high-availability LLM serving at scale.
GriNNder: Breaking the Memory Capacity Wall in Full-Graph GNN Training with Storage Offloading
Jaeyong Song ⋅ Seongyeon Park ⋅ Hongsun Jang ⋅ Jaewon Jung ⋅ Hunseong Lim ⋅ Junguk Hong ⋅ Jinho Lee
Full-graph training of graph neural networks (GNNs) is widely used as it enables direct validation of algorithmic improvements by preserving complete neighborhood information. However, it typically requires multiple GPUs or servers, incurring substantial hardware and inter-device communication costs. While existing single-server methods reduce infrastructure requirements, they remain constrained by GPU and host memory capacity as graph sizes increase. To address this limitation, we introduce **GriNNder**, which is the first work to leverage storage devices to enable full-graph training even with limited memory. Because modern NVMe SSDs offer multi-terabyte capacities and bandwidths exceeding 10 GB/s, they provide an appealing option when memory resources are scarce. Yet, directly applying storage-based methods from other domains fails to address the unique access patterns and data dependencies in full-graph GNN training. GriNNder tackles these challenges by *structured storage offloading (SSO)*, a framework that manages the GPU-host-storage hierarchy through coordinated *cache*, *(re)gather*, and *bypass* mechanisms. To realize the framework, we devise (i) a partition-wise caching strategy for host memory that exploits the observation on cross-partition dependencies, (ii) a regathering strategy for gradient computation that eliminates redundant storage operations, and (iii) a lightweight partitioning scheme that mitigates the memory requirements of existing graph partitioners. In experiments performed over various models and datasets, GriNNder achieves up to 9.78$\times$ speedup over state-of-the-art baselines and throughput comparable to distributed systems, enabling previously infeasible large-scale full-graph training even on a single GPU.
Grolar: Efficient LLM Training on Heterogeneous Clusters
Runsheng Guo ⋅ Utkarsh Anand ⋅ Khuzaima Daudjee ⋅ Rathijit Sen
Large language models (LLMs) require vast amounts of GPU compute to train, but limited availability and high costs of GPUs make homogeneous clusters impractical for many organizations. Instead, assembling heterogeneous clusters by pooling together GPUs of different generations allows them to achieve higher aggregate compute and make use of all available GPUs. However, training on heterogeneous clusters presents significant challenges. The workload must be carefully partitioned such that GPUs in the cluster with limited compute, memory, or network bandwidth do not bottleneck the training process. Existing heterogeneous training systems cannot do so efficiently since they integrate data, pipeline, and tensor parallelism in a way that trades off communication for memory overhead. Combining vanilla data parallelism with pipeline parallelism is communication-efficient but results in high memory overhead from replicating model parameters. Alternatively, using sharded data parallelism or tensor parallelism reduces memory overhead but increases communication overhead when combined with pipeline parallelism. To address this problem, we designed Grolar, a system that uses Pipeline-Efficient ZeRO DP, a novel integration of pipeline parallelism and data parallelism that is both communication- and memory-efficient. Grolar uses a planner to automatically find an optimized training configuration from the vast search space of possibilities on heterogeneous clusters, and our evaluation shows that Grolar achieves up to 3× higher training throughput than state-of-the-art systems across representative heterogeneous training scenarios.
HetRL: Efficient Reinforcement Learning for LLMs in Heterogeneous Environments
Yongjun He ⋅ Shuai Zhang ⋅ ⋅ Xiyuan Zhang ⋅ Boran Han ⋅ Bernie Wang ⋅ Huzefa Rangwala ⋅ George Karypis
As large language models (LLMs) scale and new GPUs are released even more frequent, there is an increasing demand for LLM post-training in heterogeneous environments to fully leverage underutilized mid-range or previous-generation GPUs across regions and alleviate the shortage of homogeneous high-end GPUs in a single region. However, achieving high-performance reinforcement learning (RL) training for LLMs on such computing resources remains challenging because the workflow involves multiple models and tasks with complex computation and data dependencies. In this paper, we present HetRL, a distributed system for efficient RL training in infrastructures with heterogeneous GPUs and network. HetRL formulates RL training scheduling in heterogeneous environments as a constrained joint optimization problem and introduces a novel scheduling algorithm that (1) decomposes the complex search space with a multi-level search framework; and (2) allocates the search budget via successive halving. Our extensive evaluation consuming 20,000 GPU-hours shows that HetRL delivers up to 9.17× and 3.17× on average the throughput of state-of-the-art systems under various workloads and settings.
HexiScale: Facilitating Large Language Model Training over Heterogeneous Hardware
Ran Yan ⋅ YOUHE JIANG ⋅ Xiaonan Nie ⋅ Fangcheng Fu ⋅ Bin CUI ⋅ Binhang Yuan
Training large language models (LLMs) is a computationally intensive task, which is typically conducted in data centers with homogeneous high-performance GPUs. In this paper, we explore an alternative approach by deploying training computations across heterogeneous GPUs to enable better flexibility and efficiency for heterogeneous resource utilization. Toward this end, we propose a novel system, HexiScale, that can flexibly support asymmetric partition of training computations in the scope of data-, pipeline-, and tensor model parallelism. We further formalize the allocation of asymmetric partitioned training computations over a set of heterogeneous GPUs as a constrained optimization problem and propose an efficient hierarchical graph partitioning algorithm. Our approach effectively allocates training computations across heterogeneous GPUs, fully leveraging the available computational power. We compare the performance of HexiScale with state-of-the-art homogeneous and heterogeneous training systems. When training LLMs at different scales (from 7B to 30B), empirical results demonstrate that: (\underline{i}) compared to state-of-the-art homogeneous baselines running over homogeneous GPUs, HexiScale achieves \textit{similar} performance when running over heterogeneous GPUs with the \textit{same} theoretical FLOPS; (\underline{ii}) compared to state-of-the-art heterogeneous baselines running on the same heterogeneous clusters, HexiScale delivers $1.5\times$ to $2.4\times$ higher throughput.
Kitty: Accurate and Efficient 2-bit KV Cache Quantization with Dynamic Channel-wise Precision Boost
Haojun Xia ⋅ Xiaoxia Wu ⋅ Jisen Li ⋅ Tsai-chuan Wu ⋅ Junxiong Wang ⋅ Jue Wang ⋅ Chenxi Li ⋅ Aman Singhal ⋅ Alay Dilipbhai Shah ⋅ ⋅ Donglin Zhuang ⋅ Zhongzhu Zhou ⋅ Ben Athiwaratkun ⋅ Zhen Zheng ⋅ Shuaiwen Song
The KV cache is a dominant memory bottleneck for LLM inference. While 4-bit KV quantization preserves accuracy, 2-bit often degrades it, especially on long-context reasoning. We close this gap via an algorithm–system co-design for mixed-precision KV caching: \emph{Kitty}. On the algorithm side, extensive experiments show that \emph{Dynamic Channel-wise Precision Boost} — which ranks Key-cache channels by sensitivity and keeps only a small fraction at higher precision — maintains near-zero loss in accuracy drop while approaching 2-bit memory. The main challenge is handling dynamic 4-bit channel boosts while keeping the page layout coalesced and the dequantization uniform, with no scattered reads or hard-coded masks. \emph{Kitty} addresses these issues by decompose each mixed-precision Key page into two tensors with unified 2-bit precision. Based on this, Kitty provides a page-centric KV layout, Triton-compatible page dequantization kernels, and a lightweight runtime pipeline that preserves coalescing and avoids divergence. Across seven tasks and two model families (Qwen3, LLaMA3), \emph{Kitty} cuts KV memory by nearly $8\times$ with negligible accuracy loss, enabling up to $8\times$ larger batches and $2.1\times$–$4.1\times$ higher throughput under the same memory budget.
Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes
Justin Bauer ⋅ Thomas Walshe ⋅ Derek Pham ⋅ Harit Vishwakarma ⋅ Armin Parchami ⋅ Frederic Sala ⋅ Paroma Varma
Fine-tuning Large Language Models (LLMs) typically relies on large quantities of high-quality annotated data, or questions with well-defined ground truth answers in the case of Reinforcement Learning with Verifiable Rewards (RLVR). While previous work has explored the benefits to model reasoning capabilities by scaling both data and compute used for RLVR, these results lack applicability in many real-world settings where annotated data and accessible compute may be scarce. In this work, we present a comprehensive empirical study of open-source Small Language Model (SLM) performance after RLVR in low data regimes. Across three novel datasets covering number counting problems, graph reasoning, and spatial reasoning, we characterize how model performance scales with both dataset size, diversity, and complexity. We demonstrate that (1) procedural datasets allow for fine-grained evaluation and training dataset development with controllable properties (size, diversity, and complexity), (2) RLVR enables models trained on lower complexity tasks to generalize to higher complexity tasks, and (3) training on mixed complexity datasets offers the greatest benefits in low data regimes, providing up to 5$\times$ sample efficiency versus training on easy tasks. These findings inspire future work on the development of data scaling laws for RLVR and the use of procedural data generators to further understand effective data development for efficient LLM fine-tuning.
LLMInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems
Shanli Xing ⋅ Vivian Zhai ⋅ Alexander Jiang ⋅ Yixin Dong ⋅ Yong Wu ⋅ Zihao Ye ⋅ ⋅ Yingyi Huang ⋅ Yineng Zhang ⋅ ⋅ ⋅ Luis Ceze ⋅ Tianqi Chen
Recent advances show that large language models (LLMs) can act as autonomous agents capable of generating GPU kernels, but integrating these AI-generated kernels into real-world inference systems remains challenging. LLMInfer-Bench addresses this gap by establishing a standardized, closed-loop framework that connects kernel generation, benchmarking, and deployment. At its core, LLMInfer Trace provides a unified schema describing kernel definitions, workloads, implementations, and evaluations, enabling consistent communication between agents and systems. Built on real serving traces, LLMInfer-Bench includes a curated dataset, a robust correctness- and performance-aware benchmarking framework, a public leaderboard to track LLM agents’ GPU programming capabilities, and a dynamic substitution mechanism (apply()) that seamlessly injects the best-performing kernels into production LLM engines such as SGLang and vLLM. Using LLMInfer-Bench, we further evaluate the performance and limitations of LLM agents, compare the trade-offs among different GPU programming languages, and provide insights for future agent design. LLMInfer-Bench thus establishes a practical, reproducible pathway for continuously improving AI-generated kernels and deploying them safely into large-scale LLM inference systems.
Locality-Aware Beam Scheduling for Efficient Test-Time Compute with a Consumer-grade GPU
Hsing-Ti Wang ⋅ Hung-Tso Shiao ⋅ Chia-Lin Yang
Large Language Models (LLMs) are central to modern NLP applications, yet their deployment on consumer-grade GPUs is limited by limited memory capacity and bandwidth. In typical single-batch inference on local devices, the key–value (KV) cache occupies only a small fraction of total memory, so prior studies have largely focused on model weights. The rise of test-time compute (TTC), however, introduces a new bottleneck: the rapidly expanding KV cache. In TTC methods such as step-wise beam search, concurrent decoding paths cause KV cache size and transfer costs to scale with exploration space, resulting in severe I/O stalls on consumer-grade GPUs. We identify two complementary forms of data locality in TTC workloads. Inter-token locality occurs within each decoding step, as consecutive tokens in the same beam access nearly identical KV cache data. Inter-beam locality arises across decoding steps, as beams that share common prefixes reuse overlapping KV segments. Building on these observations, we propose Locality-Aware Beam Scheduling, which exploits these locality patterns to reduce redundant KV cache transfers. It also employs balanced grouping with prefetching to overlap data movement with computation. Evaluated on OPT-6.7B, LLaMA-2-7B, and Qwen-7B, our method reduces KV cache transfer volume by over 95\% and achieves consistent end-to-end speedups of 3.39×–9.72×, 3.60×–8.74×, and 4.17×–7.99×, respectively, compared to layer-wise offloading.
MAC-Attention: a Match--Amend--Complete scheme for fast and accurate attention computation
Jinghan Yao ⋅ Sam Jacobs ⋅ Walid Krichene ⋅ Masahiro Tanaka ⋅ Dhabaleswar Panda
Long-context decoding in LLMs is IO-bound: each token re-reads an ever-growing KV cache. Prior accelerations cut bytes via compression (lowering fidelity) or selection/eviction (restricting what remains accessible), which can degrade delayed recall and long-form generation. We introduce MAC-Attention, a fidelity and access-preserving alternative that accelerates decode by reusing prior attention computations for semantically similar recent queries. It starts with a match stage that performs pre-RoPE L2 matching over a short local window; an amend stage rectifies the reused attention by recomputing a small band near the match boundary; and a complete stage fuses the rectified results with a fresh attention computed on the KV tail, via a numerically stable merge. On a match hit, the compute and bandwidth complexity is constant regardless of the context length. The method is model-agnostic, and composes with IO-aware kernels, paged-KV managers, and MQA/GQA. Across LongBench v2 (120K), RULER (120K), and LongGenBench (16K continuous generation), MAC-Attention reduces KV accesses by up to 99%, cuts token generation latency by over 60% at 128K, and achieves over 14.3x attention-phase speedups (up to 2.6x end-to-end), while maintaining full-attention quality. By reusing computation rather than compressing or discarding tokens, MAC-Attention delivers long-context inference that is both fast and faithful. Code is available.
MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design
Zhen Zheng ⋅ Xiaonan Song ⋅ Chuanjie Liu
Quantization has become one of the most effective methodologies to compress LLMs into smaller size. However, the existing quantization solutions still show limitations of either non-negligible accuracy drop or low system efficiency. In this paper, we propose MixLLM that explores the optimization space of mixed-precision quantization between output features, based on the insight that different features matter differently in the model. MixLLM identifies the important output features in the global view rather than within each single layer, effectively assigning larger bit-width to output features that need it the most to achieve high accuracy and low memory usage. We present the sweet spot of quantization configuration of algorithm-system co-design with high accuracy and system efficiency. To address the system challenge, we design the two-step dequantization to make use of the Tensor Core easily and fast data type conversion to reduce dequantization overhead, and present the software pipeline to overlap the memory access, dequantization and the MatMul to the best. Extensive experiments show that with only 10\% more bits, the perplexity increase can be reduced from about 0.5 in SOTA to within 0.2 for Llama 3.1 70B, while MMLU-Pro loss can be reduced from 1.92 to 0.99 over the SOTA of three popular models. Besides its superior accuracy, MixLLM also achieves state-of-the-art system efficiency.
MorphServe: Efficient and Workload-Aware LLM Serving via Runtime Quantized Layer Swapping and KV Cache Resizing
Zhaoyuan Su ⋅ Zeyu Zhang ⋅ Tingfeng Lan ⋅ Zirui Wang ⋅ ⋅ Juncheng Yang ⋅ Yue Cheng
Efficiently serving large language models (LLMs) under dynamic and bursty workloads remains a key challenge for real-world deployment. Existing serving frameworks and static model compression techniques fail to adapt to workload fluctuations, leading to either service-level objective (SLO) violations under full-precision serving or persistent accuracy degradation with static quantization. We present MorphServe, a dynamic, workload-aware LLM serving framework based on morphological adaptation. MorphServe introduces two asynchronous, token-level runtime mechanisms: quantized layer swapping, which selectively replaces less impactful layers with quantized alternatives during high-load periods, and pressure-aware KV cache resizing, which dynamically adjusts KV cache capacity in response to memory pressure. These mechanisms enable state-preserving transitions with minimum runtime overhead and are fully compatible with modern scheduling and attention techniques. Extensive experiments on Vicuna and Llama family models with real-world workloads demonstrate that MorphServe reduces average SLO violations by 92.45% and improves the P95 TTFT latency by 2.2–3.9$\times$ compared to full-precision serving, without compromising generation quality. These results establish MorphServe as a practical and elastic solution for LLM deployment in dynamic environments.
MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training
Wenxuan Li ⋅ Chengruidong Zhang ⋅ Huiqiang Jiang ⋅ Yucheng Li ⋅ ⋅ Lili Qiu
The adoption of long context windows has become a standard feature in Large Language Models (LLMs), as extended contexts significantly enhance their capacity for complex reasoning and broaden their applicability across diverse scenarios. Dynamic sparse attention is a promising approach for reducing the computational cost of long-context. However, efficiently training LLMs with dynamic sparse attention on ultra-long contexts—especially in distributed settings—remains a significant challenge, due in large part to worker- and step-level imbalance. This paper introduces MTraining, a novel distributed methodology leveraging dynamic sparse attention to enable efficient training for LLMs with ultra-long contexts. Specifically, MTraining integrates three key components: a distributed sparse index approximating algorithm, balanced sparse ring attention, and hierarchical sparse ring attention. These components are designed to synergistically address the computational imbalance and communication overheads inherent in dynamic sparse attention mechanisms during training LLMs with extensive context lengths. We demonstrate the efficacy of MTraining mainly by training Qwen2.5-3B and Llama-3.1-8B, successfully expanding its context window from 32K/128K to 512K tokens on a cluster of 32x A100 GPUs. Our evaluations on a comprehensive suite of downstream tasks, including RULER, PG-19, InfiniteBench, and NIAH, reveal that MTraining achieves up to a 6x higher training throughput while preserving model accuracy.
NEST: Network- and Memory-Aware Device Placement for Distributed Deep Learning
Irene Wang ⋅ ⋅ Arvind Krishnamurthy ⋅ Divya Mahajan
The growing scale of deep learning demands distributed training frameworks that jointly reason about parallelism, memory, and network topology. Prior works often rely on heuristic or topology-agnostic search, handling communication and memory separately. Without per-device memory awareness, these methods typically ensure feasibility post hoc by sharding parameters and activations across many devices, increasing synchronization, inflating communication, and underutilizing compute, limiting scalability and efficiency on real datacenter networks. We present NEST, a network-, compute-, and memory-aware device placement framework that unifies model parallelism, topology modeling, and memory feasibility via structured dynamic programming. NEST’s DP operates on operator graphs with tensor and expert parallel configurations, explicit allreduce latencies across hierarchical or arbitrary networks, and memory/compute profiles. By factoring parallelism across tensor, pipeline, data, and expert dimensions, NEST defines a principled search space for hybrid strategies while jointly optimizing co-location, network latency, and memory feasibility. Evaluations across diverse hardware and networks show NEST achieves up to 2.35 times higher throughput, better memory efficiency, and improved scalability over state-of-the-art baselines, providing a foundation for co-designing parallelization strategies and datacenter interconnects for next-generation AI infrastructure.
NexSpec: Towards Optimizing Speculative Decoding in Reinforcement Learning Systems
qiaoling chen ⋅ Zijun Liu ⋅ Peng Sun ⋅ Shenggui Li ⋅ Guoteng Wang ⋅ Ziming Liu ⋅ Yonggang Wen ⋅ Siyuan Feng ⋅ Tianwei Zhang
Adapting large language models (LLMs) via reinforcement learning (RL) is often bottlenecked by the generation stage, which can consume over 75% of the training time. Speculative decoding (SD) accelerates autoregressive generation in serving systems, but its behavior under RL training remains largely unexplored. We identify three critical gaps that hinder the naïve integration of SD into RL systems: diminishing speedups at large batch sizes, drafter staleness under continual actor updates, and drafter-induced policy degradation. To address these gaps, we present NexSpec, a system that adapts SD to RL through three complementary mechanisms: dynamically tuning SD configurations, evolving the drafter via knowledge distillation, and weighting updates by rollout rewards. On Qwen models (3B–14B), NexSpec achieves up to 4.5x speedup while preserving reward convergence and training stability, providing a practical solution for efficient RL-based LLM adaptation.
Deploying neural networks on microcontrollers is constrained by kilobytes of flash and SRAM, where 1x1 pointwise (PW) channel mixers often dominate memory even after INT8 quantization. We present HyperTinyPW, a compression-as-generation approach that replaces most stored PW weights with generated weights. A shared micro-MLP synthesizes PW kernels once at load time from tiny per-layer codes; the kernels are cached and then executed with standard integer operators, so the deployment stack stays unchanged. A shared latent basis across layers reduces redundancy, and keeping the first PW layer in INT8 stabilizes early morphology-sensitive mixing. Our contributions are: (1) TinyML-faithful packed-byte accounting that includes the generator, heads or factorization, per-layer codes, the kept first PW layer, and the backbone; (2) a unified evaluation protocol with a validation-tuned threshold (t*) and bootstrap confidence intervals; and (3) a deployability analysis covering integer-only inference and boot-versus-lazy synthesis trade-offs. On three ECG benchmarks (Apnea-ECG, PTB-XL, MIT-BIH), HyperTinyPW shifts the macro-F1 versus flash Pareto frontier: at about 225 kB it matches a ~1.4 MB CNN while being 6.31x smaller (84.15% fewer bytes), retaining at least 95% of large-model macro-F1. Under 32-64 kB budgets it sustains balanced detection where compact baselines degrade. The mechanism applies broadly to other 1D biosignals, on-device speech, and embedded sensing tasks where per-layer redundancy dominates, suggesting a wider role for compression-as-generation in resource-constrained ML systems.
OPKV: A High-Throughput Plugin-Driven Framework for Recallable Sparsity in Paged KV Cache Systems
Huazheng Lao ⋅ Xiaofeng Li ⋅ Rui Xu ⋅ Long Chen ⋅ Xia Zhu ⋅
Long-context large language model (LLM) inference faces severe KV cache inflation, making GPU memory a key bottleneck. Existing recallable sparsity methods mitigate memory pressure by offloading non-critical key–value (KV) pairs to CPU memory and recalling them on demand, they are intrusive to KV cache management in the existing inference frameworks and fail to cope with the linearly increasing recall overhead under high batches. To address these limitations, we propose OPKV, a high-throughput plugin-driven framework that seamlessly integrates recallable sparsity into paged KV cache systems and performs unified recall optimization. OPKV introduces a plugin interface that decouples sparsity logic from model and cache management, and applies object reaggregation and hot page hit algorithms to reduce the recall overhead based on the observation of spatial discreteness and temporal locality in critical KV selection. In addition, a local intra-iteration metadata manager is implemented to perform millisecond-level page retrieval and cache eviction. The experimental results show that OPKV helps the SoTA methods attain 1.36-1.77x higher decoding throughput under different batches.
Practical Adversarial Multi-Armed Bandits with Sublinear Runtime
Kasper Overgaard Mortensen ⋅ Ama Bembua Bainson ⋅ ⋅ Kristoffer Strube ⋅ Renata Borovica-Gajic ⋅ Andrea Paudice ⋅ Davide Mottin ⋅ Panagiotis Karras
We study the Multi-Armed Bandit problem in nonstationary adversarial environments, where the identity of the optimal arm can change over time due to shifts in the loss sequence. Motivated by applications such as physical design tuning in database systems, we focus on settings with a very large number of arms and seek practical algorithms with sublinear runtime. Our main contribution is a novel algorithm, Queuing Behind the Leader (QBL), which achieves a per-iteration complexity of O(m log k), where m is the number of arms selected at each step. QBL combines limited update operations via a priority queue, a constant sampling overhead, and a balanced exploration strategy. We evaluate QBL extensively on state-of-the-art benchmarks and demonstrate that it consistently outperforms existing methods in both time and solution quality.
ProTrain: Efficient LLM Training via Automatic Memory Management
Hanmei Yang ⋅ ⋅ Yao Fu ⋅ ⋅ Ramine Roane ⋅ Hui Guan ⋅ Tongping Liu
Memory pressure has emerged as a dominant constraint in scaling the training of large language models (LLMs), particularly in resource-constrained environments. While modern frameworks incorporate various memory-saving techniques, they often expose low-level configuration knobs that require manual tuning and specialized system expertise. This not only adds engineering overhead but also risks suboptimal hardware utilization when misconfigured. This paper introduces ProTrain, a novel training system that automatically tailors memory management policies to the model architecture and underlying hardware resources, eliminating the need for manual intervention. The core of ProTrain is its automated memory management that abstracts complex memory management strategies into a few tunable configuration parameters, allowing searches for optimal parameter settings using cost models. ProTrain is equipped with a runtime profiler that provides precise estimates of latency, memory usage, and I/O bandwidth to build high-fidelity cost models. ProTrain does not change the training algorithm and thus does not compromise accuracy. Experiments show that ProTrain improves training throughput by 1.43$\times$ to 2.71$\times$ compared to the state-of-the-art training systems.
Pylo: Towards Accessible Learned Optimizers in PyTorch
Paul Janson ⋅ Benjamin Thérien ⋅ Quentin Anthony ⋅ Xiaolong Huang ⋅ Abhinav Moudgil ⋅ Eugene Belilovsky
Learned optimizers have been an active research topic over the past decade, with increasing progress toward practical, general-purpose optimizers that can serve as drop-in replacements for widely used methods like Adam. However, recent advances such as VeLO, which was meta-trained for 4000 TPU-months, remain largely inaccessible to the broader community, in part due to their reliance on JAX and the absence of user-friendly packages for independently using the optimizers after meta-training. To address this gap, we introduce PyLO, a PyTorch-based library that brings learned optimizers to the remaining ≈ 80% of machine learning community via the familiar torch.optim.Optimizer interface. Unlike prior work focused on limited-scale academic tasks, our emphasis is on applying learned optimization to real-world large-scale pre-training tasks. Our systems contribution includes CUDA-accelerated implementations of the small fc lopt(Metz et al., 2022a) and VeLO(Metz et al., 2022b) learned optimizers, achieving substantial performance gains, with training throughput on ViT-B/16 (batch size 32) increasing from 39.36 and 49.73 to 205.59 and 191.18 samples per second, respectively. PyLO has the versatility that allows us to easily combine learned optimizers with existing optimization tools such as learning rate schedules and weight decay. When doing so, we discover that learned optimizers can substantially benefit from it. Our code is available at https://anonymous.4open.science/r/pylo-C91E32
RaidServe: High-performance Resilient Serving
Ziyi Xu ⋅ Zhiqiang Xie ⋅ Swapnil Gandhi ⋅ Christos Kozyrakis
Tensor parallelism (TP) enables large language models (LLMs) to scale inference efficiently across multiple GPUs, but its tight coupling makes systems fragile: a single GPU failure can halt execution, trigger costly KVCache recomputation, and introduce long-term compute and memory imbalance. We present RaidServe , a fault-tolerant TP serving system that sustains high performance under irregular GPU availability. RaidServe introduces three techniques to balance computation and memory across GPUs: (1) Cyclic KVCache Placement for even memory utilization, (2) Hybrid Attention combining tensor- and data-parallel attention to eliminate stragglers, and (3) Fine-Grained Load-Aware Routing to dynamically balance requests. It further employs proactive KVCache backup and on-demand weight recovery to avoid expensive recomputation and redundant data transfers. Implemented in a lightweight serving engine compatible with existing infrastructures, RaidServe achieves up to 2× higher throughput and two orders of magnitude faster recovery than standard fault-handling methods on an 8×H100 DGX system, maintaining strong performance even with multiple GPU failures.
RDMA Point-to-Point Communication for LLM Systems
Nandor Licker ⋅ Kevin Hu ⋅ Vladimir Zaytsev ⋅ Lequn Chen
Emerging Large Language Model (LLM) system patterns, such as disaggregated inference, Mixture-of-Experts (MoE) routing, and asynchronous reinforcement fine-tuning, require flexible point-to-point communication beyond simple collectives. Existing implementations are locked to specific Network Interface Controllers (NICs), hindering integration into inference engines and portability across hardware providers. We present TransferEngine, which bridges the functionality of common NICs to expose a uniform interface. TransferEngine exposes one-sided WriteImm operations with a ImmCounter primitive for completion notification, without ordering assumptions of network transport, transparently managing multiple NICs per GPU. We demonstrate peak throughput of 400 Gbps on both NVIDIA ConnectX-7 and AWS Elastic Fabric Adapter (EFA). We showcase TransferEngine through three production systems: (1) KvCache transfer for disaggregated inference with dynamic scaling, (2) RL weight updates achieving 1.3 seconds for trillion-parameter models, and (3) MoE dispatch/combine implementation exceeding DeepEP decode latency on ConnectX-7, with the first viable latencies on EFA. We demonstrate that our portable point-to-point communication complements collectives while avoiding lock-in.
Search Your NVFP4 Scales!
Tanmaey Gupta ⋅ Hayden Prairie ⋅ Xiaoxia Wu ⋅ Reyna Abhyankar ⋅ Qingyang Wu ⋅ Austin Silveria ⋅ Pragaash Ponnusamy ⋅ Jue Wang ⋅ Ben Athiwaratkun ⋅ Shuaiwen Song ⋅ Tri Dao ⋅ Daniel Fu ⋅ Christopher De Sa
Quantization has emerged as a standard technique for accelerating inference for generative models by enabling faster low-precision computations and reduced memory transfers. Recently, GPU accelerators have added first-class support for microscaling Block Floating Point (BFP) formats. Standard BFP algorithms use a fixed scale based on the maximum magnitude of the block. We observe that this scale choice can be suboptimal with respect to quantization errors. In this work, we propose \textbf{ScaleSearch}, an alternative strategy for selecting these scale factors: using a fine-grained search leveraging the mantissa bits in microscaling formats to minimize the quantization error for the given distribution. \textbf{ScaleSearch} can be integrated with existing quantization methods such as Post Training Quantization and low-precision attention, and is shown to improve their performance. Additionally, we introduce \textbf{ScaleSearchAttention}, an accelerated NVFP4-based attention algorithm, which uses \textbf{ScaleSearch} and adapted prior techniques to ensure near-0 performance loss for causal language modeling. Experiments show that \textbf{ScaleSearch} improves language model weight PTQ by up to 7.5 points for GPQA (Qwen3-8B), video generation on Mochi by up to 14 points in VQA-a over SageAttention3. \textbf{ScaleSearchAttention} improves Wikitext-2 PPL by 0.9 points for Llama 3.1 70B.
Shannonic: Efficient Entropy-Optimal Compression for ML Workloads
Kareem Ibrahim ⋅ ⋅ Andreas Moshovos
We present Shannonic, a lossless compression method for machine learning tensors that achieves near-entropy-optimal compression, minimal state footprint, and high throughput. Shannonic uses an off-line pre-processing step to partition the tensor value space into optimally selected subranges and generates encoding/decoding tables that encode each value as a (range index, offset) pair where the range is entropy encoded using the asymmetric numeral systems (ANS) method. We formally prove and empirically show that Shannonic achieves higher compression efficiency than standard ANS. For a variety of 8b-quantized models, Shannonic's codec uses just 530B of state and achieves coding efficiency within 1\% of the Shannon limit. Shannonic enables 1.3-3.1$\times$ faster federated learning over bandwidth-constrained networks and 29-32\% latency reduction in edge-cloud LLM inference.
SkipKV: Selective Skipping of KV Generation and Storage for Efficient Inference with Large Reasoning Models
Jiayi Tian ⋅ Seyedarmin Azizi ⋅ Yequan Zhao ⋅ Erfan Potraghloo ⋅ Sean McPherson ⋅ Sharath Nittur Sridhar ⋅ Zhengyang Wang ⋅ zheng Zhang ⋅ Massoud Pedram ⋅ Souvik Kundu
Large reasoning models (LRMs) often cost significant key-value (KV) cache overhead, due to their linear growth with the verbose chain-of-thought (CoT) reasoning process. This costs both memory and throughput bottleneck limiting their efficient deployment. Towards reducing KV cache size during inference, we first investigate the effectiveness of existing KV cache eviction methods for CoT reasoning. Interestingly, we find that due to unstable token-wise scoring and the reduced effective KV budget caused by padding tokens, state-of-the-art (SoTA) eviction methods fail to maintain accuracy in the multi-batch setting. Additionally, these methods often generate longer sequences than the original model, as semantic-unaware token-wise eviction leads to repeated revalidation during reasoning. To address these issues, we present \textbf{SkipKV}, a \textbf{\textit{training-free}} KV compression method for selective \textit{eviction} and \textit{generation} operating at a coarse-grained sentence-level sequence removal for efficient CoT reasoning. In specific, it introduces a \textit{sentence-scoring metric} to identify and remove highly similar sentences while maintaining semantic coherence. To suppress redundant generation, SkipKV dynamically adjusts a steering vector to update the hidden activation states during inference enforcing the LRM to generate concise response. Extensive evaluations on multiple reasoning benchmarks demonstrate the effectiveness of SkipKV in maintaining up to $\mathbf{26.7}\%$ improved accuracy compared to the alternatives, at a similar compression budget. Additionally, compared to SoTA, SkipKV yields up to $\mathbf{1.6}\times$ fewer generation length while improving throughput up to $\mathbf{1.7}\times$.
Unified LLM Model for Power, Performance, and Area Prediction from Hardware Code
Armin Abdollahi ⋅ Mehdi Kamal ⋅ Massoud Pedram
We present RocketPPA, a unified LLM-based model that predicts power, performance, and area for Verilog designs across technology nodes and optimization styles. The approach combines a large language model backbone with mixture-of-experts regression and low-rank adaptation for parameter efficiency. To improve generalization, we introduce a contrastive learning framework that encourages semantically similar designs to cluster in embedding space, providing an inductive bias that reflects the structure of the hardware design space. Trained on 15nm and 45nm nodes with area- and delay-optimized flows, the model achieves 9.4 percentage point improvement in pass rate at ten percent tolerance over prior methods, with approximately 20$\times$ higher throughput (0.12 seconds per design). Ablations show contrastive learning contributes 2.5 points to accuracy, while leave-one-regime-out experiments demonstrate robust cross-regime generalization with minimal degradation. These results validate that combining supervised and contrastive objectives enables rapid, accurate PPA prediction across nodes and optimization styles.
Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP
Yilong Zhao ⋅ Xiaonan Nie ⋅ Kan Zhu ⋅ ⋅ ⋅ Hongxiang Hao ⋅ Yang Zhou ⋅ Baris Kasikci ⋅ Ion Stoica
Context parallelism (CP) has been widely adopted to support the growing context length in foundation model pretraining. However, existing designs fail to handle the large variation in sequence length from training datasets, resulting in suboptimal performance. These methods often over-shard short sequences, leading to compute inefficiency and excessive communication, or process long and short sequences separately without proper bin-packing, causing workload imbalance. In this paper, we propose FCP, a flexible context parallelism paradigm that shards and schedules sequences at block-level granularity. Instead of relying on rigid communication topologies such as ring, FCP enables arbitrary peer-to-peer communication, allowing flexible placement of sequence blocks across workers. By bin-packing blocks from both short and long sequences, FCP achieves both high compute efficiency and balanced workload distribution. Extensive evaluations show that FCP attains near-linear scalability on up to $256\times$H20 and GB200 GPUs, with $1.13\times$–$2.21\times$ improvement in the attention MFU.
Using Span Queries to Optimize Cache and Attention Locality
Paul Castro ⋅ Nick Mitchell ⋅ Nathan Ordonez ⋅ Thomas Parnell ⋅ Mudhakar Srivatsa ⋅ Antoni Viros i Martin
Clients are evolving beyond chat completion, and now include a variety of innovative inference-time scaling and deep reasoning techniques. At the same time, inference servers remain heavily optimized for chat completion. Prior work has shown that large improvements to KV cache hit rate are possible if inference servers evolve towards these non-chat use cases. However, they offer solutions that are also optimized for a single use case, RAG. In this paper, we introduce the \emph{span query} to generalize the interface to the inference server. We demonstrate that chat, RAG, inference-time scaling, and agentic workloads can all be expressed as span queries. We show how the critical distinction that had been assumed by prior work lies in whether the order of the inputs matter --- do they \emph{commute}? In chat, they do not. In RAG, they often do. This paper introduces span queries, which are expression trees of inference calls, linked together with commutativity constraints. We describe span query syntax and semantics. We show how they can be automatically optimized to improve KV cache locality. We show how a small change to vLLM (affecting only 492 lines) can enable high-performance execution of span queries. Using this stack, we demonstrate that span queries can achieve 10-20x reductions in TTFT for two distinct non-chat use cases. Finally, we show that span queries can also be optimized to improve \emph{attention locality}, so as to avoid the so-called lost-in-the-middle problem. We demonstrate that an attention-optimized span query on a 2b parameter model vastly outperforms the accuracy of a stock inference server using an 8b model.
Virtual Machine NUMA Placement at Scale: Learning the Norm, Shielding the Tail
Yibo Zhao ⋅ Tianyuan Wu ⋅ HUI XUE ⋅ Qi Chen ⋅ Zhenhua Han ⋅ ⋅ ⋅ ⋅ ⋅ Jui-Hao Chiang ⋅ Mingxia Li ⋅ Yuqing Yang ⋅ Cheng Tan ⋅ Fan Yang ⋅ Peng Cheng ⋅ Yongqiang Xiong ⋅ Lili Qiu ⋅ Lidong Zhou
In modern data centers, servers organize memory and CPUs into Non-Uniform Memory Access (NUMA) nodes, where unequal memory-to-CPU proximity leads to varying memory latency. Hypervisors must carefully place Virtual Machines (VMs) to reduce remote memory access. Poor placements can lead to significant performance degradation—sometimes up to 30%. However, achieving optimal placement at scale is challenging due to the large number of VM configurations, diverse NUMA structures, and evolving workload patterns. We present Catur, a NUMA placement system designed for large-scale cloud environments. Catur leverages reinforcement learning to learn from production data. Moreover, to address real-world challenges, Catur integrates several techniques: robust action space design to prevent model collapse, reward shaping to address learning inefficiency, drift-aware continuous training for evolving workload patterns, and speculative shielding to mitigate VM performance anomalies. Evaluations on production traces with 100 million VMs demonstrate that Catur reduces average resource defect by 34.2%–50.0% compared to state-of-the-art hypervisor policies.
When Machine Learning Isn’t Sure: Building Resilient ML-Based Computer Systems by Embracing Uncertainty
Varun Gohil ⋅ Nevena Stojkovic ⋅ Noman Bashir ⋅ ⋅ Gaurang Upasani ⋅ David Lo ⋅ Parthasarathy Ranganathan ⋅ Christina Delimitrou
Machine learning (ML) models are increasingly used in computer systems but often suffer from poor generalizability, leading to costly failures on out-of-distribution (OOD) data. We propose an uncertainty-aware framework that improves system resilience by quantifying prediction uncertainty at runtime and rejecting unreliable outputs before they cause harm. When a prediction is uncertain, the system gracefully degrades to a safe fallback strategy. We evaluate the framework across three case studies, server provisioning, cluster management, and storage I/O admission, and find that the best uncertainty estimator is not universal but depends on how its properties align with each task’s design and resource constraints. Similarly, the optimal fallback workflow (e.g., a lightweight and parallel vs. resource-intensive and sequential ) depends on task’s runtime latency constraints. Together, these findings offer a practical path towards building more reliable ML-driven computer systems.