Skip to yearly menu bar Skip to main content


Session

Research-Track Oral Presentation: R18: Benchmarks

Grand Ballroom 2
Thu 21 May 2:45 p.m. PDT — 4:15 p.m. PDT
Abstract:
Chat is not available.


Charon: A Unified and Fine-Grained Simulator for Large-Scale LLM Training and Inference

Mengtian Yang ⋅ Zhekun Zhang ⋅ Mingheng Wu ⋅ jianwen yan ⋅ Hanshi Sun ⋅ Li-Wen Chang

Deploying large-scale LLM training and inference with optimal performance is exceptionally challenging due to a complex design space of parallelism strategies, system optimizations, and hardware configurations. Accurate and rapid performance simulation is critical for guiding optimization efforts and system studies by validating “what-if” Hooker Figure hypotheses. To address this, we introduce Charon, a unified, modular, and fine-grained simulator for accurately predicting LLM performance. Experiments show Charon achieves high accuracy across different models and configurations, with an overall prediction error consistently under 5.35%, and even under 3.74% for training with over 10,000 GPUs. In a practical inference deployment case, Charon discovered a configuration that improved system throughput by 275% over a manually-tuned baseline, demonstrating its significant real-world value.

Production LLM deployments lack systematic methods to assess output consistency risks when infrastructure changes. We present DriftBench, a measurement and prediction framework comprising 236,985 prompt-response pairs across 105 configurations spanning 5 models, 4 GPU platforms, 3 frameworks, 3 precisions. We develop the Portability Risk Index (PRI), achieving $R^2$=0.987 on held-out test data ($R^2$ ranges from 0 to 1, with higher values indicating better predictive accuracy) with held-out-dimension generalization: hardware $R^2$=0.909, precision $R^2$=0.763. We discover a fundamental dichotomy: hardware/precision changes exhibit systematic drift ($R^2 \geq 0.76$) enabling predict-once deployment, while framework/model changes show idiosyncratic drift ($R^2 < 0.48$) requiring re-measurement. Production validation blocked a +9.23pp drift upgrade affecting 1 in 5 queries, demonstrating operational value. Our contribution is measurement and risk assessment; we do not propose drift mitigation techniques, as this remains an open challenge for future work. Verification: https://anonymous.4open.science/r/reviewer-verification-5F4E/ | DriftBench CLI: https://anonymous.4open.science/r/driftbench-7FEC/


Hawkeye: Reproducing GPU-Level Non-Determinism

⋅ Dan Boneh ⋅ Ilan Komargodski ⋅ Megha Srivastava

We present Hawkeye, a system for analyzing and reproducing GPU-level arithmetic operations on CPUs. Using our framework, an auditor can re-execute a full model training or inference workflow executed on NVIDIA GPUs on a CPU, without any precision loss and without introducing any additional operations or slowdown on the GPU side. This is in stark contrast to prior approaches to verifiable machine learning that introduced significant computational overhead for the model provider. The main technical contribution underlying Hawkeye is a systematic algorithmic framework for numerical treatment within NVIDIA's Tensor Cores rounding, subnormal number handling, and order of (non-associative) accumulation during matrix multiplication. Our framework consists of a sequence of carefully crafted tests that reduce the (otherwise exponential size) search space of potential options for each operation. We test and evaluate our framework on a variety of GPU architectures (including Ampere, and Hopper), as well as all available precision types (FP16, BF16). In all test cases, our framework recovers the exact implementation of operations underlying matrix multiplication, and therefore allows for the full reproduction of model training and inference workflows on a CPU.


Massive-Scale Out-Of-Core UMAP on the GPU

Jinsol Park ⋅ Corey Nolet ⋅ Edward Raff ⋅ Tim Oates ⋅ Akira Naruse

The Uniform Manifold Approximation and Projection (UMAP) algorithm has become a widely popular technique to reduce the dimensionality of a set of vectors, both for visualization and as a pre-processing step for follow-on machine learning tasks. UMAP is often an integral part of iterative and exploratory workflows, but the heavy amount of compute and memory required makes scaling to tens or even hundreds of gigabytes of vectors intractable on the CPU, often taking several hours to days to complete. In this paper, we show how we improved UMAP while unlocking performance that permits interactive analysis, even at massive-scale. We introduce an out-of-core strategy with optional multi-GPU support, achieving up to 74× faster performance than the CPU baseline.


Parrot: Persuasion and Agreement Robustness Rating of Output Truth

Yusuf Çelebi ⋅ Mahmoud ElHussieni ⋅ Özay Ezerceli

This study presents PARROT (Persuasion and Agreement Robustness Rating of Output Truth), a robustness-focused framework designed to measure the degradation in accuracy that occurs under social pressure exerted on users through authority and persuasion in large language models (LLMs) the phenomenon of sycophancy (excessive conformity). PARROT (i) isolates causal effects by comparing the neutral version of the same question with an authoritatively false version using a double-blind evaluation, (ii) quantifies confidence shifts toward the correct and imposed false responses using log-likelihood-based calibration tracking, and (iii) systematically classifies failure modes (e.g., robust correct, sycophantic agreement, reinforced error, stubborn error, self-correction, etc.) using an eight-state behavioral taxonomy. We evaluated 22 models using 1,302 MMLU-style multiple-choice questions across 13 domains and domain-specific authority templates. Findings show marked heterogeneity: advanced models (e.g., GPT-5, GPT-4.1, Claude Sonnet 4.5) exhibit low “follow rates” ($\leq11\%$, GPT-5: 4\%) and minimal accuracy loss, while older/smaller models show severe epistemic collapse (GPT-4: 80\%, Qwen 2.5-1.5B: 94\%). The danger is not limited to response changes; weak models reduce confidence in the correct response while increasing confidence in the imposed incorrect response. While international law and global knowledge at the domain level exhibit high fragility, elementary mathematics is relatively resilient. Consequently, we argue that the goal of “resistance to overfitting pressure” should be addressed as a primary objective alongside accuracy, harm avoidance, and privacy for safe deployment in the real world.