Track: Research Track Oral Presentation: Benchmarks and Evaluation

Thu 21 May 14:45 - 15:00 PDT

Massive-Scale Out-Of-Core UMAP on the GPU

Jinsol Park ⋅ Corey Nolet ⋅ Edward Raff ⋅ Tim Oates ⋅ Akira Naruse

The Uniform Manifold Approximation and Projection (UMAP) algorithm has become a widely popular technique to reduce the dimensionality of a set of vectors, both for visualization and as a pre-processing step for follow-on machine learning tasks. UMAP is often an integral part of iterative and exploratory workflows, but the heavy amount of compute and memory required makes scaling to tens or even hundreds of gigabytes of vectors intractable on the CPU, often taking several hours to days to complete. In this paper, we show how we improved UMAP while unlocking performance that permits interactive analysis, even at massive-scale, by introducing an out-of-core strategy with optional multi-GPU support. We observe 22.7x speedup using a single GPU on smaller data scales where CPU baseline runs to completion, and project up to 74x speedup using multiple GPUs on a single node at larger scales where CPU was not able to complete by extrapolating measured scaling behavior.

Thu 21 May 15:00 - 15:15 PDT

Charon: A Unified and Fine-Grained Simulator for Large-Scale LLM Training and Inference

Mengtian Yang ⋅ Zhekun Zhang ⋅ Mingheng Wu ⋅ jianwen yan ⋅ Hanshi Sun ⋅ Li-Wen Chang

Deploying large-scale LLM training and inference with optimal performance is exceptionally challenging due to a complex design space of parallelism strategies, system optimizations, and hardware configurations. Accurate and rapid performance simulation is critical for guiding optimization efforts and system studies by validating “what-if” hypotheses. To address this, we introduce Charon, a unified, modular, and fine-grained simulator for accurately predicting LLM performance. Experiments show Charon achieves high accuracy across different models and configurations, with an overall prediction error consistently under 5.35%, and even under 3.74% for training with a large-scale GPU cluster. In a practical inference deployment case, Charon discovered a configuration that improved system throughput over an engineering-tuned baseline, demonstrating its significant real-world value.

Thu 21 May 15:15 - 15:30 PDT

Hawkeye: Reproducing GPU-Level Non-Determinism

Erez Badash ⋅ Dan Boneh ⋅ Ilan Komargodski ⋅ Megha Srivastava

We present Hawkeye, a system for analyzing and reproducing GPU-level arithmetic operations. Using our framework, anyone can re-execute on a CPU the exact matrix multiplication operations underlying a machine learning model training or inference workflow that was executed on an NVIDIA GPU, without any precision loss. This is in stark contrast to prior approaches to verifiable machine learning, which either introduce significant computation overhead to the original model owner, or suffer from non-robustness and quality degradation. The main technical contribution of Hawkeye is a systematic sequence of carefully crafted tests that study rounding direction, subnormal number handling, and order of (non-associative) accumulation during matrix multiplication on NVIDIA’s Tensor Cores. We test and evaluate our framework on multiple NVIDIA GPU architectures ( Ampere, Hopper, and Lovelace) and precision types (FP16, BFP16, FP8). In all test cases, Hawkeye enables perfect reproduction of matrix multiplication on a CPU, paving the way for efficient and trustworthy third-party auditing of ML model training and inference.

Thu 21 May 15:30 - 15:45 PDT

DriftBench: Measuring and Predicting Infrastructure Drift in LLM Serving Systems

Gianluigi Vitale

Production LLM deployments lack systematic methods to assess output consistency risks when infrastructure changes. We present DriftBench, a measurement and prediction framework comprising 236,985 prompt-response pairs across 105 configurations spanning 5 models, 4 GPU platforms, 3 frameworks, 3 precisions. We develop the Portability Risk Index (PRI), achieving held-out-dimension generalization of $R^2$=0.909 for unseen hardware and $R^2$=0.763 for unseen precision ($R^2$ ranges up to 1.0; higher is better). We discover a fundamental dichotomy: hardware/precision changes exhibit systematic drift ($R^2 \geq 0.76$) enabling predict-once deployment, while framework/model changes show idiosyncratic drift ($R^2 < 0.48$) requiring re-measurement. Production validation blocked a high-drift upgrade where 23.85\% of safety prompts flipped between safe and unsafe classifications (nearly 1 in 4 answers changed from safe to unsafe or unsafe to safe), demonstrating operational value. Our contribution is measurement and risk assessment; we do not propose drift mitigation techniques, as this remains an open challenge for future work.

Thu 21 May 15:45 - 16:00 PDT

PARROT: Persuasion and Agreement Robustness Rating of Output Truth — A Sycophancy Robustness Benchmark for LLMs

Özay Ezerceli ⋅ Mahmoud ElHussieni

This study presents PARROT (Persuasion and Agreement Robustness Rating of Output Truth), a robustness-focused framework designed to measure the degradation in accuracy that occurs under social pressure exerted on users through authority and persuasion in large language models (LLMs) the phenomenon of sycophancy (excessive conformity). PARROT (i) isolates causal effects by comparing the neutral version of the same question with an authoritatively false version using a double-blind evaluation, (ii) quantifies confidence shifts toward the correct and imposed false responses using log-likelihood-based calibration tracking, and (iii) systematically classifies failure modes (e.g., robust correct, sycophantic agreement, reinforced error, stubborn error, self-correction, etc.) using an eight-state behavioral taxonomy. We evaluated 22 models using 1,302 MMLU-style multiple-choice questions across 13 domains and domain-specific authority templates. Findings show marked heterogeneity: advanced models (e.g., GPT-5, GPT-4.1, Claude Sonnet 4.5) exhibit low “follow rates” ($\leq11\%$, GPT-5: 4\%) and minimal accuracy loss, while older/smaller models show severe epistemic collapse (GPT-4: 80\%, Qwen 2.5-1.5B: 94\%). The danger is not limited to response changes; weak models reduce confidence in the correct response while increasing confidence in the imposed incorrect response. While international law and global knowledge at the domain level exhibit high fragility, elementary mathematics is relatively resilient. Consequently, we argue that the goal of “resistance to overfitting pressure” should be addressed as a primary objective alongside accuracy, harm avoidance, and privacy for safe deployment in the real world.

Main Navigation

Session

Research Track Oral Presentation: Benchmarks and Evaluation

Grand Ballroom 2

Massive-Scale Out-Of-Core UMAP on the GPU

Charon: A Unified and Fine-Grained Simulator for Large-Scale LLM Training and Inference

Hawkeye: Reproducing GPU-Level Non-Determinism

DriftBench: Measuring and Predicting Infrastructure Drift in LLM Serving Systems

PARROT: Persuasion and Agreement Robustness Rating of Output Truth — A Sycophancy Robustness Benchmark for LLMs