DriftBench: Measuring and Predicting Infrastructure Drift in LLM Serving Systems
Gianluigi Vitale
Abstract
Production LLM deployments lack systematic methods to assess output consistency risks when infrastructure changes. We present DriftBench, a measurement and prediction framework comprising 236,985 prompt-response pairs across 105 configurations spanning 5 models, 4 GPU platforms, 3 frameworks, 3 precisions. We develop the Portability Risk Index (PRI), achieving held-out-dimension generalization of $R^2$=0.909 for unseen hardware and $R^2$=0.763 for unseen precision ($R^2$ ranges up to 1.0; higher is better). We discover a fundamental dichotomy: hardware/precision changes exhibit systematic drift ($R^2 \geq 0.76$) enabling predict-once deployment, while framework/model changes show idiosyncratic drift ($R^2 < 0.48$) requiring re-measurement. Production validation blocked a high-drift upgrade where 23.85\% of safety prompts flipped between safe and unsafe classifications (nearly 1 in 4 answers changed from safe to unsafe or unsafe to safe), demonstrating operational value. Our contribution is measurement and risk assessment; we do not propose drift mitigation techniques, as this remains an open challenge for future work.
Successful Page Load