DriftBench: Measuring and Predicting Infrastructure Drift in LLM Serving Systems
Gianluigi Vitale
Abstract
Production LLM deployments lack systematic methods to assess output consistency risks when infrastructure changes. We present DriftBench, a measurement and prediction framework comprising 236,985 prompt-response pairs across 105 configurations spanning 5 models, 4 GPU platforms, 3 frameworks, 3 precisions. We develop the Portability Risk Index (PRI), achieving $R^2$=0.987 on held-out test data ($R^2$ ranges from 0 to 1, with higher values indicating better predictive accuracy) with held-out-dimension generalization: hardware $R^2$=0.909, precision $R^2$=0.763. We discover a fundamental dichotomy: hardware/precision changes exhibit systematic drift ($R^2 \geq 0.76$) enabling predict-once deployment, while framework/model changes show idiosyncratic drift ($R^2 < 0.48$) requiring re-measurement. Production validation blocked a +9.23pp drift upgrade affecting 1 in 5 queries, demonstrating operational value. Our contribution is measurement and risk assessment; we do not propose drift mitigation techniques, as this remains an open challenge for future work. Verification: https://anonymous.4open.science/r/reviewer-verification-5F4E/ | DriftBench CLI: https://anonymous.4open.science/r/driftbench-7FEC/
Chat is not available.
Successful Page Load