Cost-aware Duration Prediction for Software Upgrades in Datacenters
Abstract
Software upgrades are critical to maintaining server reliability in datacenters. While job duration prediction and scheduling have been extensively studied, the unique challenges posed by software upgrades remain largely under-explored. This paper presents the first in-depth investigation into software upgrade scheduling at datacenter scale. We begin by characterizing various types of upgrades and then frame the scheduling task as a constrained optimization problem. To address this problem, we introduce Zephyr, a cost-aware duration prediction framework designed to improve upgrade scheduling efficiency and throughput while meeting service-level objectives (SLOs). Zephyr accounts for asymmetric misprediction costs, strategically selects the best predictive models, and mitigates straggler-induced overestimations. Evaluations on Meta's production datacenter systems demonstrate that Zephyr significantly outperforms the existing upgrade scheduler by improving upgrade window utilization by 1.25x, increasing the number of scheduled and completed upgrades by 33% and 41%, and reducing cancellation rates by 2.4x. The code and data sets will be released after paper acceptance.