Poster 18

When Machine Learning Isn’t Sure: Building Resilient ML-Based Computer Systems by Embracing Uncertainty

Varun Gohil ⋅ Nevena Stojkovic ⋅ Noman Bashir ⋅ Sundar Dev ⋅ Gaurang Upasani ⋅ David Lo ⋅ Parthasarathy Ranganathan ⋅ Christina Delimitrou

[ OpenReview]

Abstract

Machine learning (ML) models are increasingly used in computer systems but often suffer from poor generalizability, leading to costly failures on out-of-distribution (OOD) data. We propose an uncertainty-aware framework that improves system resilience by quantifying prediction uncertainty at runtime and rejecting unreliable outputs before they cause harm. When a prediction is uncertain, the system gracefully degrades to a safe fallback strategy. We evaluate the framework across three case studies, server provisioning, cluster management, and storage I/O admission, and find that the best uncertainty estimator is not universal but depends on how its properties align with each task’s design and resource constraints. Similarly, the optimal fallback workflow (e.g., a lightweight and parallel vs. resource-intensive and sequential ) depends on task’s runtime latency constraints. Together, these findings offer a practical path towards building more reliable ML-driven computer systems.

Chat is not available.