This tutorial aims to introduce the audience to ML-based telemetry analytics for large-scale computing systems to improve system performance, resilience, and power efficiency. Modern large-scale computing systems (i.e., data centers, High-Performance Computing clusters, etc.) are highly parallel systems that perform numerous complex operations concurrently, and they are critical for many societal and scientific applications. These complex systems support higher degrees of parallelism, which often leads to significant resource contention and eventually to performance variability and loss of efficiency. One way to assess system performance and identify the root causes of problems is by gathering and inspecting telemetry data. Such telemetry (of hundreds or thousands of hardware and software sensors) and log data are readily
available on any computer system today. As this system data contains billions of data points per day, manual analysis is impractical and has limited benefits. Considering the limitations of manual analysis, ML is emerging as a promising approach to automate performance analytics. Also, computer system telemetry analytics is a challenging application area with many open problems since labeled data is scarcely available, whereas unlabeled data can reach up to the scale of terabytes per day.
The goal of this tutorial is twofold. First, the tutorial provides an overview of telemetry data-based analytics and shows why ML-based approaches are more promising than existing methods that can identify which applications are running on compute nodes, performance or other anomalies, and root causes of anomalies. Participants will learn and experience these materials directly during hands-on activities through the use of open-source analytics frameworks designed by the speakers' teams at Boston University and the University of Bologna. At the end of this tutorial, participants will have a better understanding of the challenges and opportunities and gain the skills needed to employ ML-based frameworks for
solving complex problems in computer systems.
Schedule
Tue 8:00 a.m. - 8:30 a.m.
|
"Overview: Telemetry Data-based Analytics on Large-scale Computing Systems "
(
Talk
)
>
|
Burak Aksar 🔗 |
Tue 8:30 a.m. - 9:00 a.m.
|
Supervised Methods: Anomaly and Application Detection
(
Talk
)
>
|
Burak Aksar 🔗 |
Tue 9:00 a.m. - 9:15 a.m.
|
Break
|
🔗 |
Tue 9:15 a.m. - 9:45 a.m.
|
Semi-supervised Anomaly Detection in Supercomputers
(
Talk
)
>
|
Martin Molan 🔗 |
Tue 9:45 a.m. - 10:15 a.m.
|
Deployment: Challenges and Current Status
(
Talk
)
>
|
Burak Aksar 🔗 |
Tue 10:15 a.m. - 10:40 a.m.
|
Hands-on Activity
(
Q&A
)
>
|
Martin Molan 🔗 |