This tutorial aims to introduce the audience to ML-based telemetry analytics for large-scale computing systems to improve system performance, resilience, and power efficiency. Modern large-scale computing systems (i.e., data centers, High-Performance Computing clusters, etc.) are highly parallel systems that perform numerous complex operations concurrently, and they are critical for many societal and scientific applications. These complex systems support higher degrees of parallelism, which often leads to significant resource contention and eventually to performance variability and loss of efficiency. One way to assess system performance and identify the root causes of problems is by gathering and inspecting telemetry data. Such telemetry (of hundreds or thousands of hardware and software sensors) and log data are readily
available on any computer system today. As this system data contains billions of data points per day, manual analysis is impractical and has limited benefits. Considering the limitations of manual analysis, ML is emerging as a promising approach to automate performance analytics. Also, computer system telemetry analytics is a challenging application area with many open problems since labeled data is scarcely available, whereas unlabeled data can reach up to the scale of terabytes per day.
The goal of this tutorial is twofold. First, the tutorial provides an overview of telemetry data-based analytics and shows why ML-based approaches are more promising than existing methods that can identify which applications are running on compute nodes, performance or other anomalies, and root causes of anomalies. Participants will learn and experience these materials directly during hands-on activities through the use of open-source analytics frameworks designed by the speakers' teams at Boston University and the University of Bologna. At the end of this tutorial, participants will have a better understanding of the challenges and opportunities and gain the skills needed to employ ML-based frameworks for
solving complex problems in computer systems.
Tue 8:00 a.m. - 8:30 a.m.
|
"Overview: Telemetry Data-based Analytics on Large-scale Computing Systems "
(
Talk
)
The first talk will provide a broad overview of the field of telemetry-based analytics, including topics like monitoring frameworks, details about the nature of large-scale computing systems, and telemetry-based analytics frameworks. We will first describe some existing rule-based or heuristic analytics methods, and special emphasis will be given to opportunities for ML-based automated techniques. |
Burak Aksar 🔗 |
Tue 8:30 a.m. - 9:00 a.m.
|
Supervised Methods: Anomaly and Application Detection
(
Talk
)
The second talk will focus on anomaly and application detection in large-scale computing systems. We will cover the motivation behind detecting performance anomalies in large-scale computing systems and summarize existing synthetic performance anomaly suites that can help generate labeled anomalous data samples for supervised methods [9]. Next, we will discuss several successful supervised anomaly detection/diagnosis methods introduced in the last five years [1, 7]. We will also cover the security aspect and provide an example from a framework that can identify running applications, especially important for detecting unwanted applications such as cryptocurrency miners and password crackers [2]. We will conclude with the unique strengths and limitations of the supervised methods, as well as the usability of these methods in production systems. |
Burak Aksar 🔗 |
Tue 9:00 a.m. - 9:15 a.m.
|
Break
|
🔗 |
Tue 9:15 a.m. - 9:45 a.m.
|
Semi-supervised Anomaly Detection in Supercomputers
(
Talk
)
The third talk will focus on the frameworks that are able to leverage a limited number of labeled and a large amount of unlabeled data. The main motivation behind these frameworks is that supervised ML-based frameworks require large labeled data sets, and this requirement is restrictive for many real-world application domains. We will discuss two techniques in detail. The first technique requires only non-anomalous telemetry data to detect performance anomalies on compute nodes [5]. The second technique takes advantage of a few labeled anomalous samples to classify anomaly types [3]. We finally provide a glimpse towards the opportunities stemming from the combination of supervised and semi-supervised approaches [6]. |
Martin Molan 🔗 |
Tue 9:45 a.m. - 10:15 a.m.
|
Deployment: Challenges and Current Status
(
Talk
)
The fourth talk will focus on operational data analytics solutions that provide runtime system insights for users and system administrators in real-world production systems [4, 5,8]. We will cover the challenges we faced while deploying ML-based solutions to production environments. We will also discuss the required system components and aspects to leverage ML-based approaches at scale and open problems. |
Burak Aksar 🔗 |
Tue 10:15 a.m. - 10:40 a.m.
|
Hands-on Activity
(
Q&A
)
In the hands-on section, we will present (a subset) of data collected from a Tier-0 supercomputing system in CINECA (Marconi 100). On that subset of data, we will demonstrate a semi-supervised anomaly detection approach [6]. |
Martin Molan 🔗 |