Using ML for improving computer systems has seen a significant amount of work both in academia and industry. However, deployed uses of such techniques remain rare. While many published works in this space focus on solving the underlying learning problems, we observed from an industry vantage point that some of the biggest challenges of deploying ML for Systems in practice come from non-ML systems aspects, such as feature stability, reliability, availability, ML integration into rollout processes, verification, safety guarantees, feedback loops introduced by learning, debuggability, and explainability.
The goal of this workshop is to raise awareness of these problems and bring together practitioners (both on the production systems and ML side) and academic researchers, to work towards a methodology of capturing these problems in academic research. We believe that starting this conversation between the academic and industrial research communities will facilitate the adoption of ML for Systems research in production systems, and will provide the academic community with access to new research problems that exist in real-world deployments but have seen less attention in the academic community.
The workshop will uniquely facilitate this conversation by providing a venue for lightweight sharing of anecdotes and experiences from real-world deployments, as well as giving researchers a venue for sharing early-stage work on addressing these problems.
Thu 9:00 a.m. - 9:15 a.m.
|
Opening Remarks
|
🔗 |
Thu 9:15 a.m. - 10:00 a.m.
|
ML-driven Cloud Resource Management
(
Invited Talk
)
The variety of user workloads, application requirements, heterogeneous hardware resources, and large number of management tasks have rendered today’s cloud fairly complex. Recent work has shown promise in using Machine Learning for efficient resource management for such dynamically changing cloud execution environments. These approaches range from offline to online learning agents. In this talk, I will focus on the challenges that arise when building such agents and those that arise when these agents are deployed in real systems. To do so, I will use SmartHarvest, a system that improves utilization of resources by dynamically harvesting spare CPU cores from primary workloads to run batch workloads on cloud servers, as an example. Building on that, I will briefly talk about SOL, a framework that assists developers in building and deploying online learning agents for various use-cases. Bio: Neeraja is an assistant professor in the department of Electrical and Computer Engineering at UT Austin. She is a Cloud Computing Systems researcher, with a strong background in Machine Learning (ML). Most of her research straddles the boundaries of Systems and ML: using and developing ML techniques for systems, and building systems for ML. Before joining UT Austin, she was a postdoctoral fellow in the Computer Science department at Stanford University and before that, received her PhD in Computer Science from UC Berkeley. She had previously earned a bachelors in Computer Engineering from the Government College of Engineering, Pune, India. |
Neeraja Yadwadkar 🔗 |
Thu 10:00 a.m. - 10:30 a.m.
|
Break with Refreshments
|
🔗 |
Thu 10:30 a.m. - 11:15 a.m.
|
The Past and Future of Machine Programming in Academia and Industry (A Retrospective and Forecast)
(
Invited Talk
)
Machine programming (MP) is principally concerned with the automation of software development. Different than program synthesis, MP targets all aspects of software development such as automating debugging, testing, and profiling code, amongst other things. In this talk, we discuss the foundations of MP and consider its impact across three views: (i) academia, (ii) established corporations, and (iii) startup ventures. We begin with the “The Three Pillars of Machine Programming” and the formation of the ACM SIGPLAN Machine Programming Symposium (MAPS) both in 2017. We then discuss critical developments that occurred in MP over the last five years leading us to today, including some potential missteps. We then forecast the future of MP over the next five years, including discussing some obvious upcoming developments (e.g., AI-coding partners) and some less obvious ones (e.g., semantic reasoners, transpilation, intentional programming languages, etc.). |
Justin Gottschlich 🔗 |
Thu 11:15 a.m. - 12:00 p.m.
|
Raptor: Industrial Reinforcement Learning At Scale
(
Invited Talk
)
Jonathan Raiman is a Senior Research Scientist in the NVIDIA Applied Deep Learning Research group working on large-scale distributed reinforcement learning and AI for systems. Previously he was a Research Scientist at OpenAI where he co-created OpenAI Five, a superhuman Deep Reinforcement Learning Dota 2 bot. At Baidu SVAIL, he co-created several neural text-to-speech systems (Deep Voice 1, 2, and 3), and worked on speech recognition (Deep Voice 2), and question answering (Globally Normalized Reader). He is also the creator of DeepType 1, and DeepType 2, a superhuman entity linking system. He is completing his Ph.D. at Paris Saclay, and previously obtained his master’s at MIT. |
Jonathan Ben-tzur 🔗 |
Thu 1:00 p.m. - 1:30 p.m.
|
Limitations of Data-driven based Approaches for Assuring Performance of Enterprise IT Systems
(
Submission Talks
)
|
Rekha Singhal 🔗 |
Thu 1:00 p.m. - 1:30 p.m.
|
Real-World Challenges of ML-based Database Auto-tuning
(
Submission Talks
)
|
Shohei Matsuura · Takashi Miyazaki 🔗 |
Thu 1:00 p.m. - 1:30 p.m.
|
Understanding Model Drift in a Large Cellular Network
(
Submission Talks
)
|
Shinan Liu 🔗 |
Thu 1:30 p.m. - 2:15 p.m.
|
Panel Discussion
|
Benoit Steiner · Neeraja Yadwadkar · Siddhartha Sen · Jonathan Raiman 🔗 |
Thu 2:15 p.m. - 3:00 p.m.
|
CacheSack: Lessons from deploying an admission optimizer for Google datacenter flash caches
(
Invited Talk
)
CacheSack is the admission algorithm for Google datacenter flash caches. CacheSack partitions cache traffic into categories, estimates the benefit and cost of different cache admission policies for each category, and assigns the optimal combination of admission policies while respecting resource constraints. This talk will briefly describe the design and deployment of CacheSack. We will then discuss the challenges of deploying CacheSack in production, and what lessons we learnt for the future. Bio: Arif Merchant is a Research Scientist at Google and leads the Storage Analytics group, which studies interactions between components of the storage stack. His interests include distributed storage systems, storage management, and stochastic modeling. He holds the B.Tech. degree from IIT Bombay and the Ph.D. in Computer Science from Stanford University. He is an ACM Distinguished Scientist. |
Arif Merchant 🔗 |
Thu 3:00 p.m. - 3:30 p.m.
|
PM Break with Refreshments
|
🔗 |
Thu 3:30 p.m. - 4:15 p.m.
|
Counterfactual Reasoning and Safeguards for ML Systems
(
Invited Talk
)
Counterfactual reasoning is a powerful idea from reinforcement learning (RL) that allows us to evaluate new candidate policies for a system, without actually deploying those policies. Traditionally, this is done by collecting randomized data from an existing policy and matching this data against the decisions of a candidate policy. In many systems, we observe that a special kind of information exists that can boost the power of counterfactual reasoning. Specifically, system policies that make threshold decisions involving a resource (e.g., time, memory, cores) naturally reveal additional, or implicit feedback about alternative decisions. For example, if a system waits X min for an event to occur, then it automatically learns what would have happened if it waited Bio:
Siddhartha Sen is a Principal Researcher in the Microsoft Research New York City lab. His research trajectory started with distributed systems and data structures, evolved to incorporate machine learning, and is currently most inspired by humans. His current mission is to use AI to design human-oriented and human-inspired systems that advance human skills and empower them to achieve more. Siddhartha received his BS/MEng degrees in computer science and mathematics from MIT, then worked for three years as a developer in Microsoft’s Windows Server team before returning to academia to complete his PhD from Princeton University. Siddhartha’s work on data structures and human/AI gaming has been featured in several textbooks and podcasts. |
Siddhartha Sen 🔗 |
Thu 4:15 p.m. - 5:00 p.m.
|
Breakout Session
|
🔗 |