Track: Session 7: Systems

Wed 7 April 13:30 - 13:50 PDT

IOS: Inter-Operator Scheduler for CNN Acceleration

Yaoyao Ding · Ligeng Zhu · Zhihao Jia · Gennady Pekhimenko · Song Han

To accelerate CNN inference, existing deep learning frameworks focus on optimizing intra-operator parallelization. However, a single operator can no longer fully utilize the available parallelism given the rapid advances in high-performance hardware, resulting in a large gap between the peak performance and the real performance. This performance gap is more severe under smaller batch sizes. In this work, we extensively study the parallelism between operators and propose Inter-Operator Scheduler (IOS) to automatically schedule multiple operators' parallel execution through a novel dynamic programming algorithm. IOS consistently outperforms state-of-the-art libraries (e.g., TensorRT) by 1.1 to 1.5x on modern CNN benchmarks. The code to reproduce each experiment is available at: https://github.com/mit-han-lab/inter-operator-scheduler.

Wed 7 April 13:50 - 14:10 PDT

Value Learning for Throughput Optimization of Deep Learning Workloads

Benoit Steiner · Chris Cummins · Horace He · Hugh Leather

As the usage of machine learning techniques is becoming ubiquitous, the efficient execution of deep learning models is crucial to many applications. Frameworks such as Halide or TVM separate the algorithmic representation of the neural network from the schedule that determines its implementation. Finding good schedules, however, remains extremely challenging. Autotuning methods, which search the space of valid schedules and execute each candidate on the hardware, identify some of the best performing schedules, but the search can take hours, hampering the productivity of deep learning practitioners. What is needed is a method that achieves a similar performance without extensive search, delivering the needed efficiency quickly.

We model the scheduling process as a sequence of optimization choices, and present a new technique to accurately predict the expected performance of a partial schedule using a LSTM over carefully engineered features that describe each DNN operator and their current scheduling choices. Leveraging these predictions we are able to make these optimization decisions greedily, and without any executions on the target hardware, quickly identify an efficient schedule.

Our evaluation shows that our performance predictions are one order of magnitude more accurate than the state of the art. This enables us to find schedules that improve the execution performance of deep neural networks by 1.5x or more over the best autoschedulers. Moreover, our technique is two to three orders of magnitude faster than these tools, and completes in seconds instead of hours.

Wed 7 April 14:10 - 14:30 PDT

ByzShield: An Efficient and Robust System for Distributed Training

Konstantinos Konstantinidis · Aditya Ramamoorthy

Training of large scale models on distributed clusters is a critical component of the machine learning pipeline. However, this training can easily be made to fail if some workers behave in an adversarial (Byzantine) fashion whereby they return arbitrary results to the parameter server (PS). A plethora of existing papers consider a variety of attack models and propose robust aggregation and/or computational redundancy to alleviate the effects of these attacks. In this work we consider an omniscient attack model where the adversary has full knowledge about the gradient computation assignments of the workers and can choose to attack (up to) any q out of K worker nodes to induce maximal damage. Our redundancy-based method ByzShield leverages the properties of bipartite expander graphs for the assignment of tasks to workers; this helps to effectively mitigate the effect of the Byzantine behavior. Specifically, we demonstrate an upper bound on the worst case fraction of corrupted gradients based on the eigenvalues of our constructions which are based on mutually orthogonal Latin squares and Ramanujan graphs. Our numerical experiments indicate over a 36% reduction on average in the fraction of corrupted gradients compared to the state of the art. Likewise, our experiments on training followed by image classification on the CIFAR-10 dataset show that ByzShield has on average a 20% advantage in accuracy under the most sophisticated attacks. ByzShield also tolerates a much larger fraction of adversarial nodes compared to prior work.

Wed 7 April 14:30 - 14:50 PDT

FirePlace: Placing Firecraker Virtual Machines with Hindsight Imitation

Bharathan Balaji · Christopher Kakovitch · Balakrishnan Narayanaswamy

Virtual machines (VM) form the foundation of modern cloud computing as they help logically abstract per-user compute from shared physical infrastructure. Users of these services require VMs of varying sizes and configurations, which the provider places on a set of physical machines (PMs). VMs on the same physical PM share memory and CPU resources so a bad packing directly impacts the quality of user experience. We consider the placement of Firecracker VMs (a form of Micro-VMs) -- lightweight VMs that are typically used for short lived tasks. Our objective is to place each VM as it arrives, so that the peak to average ratio of resource usage across PMs is minimized. Placement is challenging as we need to consider resource use in multiple dimensions, such as CPU and memory, and because resource use changes over time. Past approaches to similar problems have suggested that one could forecast VM resource use for placement. We see that in our production traffic, Micro-VM resource use is spiky and short lived, and that forecasting algorithms are not useful. We evaluate Reinforcement Learning (RL) approaches for this task, but find that off-the-shelf RL algorithms are not always performant. We present a forecasting free algorithm, called FirePlace, that learns the placement decision using a variant of hindsight optimization, which we call hindsight imitation. We evaluate our approach using a production traffic trace of Firecracker usage AWS Lambda. FirePlace improves upon baseline algorithms by 10% on a production data trace of 100K Firecracker VMs.