Training Deep Neural Network (DNN) models in parallel on a distributed machine cluster is an emergent important work-load and increasingly, communication bound. To be clear, it remains computationally intensive. But the last seven years have brought a 62× improvement in compute performance, thanks to GPUs and other hardware accelerators. Cloud net- work deployments have found this pace hard to match, skewing the ratio of computation to communication towards the latter. Meanwhile, in pursuit of ever higher accuracy, data scientists continue to enlarge model sizes and complexity as the underlying compute and memory capabilities allow them to.
The MLSys community has recently been tackling the communication challenges in distributed DNN training with various approaches ranging from efficient parameter servers [3,4] or scalable collective communication [2, 6] to in-network aggregation  and gradient compression techniques [1, 7]. The overarching goal of these works has been to alleviate the communication bottlenecks by reducing the time that workers spend on overall network communication to exchange the local gradients.
In this tutorial, we will present some of the state-of-the-art approaches, with a primary focus on our own work in the area while also drawing links to the broader literature. The tutorial will include practical hands-on parts so that the audience may also learn by doing. We hope the tutorial will familiarize the attendants with this timely area and stimulat discussions and new ideas. A first iteration of this tutorial took place during the ACMSIGCOMM 2021 conference.1 The tutorial had been success- ful and attracted an audience of around 30 people. A recording of the event is available at https://sands.kaust.edu.sa/naddl-sigcomm21/