Poster

Scaling Distributed Training with Adaptive Summation

Saeed Maleki ⋅ Madan Musuvathi ⋅ Todd Mytkowicz ⋅ Olli Saarikivi ⋅ Tianju Xu ⋅ Vadim Eksarevskiy ⋅ Jaliya Ekanayake ⋅ Emad Barsoum

2021 Poster

[ Paper PDF]

Abstract

Data parallelism is a common way to parallelize stochastic gradient descent (SGD). However, the loss of convergence at large minibatch sizes limits the scalability of data parallelism. This paper introduces a novel method to combine gradients called Adasum that significantly improves the convergence when using large minibatches. This paper provides the intuition and formal justification of Adasum along with a convergence proof. Additionally, the paper describes an efficient implementation of Adasum and its integration into the open-source toolkit Horovod for use in both TensorFlow and PyTorch.

The paper empirically shows that Adasum improves convergence when using large minibatch sizes for multiple optimizers (Momentum-SGD, Adam, and LAMB). For BERT-Large training with a minibatch size of 64K, using both Adasum and LAMB training converges in 20% fewer epochs than with LAMB alone. This combination also allows BERT-Large training to scale to a 128K minibatch size. While one of the motivations for LAMB was the inability of the Adam optimizer to scale beyond a minibatch size of 16K, we show that Adasum helps Adam scale BERT-Large training to a 64K minibatch size. Our implementation of Adasum in Horovod has already been adopted in several production environments.

Video

The live parts of this page are not open to all registrants until 2021-04-06 00:00:00+00:00. You are seeing them because you have privileged access.

Chat is not available.