Zero redundancy distributed learning with differential privacy
Abstract
Deep learning using large models has achieved great success in a wide range of domains. However, training these models on billions of parameters is very challenging in terms of training speed, memory cost, and communication efficiency, especially under the privacy-preserving regime with differential privacy (DP). On the one hand, the efficiency of DP optimization is comparable to that of standard non-DP optimization on a single GPU, but existing DP distributed learning is significantly inefficient on multiple GPUs. On the other hand, the Zero Redundancy Optimizer (ZeRO) is a state-of-the-art solution to the standard distributed learning, which can be technically complicated to work compatibly with DP. In this work, we develop a new systematic solution, DP-ZeRO, (I) to scale up the trainable DP model size, e.g. to GPT-100B, (II) to obtain the same computation and communication efficiency as the standard ZeRO, and (III) to enable mixed-precision DP training. Our DP-ZeRO, like the standard ZeRO, has the potential to train models with arbitrary size and exhibits excellent training efficiency on large models. Code at \url{https://anonymous.4open.science/r/fast-differential-privacy-3B50}.