Skip to yearly menu bar Skip to main content


Poster

APOLLO: SGD-like Memory, AdamW-level Performance

Hanqing Zhu · Zhenyu Zhang · Wenyan Cong · Xi Liu · Sem Park · Vikas Chandra · Bo Long · David Pan · Atlas Wang · Jinwon Lee


Abstract:

Large language models (LLMs) demonstrate remarkable capabilities but are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden often necessitates using more or higher-end GPUs or reducing batch sizes, limiting training scalability and throughput, respectively. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face key challenges: (i) reliance on costly SVD operations (e.g., GaLore, Fira); (ii) significant performance trade-offs compared to AdamW (e.g., Flora); and (iii) still substantial memory overhead of optimization states in order to maintain competitive performance (e.g., 1/4 rank in GaLore, and full-rank first momentum in Adam-mini).In this work, we investigate the redundancy in AdamW's learning rate adaptation rule and identify that it can be coarsened as a structured learning rate update (channel-wise or tensor-wise). Based on this insight, we propose a novel approach, Approximated Gradient Scaling for Memory Efficient LLM Optimization (APOLLO), which approximates the channel-wise learning rate scaling with an auxiliary low-rank optimizer state based on pure random projection. The structured learning rate update rule makes APOLLO highly tolerant to further memory reduction with lower rank, halving the rank while delivering similar pre-training performance. Moreover, we further propose an extreme memory-efficient version, APOLLO-MINI, which utilizes tensor-wise scaling with only a rank-1 auxiliary sub-space, achieving SGD-level memory cost but superior pre-training performance than Adam(W).We conduct extensive experiments across different model architectures and tasks, showing that APOLLO series performs generally on-par with, or even better than Adam(W). Meanwhile, APOLLO achieves even greater memory savings than GaLore, by almost eliminating the optimization states in AdamW. These savings translate into significant system benefits:(1) Enhanced Throughput: APOLLO and APOLLO-Mini achieve around 3x throughput on an 8xA100-80GB setup compared to AdamW by fully utilizing memory to support 4x larger batch sizes.(2) Improved Model Scalability: APOLLO-Mini for the first time enables pre-training LLaMA-13B model with naive DDP on A100-80G without requiring other system-level optimizations.(3) Low-End GPU Friendly Pre-training: Combined with quantization, the APOLLO series for the first time enables the training of LLaMA-7B from scratch on a single GPU using less than 12 GB of memory.Our code is open-sourced at https://github.com/zhuhanqing/APOLLO.

Chat is not available.