Skip to yearly menu bar Skip to main content


Poster

APOLLO: SGD-like Memory, AdamW-level Performance

Hanqing Zhu · Zhenyu Zhang · Wenyan Cong · Xi Liu · Sem Park · Vikas Chandra · Bo Long · David Pan · Atlas Wang · Jinwon Lee


Abstract:

Large language models (LLMs) exhibit remarkable capabilities but are notoriously memory-intensive for storing optimization states during training, especially with the widely-used Adam optimizer. This memory burden forces practitioners to use more GPUs, smaller batch sizes, and higher-end GPUs to train large-scale models, thereby limiting model scalability and reducing training efficiency.To alleviate this,several memory-efficient optimizers have been proposed to compress the optimizer memory.However, they face critical limitations: (i) reliance on time-consuming SVD operations (GaLore and Fira); (ii) significant performance gaps over AdamW (Flora); and (iii) substantial memory overhead to achieve competitive performance(GaLore, Fira, Flora, and Adam-mini).In this work, we investigate the redundancy in Adam's learning rate adaption rule and identify that it can be coarsened as a structured learning rate update (channel-wise or tensor-wise).Based on this insight, we propose a novel approach, Approximated Gradient Scaling for Memory Efficient LLM Optimization (APOLLO), which approximates the channel-wise learning rate scaling with an auxiliary low-rank optimizer state based on pure random projection.The structured learning rate update rule makes APOLLO highly tolerant to further memory reduction with lower rank, halving the rank while delivering similar pre-training performance.Moreover, we further propose an extreme memory-efficient version, APOLLO-MINI, which utilizes tensor-wise scaling with only a rank-1 auxiliary sub-space, achieving SGD-level memory cost but superior pre-training performance than Adam.We conduct experiments across different tasks and model architectures to verify the effectiveness of our approach, showing that APOLLO series performs better than Adam and achieves greater memory savings than Galore. Notably, APOLLO-MINI can almost entirely eliminate the optimization states in AdamW and GaLore, facilitating the training of LLaMA-7B from scratch with less than 12 GB of memory while enhancing the thought by 3x. This makes the APOLLO series a highly competitive solution for memory-efficient large-scale LLM training.

Chat is not available.