Poster
Youmu: Efficient Columnar Data Pipeline for LLM Training
Tianle Zhong · Jiechen Zhao · Qiang Su · Geoffrey Fox
Abstract:
Large language models (LLMs) training is extremely data-intensive, often involving over trillion-level tokens. Although LLM datasets are usually ingested and stored in columnar formats, they often need to be converted into another format for training, which incurs significant storage and maintenance costs for extra data copies. While eliminating the conversion would save tens of terabytes of space in costly high performance storage, this work identifies challenges that drive us to re-think the entire data pipeline. Without conversion, we find that fine-grained random access patterns incur hundreds of times efficiency drops. Specifically, the existing pipeline cannot efficiently support directly digesting data in columnar format due to default coarse-grained I/O. Existing solutions often tend to sacrifice memory footprint for caching datasets. In this paper, we present Youmu, a new data pipeline that directly feeds fine-grained columnar data into GPUs, enabling cost-efficient LLM training. Meanwhile, Youmu maintains high training accuracy, whose perplexity outperforms widely adopted local shuffle by reducing 0.3-0.7 for pretraining. Compared to performance-optimal state-of-the-art, distributed memory-based pipelines, Youmu achieves comparable throughput with $\sim$80\% less memory footprint.
Chat is not available.