Moderator: Bilge Acun
Daochen Zha · Louis Feng · Liang Luo · Bhargav Bhushanam · Zirui Liu · Yusuo Hu · Jade Nie · Yuzhen Huang · Yuandong Tian · Arun Kejariwal · Xia Hu
Sharding a large machine learning model across multiple devices to balance the costs is important in distributed training. This is challenging because partitioning is NP-hard, and estimating the costs accurately and efficiently is difficult. In this work, we explore a "pre-train, and search" paradigm for efficient sharding. The idea is to pre-train a universal and once-for-all neural network to predict the costs of all the possible shards, which serves as an efficient sharding simulator. Built upon this pre-trained cost model, we then perform an online search to identify the best sharding plans given any specific sharding task. We instantiate this idea in deep learning recommendation models (DLRMs) and propose NeuroShard for embedding table sharding. NeuroShard pre-trains neural cost models on augmented tables to cover various sharding scenarios. Then it identifies the best column-wise and table-wise sharding plans with beam search and greedy grid search, respectively. Experiments show that NeuroShard significantly and consistently outperforms the state-of-the-art on the benchmark sharding dataset, achieving up to 23.8% improvement. When deployed in an ultra-large production DLRM with multi-terabyte embedding tables, NeuroShard achieves 11.6% improvement in embedding costs over the state-of-the-art, which translates to 6.6% end-to-end training throughput improvement. To facilitate future research of the "pre-train, and search" paradigm in ML for Systems, we open-source our code at https://github.com/daochenzha/neuroshard
Sanket Purandare · Abdul Wasay · Stratos Idreos · Animesh Jain
In this paper, we identify that modern GPUs - the key platform for developing neural networks - are being severely underutilized, with ∼ 50% utilization, that further drops as GPUs get faster. We show that state-of-the-art training techniques that employ operator fusion and larger mini-batch size to improve GPU utilization are limited by memory and do not scale with the size and number of models. Additionally, we show that using state-of-the art data swapping techniques (between GPU and host memory) to address GPU memory limitations lead to massive computation stalls as network sizes grow.We introduce μ-two, a novel compiler that maximizes GPU utilization. At the core of μ-two is an approach that leverages selective data swapping from GPU to host memory only when absolutely necessary, and maximally overlaps data movement with independent computation operations such that GPUs never have to wait for data. By collecting accurate run-time statistics and data dependencies, μ-two automatically fuses operators across different models, and precisely schedules data movement and computation operations to enable concurrent training of multiple models with minimum stall time. We show how to generate μ-two schedules for diverse neural network and GPU architectures and integrate μ-two into the PyTorch framework. Our experiments show that μ-two can achieve up to a 3× speed-up across a range of network architectures and hardware, spanning vision, natural language processing, and recommendation applications.
Pritam Damania · Shen Li · Alban Desmaison · Alisson Azzolini · Brian Vaughan · Edward Yang · Gregory Chanan · Guoqiang Jerry Chen · Hongyi Jia · Howard Huang · Joseph Spisak · Luca Wehrstedt · Lucas Hosseini · Manoj Krishnan · Omkar Salpekar · Pavel Belevich · Rohan Varma · Satendra Gera · Wanchao Liang · Shihao Xu · Soumith Chintala · Chaoyang He · Amir Ziashahabi · Salman Avestimehr · · Zachary DeVito
Distributed training technologies have advanced rapidly in the past few years and have unlocked unprecedented scalability with increasingly complex solutions. These technologies have made distributed training much more efficient and accessible, though they impose specific constraints on the training paradigm or the model structure. As a result, applications that fail to meet these constraints must rely on general-purpose distributed computing frameworks to scale out. However, without access to the internal components of deep learning frameworks, these distributed computing frameworks usually significantly fall short in terms of efficiency and usability. To address these problems, we propose PyTorch RPC as a generic and high-performance solution for distributed deep learning. Compared to generic distributed computing frameworks, PyTorch RPC natively provides essential features for implementing training applications in a distributed environment, including optimized tensor communications, remote memory management, and distributed autograd. Evaluations show that PyTorch RPC attains up to two orders of magnitude faster tensor communication compared to gRPC with one-tenth of the user code. Case studies further demonstrate that users can easily employ PyTorch RPC to build efficient reinforcement learning applications (video game solver), implement large language models (GPT3), train recommendation models (DLRM), and scale federated learning tasks (FedML).
Mark Zhao · Dhruv Choudhary · Devashish Tyagi · Ajay Somani · Max Kaplan · Sung-Han Lin · Sarunya Pumma · Jongsoo Park · Aarti Basant · Niket Agarwal · Carole-Jean Wu · Christos Kozyrakis
We present RecD (Recommendation Deduplication), a suite of end-to-end infrastructure optimizations across the Deep Learning Recommendation Model (DLRM) training pipeline. RecD addresses immense storage, preprocessing, and training overheads caused by feature duplication inherent in industry-scale DLRM training datasets. Feature duplication arises because DLRM datasets are generated from interactions. While each user session can generate multiple training samples, many features’ values do not change across these samples. We demonstrate how RecD exploits this property, end-to-end, across a deployed training pipeline. RecD optimizes data generation pipelines to decrease dataset storage and preprocessing resource demands and to maximize duplication within a training batch. RecD introduces a new tensor format, InverseKeyedJaggedTensors (IKJTs), to deduplicate feature values in each batch. We show how DLRM model architectures can leverage IKJTs to drastically increase training throughput. RecD improves the training and preprocessing throughput and storage efficiency by up to 2.48×, 1.79×, and 3.71×, respectively, in an industry-scale DLRM training system.