Moderator: Qi Lei
Hongyi Wang · Saurabh Agarwal · Dimitris Papailiopoulos
To mitigate communication overheads in distributed model training, several studies propose the use of compressed stochastic gradients, usually achieved by sparsification or quantization. Such techniques achieve high compression ratios, but in many cases incur either significant computational overheads or some accuracy loss. In this work, we present Pufferfish, a communication and computation efficient distributed training framework that incorporates the gradient compression into the model training process via training low-rank, pre-factorized deep networks. Pufferfish not only reduces communication, but also completely bypasses any computation overheads related to compression, and achieves the same accuracy as state-of-the-art, off-the-shelf deep models. Pufferfish can be directly integrated into current deep learning frameworks with minimum implementation modification. Our extensive experiments over real distributed setups, across a variety of large-scale machine learning tasks, indicate that Pufferfish achieves up to 1.64x end-to-end speedup over the latest distributed training API in PyTorch without accuracy loss. Compared to the Lottery Ticket Hypothesis models, Pufferfish leads to equally accurate, small-parameter models while avoiding the burden of ``winning the lottery''. Pufferfish also leads to more accurate and smaller models than SOTA structured model pruning methods.
Nadeen Gebara · Manya Ghobadi · Paolo Costa
We present PANAMA, a network architecture for machine learning (ML) workloads on shared clusters where a variety of training jobs co-exist.PANAMA consists of two key components: (i) an efficient in-network hardware accelerator designed to accelerate large data-parallel training transfers; and (ii) a lightweight congestion control protocol to enable fair sharing of network resources across different flows. Our congestion control protocol exploits the unique communication pattern in training to ensure large in-network aggregation transfers do not negatively impact short latency-sensitive flows. To evaluate the feasibility of PANAMA, we build an FPGA-based prototype with 10 Gbps transceivers and show that our hardware datapath achieves line-rate aggregation. Our large-scale simulations demonstrate that PANAMA improves the mean and 99%-tile completion time of latency-sensitive short flows by a factor of 2–4.5 while reducing the average training time of large jobs by a factor of 1.25.
Andrei Ivanov · Nikoli Dryden · Tal Ben-Nun · Shigang Li · Torsten Hoefler
Transformers are one of the most important machine learning workloads today. Training one is a very compute-intensive task, often taking days or weeks, and significant attention has been given to optimizing transformers. Despite this, existing implementations do not efficiently utilize GPUs. We find that data movement is the key bottleneck when training. Due to Amdahl's Law and massive improvements in compute performance, training has now become memory-bound. Further, existing frameworks use suboptimal data layouts. Using these insights, we present a recipe for globally optimizing data movement in transformers. We reduce data movement by up to 22.91% and overall achieve a 1.30x performance improvement over state-of-the-art frameworks when training a BERT encoder layer and 1.19x for the entire BERT. Our approach is applicable more broadly to optimizing deep neural networks, and offers insight into how to tackle emerging performance bottlenecks.
Giulio Zhou · Martin Maas
Storage services in data centers continuously make decisions, such as for cache admission, prefetching, and block allocation. These decisions are typically driven by heuristics based on statistical properties like temporal locality or common file sizes. The quality of decisions can be improved through application-level information such as the database operation a request belongs to. While such features can be exploited through application hints (e.g., explicit prefetches), this process requires manual work and is thus only viable for the most tuned workloads.
In this work, we show how to leverage application-level information automatically, by building on distributed traces that are already available in warehouse-scale computers. As these traces are used for diagnostics and accounting, they contain information about requests, including those to storage services. However, this information is mostly unstructured (e.g., arbitrary text) and thus difficult to use. We demonstrate how to do so automatically using machine learning, by applying ideas from natural language processing.
We show that different storage-related decisions can be learned from distributed traces, using models ranging from simple clustering techniques to neural networks. Instead of designing specific models for different storage-related tasks, we show that the same models can be used as building blocks for different tasks. Our models improve prediction accuracy by 11-33% over non-ML baselines, which translates to significantly improving the hit rate of a caching task, as well as improvements to an SSD/HDD tiering task, on production data center storage traces.