MLSys 2024 Wednesday 05/15

FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics

Poster

Ke Hong · Guohao Dai · Jiaming Xu · Qiuli Mao · Xiuhong Li · Jun Liu · kangdi chen · Yuhan Dong · Yu Wang

[ Poster Position Number ]

Abstract

As the Large Language Model (LLM) becomes increasingly important in various domains, the performance of LLM inference is crucial to massive LLM applications. However, the following challenges still remain unsolved in accelerating LLM inference: (1) Synchronized partial softmax update. The softmax operation requires a synchronized update operation among each partial softmax result, leading to ∼20% overheads for the attention computation in LLMs. (2) Under-utilized computation of flat GEMM. The shape of matrices performing GEMM in LLM inference is flat, leading to under-utilized computation and 50% performance loss after padding zeros in previous designs (e.g., cuBLAS, CUTLASS, etc.). (3) Performance loss to static dataflow. Kernel performance in LLM depends on varied input data features, hardware configurations, etc. A single and static dataflow may lead to 50.25% performance loss for GEMMs of different shapes in LLM inference.We present FlashDecoding++, a fast LLM inference engine supporting mainstream LLMs and hardware back- ends. To tackle the above challenges, FlashDecoding++ creatively proposes: (1) Asynchronized softmax with unified max value. FlashDecoding++ introduces a unified max value technique for different partial softmax computations to avoid synchronization. Based on this, the fine-grained pipelining is proposed, leading to 1.05× and 1.14× for the prefill and decoding stage in LLM …

Prompt Cache: Modular Attention Reuse for Low-Latency Inference

Poster

In Gim · Guojun Chen · Seung-seob Lee · Nikhil Sarda · Anurag Khandelwal · Lin Zhong

[ Poster Position Number ]

Abstract

We present Prompt Cache, an approach for accelerating inference for large language models (LLM) by reusing attention states across different LLM prompts. Many input prompts have overlapping text segments, such as system messages, prompt templates, and documents provided for context.Our key insight is that by precomputing and storing the attention states of these frequently occurring text segments on the inference server, we can efficiently reuse them when these segments appear in user prompts. Prompt Cache employs a schema to explicitly define such reusable text segments, called prompt modules. The schema ensures positional accuracy during attention state reuse and provides users with an interface to access cached states in their prompt.Using a prototype implementation, we evaluate Prompt Cache across several LLMs. We show that Prompt Cache significantly reduce latency in time-to-first-token, especially for longer prompts such as document-based question answering and recommendations. The improvements range from 8x for GPU-based inference to 60x for CPU-based inference, all while maintaining output accuracy and without the need for model parameter modifications.

Keyformer: KV Cache reduction through key tokens selection for Efficient Generative Inference

Poster

Muhammad Adnan · Akhil Arunkumar · Gaurav Jain · Prashant Nair · Ilya Soloveychik · Purushotham Kamath

[ Poster Position Number ]

Abstract

Transformers have emerged as the standard architecture for Large Language Models (LLMs). In generativelanguage models, the inference process involves two main phases: prompt processing and token generation. Tokengeneration, which constitutes most of the computational load, primarily entails vector-matrix multiplicationsand interactions with the Key-Value ($\mathsf{KV}$) Cache. This phase is memory bandwidth-bound due to the overheadof transferring weights and KV cache values from memory to the computing units, which involves relativelylow compute intensity. This memory bottleneck becomes particularly prominent in applications that demandlong-context and extensive text generation, both of which are increasingly crucial for LLMs.This paper introduces an innovative approach to mitigate the challenges associated with KV cache size and memorybandwidth utilization, termed "$\mathsf{Keyformer}$". $\mathsf{Keyformer}$ capitalizes on the observation that during generativeinference, approximately 90% of the attention weight is concentrated on a select subset of tokens, which actas "key" tokens. $\mathsf{Keyformer}$’s key tokens identification takes into account the discarded tokens by utilizing anovel score function. By retaining only these "key" tokens in the $\mathsf{KV cache}$, both the $\mathsf{KV cache}$ size and memorybandwidth usage are significantly reduced while maintaining the model’s accuracy. We evaluate $\mathsf{Keyformer}$’seffectiveness using three foundational models: GPT-J, Cerebras-GPT, and MPT, which employ various positionalembedding algorithms. Our assessment covers a range …

JIT-Q: Just-in-time Quantization with Processing-In-Memory for Efficient ML Training

Poster

Mohamed Ibrahim · Shaizeen Aga · Ada Li · Suchita Pati · Mahzabeen Islam

[ Poster Position Number ]

Abstract

Data format innovations have been critical for machine learning (ML) scaling, which in turn fuels ground-breaking ML capabilities. However, even in the presence of low-precision formats, model weights are often stored in both high-precision and low-precision during training. Furthermore, with emerging directional data-formats (e.g., MX9, MX6, etc.) multiple low-precision weight copies can be required. To lower memory capacity needs of weights, we explore just-in-time quantization (JIT-Q) where we only store high-precision weights in memory and generate low-precision weights only when needed. To perform JIT-Q efficiently, in this work, we evaluate emerging processing-in-memory (PIM) technology to execute quantization. With PIM, we can offload quantization to in-memory compute units enabling quantization to be performed without incurring costly data-movement while allowing quantization to be concurrent with accelerator computation. Our proposed PIM-offloaded quantization keeps up with GPU compute and delivers considerable capacity savings (up to 24\%) at marginal throughput loss (up to 2.4\%). Said memory capacity savings can unlock several benefits such as fitting larger model in the same system, reducing model parallelism requirement, and improving overall ML training efficiency.

Torch2Chip: An End-to-end Customizable Deep Neural Network Compression and Deployment Toolkit for Prototype Hardware Accelerator Design

Poster

Jian Meng · Yuan Liao · Anupreetham Anupreetham · Ahmed Hasssan · Shixing Yu · Han-sok Suh · Xiaofeng Hu · Jae-sun Seo

[ Poster Position Number ]

Abstract

Deep neural network (DNN) compression (e.g., quantization, pruning) has been widely investigated in variousdeep learning tasks (e.g., vision and language). The development of model compression is continuously motivatedby the evolution of various neural network accelerator designs with ASIC or FPGA. On the algorithm side, theultimate goal of quantization or pruning is accelerating the expensive DNN computations on low-power hardware.However, such a “design-and-deploy” workflow faces under-explored challenges in the current hardware-algorithmco-design community due to some unavoidable flaws. First, although the state-of-the-art quantization algorithmcan achieve ultra-low precision with negligible degradation of accuracy, the latest deep learning framework (e.g.,PyTorch) can only support non-customizable 8-bit precision, data format, and parameter extraction workflow forCNN. Secondly, the ultimate goal of quantization is enabling the computation with low-precision data (e.g., 4-bitinteger). However, the current SoTA algorithm treats the quantized integer as an intermediate result, while the finaloutput of the quantizer is the “discretized” floating-point values, ignoring the practical needs and adding additionalworkload to hardware designers for integer parameter extraction and layer fusion. Finally, the compressiontoolkits designed by the industry are constrained to their in-house product or a handful of algorithms. The limiteddegree of freedom in the current toolkit and the under-explored customization hinder the prototype ASIC orFPGA-based …

Schrodinger's FP Training Neural Networks with Dynamic Floating-Point Containers

Poster

Milos Nikolic · Enrique Torres Sanchez · Jiahui Wang · Ali Hadi Zadeh · Mostafa Mahmoud · Ameer Abdelhadi · Kareem Ibrahim · Andreas Moshovos

[ Poster Position Number ]

Abstract

The transfer of tensors from/to memory during neural network training dominates time and energy. To improve energy efficiency and performance, research has been exploring ways to use narrower data representations. So far, these attempts relied on user-directed trial-and-error to achieve convergence. We present methods that relieve users from this responsibility. Our methods dynamically adjust the size and format of the floating-point containers used for activations and weights during training, achieving adaptivity across three dimensions: i) which datatype to use, ii) on which tensor, and iii) how it changes over time. The different meanings and distributions of exponent and mantissas lead us to tailored approaches for each. We present two lossy pairs of methods to eliminate as many mantissa and exponent bits as possible without affecting accuracy. Quantum Mantissa and Quantum Exponent are machine learning compression methods that tap into the gradient descent algorithm to learn the minimal mantissa and exponent bitlengths on a per-layer granularity. They automatically learn that many tensors can use just 1 or 2 mantissa bits and 3 or 4 exponent bits. Overall, the two machine learning methods reduce the footprint by $4.73\times$. Alternatively, BitWave observes changes in the loss function during training to adjust mantissa and …

Efficient Post-training Quantization with FP8 Formats

Poster

Haihao Shen · Naveen Mellempudi · Xin He · Qun Gao · Chang Wang · Mengni Wang

[ Poster Position Number ]

Abstract

Recent advances in deep learning methods such as LLMs and Diffusion models have created a need for improved quantization methods that can meet the computational demands of these modern architectures while maintaining accuracy. Towards this goal, we study the advantages of FP8 data formats for post-training quantization across 75 unique network architectures covering a wide range of tasks, including machine translation, language modeling, text generation, image classification, generation, and segmentation. We examine three different FP8 representations (E5M2, E4M3, and E3M4) to study the effects of varying degrees of trade-off between dynamic range and precision on model accuracy. Based on our extensive study, we developed a quantization workflow that generalizes across different network architectures. Our empirical results show that FP8 formats outperform INT8 in multiple aspects, including workload coverage (92.64\% vs. 65.87\%), model accuracy and suitability for a broader range of operations. Furthermore, our findings suggest that E4M3 is better suited for NLP models, whereas E3M4 performs marginally better than E4M3 on computer vision tasks.

FedTrans: Efficient Federated Learning via Multi-Model Transformation

Poster

Yuxuan Zhu · Jiachen Liu · Mosharaf Chowdhury · Fan Lai

[ Poster Position Number ]

Abstract

Federated learning (FL) aims to train machine learning (ML) models across potentially millions of edge client devices. Yet, training and customizing models for FL clients is notoriously challenging due to the heterogeneity of client data, device capabilities, and the massive scale of clients, making individualized model exploration prohibitively expensive. State-of-the-art FL solutions personalize a globally trained model or concurrently train multiple models, but they often incur suboptimal model accuracy and huge training costs. In this paper, we introduce FedTrans, a multi-model FL training framework that automatically produces and trains high-accuracy, hardware-compatible models for individual clients at scale.FedTrans begins with a basic global model, identifies accuracy bottlenecks in model architectures during training, and then employs model transformation to derive new models for heterogeneous clients on the fly. It judiciously assigns models to individual clients while performing soft aggregation on multi-model updates to minimize total training costs. Our evaluations using realistic settings show that FedTrans improves individual client model accuracy by 13\% while slashing training costs by 4$\times$ over state-of-the-art solutions.

HeteroSwitch: Characterizing and Taming System-Induced Data Heterogeneity in Federated Learning

Poster

Gyudong Kim · Mehdi Ghasemi · Soroush Heidari · Seungryong Kim · Young Geun Kim · Sarma Vrudhula · Carole-Jean Wu

[ Poster Position Number ]

Abstract

Federated Learning (FL) is a practical approach to train deep learning models collaboratively across user-end devices, protecting user privacy by retaining raw data on-device. In FL, participating user-end devices are highly fragmented in terms of hardware and software configurations. Such fragmentation introduces a new type of data heterogeneity in FL, namely system-induced data heterogeneity, as each device generates distinct data depending on its hardware and software configurations. In this paper, we first characterize the impact of system-induced data heterogeneity on FL model performance. We collect a dataset using heterogeneous devices with variations across vendors and performance tiers. By using this dataset, we demonstrate that system-induced data heterogeneity negatively impacts accuracy, and deteriorates fairness and domain generalization problems in FL. To address these challenges, we propose HeteroSwitch, which adaptively adopts generalization techniques (i.e., ISP transformation and SWAD) depending on the level of bias caused by varying HW and SW configurations. In our evaluation with a realistic FL dataset (FLAIR), HeteroSwitch reduces the variance of averaged precision by 6.3% across device types.

LIFL: A Lightweight, Event-driven Serverless Platform for Federated Learning

Poster

Shixiong Qi · K. K. Ramakrishnan · Myungjin Lee

[ Poster Position Number ]

Abstract

Federated Learning (FL) typically involves a large-scale, distributed system with individual user devices/servers training models locally and then aggregating their model updates on a trusted central server. Existing systems for FL often use an always-on server for model aggregation, which can be inefficient in terms of resource utilization. They also may be inelastic in their resource management. This is particularly exacerbated when aggregating model updates at scale in a highly dynamic environment with varying numbers of heterogeneous user devices/servers. We present LIFL, a lightweight and elastic serverless cloud platform with fine-grained resource management for efficient FL aggregation at scale. LIFL is enhanced by a streamlined, event-driven serverless design that eliminates the individual, heavyweight message broker and replaces inefficient container-based sidecars with lightweight eBPF-based proxies. We leverage shared memory processing to achieve high-performance communication for hierarchical aggregation, which is commonly adopted to speed up FL aggregation at scale. We further introduce the locality-aware placement in LIFL to maximize the benefits of shared memory processing. LIFL precisely scales and carefully reuses the resources for hierarchical aggregation to achieve the highest degree of parallelism, while minimizing aggregation time and resource consumption. Our preliminary experimental results show that LIFL achieves significant improvement in resource …

Lancet: Accelerating Mixture-of-Experts Training by Overlapping Weight Gradient Computation and All-to-All Communication

Poster

Chenyu Jiang · Ye Tian · Zhen Jia · Chuan Wu · Yida Wang · Shuai Zheng

[ Poster Position Number ]

Abstract

The Mixture-of-Expert (MoE) technique plays a crucial role in expanding the size of DNN model parameters, but it grapples with the challenge of prolonged all-to-all communication latency during training. Existing methods attempt to mitigate this issue by overlapping all-to-all with expert computation. However, this approach often falls short of achieving sufficient overlap, thereby limiting potential performance improvements. In our study, we extend the scope of this challenge by considering overlap at the broader training graph level. During the forward pass, we enable non-MoE computations to overlap with all-to-all through careful partitioning and pipelining. In the backward pass, we achieve overlap with all-to-all by scheduling gradient weight computations. We implement these techniques in Lancet, an optimization system for DNN compilers designed to automatically enhance MoE model training. Our extensive evaluation reveals that Lancet significantly reduces the time devoted to non-overlapping communication, by as much as 77%. Moreover, it achieves a notable end-to-end speedup of up to 1.3 times when compared to the state-of-the-art solutions.

Disaggregated Multi-Tower: Topology-aware Modeling Technique for Efficient Large Scale Recommendation

Poster

Liang Luo · Buyun Zhang · Michael Tsang · Yinbin Ma · Ching-Hsiang Chu · Yuxin Chen · Shen Li · Yuchen Hao · Yanli Zhao · Guna Lakshminarayanan · Ellie Wen · Jongsoo Park · Dheevatsa Mudigere · Maxim Naumov

[ Poster Position Number ]

Abstract

We study a mismatch between the deep learning recommendation models’ flat architecture, common distributedtraining paradigm and hierarchical data center topology. To address the associated inefficiencies, we proposeDisaggregated Multi-Tower (DMT), a modeling technique that consists of (1) semantic-preserving tower transform(SPTT), a novel training paradigm that decomposes the monolithic global embedding lookup process into disjointtowers to exploit data center locality; (2) Tower Module (TM), a synergistic dense component attached to eachtower to reduce model complexity and communication volume through hierarchical feature interaction; and (3)Tower Partitioner (TP), a feature partitioner to systematically create towers with meaningful feature interactionsand load balanced assignments to preserve model quality and training throughput via learned embeddings. Weshow that DMT can achieve up to 1.9× speedup compared to the state-of-the-art baselines without losing accuracyacross multiple generations of hardware at large data center scales.

HeteGen: Efficient Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices

Poster

ZHAO XUANLEI · Bin Jia · Haotian Zhou · Ziming Liu · Shenggan Cheng · Yang You

[ Poster Position Number ]

Abstract

In recent times, the emergence of Large Language Models (LLMs) has resulted in increasingly larger model size, posing challenges for inference on low-resource devices. Prior approaches have explored offloading to facilitate low-memory inference but often suffer from efficiency due to I/O bottlenecks. To achieve low-latency LLMs inference on resource-constrained devices, we introduce HeteGen, a novel approach that presents a principled framework for heterogeneous parallel computing using CPUs and GPUs. Based on this framework, HeteGen further employs heterogeneous parallel computing and asynchronous overlap for LLMs to mitigate I/O bottlenecks. Our experiments demonstrate a substantial improvement in inference speed, surpassing state-of-the-art methods by over 317\% at most.

Main Navigation

Registration Desk: Registration Check-in Desk Wed 15 May 07:00 a.m.

Poster: LLM 2 Wed 15 May 09:00 a.m.

Invited Talk: J. Zico Kolter

Poster: Quantization and Compression 2 Wed 15 May 01:30 p.m.

Poster: Federated Learning Wed 15 May 03:30 p.m.

Poster: Parallel and Distributed 2 Wed 15 May 04:30 p.m.