Registration Desk: Registration Check-in Desk Wed 15 May 07:00 a.m.
Poster: LLM 2 Wed 15 May 09:00 a.m.
[ Poster Position Number ]
Abstract
As the Large Language Model (LLM) becomes increasingly important in various domains, the performance of LLM inference is crucial to massive LLM applications. However, the following challenges still remain unsolved in accelerating LLM inference: (1) Synchronized partial softmax update. The softmax operation requires a synchronized update operation among each partial softmax result, leading to ∼20% overheads for the attention computation in LLMs. (2) Under-utilized computation of flat GEMM. The shape of matrices performing GEMM in LLM inference is flat, leading to under-utilized computation and 50% performance loss after padding zeros in previous designs (e.g., cuBLAS, CUTLASS, etc.). (3) Performance loss to static dataflow. Kernel performance in LLM depends on varied input data features, hardware configurations, etc. A single and static dataflow may lead to 50.25% performance loss for GEMMs of different shapes in LLM inference.We present FlashDecoding++, a fast LLM inference engine supporting mainstream LLMs and hardware back- ends. To tackle the above challenges, FlashDecoding++ creatively proposes: (1) Asynchronized softmax with unified max value. FlashDecoding++ introduces a unified max value technique for different partial softmax computations to avoid synchronization. Based on this, the fine-grained pipelining is proposed, leading to 1.05× and 1.14× for the prefill and decoding stage in LLM …
[ Poster Position Number ]

Abstract
We present Prompt Cache, an approach for accelerating inference for large language models (LLM) by reusing attention states across different LLM prompts. Many input prompts have overlapping text segments, such as system messages, prompt templates, and documents provided for context.Our key insight is that by precomputing and storing the attention states of these frequently occurring text segments on the inference server, we can efficiently reuse them when these segments appear in user prompts. Prompt Cache employs a schema to explicitly define such reusable text segments, called prompt modules. The schema ensures positional accuracy during attention state reuse and provides users with an interface to access cached states in their prompt.Using a prototype implementation, we evaluate Prompt Cache across several LLMs. We show that Prompt Cache significantly reduce latency in time-to-first-token, especially for longer prompts such as document-based question answering and recommendations. The improvements range from 8x for GPU-based inference to 60x for CPU-based inference, all while maintaining output accuracy and without the need for model parameter modifications.
[ Poster Position Number ]

Abstract
Invited Talk: J. Zico Kolter
AI Robustness and Security in the Age of LLMs
Bio :
Poster: Quantization and Compression 2 Wed 15 May 01:30 p.m.
[ Poster Position Number ]
Abstract
Data format innovations have been critical for machine learning (ML) scaling, which in turn fuels ground-breaking ML capabilities. However, even in the presence of low-precision formats, model weights are often stored in both high-precision and low-precision during training. Furthermore, with emerging directional data-formats (e.g., MX9, MX6, etc.) multiple low-precision weight copies can be required. To lower memory capacity needs of weights, we explore just-in-time quantization (JIT-Q) where we only store high-precision weights in memory and generate low-precision weights only when needed. To perform JIT-Q efficiently, in this work, we evaluate emerging processing-in-memory (PIM) technology to execute quantization. With PIM, we can offload quantization to in-memory compute units enabling quantization to be performed without incurring costly data-movement while allowing quantization to be concurrent with accelerator computation. Our proposed PIM-offloaded quantization keeps up with GPU compute and delivers considerable capacity savings (up to 24\%) at marginal throughput loss (up to 2.4\%). Said memory capacity savings can unlock several benefits such as fitting larger model in the same system, reducing model parallelism requirement, and improving overall ML training efficiency.
[ Poster Position Number ]
Abstract
Deep neural network (DNN) compression (e.g., quantization, pruning) has been widely investigated in variousdeep learning tasks (e.g., vision and language). The development of model compression is continuously motivatedby the evolution of various neural network accelerator designs with ASIC or FPGA. On the algorithm side, theultimate goal of quantization or pruning is accelerating the expensive DNN computations on low-power hardware.However, such a “design-and-deploy” workflow faces under-explored challenges in the current hardware-algorithmco-design community due to some unavoidable flaws. First, although the state-of-the-art quantization algorithmcan achieve ultra-low precision with negligible degradation of accuracy, the latest deep learning framework (e.g.,PyTorch) can only support non-customizable 8-bit precision, data format, and parameter extraction workflow forCNN. Secondly, the ultimate goal of quantization is enabling the computation with low-precision data (e.g., 4-bitinteger). However, the current SoTA algorithm treats the quantized integer as an intermediate result, while the finaloutput of the quantizer is the “discretized” floating-point values, ignoring the practical needs and adding additionalworkload to hardware designers for integer parameter extraction and layer fusion. Finally, the compressiontoolkits designed by the industry are constrained to their in-house product or a handful of algorithms. The limiteddegree of freedom in the current toolkit and the under-explored customization hinder the prototype ASIC orFPGA-based …
[ Poster Position Number ]

Abstract
[ Poster Position Number ]
Abstract
Recent advances in deep learning methods such as LLMs and Diffusion models have created a need for improved quantization methods that can meet the computational demands of these modern architectures while maintaining accuracy. Towards this goal, we study the advantages of FP8 data formats for post-training quantization across 75 unique network architectures covering a wide range of tasks, including machine translation, language modeling, text generation, image classification, generation, and segmentation. We examine three different FP8 representations (E5M2, E4M3, and E3M4) to study the effects of varying degrees of trade-off between dynamic range and precision on model accuracy. Based on our extensive study, we developed a quantization workflow that generalizes across different network architectures. Our empirical results show that FP8 formats outperform INT8 in multiple aspects, including workload coverage (92.64\% vs. 65.87\%), model accuracy and suitability for a broader range of operations. Furthermore, our findings suggest that E4M3 is better suited for NLP models, whereas E3M4 performs marginally better than E4M3 on computer vision tasks.
Poster: Federated Learning Wed 15 May 03:30 p.m.
[ Poster Position Number ]
Abstract
[ Poster Position Number ]

Abstract
Federated Learning (FL) is a practical approach to train deep learning models collaboratively across user-end devices, protecting user privacy by retaining raw data on-device. In FL, participating user-end devices are highly fragmented in terms of hardware and software configurations. Such fragmentation introduces a new type of data heterogeneity in FL, namely system-induced data heterogeneity, as each device generates distinct data depending on its hardware and software configurations. In this paper, we first characterize the impact of system-induced data heterogeneity on FL model performance. We collect a dataset using heterogeneous devices with variations across vendors and performance tiers. By using this dataset, we demonstrate that system-induced data heterogeneity negatively impacts accuracy, and deteriorates fairness and domain generalization problems in FL. To address these challenges, we propose HeteroSwitch, which adaptively adopts generalization techniques (i.e., ISP transformation and SWAD) depending on the level of bias caused by varying HW and SW configurations. In our evaluation with a realistic FL dataset (FLAIR), HeteroSwitch reduces the variance of averaged precision by 6.3% across device types.
[ Poster Position Number ]

Abstract
Federated Learning (FL) typically involves a large-scale, distributed system with individual user devices/servers training models locally and then aggregating their model updates on a trusted central server. Existing systems for FL often use an always-on server for model aggregation, which can be inefficient in terms of resource utilization. They also may be inelastic in their resource management. This is particularly exacerbated when aggregating model updates at scale in a highly dynamic environment with varying numbers of heterogeneous user devices/servers. We present LIFL, a lightweight and elastic serverless cloud platform with fine-grained resource management for efficient FL aggregation at scale. LIFL is enhanced by a streamlined, event-driven serverless design that eliminates the individual, heavyweight message broker and replaces inefficient container-based sidecars with lightweight eBPF-based proxies. We leverage shared memory processing to achieve high-performance communication for hierarchical aggregation, which is commonly adopted to speed up FL aggregation at scale. We further introduce the locality-aware placement in LIFL to maximize the benefits of shared memory processing. LIFL precisely scales and carefully reuses the resources for hierarchical aggregation to achieve the highest degree of parallelism, while minimizing aggregation time and resource consumption. Our preliminary experimental results show that LIFL achieves significant improvement in resource …
Poster: Parallel and Distributed 2 Wed 15 May 04:30 p.m.
[ Poster Position Number ]
Abstract
The Mixture-of-Expert (MoE) technique plays a crucial role in expanding the size of DNN model parameters, but it grapples with the challenge of prolonged all-to-all communication latency during training. Existing methods attempt to mitigate this issue by overlapping all-to-all with expert computation. However, this approach often falls short of achieving sufficient overlap, thereby limiting potential performance improvements. In our study, we extend the scope of this challenge by considering overlap at the broader training graph level. During the forward pass, we enable non-MoE computations to overlap with all-to-all through careful partitioning and pipelining. In the backward pass, we achieve overlap with all-to-all by scheduling gradient weight computations. We implement these techniques in Lancet, an optimization system for DNN compilers designed to automatically enhance MoE model training. Our extensive evaluation reveals that Lancet significantly reduces the time devoted to non-overlapping communication, by as much as 77%. Moreover, it achieves a notable end-to-end speedup of up to 1.3 times when compared to the state-of-the-art solutions.
[ Poster Position Number ]
Abstract
We study a mismatch between the deep learning recommendation models’ flat architecture, common distributedtraining paradigm and hierarchical data center topology. To address the associated inefficiencies, we proposeDisaggregated Multi-Tower (DMT), a modeling technique that consists of (1) semantic-preserving tower transform(SPTT), a novel training paradigm that decomposes the monolithic global embedding lookup process into disjointtowers to exploit data center locality; (2) Tower Module (TM), a synergistic dense component attached to eachtower to reduce model complexity and communication volume through hierarchical feature interaction; and (3)Tower Partitioner (TP), a feature partitioner to systematically create towers with meaningful feature interactionsand load balanced assignments to preserve model quality and training throughput via learned embeddings. Weshow that DMT can achieve up to 1.9× speedup compared to the state-of-the-art baselines without losing accuracyacross multiple generations of hardware at large data center scales.
[ Poster Position Number ]
Abstract
In recent times, the emergence of Large Language Models (LLMs) has resulted in increasingly larger model size, posing challenges for inference on low-resource devices. Prior approaches have explored offloading to facilitate low-memory inference but often suffer from efficiency due to I/O bottlenecks. To achieve low-latency LLMs inference on resource-constrained devices, we introduce HeteGen, a novel approach that presents a principled framework for heterogeneous parallel computing using CPUs and GPUs. Based on this framework, HeteGen further employs heterogeneous parallel computing and asynchronous overlap for LLMs to mitigate I/O bottlenecks. Our experiments demonstrate a substantial improvement in inference speed, surpassing state-of-the-art methods by over 317\% at most.