Track: Quantization and Compression 2

Wed 15 May 13:30 - 13:50 PDT

8

JIT-Q: Just-in-time Quantization with Processing-In-Memory for Efficient ML Training

Mohamed Ibrahim · Shaizeen Aga · Ada Li · Suchita Pati · Mahzabeen Islam

Data format innovations have been critical for machine learning (ML) scaling, which in turn fuels ground-breaking ML capabilities. However, even in the presence of low-precision formats, model weights are often stored in both high-precision and low-precision during training. Furthermore, with emerging directional data-formats (e.g., MX9, MX6, etc.) multiple low-precision weight copies can be required. To lower memory capacity needs of weights, we explore just-in-time quantization (JIT-Q) where we only store high-precision weights in memory and generate low-precision weights only when needed. To perform JIT-Q efficiently, in this work, we evaluate emerging processing-in-memory (PIM) technology to execute quantization. With PIM, we can offload quantization to in-memory compute units enabling quantization to be performed without incurring costly data-movement while allowing quantization to be concurrent with accelerator computation. Our proposed PIM-offloaded quantization keeps up with GPU compute and delivers considerable capacity savings (up to 24\%) at marginal throughput loss (up to 2.4\%). Said memory capacity savings can unlock several benefits such as fitting larger model in the same system, reducing model parallelism requirement, and improving overall ML training efficiency.

Wed 15 May 13:50 - 14:10 PDT

23

Torch2Chip: An End-to-end Customizable Deep Neural Network Compression and Deployment Toolkit for Prototype Hardware Accelerator Design

Jian Meng · Yuan Liao · Anupreetham Anupreetham · Ahmed Hasssan · Shixing Yu · Han-sok Suh · Xiaofeng Hu · Jae-sun Seo

Deep neural network (DNN) compression (e.g., quantization, pruning) has been widely investigated in variousdeep learning tasks (e.g., vision and language). The development of model compression is continuously motivatedby the evolution of various neural network accelerator designs with ASIC or FPGA. On the algorithm side, theultimate goal of quantization or pruning is accelerating the expensive DNN computations on low-power hardware.However, such a “design-and-deploy” workflow faces under-explored challenges in the current hardware-algorithmco-design community due to some unavoidable flaws. First, although the state-of-the-art quantization algorithmcan achieve ultra-low precision with negligible degradation of accuracy, the latest deep learning framework (e.g.,PyTorch) can only support non-customizable 8-bit precision, data format, and parameter extraction workflow forCNN. Secondly, the ultimate goal of quantization is enabling the computation with low-precision data (e.g., 4-bitinteger). However, the current SoTA algorithm treats the quantized integer as an intermediate result, while the finaloutput of the quantizer is the “discretized” floating-point values, ignoring the practical needs and adding additionalworkload to hardware designers for integer parameter extraction and layer fusion. Finally, the compressiontoolkits designed by the industry are constrained to their in-house product or a handful of algorithms. The limiteddegree of freedom in the current toolkit and the under-explored customization hinder the prototype ASIC orFPGA-based accelerator design. To resolve these challenges, we propose Torch2Chip, an open-sourced, fullycustomizable, and high-performance toolkit that supports the user-designed compression algorithm followed byautomatic model fusion and parameter extraction. Torch2Chip incorporates the hierarchical design workflow, andthe user-customized compression algorithm will be directly packed into the deployment-ready format for eitherprototype chip verification with either CNN or vision transformer (ViT). Furthermore, Torch2Chip covers a widerange of training methods to achieve high performance, from basic supervised learning to state-of-the-art (SoTA)lightweight self-supervised learning (SSL). The Torch2Chip toolkit and source codes will be released soon.

Wed 15 May 14:20 - 14:40 PDT

18

Schrodinger's FP Training Neural Networks with Dynamic Floating-Point Containers

Milos Nikolic · Enrique Torres Sanchez · Jiahui Wang · Ali Hadi Zadeh · Mostafa Mahmoud · Ameer Abdelhadi · Kareem Ibrahim · Andreas Moshovos

The transfer of tensors from/to memory during neural network training dominates time and energy. To improve energy efficiency and performance, research has been exploring ways to use narrower data representations. So far, these attempts relied on user-directed trial-and-error to achieve convergence. We present methods that relieve users from this responsibility. Our methods dynamically adjust the size and format of the floating-point containers used for activations and weights during training, achieving adaptivity across three dimensions: i) which datatype to use, ii) on which tensor, and iii) how it changes over time. The different meanings and distributions of exponent and mantissas lead us to tailored approaches for each. We present two lossy pairs of methods to eliminate as many mantissa and exponent bits as possible without affecting accuracy. Quantum Mantissa and Quantum Exponent are machine learning compression methods that tap into the gradient descent algorithm to learn the minimal mantissa and exponent bitlengths on a per-layer granularity. They automatically learn that many tensors can use just 1 or 2 mantissa bits and 3 or 4 exponent bits. Overall, the two machine learning methods reduce the footprint by $4.73\times$. Alternatively, BitWave observes changes in the loss function during training to adjust mantissa and exponent bitlengths network-wide, yielding a $3.17\times$ reduction in footprint. Finally, we present an optional method, Gecok, to exploit the naturally emerging, lop-sided exponent distribution to losslessly compress resulting exponents from Quantum Exponent or BitWave and, on average, improve compression rates to $5.61\times$ and $4.53\times$.

Wed 15 May 14:40 - 15:00 PDT

28

Efficient Post-training Quantization with FP8 Formats

Haihao Shen · Naveen Mellempudi · Xin He · Qun Gao · Chang Wang · Mengni Wang

Recent advances in deep learning methods such as LLMs and Diffusion models have created a need for improved quantization methods that can meet the computational demands of these modern architectures while maintaining accuracy. Towards this goal, we study the advantages of FP8 data formats for post-training quantization across 75 unique network architectures covering a wide range of tasks, including machine translation, language modeling, text generation, image classification, generation, and segmentation. We examine three different FP8 representations (E5M2, E4M3, and E3M4) to study the effects of varying degrees of trade-off between dynamic range and precision on model accuracy. Based on our extensive study, we developed a quantization workflow that generalizes across different network architectures. Our empirical results show that FP8 formats outperform INT8 in multiple aspects, including workload coverage (92.64\% vs. 65.87\%), model accuracy and suitability for a broader range of operations. Furthermore, our findings suggest that E4M3 is better suited for NLP models, whereas E3M4 performs marginally better than E4M3 on computer vision tasks.

Main Navigation

Session

Quantization and Compression 2

JIT-Q: Just-in-time Quantization with Processing-In-Memory for Efficient ML Training

Torch2Chip: An End-to-end Customizable Deep Neural Network Compression and Deployment Toolkit for Prototype Hardware Accelerator Design

Schrodinger's FP Training Neural Networks with Dynamic Floating-Point Containers

Efficient Post-training Quantization with FP8 Formats