Session
Research Track Oral Presentation: Model Compression
Grand Ballroom 2
Moderator: Sanjeev Singh
CAGE: Curvature-Aware Gradient Estimation For Accurate Quantization-Aware Training
Soroush Tabesh ⋅ Mher Safaryan ⋅ Andrei Panferov ⋅ Alexandra Volkova ⋅ Dan Alistarh
Despite significant work on low-bit quantization-aware training (QAT), there is still an accuracy gap between such techniques and native training. To address this, we introduce CAGE (Curvature-Aware Gradient Estimation), a new QAT method that augments the straight-through estimator (STE) gradient with a curvature-aware correction designed to counteract the loss increase induced by quantization. CAGE is derived from a multi-objective view of QAT that balances loss minimization with adherence to quantization constraints, yielding a principled correction term that depends on local curvature information. On the theoretical side, we introduce the notion of Pareto-optimal solutions for quantized optimization, and establish that CAGE yields strong convergence guarantees in the smooth non-convex setting. In terms of implementation, our approach is optimizer-agnostic, but we provide a highly-efficient implementation that leverages Adam statistics. CAGE significantly improves upon the prior state-of-the-art methods in terms of accuracy, for similar computational cost: for QAT fine-tuning, it halves the compression accuracy loss relative to the prior best method, while for QAT pre-training of Llama models, its accuracy for 3-bit weights-and-activations (W3A3) matches that of 4-bit training (W4A4) with the prior best method (QuEST).
Shannonic: Efficient Entropy-Optimal Compression for ML Workloads
Kareem Ibrahim ⋅ Mohammadjavad Maheronnaghsh ⋅ Andreas Moshovos
We present Shannonic, a lossless compression method for machine learning tensors that achieves near-entropy-optimal compression, minimal state footprint, and high throughput. Shannonic uses an off-line pre-processing step to partition the tensor value space into optimally selected subranges and generates encoding/decoding tables that encode each value as a (range index, offset) pair where the range is entropy encoded using the asymmetric numeral systems (ANS) method. We formally prove and empirically show that Shannonic achieves higher compression efficiency than standard ANS. For a variety of 8b-quantized models, Shannonic's codec uses just 530B of state and achieves coding efficiency within 1\% of the Shannon limit. Shannonic enables 1.3-3.1$\times$ faster federated learning over bandwidth-constrained networks and 29-32\% latency reduction in edge-cloud LLM inference.
Once-for-All Channel Mixers (HyperTinyPW): Generative Compression for TinyML
Yassien Shaalan
Neural networks on microcontrollers are constrained by kilobytes of flash/SRAM, where 1×1 pointwise (PW) mixers often dominate memory even after INT8 quantization. We present HYPERTINYPW, a compression-as-generation method that replaces most stored PW weights with generated weights: a shared micro-MLP synthesizes PW kernels once at load time from tiny per-layer codes, caches them, and executes them with standard integer operators. This preserves commodity MCU runtimes and incurs only a one-off synthesis cost; steady-state inference matches INT8 separable CNNs. Sharing a latent basis across layers removes cross-layer redundancy, while keeping PW1 in INT8 stabilizes early, morphology-sensitive mixing. We also introduce TinyML-faithful packed-byte accounting (generator, heads/factorization, codes, kept PW1, backbone) and a unified evaluation protocol with validation-tuned thresholds and bootstrap CIs. On three ECG benchmarks (Apnea-ECG, PTB-XL, MIT-BIH), HYPERTINYPW improves the macro- F1–vs.–flash Pareto: at ∼225 kB it achieves neariso performance to a ∼1.4MB CNN while being 6.31× smaller (84.15% fewer bytes), retaining ≥95% of large-model macro-F1. Beyond ECG, HYPERTINYPW transfers to TinyML audio: on Speech Commands keyword spotting it reaches 96.2% test accuracy (98.2% best validation), supporting that generate-and-cache channel mixing applies broadly to embedded sensing workloads where repeated linear mixers dominate memory.
MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design
Zhen Zheng ⋅ Xiaonan Song ⋅ Chuanjie Liu
Quantization has become one of the most effective methodologies to compress LLMs into smaller size. However, the existing quantization solutions still show limitations of either non-negligible accuracy drop or low system efficiency. In this paper, we propose MixLLM that explores the optimization space of mixed-precision quantization between output features, based on the insight that different features matter differently in the model. MixLLM identifies the important output features in the global view rather than within each single layer, effectively assigning larger bit-width to output features that need it the most to achieve high accuracy and low memory usage. We present the sweet spot of quantization configuration of algorithm-system co-design with high accuracy and system efficiency. To address the system challenge, we design the two-step dequantization to make use of the Tensor Core easily and fast data type conversion to reduce dequantization overhead, and present the software pipeline to overlap the memory access, dequantization and the MatMul to the best. Extensive experiments show that with only 10\% more bits, the perplexity increase can be reduced from about 0.5 in SOTA to within 0.2 for Llama 3.1 70B, while MMLU-Pro loss can be reduced from 1.92 to 0.99 over the SOTA of three popular models. Besides its superior accuracy, MixLLM also achieves state-of-the-art system efficiency. Code is released at https://github.com/microsoft/MixLLM.
Search Your Block Floating Point Scales!
Tanmaey Gupta ⋅ Hayden Prairie ⋅ Xiaoxia Wu ⋅ Reyna Abhyankar ⋅ Qingyang Wu ⋅ Austin Silveria ⋅ Pragaash Ponnusamy ⋅ Jue Wang ⋅ Ben Athiwaratkun ⋅ Shuaiwen Song ⋅ Tri Dao ⋅ Daniel Fu ⋅ Christopher De Sa
Quantization has emerged as a standard technique for accelerating inference for generative models by enabling faster low-precision computations and reduced memory transfers. Recently, GPU accelerators have added first-class support for microscaling Block Floating Point (BFP) formats. Standard BFP algorithms use a fixed scale based on the maximum magnitude of the block. We observe that this scale choice can be suboptimal with respect to quantization errors. In this work, we propose ScaleSearch, an alternative strategy for selecting these scale factors: using a fine-grained search leveraging the mantissa bits in microscaling formats to minimize the quantization error for the given distribution. ScaleSearch can be integrated with existing quantization methods such as Post Training Quantization and low precision attention, and is shown to improve their performance. Additionally, we introduce ScaleSearchAttention, an accelerated NVFP4-based attention algorithm, which uses ScaleSearch and adapted prior techniques to ensure near-0 performance loss for causal language modeling. Experiments show that ScaleSearch reduces quantization error by 27% for NVFP4 and improves language model PTQ by up to 15 points for MATH500 (Qwen3-8B), while ScaleSearchAttention im- proves Wikitext-2 PPL by upto 0.77 points for Llama 3.1 70B. The proposed methods closely match baseline performance while providing quantization accuracy improvements.