Session
Research-Track Oral Presentation: R15: Model Compression
Grand Ballroom 2
CAGE: Curvature-Aware Gradient Estimation For Accurate Quantization-Aware Training
Soroush Tabesh · · Andrei Panferov · ·
Despite significant work on low-bit quantization-aware training (QAT), there is still an accuracy gap between such techniques and native training. To address this, we introduce CAGE (Curvature-Aware Gradient Estimation), a new QAT method that augments the straight-through estimator (STE) gradient with a curvature-aware correction designed to counteract the loss increase induced by quantization. CAGE is derived from a multi-objective view of QAT that balances loss minimization with adherence to quantization constraints, yielding a principled correction term that depends on local curvature information. On the theoretical side, we introduce the notion of Pareto-optimal solutions for quantized optimization, and establish that CAGE yields strong convergence guarantees in the smooth non-convex setting. In terms of implementation, our approach is optimizer-agnostic, but we provide a highly-efficient implementation that leverages Adam statistics. CAGE significantly improves upon the prior state-of-the-art methods in terms of accuracy, for similar computational cost: for QAT fine-tuning, it halves the compression accuracy loss relative to the prior best method, while for QAT pre-training of Llama models, its accuracy for 3-bit weights-and-activations (W3A3) matches that of 4-bit training (W4A4) with the prior best method (QuEST).
MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design
Zhen Zheng · Xiaonan Song · Chuanjie Liu
Quantization has become one of the most effective methodologies to compress LLMs into smaller size. However, the existing quantization solutions still show limitations of either non-negligible accuracy drop or low system efficiency. In this paper, we propose MixLLM that explores the optimization space of mixed-precision quantization between output features, based on the insight that different features matter differently in the model. MixLLM identifies the important output features in the global view rather than within each single layer, effectively assigning larger bit-width to output features that need it the most to achieve high accuracy and low memory usage. We present the sweet spot of quantization configuration of algorithm-system co-design with high accuracy and system efficiency. To address the system challenge, we design the two-step dequantization to make use of the Tensor Core easily and fast data type conversion to reduce dequantization overhead, and present the software pipeline to overlap the memory access, dequantization and the MatMul to the best. Extensive experiments show that with only 10\% more bits, the perplexity increase can be reduced from about 0.5 in SOTA to within 0.2 for Llama 3.1 70B, while MMLU-Pro loss can be reduced from 1.92 to 0.99 over the SOTA of three popular models. Besides its superior accuracy, MixLLM also achieves state-of-the-art system efficiency.
Deploying neural networks on microcontrollers is constrained by kilobytes of flash and SRAM, where 1x1 pointwise (PW) channel mixers often dominate memory even after INT8 quantization. We present HyperTinyPW, a compression-as-generation approach that replaces most stored PW weights with generated weights. A shared micro-MLP synthesizes PW kernels once at load time from tiny per-layer codes; the kernels are cached and then executed with standard integer operators, so the deployment stack stays unchanged. A shared latent basis across layers reduces redundancy, and keeping the first PW layer in INT8 stabilizes early morphology-sensitive mixing. Our contributions are: (1) TinyML-faithful packed-byte accounting that includes the generator, heads or factorization, per-layer codes, the kept first PW layer, and the backbone; (2) a unified evaluation protocol with a validation-tuned threshold (t*) and bootstrap confidence intervals; and (3) a deployability analysis covering integer-only inference and boot-versus-lazy synthesis trade-offs. On three ECG benchmarks (Apnea-ECG, PTB-XL, MIT-BIH), HyperTinyPW shifts the macro-F1 versus flash Pareto frontier: at about 225 kB it matches a ~1.4 MB CNN while being 6.31x smaller (84.15% fewer bytes), retaining at least 95% of large-model macro-F1. Under 32-64 kB budgets it sustains balanced detection where compact baselines degrade. The mechanism applies broadly to other 1D biosignals, on-device speech, and embedded sensing tasks where per-layer redundancy dominates, suggesting a wider role for compression-as-generation in resource-constrained ML systems.
Search Your NVFP4 Scales!
Tanmaey Gupta · Hayden Prairie · Xiaoxia Wu · Reyna Abhyankar · Qingyang Wu · Austin Silveria · Pragaash Ponnusamy · Jue Wang · Ben Athiwaratkun · Shuaiwen Song · Tri Dao · Daniel Fu · Christopher De Sa
Quantization has emerged as a standard technique for accelerating inference for generative models by enabling faster low-precision computations and reduced memory transfers. Recently, GPU accelerators have added first-class support for microscaling Block Floating Point (BFP) formats. Standard BFP algorithms use a fixed scale based on the maximum magnitude of the block. We observe that this scale choice can be suboptimal with respect to quantization errors. In this work, we propose \textbf{ScaleSearch}, an alternative strategy for selecting these scale factors: using a fine-grained search leveraging the mantissa bits in microscaling formats to minimize the quantization error for the given distribution. \textbf{ScaleSearch} can be integrated with existing quantization methods such as Post Training Quantization and low-precision attention, and is shown to improve their performance. Additionally, we introduce \textbf{ScaleSearchAttention}, an accelerated NVFP4-based attention algorithm, which uses \textbf{ScaleSearch} and adapted prior techniques to ensure near-0 performance loss for causal language modeling. Experiments show that \textbf{ScaleSearch} improves language model weight PTQ by up to 7.5 points for GPQA (Qwen3-8B), video generation on Mochi by up to 14 points in VQA-a over SageAttention3. \textbf{ScaleSearchAttention} improves Wikitext-2 PPL by 0.9 points for Llama 3.1 70B.
Shannonic: Efficient Entropy-Optimal Compression for ML Workloads
Kareem Ibrahim · · Andreas Moshovos
We present Shannonic, a lossless compression method for machine learning tensors that achieves near-entropy-optimal compression, minimal state footprint, and high throughput. Shannonic uses an off-line pre-processing step to partition the tensor value space into optimally selected subranges and generates encoding/decoding tables that encode each value as a (range index, offset) pair where the range is entropy encoded using the asymmetric numeral systems (ANS) method. We formally prove and empirically show that Shannonic achieves higher compression efficiency than standard ANS. For a variety of 8b-quantized models, Shannonic's codec uses just 530B of state and achieves coding efficiency within 1\% of the Shannon limit. Shannonic enables 1.3-3.1$\times$ faster federated learning over bandwidth-constrained networks and 29-32\% latency reduction in edge-cloud LLM inference.