Search Your NVFP4 Scales!
Abstract
Quantization has emerged as a standard technique for accelerating inference for generative models by enabling faster low-precision computations and reduced memory transfers. Recently, GPU accelerators have added first-class support for microscaling Block Floating Point (BFP) formats. Standard BFP algorithms use a fixed scale based on the maximum magnitude of the block. We observe that this scale choice can be suboptimal with respect to quantization errors. In this work, we propose \textbf{ScaleSearch}, an alternative strategy for selecting these scale factors: using a fine-grained search leveraging the mantissa bits in microscaling formats to minimize the quantization error for the given distribution. \textbf{ScaleSearch} can be integrated with existing quantization methods such as Post Training Quantization and low-precision attention, and is shown to improve their performance. Additionally, we introduce \textbf{ScaleSearchAttention}, an accelerated NVFP4-based attention algorithm, which uses \textbf{ScaleSearch} and adapted prior techniques to ensure near-0 performance loss for causal language modeling. Experiments show that \textbf{ScaleSearch} improves language model weight PTQ by up to 7.5 points for GPQA (Qwen3-8B), video generation on Mochi by up to 14 points in VQA-a over SageAttention3. \textbf{ScaleSearchAttention} improves Wikitext-2 PPL by 0.9 points for Llama 3.1 70B.