Hawkeye: Reproducing GPU-Level Non-Determinism
Abstract
We present Hawkeye, a system for analyzing and reproducing GPU-level arithmetic operations on CPUs. Using our framework, an auditor can re-execute a full model training or inference workflow executed on NVIDIA GPUs on a CPU, without any precision loss and without introducing any additional operations or slowdown on the GPU side. This is in stark contrast to prior approaches to verifiable machine learning that introduced significant computational overhead for the model provider. The main technical contribution underlying Hawkeye is a systematic algorithmic framework for numerical treatment within NVIDIA's Tensor Cores rounding, subnormal number handling, and order of (non-associative) accumulation during matrix multiplication. Our framework consists of a sequence of carefully crafted tests that reduce the (otherwise exponential size) search space of potential options for each operation. We test and evaluate our framework on a variety of GPU architectures (including Ampere, and Hopper), as well as all available precision types (FP16, BF16). In all test cases, our framework recovers the exact implementation of operations underlying matrix multiplication, and therefore allows for the full reproduction of model training and inference workflows on a CPU.