Poster 51

CATWILD: Compiler Autotuning for TPU workloads in the Wild

Ignacio Cano ⋅ Yu Wang ⋅ Mike Burrows ⋅ Ziqiang Feng ⋅ Matheus Camargo ⋅ Chao Wang ⋅ David Liu ⋅ Tengyu Sun ⋅ Alexander Wertheim ⋅ Arissa Wongpanich ⋅ Christof Angermueller ⋅ Hyojun Kim ⋅ Wenqi Cao ⋅ Aleksey Orekhov ⋅ Amit Sabne ⋅ Emma Sevastian ⋅ Mehrdad Khani ⋅ Karthik Murthy ⋅ Berkin Ilbeyi ⋅ Subhankar Shah ⋅ Ryan Lefever ⋅ Arjun Khare ⋅ Ankit Sinha ⋅ Peter Ma ⋅ Matt Bierbaum ⋅ Jeremiah Wilke ⋅ Emily Donahue ⋅ Sami Abu-El-Haija ⋅ Nikhil Sarda ⋅ Vineetha Govindaraj ⋅ Shobha Vasudevan ⋅ Kirill Gugaev ⋅ Idan Nachman ⋅ Jie Sun ⋅ Jose Baiocchi Paredes ⋅ Samrat Ghosh ⋅ Domagoj Babic ⋅ Zongwei Zhou ⋅ Naveen Kumar ⋅ Phitchaya Phothilimthana

Abstract

Compilers play a fundamental role at achieving peak performance for machine learning (ML) workloads. However, given the diverse nature of workloads and accelerators, compilers’ heuristics and analytical cost models can result in sub-optimal performance, and thus waste precious datacenter resources. Furthermore, the multitude of tunable parameters and their complex interplay often make it impossible for human experts to manually find optimal configurations. In this paper, we present CATWILD, a system that automatically optimizes ML jobs in Google’s TPU fleet using compiler autotuning techniques. We describe CATWILD’s design and implementation, and evaluate its performance using a handful of representative metrics. We further report experiences and lessons learned from its five-year development and operation. To the best of our knowledge, CATWILD represents the first ML compiler autotuning solution deployed in datacenters at scale. Its successful rollout yielded substantial benefits, generating tuned configurations for a large portion of Google’s TPU training workloads and achieving significant chip savings.