Timezone: »
A growing number of applications depends on Machine Learning (ML) functionality and benefits from bothhigher quality of ML predictions and better timeliness (latency) at the same time. A growing body of research incomputer architecture, ML, and systems software literature focuses on reaching better latency/accuracy tradeoffsfor ML models. Efforts include compression, quantization, pruning, early-exit models, mixed DNN precision, aswell as ML inference accelerator designs that minimize latency and energy, while preserving delivered accuracy.All of them, however, yield improvements for a single point in the latency/accuracy tradeoff space. We make acase for applications that operate in dynamically changing deployment scenarios, where no single point is optimal.We draw on recently proposed weight-shared SuperNet mechanism to enable serving a stream of queries that uses(activates) different SubNets within this weight-shared construct. This creates an opportunity to exploit the inherenttemporal locality with our proposed SubGraph Stationary (SGS) optimization. We take a hardware-softwareco-design approach with a real implementation of SGS in SushiAccel and the implementation of a softwarescheduler SushiSched controlling which SubNets to serve and what to cache in real time. Combined, they arevertically integrated into SUSHI—an inference serving stack, which yields up to 41% improvement in latency,0.98% increase in served accuracy, and achieves up to 52.6% saved energy.
Author Information
Payman Behnam (Georgia Institute of Technology)
Alexey Tumanov (Georgia Tech)
Tushar Krishna (Georgia Institute of Technology)
Pranav Gadikar (Georgia Institute of Technology)
Yangyu Chen (Georgia Institute of technology)
Jianming Tong (Georgia Tech)

Jianming Tong started his PhD program in Spring 2021 with a primary interest in hardware acceleration for AI/ML (MLSys'23) and Fully Homomorphic Encryption (ODIW'23). He is advised by Dr. Tushar Krishna. Jianming received his bachelor's degree at Xi'an Jiaotong University in 2020 (EE, awarded the national scholarship 3 times), supervised by Prof. Pengju Ren. Jianming also developed an open-sourced FPGA acceleration framework (ac2SLAM) for perception in robotics (FPT21). He also served as a Research Assistant at Tsinghua University, working with Prof. Yu Wang, with the outcome of an open-sourced multi-robot exploration system (ICRA21). He has an extensive prototype experience on both ASICs and FPGAs and internships at Alibaba DAMO Academy and Pacific Northwest National Lab. He was 2022 Qualcomm Innovation Fellowship finalist.
Yue Pan (University of California San Diego)
Abhimanyu Rajeshkumar Bambhaniya (Georgia Institute of Technology)
Alind Khare (Georgia Tech)
More from the Same Authors
-
2023 Poster: XRBench: An Extended Reality (XR) Machine Learning Benchmark Suite for the Metaverse »
Hyoukjun Kwon · Krishnakumar Nair · Jamin Seo · Jason Yik · Debabrata Mohapatra · Dongyuan Zhan · JINOOK SONG · Peter Capak · Peizhao Zhang · Peter Vajda · Colby Banbury · Mark Mazumder · Liangzhen Lai · Ashish Sirasao · Tushar Krishna · Harshit Khaitan · Vikas Chandra · Vijay Janapa Reddi -
2022 Tutorial: ASTRA-sim: Enabling SW/HW Co-Design Exploration for Distributed Deep Learning Training Platforms »
Tushar Krishna -
2022 : Introduction to Distributed DL Training »
Tushar Krishna -
2021 : Panel Session - Lizy John (UT Austin), David Kaeli (Northeastern University), Tushar Krishna (Georgia Tech), Peter Mattson (Google), Brian Van Essen (LLNL), Venkatram Vishwanath (ANL), Carole-Jean Wu (Facebook) »
Tom St John · LIZY JOHn · Tushar Krishna · Peter Mattson · Venkatram Vishwanath · Carole-Jean Wu · David Kaeli · Brian Van Essen -
2021 : Wrap up »
Alexey Tumanov -
2021 Workshop: SysML4Health: Scalable Systems for ML-driven Analytics in Healthcare »
Alexey Tumanov · Jimeng Sun · Tushar Krishna · Vivek Sarkar · Dawn Song -
2021 : Intro and Welcome »
Alexey Tumanov