Timezone: »

Payman Behnam · Alexey Tumanov · Tushar Krishna · Pranav Gadikar · Yangyu Chen · Jianming Tong · Yue Pan · Abhimanyu Rajeshkumar Bambhaniya · Alind Khare

@ Ballroom B - Position 46
in Edge »

A growing number of applications depends on Machine Learning (ML) functionality and benefits from bothhigher quality of ML predictions and better timeliness (latency) at the same time. A growing body of research incomputer architecture, ML, and systems software literature focuses on reaching better latency/accuracy tradeoffsfor ML models. Efforts include compression, quantization, pruning, early-exit models, mixed DNN precision, aswell as ML inference accelerator designs that minimize latency and energy, while preserving delivered accuracy.All of them, however, yield improvements for a single point in the latency/accuracy tradeoff space. We make acase for applications that operate in dynamically changing deployment scenarios, where no single point is optimal.We draw on recently proposed weight-shared SuperNet mechanism to enable serving a stream of queries that uses(activates) different SubNets within this weight-shared construct. This creates an opportunity to exploit the inherenttemporal locality with our proposed SubGraph Stationary (SGS) optimization. We take a hardware-softwareco-design approach with a real implementation of SGS in SushiAccel and the implementation of a softwarescheduler SushiSched controlling which SubNets to serve and what to cache in real time. Combined, they arevertically integrated into SUSHI—an inference serving stack, which yields up to 41% improvement in latency,0.98% increase in served accuracy, and achieves up to 52.6% saved energy.

Author Information

Payman Behnam (Georgia Institute of Technology)
Alexey Tumanov (Georgia Tech)
Tushar Krishna (Georgia Institute of Technology)
Pranav Gadikar (Georgia Institute of Technology)
Yangyu Chen (Georgia Institute of technology)
Jianming Tong (Georgia Tech)
Jianming Tong

Jianming Tong started his PhD program in Spring 2021 with a primary interest in hardware acceleration for AI/ML (MLSys'23) and Fully Homomorphic Encryption (ODIW'23). He is advised by Dr. Tushar Krishna. Jianming received his bachelor's degree at Xi'an Jiaotong University in 2020 (EE, awarded the national scholarship 3 times), supervised by Prof. Pengju Ren. Jianming also developed an open-sourced FPGA acceleration framework (ac2SLAM) for perception in robotics (FPT21). He also served as a Research Assistant at Tsinghua University, working with Prof. Yu Wang, with the outcome of an open-sourced multi-robot exploration system (ICRA21). He has an extensive prototype experience on both ASICs and FPGAs and internships at Alibaba DAMO Academy and Pacific Northwest National Lab. He was 2022 Qualcomm Innovation Fellowship finalist.

Yue Pan (University of California San Diego)
Abhimanyu Rajeshkumar Bambhaniya (Georgia Institute of Technology)
Alind Khare (Georgia Tech)

More from the Same Authors