Track: Performance and Memory

Session

Performance and Memory

Abstract:

Chat is not available.

Thu 16 May 9:00 - 9:20 PDT

vMCU: Coordinated Memory Management and Kernel Optimization for DNN Inference on MCUs

Size Zheng · Renze Chen · Meng Li · Zihao Ye · Luis Ceze · Yun Liang

IoT devices based on microcontroller units (MCU) provide ultra-low power consumption and ubiquitous computation for near-sensor deep learning models (DNN).However, the memory of MCU is usually 2-3 orders of magnitude smaller than mobile devices, which makes it challenging to map DNNs onto MCUs.Previous work separates memory management and kernel implementation for MCU and relies on coarse-grained memory management techniques such as inplace update to reduce memory consumption.In this paper, we propose to coordinate memory management and kernel optimization for DNN inference on MCUs to enable fine-grained memory management. The key idea is to virtualize the limited memory of MCU as a large memory pool. Each kernel divides the memory pool into kernel-specific segments and handles segment load and store while computing DNN layers.Memory consumption can be reduced because using the fine-grained segment-level memory control, we can overlap the memory footprint of different tensors without the need to materialize them at the same time. Following this idea, we implement \ours{} for DNN inference on MCU.Evaluation for single layers on ARM Cortex-M4 and Cortex-M7 processors shows that \ours{} can reduce from $12.0\%$ to $49.5\%$ RAM usage and from $20.6\%$ to $53.0\%$ energy consumption compared to state-of-the-art work. For full DNN evaluation, \ours{} can reduce the memory bottleneck by $61.5\%$, enabling more models to be deployed on low-end MCUs.

Thu 16 May 9:20 - 9:40 PDT

SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models

Zhixu Du · Shiyu Li · Yuhao Wu · Xiangyu Jiang · Jingwei Sun · Qilin Zheng · Yongkai Wu · Ang Li · Hai Li · Yiran Chen

Mixture-of-Experts (MoE) has emerged as a favorable architecture in the era of large models due to its inherent advantage, i.e., enlarging model capacity without incurring notable computational overhead. Yet, the realization of such benefits often results in ineffective GPU memory utilization, as large portions of the model parameters remain dormant during inference. Moreover, the memory demands of large models consistently outpace the memory capacity of contemporary GPUs. Addressing this, we introduce SiDA-MoE (Sparsity-inspired Data-Aware), an efficient inference approach tailored for large MoE models. SiDA-MoE judiciously exploits both the system's main memory, which is now abundant and readily scalable, and GPU memory by capitalizing on the inherent sparsity on expert activation in MoE models. By adopting a data-aware perspective, SiDA-MoE achieves enhanced model efficiency with a neglectable performance drop. Specifically, SiDA-MoE attains a remarkable speedup in MoE inference with up to 3.93x throughput increasing, up to 72% latency reduction, and up to 80% GPU memory saving with down to 1% performance drop. This work paves the way for scalable and efficient deployment of large MoE models, even with constrained resources. Code is available at: https://github.com/timlee0212/SiDA-MoE.

Thu 16 May 9:40 - 10:00 PDT

ACROBAT: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time

Pratik Fegade · Tianqi Chen · Phillip Gibbons · Todd Mowry

Dynamic control flow is an important technique often used to design expressive and efficient deep learning computations for applications such as text parsing, machine translation, exiting early out of deep models and so on. However, the resulting control flow divergence makes batching, an important performance optimization, difficult to perform manually. In this paper, we present ACRoBat, a framework that enables efficient automatic batching for dynamic deep learning computations by performing hybrid static+dynamic compiler optimizations and end-to-end tensor code generation. ACRoBat performs up to 8.5 better than DyNet, a state-of-the-art framework for automatic batching, on an Nvidia GeForce RTX 3070 GPU.