2nd On-Device Intelligence Workshop

Paul Whatmough, Vijay Janapa Reddi, Chuteng Zhou, Igor Federov, Matthew Mattina, Pete Warden, Ganesh Venkatesh, Vikas Chandra


Ubiquitous on-device artificial intelligence (AI) is the next step in transforming the myriad of mobile computing devices in our everyday lives into a new class of truly “smart” devices capable of constantly observing, learning and adapting to their environment. The 2nd on-device intelligence workshop aims to advance the state-of-the-art by bringing together researchers and practitioners to discuss the key problems, disseminate new research results, and provide practical tutorial material.

Chat is not available.

Timezone: »


Fri 7:00 a.m. - 7:15 a.m.
Paul Whatmough, Arm Research (Opening Remarks)
Paul Whatmough
Fri 7:15 a.m. - 8:15 a.m.

Today’s AI is too big. Deep neural networks demand extraordinary levels of compute, and therefore power, for training and inference. This severely limits the practical deployment of AI in edge devices. We aim to improve the efficiency of deep learning. First, I’ll present MCUNet that brings deep learning to IoT devices. MCUNet is a framework that jointly designs the efficient neural architecture (TinyNAS) and the light-weight inference engine (TinyEngine), enabling ImageNet-scale inference on IoT devices that have only 1MB of Flash. Next I will talk about TinyTL that enables on-device transfer learning, reducing the memory footprint by 7-13x. Finally, I will describe Differentiable Augmentation that enables data-efficient GAN training, generating photo-realistic images using only 100 images, which used to require tens of thousand of images. We hope such TinyML techniques can make AI greener, faster, and more sustainable.

Biography Song Han is an assistant professor at MIT’s EECS. He received his PhD degree from Stanford University. His research focuses on efficient deep learning computing. He proposed “deep compression” technique that can reduce neural network size by an order of magnitude without losing accuracy, and the hardware implementation “efficient inference engine” that first exploited pruning and weight sparsity in deep learning accelerators. His team’s work on hardware-aware neural architecture search that bring deep learning to IoT devices was highlighted by MIT News, Wired, Qualcomm News, VentureBeat, IEEE Spectrum, integrated in PyTorch and AutoGluon, and received many low-power computer vision contest awards in flagship AI conferences (CVPR’19, ICCV’19 and NeurIPS’19). Song received Best Paper awards at ICLR’16 and FPGA’17, Amazon Machine Learning Research Award, SONY Faculty Award, Facebook Faculty Award, NVIDIA Academic Partnership Award. Song was named “35 Innovators Under 35” by MIT Technology Review for his contribution on “deep compression” technique that “lets powerful artificial intelligence (AI) programs run more efficiently on low-power mobile devices.” Song received the NSF CAREER Award for “efficient algorithms and hardware for accelerated machine learning” and the IEEE “AIs 10 to Watch: The Future of AI” award.

Fri 8:15 a.m. - 8:30 a.m.
Fri 8:30 a.m. - 9:15 a.m.

Apache TVM is a complete deep learning compilation framework -- it automatically generates fast binary code for any model, on any device, by exploring a large search space of potential optimizations. TVM itself uses machine learning to guide its code synthesis process, saving months of engineering time. The code generated by TVM can be many times faster than hand-optimized libraries -- in some cases exceeding a speedup of 30x over hand-tuned code.

In this talk, I will give an overview of Apache TVM and how we are using it at OctoML to enable model deployment on mobile and IoT devices. I’ll highlight our recent efforts on micro-TVM, TVM’s solution for deploying ML on microcontrollers.

Biography Dr. Thierry Moreau is the co-founder of OctoML Inc., a Seattle-based startup that applies state of the art ML-based automation to put into production fast and efficient ML in the datacenter and on the edge. Thierry has been a key contributor to Apache TVM, the open source machine learning compiler that started at University of Washington, where Thierry got his Ph.D. Today he narrowly works with top semiconductor companies to grow the range of hardware devices that TVM targets.

Thierry Moreau
Fri 9:15 a.m. - 9:30 a.m.

Compression of the neural network models has become an important systems problem for practical machine learning workflows. While various compression mechanisms and algorithms have been proposed to address the issue, many solutions rely on highly specialized procedures and require substantial domain knowledge to use efficiently. To streamline the compression to a large body of users, we propose an extensible open-source library based on the ideas of learning-compression (LC) algorithm—the LC toolkit. The software is written in Python using Pytorch and currently supports multiple forms of pruning, quantization, and low-rank compressions that can be applied to the model’s parts individually or in combination to reduce model’s size, computational requirements, or the on-device inference time. The toolkit’s versatility comes from the separation of the model learning from the model compression in the LC algorithm: once the learning (L) step is given, any compression (C) steps can be used for the model.

Fri 9:30 a.m. - 9:45 a.m.

Quantization is a popular technique for accelerating and compressing neural networks by utilizing low-bit arithmetic to represent weights and activations. It remains a hot area for research, with continued work on removing the gap in accuracy between full and low precision models. We observe that researchers in this area tend to rely on custom implementations, rather than approaches built into the popular machine learning libraries, as they are not sufficiently flexible to enable research. We are open sourcing TorchQuant, our MIT licensed library that builds upon PyTorch by providing researchers with modular components and implementations that will accelerate their research, and provide the community with consistent baselines. Using our library, we provide an example of how to quickly evaluate a research hypothesis: the “range-precision” trade-off for quantization-aware training. our library can be found at this URL:

Shyam Tailor
Fri 9:45 a.m. - 10:00 a.m.

In autonomous driving, 3D object detection is essential as it provides basic knowledge about the environment. However, as deep learning based 3D detection methods are usually computation intensive, it is challenging to support real-time 3D detection on edge-computing devices with limited computation and memory resources. To facilitate this, we propose a compiler-aware pruning search framework, to achieve real-time inference of 3D object detection on the resource-limited mobile devices. Specifically, a generator is applied to sample better pruning proposals in the search space, and an evaluator is adopted to evaluate the sampled pruning proposal performance with Bayesian optimization. We demonstrate that the pruning search framework can achieve real-time 3D object detection on mobile (Samsung Galaxy S20 phone) with state-of-the-art detection performance.

Fri 10:00 a.m. - 10:15 a.m.

With the increasing demand to efficiently deploy DNNs on mobile edge devices, it becomes much more important to reduce unnecessary computation and increase the execution speed. Prior methods towards this goal, including model compression and network architecture search (NAS), are largely performed independently and do not fully consider compiler-level optimization which is a must-do for mobile acceleration. In this work, we propose NPS, a compiler-aware unified network pruning search, and the corresponding comprehensive compiler optimizations supporting different DNNs and different pruning schemes, which bridge the gap of weight pruning and NAS. Our framework achieves 6.7ms, 5.9ms, and 3.9ms ImageNet inference times with 77%, 75% (MobileNet-V3 level), and 71% (MobileNet-V2 level) Top-1 accuracy respectively on an off-the-shelf mobile phone, consistently outperforming prior work.

Fri 10:15 a.m. - 10:30 a.m.

Federated Learning (FL) allows edge devices to collaboratively learn a shared prediction model while keeping their training data on the device, thereby decoupling the ability to do machine learning from the need to store data in the cloud. Despite the algorithmic advancements in FL, the support for on-device training of FL algorithms on edge devices remains poor. We present one of the first explorations of on-device FL on various smartphones and embedded devices using the Flower framework. We also evaluate the system costs of on-device FL and discuss how this quantification could be used to design more efficient FL algorithms.

Fri 10:30 a.m. - 10:45 a.m.

Laser-induced breakdown spectroscopy (LIBS) is a popular, fast elemental analysis technique used to determine the chemical composition of target samples, such as in industrial analysis of metals or in space exploration. Recently, there has been a rise in the use of machine learning (ML) techniques for LIBS data processing. However, ML for LIBS is challenging as: (i) the predictive models must be lightweight since they need to be deployed in highly resource-constrained and battery-operated portable LIBS systems; and (ii) since these systems can be remote, the models must be able to self-adapt to any domain shift in input distributions which could be due to the lack of different types of inputs in training data or dynamic environmental/sensor noise. This on-device retraining of model should not only be fast but also unsupervised due to the absence of new labeled data in remote LIBS systems. We introduce a lightweight multi-layer perceptron (MLP) model for LIBS that can be adapted on-device without requiring labels for new input data. It shows 89.3% average accuracy during data streaming, and up to 2.1% better accuracy compared to an MLP model that does not support adaptation. Finally, we also characterize the inference and retraining performance of our model on Google Pixel2 phone.

Kshitij Bhardwaj
Fri 10:45 a.m. - 11:00 a.m.

As the machine learning and systems community strives to achieve higher energy-efficiency through custom DNN
accelerators and model compression techniques, there is a need for a design space exploration framework that
incorporates quantization-aware processing elements into the accelerator design space while having accurate and
fast power, performance, and area models. In this work, we present QAPPA, a highly parameterized quantizationaware
power, performance, and area modeling framework for DNN accelerators. Our framework can facilitate
the future research on design space exploration of DNN accelerators for various design choices such as bit
precision, processing element type, scratchpad sizes of processing elements, global buffer size, device bandwidth,
number of total processing elements in the the design, and DNN workloads. Our results show that different bit
precisions and processing element types lead to significant differences in terms of performance per area and
energy. Specifically, our proposed lightweight processing elements achieve up to 4:9 more performance per area
and energy improvement when compared to INT16 based implementation.

Fri 11:00 a.m. - 12:00 p.m.
Fri 12:00 p.m. - 12:45 p.m.

We propose a novel method for federated learning that is customized to the objective of a given edge device. In our proposed method, a server trains a global meta-model by collaborating with devices without actually sharing data. The trained global meta-model is then customized locally by each device to meet its specific objective. Different from the conventional federated learning setting, training customized models for each device is hindered by both the inherent data biases of the various devices, as well as the requirements imposed by the federated architecture. We present an algorithm that locally de-biases model updates, while leveraging distributed data, so that each device can be effectively customized towards its objectives. Our method is fully agnostic to device heterogeneity and imbalanced data, scalable to massive number of devices, and allows for arbitrary partial participation. Our method has built-in convergence guarantees, and on benchmark datasets we demonstrate that it outperforms other state-of-art methods.

Biography Venkatesh Saligrama is a faculty member in the Department of Electrical and Computer Engineering, the Department of Computer Science (by courtesy), and a founding member of the Faculty of Computing and Data Sciences at Boston University. He holds a PhD from MIT. His research interests are broadly in the area of Artificial Intelligence, and his recent work has focused on machine learning with resource-constraints. He is an IEEE Fellow and recipient of several awards including Distinguished Lecturer for IEEE Signal Processing Society, the Presidential Early Career Award (PECASE), ONR Young Investigator Award, the NSF Career Award. More information about his work is available at

Fri 12:45 p.m. - 1:15 p.m.

Deploying ML models on edge devices poses a big challenge, as capabilities and numeric behavior can differ on each device. We will discuss the development of the Tensor Operator Set Architecture (TOSA), a set of base operators that serve as the building blocks for complex operations. TOSA operators define the functional and numeric behavior, ensuring that deployed networks behave consistently across a variety of devices.

Biography Eric Kunze is a Senior Principal Engineer in the ML Technology group at Arm, leading a group investigating future ML solutions.

Fri 1:15 p.m. - 1:45 p.m.

Deep-learning based models have revolutionized many NLP tasks (e.g. Translation, Conversational AI, Language Modeling). There is a growing need to perform these tasks on low-resource electronic devices (e.g. mobile phones, tablets, wearables) for privacy and latency reasons. However, the large computational and memory demands of deep neural networks make it difficult to deploy them on-device as-is. They usually require significant optimizations and sometimes major model architecture changes to fit under tight memory and compute budgets.

In this talk we will share the work that Facebook is doing to bring these NLP models to user devices. We will talk about efficient building blocks and model architectures that find the right balance between model quality and compute/memory requirements on multiple NLP tasks. Finally, we will outline the biggest challenges and open problems in shipping on-device NLP models at Facebook scale.

Biography Ahmed Aly is an Engineering Manager on the AI Assistant team in Facebook Reality Labs. He leads the Language understanding team, building efficient intent understanding and semantic parsing models that power Facebook’s Conversational AI systems. Prior to this, he was the founder and tech-lead of the PyText platform. Ahmed has a Masters degree in Computational Linguistics from University of Washington and a B.E. in Computer Engineering from Cairo University.

Kshitiz Malik is a Software Engineer on the AI Assistant team in Facebook Reality Labs. He works on Privacy Preserving Machine Learning, Natural Language Understanding and Natural Language Generation. Kshitiz has a PhD in Electrical and Computer Engineering from University of Illinois at Urbana-Champaign, and a B.E in Computer Engineering from University of Delhi

Fri 1:45 p.m. - 2:00 p.m.

This talk will review the challenges associated with designing models that can be run on memory and compute constrained devices. We will then summarize some of the model design techniques which are particularly useful for TinyML applications, including pruning, quantization, and black-box / gradient-based neural architecture search.

Fri 2:00 p.m. - 2:15 p.m.

How to deploy neural network models to MCUs using TensorFlow Lite for Microcontrollers and profile their latency and memory consumption.

Fri 2:15 p.m. - 2:30 p.m.
Vijay Reddi, Harvard (Closing Remarks)