Timezone: »
Ubiquitous on-device artificial intelligence (AI) is the next step in transforming the myriad of mobile computing devices in our everyday lives into a new class of truly “smart” devices capable of constantly observing, learning and adapting to their environment. The 2nd on-device intelligence workshop aims to advance the state-of-the-art by bringing together researchers and practitioners to discuss the key problems, disseminate new research results, and provide practical tutorial material.
Fri 7:00 a.m. - 7:15 a.m.
|
Paul Whatmough, Arm Research
(
Opening Remarks
)
|
Paul Whatmough 🔗 |
Fri 7:15 a.m. - 8:15 a.m.
|
Putting AI on Diet: TinyML and Efficient Deep Learning (Song Han, MIT)
(
Keynote 1
)
Today’s AI is too big. Deep neural networks demand extraordinary levels of compute, and therefore power, for training and inference. This severely limits the practical deployment of AI in edge devices. We aim to improve the efficiency of deep learning. First, I’ll present MCUNet that brings deep learning to IoT devices. MCUNet is a framework that jointly designs the efficient neural architecture (TinyNAS) and the light-weight inference engine (TinyEngine), enabling ImageNet-scale inference on IoT devices that have only 1MB of Flash. Next I will talk about TinyTL that enables on-device transfer learning, reducing the memory footprint by 7-13x. Finally, I will describe Differentiable Augmentation that enables data-efficient GAN training, generating photo-realistic images using only 100 images, which used to require tens of thousand of images. We hope such TinyML techniques can make AI greener, faster, and more sustainable. Biography Song Han is an assistant professor at MIT’s EECS. He received his PhD degree from Stanford University. His research focuses on efficient deep learning computing. He proposed “deep compression” technique that can reduce neural network size by an order of magnitude without losing accuracy, and the hardware implementation “efficient inference engine” that first exploited pruning and weight sparsity in deep learning accelerators. His team’s work on hardware-aware neural architecture search that bring deep learning to IoT devices was highlighted by MIT News, Wired, Qualcomm News, VentureBeat, IEEE Spectrum, integrated in PyTorch and AutoGluon, and received many low-power computer vision contest awards in flagship AI conferences (CVPR’19, ICCV’19 and NeurIPS’19). Song received Best Paper awards at ICLR’16 and FPGA’17, Amazon Machine Learning Research Award, SONY Faculty Award, Facebook Faculty Award, NVIDIA Academic Partnership Award. Song was named “35 Innovators Under 35” by MIT Technology Review for his contribution on “deep compression” technique that “lets powerful artificial intelligence (AI) programs run more efficiently on low-power mobile devices.” Song received the NSF CAREER Award for “efficient algorithms and hardware for accelerated machine learning” and the IEEE “AIs 10 to Watch: The Future of AI” award. |
🔗 |
Fri 8:15 a.m. - 8:30 a.m.
|
MORNING BREAK
|
🔗 |
Fri 8:30 a.m. - 9:15 a.m.
|
Efficient ML on the Edge with Apache TVM (Thierry Moreau, OctoML)
(
Invited 1
)
Apache TVM is a complete deep learning compilation framework -- it automatically generates fast binary code for any model, on any device, by exploring a large search space of potential optimizations. TVM itself uses machine learning to guide its code synthesis process, saving months of engineering time. The code generated by TVM can be many times faster than hand-optimized libraries -- in some cases exceeding a speedup of 30x over hand-tuned code. In this talk, I will give an overview of Apache TVM and how we are using it at OctoML to enable model deployment on mobile and IoT devices. I’ll highlight our recent efforts on micro-TVM, TVM’s solution for deploying ML on microcontrollers. Biography Dr. Thierry Moreau is the co-founder of OctoML Inc., a Seattle-based startup that applies state of the art ML-based automation to put into production fast and efficient ML in the datacenter and on the edge. Thierry has been a key contributor to Apache TVM, the open source machine learning compiler that started at University of Washington, where Thierry got his Ph.D. Today he narrowly works with top semiconductor companies to grow the range of hardware devices that TVM targets. |
Thierry Moreau 🔗 |
Fri 9:15 a.m. - 9:30 a.m.
|
A Flexible, Extensible Software Framework for Model Compression Based on the LC Algorithm (Yerlan Idelbayev, University of California, Merced)
(
Contributed 1
)
Compression of the neural network models has become an important systems problem for practical machine learning workflows. While various compression mechanisms and algorithms have been proposed to address the issue, many solutions rely on highly specialized procedures and require substantial domain knowledge to use efficiently. To streamline the compression to a large body of users, we propose an extensible open-source library based on the ideas of learning-compression (LC) algorithm—the LC toolkit. The software is written in Python using Pytorch and currently supports multiple forms of pruning, quantization, and low-rank compressions that can be applied to the model’s parts individually or in combination to reduce model’s size, computational requirements, or the on-device inference time. The toolkit’s versatility comes from the separation of the model learning from the model compression in the LC algorithm: once the learning (L) step is given, any compression (C) steps can be used for the model. |
🔗 |
Fri 9:30 a.m. - 9:45 a.m.
|
TorchQuant: A Hackable Quantization Library For Researchers, By Reseachers (Shyam A Tailor, University of Cambridge)
(
Contributed 2
)
Quantization is a popular technique for accelerating and compressing neural networks by utilizing low-bit arithmetic to represent weights and activations. It remains a hot area for research, with continued work on removing the gap in accuracy between full and low precision models. We observe that researchers in this area tend to rely on custom implementations, rather than approaches built into the popular machine learning libraries, as they are not sufficiently flexible to enable research. We are open sourcing TorchQuant, our MIT licensed library that builds upon PyTorch by providing researchers with modular components and implementations that will accelerate their research, and provide the community with consistent baselines. Using our library, we provide an example of how to quickly evaluate a research hypothesis: the “range-precision” trade-off for quantization-aware training. our library can be found at this URL: https://github.com/camlsys/torchquant. |
Shyam Tailor 🔗 |
Fri 9:45 a.m. - 10:00 a.m.
|
Towards Real-Time 3D Object Detection with Pruning Search on Edge Devices (Pu Zhao, Northeastern University)
(
Contributed 3
)
link »
In autonomous driving, 3D object detection is essential as it provides basic knowledge about the environment. However, as deep learning based 3D detection methods are usually computation intensive, it is challenging to support real-time 3D detection on edge-computing devices with limited computation and memory resources. To facilitate this, we propose a compiler-aware pruning search framework, to achieve real-time inference of 3D object detection on the resource-limited mobile devices. Specifically, a generator is applied to sample better pruning proposals in the search space, and an evaluator is adopted to evaluate the sampled pruning proposal performance with Bayesian optimization. We demonstrate that the pruning search framework can achieve real-time 3D object detection on mobile (Samsung Galaxy S20 phone) with state-of-the-art detection performance. |
PU ZHAO 🔗 |
Fri 10:00 a.m. - 10:15 a.m.
|
A Compiler-aware Framework of Network Pruning Search Achieving Beyond Real-Time Mobile Acceleration (Yanyu Li, Northeastern University)
(
Contributed 4
)
With the increasing demand to efficiently deploy DNNs on mobile edge devices, it becomes much more important to reduce unnecessary computation and increase the execution speed. Prior methods towards this goal, including model compression and network architecture search (NAS), are largely performed independently and do not fully consider compiler-level optimization which is a must-do for mobile acceleration. In this work, we propose NPS, a compiler-aware unified network pruning search, and the corresponding comprehensive compiler optimizations supporting different DNNs and different pruning schemes, which bridge the gap of weight pruning and NAS. Our framework achieves 6.7ms, 5.9ms, and 3.9ms ImageNet inference times with 77%, 75% (MobileNet-V3 level), and 71% (MobileNet-V2 level) Top-1 accuracy respectively on an off-the-shelf mobile phone, consistently outperforming prior work. |
🔗 |
Fri 10:15 a.m. - 10:30 a.m.
|
On-device federated learning with Flower (Akhil Mathur, Nokia Bell Labs)
(
Contributed 5
)
Federated Learning (FL) allows edge devices to collaboratively learn a shared prediction model while keeping their training data on the device, thereby decoupling the ability to do machine learning from the need to store data in the cloud. Despite the algorithmic advancements in FL, the support for on-device training of FL algorithms on edge devices remains poor. We present one of the first explorations of on-device FL on various smartphones and embedded devices using the Flower framework. We also evaluate the system costs of on-device FL and discuss how this quantification could be used to design more efficient FL algorithms. |
🔗 |
Fri 10:30 a.m. - 10:45 a.m.
|
Semi-supervised on-device neural network adaptation for remote and portable laser-induced breakdown spectroscopy (Kshitij Bhardwaj, LLNL)
(
Contributed 6
)
Laser-induced breakdown spectroscopy (LIBS) is a popular, fast elemental analysis technique used to determine the chemical composition of target samples, such as in industrial analysis of metals or in space exploration. Recently, there has been a rise in the use of machine learning (ML) techniques for LIBS data processing. However, ML for LIBS is challenging as: (i) the predictive models must be lightweight since they need to be deployed in highly resource-constrained and battery-operated portable LIBS systems; and (ii) since these systems can be remote, the models must be able to self-adapt to any domain shift in input distributions which could be due to the lack of different types of inputs in training data or dynamic environmental/sensor noise. This on-device retraining of model should not only be fast but also unsupervised due to the absence of new labeled data in remote LIBS systems. We introduce a lightweight multi-layer perceptron (MLP) model for LIBS that can be adapted on-device without requiring labels for new input data. It shows 89.3% average accuracy during data streaming, and up to 2.1% better accuracy compared to an MLP model that does not support adaptation. Finally, we also characterize the inference and retraining performance of our model on Google Pixel2 phone. |
Kshitij Bhardwaj 🔗 |
Fri 10:45 a.m. - 11:00 a.m.
|
QAPPA: Quantization-Aware Power, Performance, and Area Modeling of DNN Accelerators (Ahmet Inci, CMU)
(
Contributed 7
)
As the machine learning and systems community strives to achieve higher energy-efficiency through custom DNN |
🔗 |
Fri 11:00 a.m. - 12:00 p.m.
|
LUNCH BREAK
|
🔗 |
Fri 12:00 p.m. - 12:45 p.m.
|
Customizing Federated Learning to the Edge Device (Venkatesh Saligrama, Boston University)
(
Keynote 2
)
We propose a novel method for federated learning that is customized to the objective of a given edge device. In our proposed method, a server trains a global meta-model by collaborating with devices without actually sharing data. The trained global meta-model is then customized locally by each device to meet its specific objective. Different from the conventional federated learning setting, training customized models for each device is hindered by both the inherent data biases of the various devices, as well as the requirements imposed by the federated architecture. We present an algorithm that locally de-biases model updates, while leveraging distributed data, so that each device can be effectively customized towards its objectives. Our method is fully agnostic to device heterogeneity and imbalanced data, scalable to massive number of devices, and allows for arbitrary partial participation. Our method has built-in convergence guarantees, and on benchmark datasets we demonstrate that it outperforms other state-of-art methods. Biography Venkatesh Saligrama is a faculty member in the Department of Electrical and Computer Engineering, the Department of Computer Science (by courtesy), and a founding member of the Faculty of Computing and Data Sciences at Boston University. He holds a PhD from MIT. His research interests are broadly in the area of Artificial Intelligence, and his recent work has focused on machine learning with resource-constraints. He is an IEEE Fellow and recipient of several awards including Distinguished Lecturer for IEEE Signal Processing Society, the Presidential Early Career Award (PECASE), ONR Young Investigator Award, the NSF Career Award. More information about his work is available at http://sites.bu.edu/data |
🔗 |
Fri 12:45 p.m. - 1:15 p.m.
|
Enabling standardized behavior for ML operations with TOSA (Eric Kunze, Arm)
(
Invited 2
)
Deploying ML models on edge devices poses a big challenge, as capabilities and numeric behavior can differ on each device. We will discuss the development of the Tensor Operator Set Architecture (TOSA), a set of base operators that serve as the building blocks for complex operations. TOSA operators define the functional and numeric behavior, ensuring that deployed networks behave consistently across a variety of devices. Biography Eric Kunze is a Senior Principal Engineer in the ML Technology group at Arm, leading a group investigating future ML solutions. |
🔗 |
Fri 1:15 p.m. - 1:45 p.m.
|
On-Device NLP at Facebook (Ahmed Aly and Kshitiz Malik, Facebook)
(
Invited 3
)
Deep-learning based models have revolutionized many NLP tasks (e.g. Translation, Conversational AI, Language Modeling). There is a growing need to perform these tasks on low-resource electronic devices (e.g. mobile phones, tablets, wearables) for privacy and latency reasons. However, the large computational and memory demands of deep neural networks make it difficult to deploy them on-device as-is. They usually require significant optimizations and sometimes major model architecture changes to fit under tight memory and compute budgets. In this talk we will share the work that Facebook is doing to bring these NLP models to user devices. We will talk about efficient building blocks and model architectures that find the right balance between model quality and compute/memory requirements on multiple NLP tasks. Finally, we will outline the biggest challenges and open problems in shipping on-device NLP models at Facebook scale. Biography Ahmed Aly is an Engineering Manager on the AI Assistant team in Facebook Reality Labs. He leads the Language understanding team, building efficient intent understanding and semantic parsing models that power Facebook’s Conversational AI systems. Prior to this, he was the founder and tech-lead of the PyText platform. Ahmed has a Masters degree in Computational Linguistics from University of Washington and a B.E. in Computer Engineering from Cairo University. Kshitiz Malik is a Software Engineer on the AI Assistant team in Facebook Reality Labs. He works on Privacy Preserving Machine Learning, Natural Language Understanding and Natural Language Generation. Kshitiz has a PhD in Electrical and Computer Engineering from University of Illinois at Urbana-Champaign, and a B.E in Computer Engineering from University of Delhi |
🔗 |
Fri 1:45 p.m. - 2:00 p.m.
|
TinyML Model Design (Igor Fedorov, Arm Research)
(
Tutorial 1
)
This talk will review the challenges associated with designing models that can be run on memory and compute constrained devices. We will then summarize some of the model design techniques which are particularly useful for TinyML applications, including pruning, quantization, and black-box / gradient-based neural architecture search. |
🔗 |
Fri 2:00 p.m. - 2:15 p.m.
|
Deploying and Profiling TinyML Models (Colby Banbury, Harvard)
(
Tutorial 2
)
How to deploy neural network models to MCUs using TensorFlow Lite for Microcontrollers and profile their latency and memory consumption. |
🔗 |
Fri 2:15 p.m. - 2:30 p.m.
|
Vijay Reddi, Harvard
(
Closing Remarks
)
|
🔗 |
Author Information
Paul Whatmough (Arm Research)
Vijay Janapa Reddi (Harvard University)
Chuteng Zhou (Arm Research)
Igor Federov (Arm Research)
Matthew Mattina (Arm ML Research Lab)
Pete Warden (Google)
Ganesh Venkatesh (Facebook)
Vikas Chandra (Facebook)
I lead the on-device AI research team in the AR/VR group focusing on vision and speech/NLU applications. My research interests include all aspects of HW/SW co-design for efficient on-device AI. I received a PhD in Computer Engineering from Carnegie Mellon University in 2004. I have held the positions of Visiting Scholar (2011 – 2014) and Visiting Faculty (2016 – 2017) at Stanford University. I have authored 70+ research publications and am an inventor on 40+ US and international patents. I received the ACM-SIGDA Technical Leadership Award in 2009 and was invited to the 2017 National Academy of Engineering’s Frontiers of Engineering Symposium. I am a senior member of IEEE.
More from the Same Authors
-
2022 Poster: ML-EXray: Visibility into ML Deployment on the Edge »
Hang Qiu · Ioanna Vavelidou · Jian Li · Evgenya Pergament · Pete Warden · Sandeep Chinchali · Zain Asgar · Sachin Katti -
2022 Poster: MLPerf Mobile Inference Benchmark: An Industry-Standard Open-Source Machine Learning Benchmark for On-Device AI »
Vijay Janapa Reddi · David Kanter · Peter Mattson · Jared Duke · Thai Nguyen · Ramesh Chukka · Ken Shiring · Koan-Sin Tan · Mark Charlebois · William Chou · Mostafa El-Khamy · Jungwook Hong · Tom St John · Cindy Trinh · Michael Buch · Mark Mazumder · Relja Markovic · Thomas Atta · Fatih Cakir · Masoud Charkhabi · Xiaodong Chen · Cheng-Ming Chiang · Dave Dexter · Terry Heo · Guenther Schmuelling · Maryam Shabani · Dylan Zika -
2023 Poster: Edge Impulse: An MLOps Platform for Tiny Machine Learning »
colby banbury · Vijay Janapa Reddi · Alexander Elium · Shawn Hymel · David Tischler · Daniel Situnayake · Carl Ward · Louis Moreau · Jenny Plunkett · Matthew Kelcey · Mathijs Baaijens · Alessandro Grande · Dmitry Maslov · Arthur Beavis · Jan Jongboom · Jessica Quaye -
2023 Poster: XRBench: An Extended Reality (XR) Machine Learning Benchmark Suite for the Metaverse »
Hyoukjun Kwon · Krishnakumar Nair · Jamin Seo · Jason Yik · Debabrata Mohapatra · Dongyuan Zhan · JINOOK SONG · Peter Capak · Peizhao Zhang · Peter Vajda · Colby Banbury · Mark Mazumder · Liangzhen Lai · Ashish Sirasao · Tushar Krishna · Harshit Khaitan · Vikas Chandra · Vijay Janapa Reddi -
2023 : Closing Remarks »
Vijay Janapa Reddi -
2023 Workshop: Workshop on Systems for Next-Gen AI Paradigms »
Jason Yik · Brian Anderson · Charlotte Frenkel · Vijay Janapa Reddi · Zergham Ahmed -
2023 Workshop: The 3rd On-Device Intelligence Workshop »
Vijay Janapa Reddi · Paul Whatmough · Vikas Chandra · Pete Warden · Brian Plancher · Colby Banbury · Matthew Stewart -
2022 Oral: MLPerf Mobile Inference Benchmark: An Industry-Standard Open-Source Machine Learning Benchmark for On-Device AI »
Vijay Janapa Reddi · David Kanter · Peter Mattson · Jared Duke · Thai Nguyen · Ramesh Chukka · Ken Shiring · Koan-Sin Tan · Mark Charlebois · William Chou · Mostafa El-Khamy · Jungwook Hong · Tom St John · Cindy Trinh · Michael Buch · Mark Mazumder · Relja Markovic · Thomas Atta · Fatih Cakir · Masoud Charkhabi · Xiaodong Chen · Cheng-Ming Chiang · Dave Dexter · Terry Heo · Guenther Schmuelling · Maryam Shabani · Dylan Zika -
2022 Oral: ML-EXray: Visibility into ML Deployment on the Edge »
Hang Qiu · Ioanna Vavelidou · Jian Li · Evgenya Pergament · Pete Warden · Sandeep Chinchali · Zain Asgar · Sachin Katti -
2021 : The Future of ML is Tiny and Bright: Challenges and Opportunities »
Vijay Janapa Reddi -
2021 : Paul Whatmough, Arm Research »
Paul Whatmough -
2021 Poster: Doping: A technique for Extreme Compression of LSTM Models using Sparse Structured Additive Matrices »
Urmish Thakker · Paul Whatmough · ZHIGANG LIU · Matthew Mattina · Jesse Beu -
2021 Oral: Doping: A technique for Extreme Compression of LSTM Models using Sparse Structured Additive Matrices »
Urmish Thakker · Paul Whatmough · ZHIGANG LIU · Matthew Mattina · Jesse Beu -
2021 Poster: RL-Scope: Cross-stack Profiling for Deep Reinforcement Learning Workloads »
James Gleeson · Srivatsan Krishnan · Moshe Gabel · Vijay Janapa Reddi · Eyal de Lara · Gennady Pekhimenko -
2021 Oral: RL-Scope: Cross-stack Profiling for Deep Reinforcement Learning Workloads »
James Gleeson · Srivatsan Krishnan · Moshe Gabel · Vijay Janapa Reddi · Eyal de Lara · Gennady Pekhimenko -
2021 Poster: TensorFlow Lite Micro: Embedded Machine Learning for TinyML Systems »
Robert David · Jared Duke · Advait Jain · Vijay Janapa Reddi · Nat Jeffries · Jian Li · Nick Kreeger · Ian Nappier · Meghna Natraj · Tiezhen Wang · Pete Warden · Rocky Rhodes · Rocky Rhodes -
2021 Poster: MicroNets: Neural Network Architectures for Deploying TinyML Applications on Commodity Microcontrollers »
Colby Banbury · Chuteng Zhou · Igor Fedorov · Ramon Matas · Urmish Thakker · Dibakar Gope · Vijay Janapa Reddi · Matthew Mattina · Paul Whatmough -
2021 Oral: TensorFlow Lite Micro: Embedded Machine Learning for TinyML Systems »
Robert David · Jared Duke · Advait Jain · Vijay Janapa Reddi · Nat Jeffries · Jian Li · Nick Kreeger · Ian Nappier · Meghna Natraj · Tiezhen Wang · Pete Warden · Rocky Rhodes · Rocky Rhodes -
2021 Oral: MicroNets: Neural Network Architectures for Deploying TinyML Applications on Commodity Microcontrollers »
Colby Banbury · Chuteng Zhou · Igor Fedorov · Ramon Matas · Urmish Thakker · Dibakar Gope · Vijay Janapa Reddi · Matthew Mattina · Paul Whatmough -
2020 Workshop: On-Device Intelligence »
Vikas Chandra · Pete Warden · Ganesh Venkatesh · Yingyan Lin -
2020 Oral: MLPerf Training Benchmark »
Peter Mattson · Christine Cheng · Gregory Diamos · Cody Coleman · Paulius Micikevicius · David Patterson · Hanlin Tang · Gu-Yeon Wei · Peter Bailis · Victor Bittorf · David Brooks · Dehao Chen · Debo Dutta · Udit Gupta · Kim Hazelwood · Andy Hock · Xinyuan Huang · Daniel Kang · David Kanter · Naveen Kumar · Jeffery Liao · Deepak Narayanan · Tayo Oguntebi · Gennady Pekhimenko · Lillian Pentecost · Vijay Janapa Reddi · Taylor Robie · Tom St John · Carole-Jean Wu · Lingjie Xu · Cliff Young · Matei Zaharia -
2020 Oral: Searching for Winograd-aware Quantized Networks »
Javier Fernandez-Marques · Paul Whatmough · Andrew Mundy · Matthew Mattina -
2020 Poster: Searching for Winograd-aware Quantized Networks »
Javier Fernandez-Marques · Paul Whatmough · Andrew Mundy · Matthew Mattina -
2020 Poster: MLPerf Training Benchmark »
Peter Mattson · Christine Cheng · Gregory Diamos · Cody Coleman · Paulius Micikevicius · David Patterson · Hanlin Tang · Gu-Yeon Wei · Peter Bailis · Victor Bittorf · David Brooks · Dehao Chen · Debo Dutta · Udit Gupta · Kim Hazelwood · Andy Hock · Xinyuan Huang · Daniel Kang · David Kanter · Naveen Kumar · Jeffery Liao · Deepak Narayanan · Tayo Oguntebi · Gennady Pekhimenko · Lillian Pentecost · Vijay Janapa Reddi · Taylor Robie · Tom St John · Carole-Jean Wu · Lingjie Xu · Cliff Young · Matei Zaharia -
2020 Demonstration: Air Learning: An End To End Learning Gym For Aerial Robots »
Srivatsan Krishnan · Colby Banbury · Bardienus Duisterhof · Aleksandra Faust · Vijay Janapa Reddi