| Introduction | Background | PoET-BiN | Experimental setup and results | Conclusion |
|--------------|------------|----------|--------------------------------|------------|
|              |            |          |                                |            |
|              |            |          |                                |            |

# PoET-BiN: Power Efficient Tiny Binary Neurons

Sivakumar Chidambaram<sup>1</sup>, J.M. Pierre Langlois<sup>2</sup>, Jean Pierre David<sup>1</sup>

Department of Electrical Engineering <sup>1</sup> Department of Computer and Software Engineering <sup>2</sup> Polytechnique Montréal Montréal, Canada

03 March 2020

| Introduction | Background | PoET-BiN | Experimental setup and results | Conclusion |
|--------------|------------|----------|--------------------------------|------------|
|              |            |          |                                |            |
|              |            |          |                                |            |
|              |            |          |                                |            |

# Contents

1 Introduction

- 2 Background
- 3 PoET-BiN
- 4 Experimental setup and results

### 5 Conclusion

| Introduction | Background | PoET-BiN | Experimental setup and results | Conclusion |
|--------------|------------|----------|--------------------------------|------------|
| ••           |            |          |                                |            |
|              |            |          |                                |            |

# Real-time deep learning use cases



Autonomous Driving

www.alten.com/sector/automotive/next-generation-camera-based-

adas-development/



Translation

Source : www.firebase.google.com/docs/ml-kit/translation



**CCTV** Monitoring

www.munhwa.com/news/view.html?no=2019100101

**Required Attributes :** 

- Accuracy
- Latency and Throughput
- Power and Energy constraints
- Memory and Hardware costs

| Introduction | Background | PoET-BiN | Experimental setup and results | Conclusion |
|--------------|------------|----------|--------------------------------|------------|
| 00           |            |          |                                |            |
|              |            |          |                                |            |

# Computation needs

#### **Exponential Growth in the Training of Artificial Intelligence Programs**



PetaFLOP/s-Day (Training)

Source: https://openai.com/blog/ai-and-compute/ Note: A petaFLOPS is a unit of computing speed equal to one quadrillion FLOPS. floating operations per second, a measure of computer performance.

#### Exponential rise in computations

| PoET-BiN | MLSys | 03 March 2020 4 / 20 |
|----------|-------|----------------------|

| Introduction | Background | PoET-BiN | Experimental setup and results | Conclusion |
|--------------|------------|----------|--------------------------------|------------|
|              | ••         |          |                                |            |
|              |            |          |                                |            |

# Current Deep Learning Software Acceleration Techniques

### **Quantized Neural Networks**

- Quantization of weights and activations
- Binarizing, Ternarization, Multi-bit quantization
- Helps in generalization on the unseen data

#### Pruning - Remove certain neurons from the vanilla neural network

- A bagging technique that averages various randomly pruned networks
- Introduces noise in the system that helps perform better on unseen data

#### Sparsification - Sparse matrix multiplication

- Removing connections between neurons
- Reduces the number of multiplication and additions
- Reduces number of memory reads
- Implemented on hardware devices such as FPGA, microprocessors, microcontrollers etc.

PoET-BiN

| Introduction | Background | PoET-BiN | Experimental setup and results | Conclusion |
|--------------|------------|----------|--------------------------------|------------|
|              | 00         |          |                                |            |
|              |            |          |                                |            |
|              |            |          |                                |            |
|              |            |          |                                |            |

# Hardware : FPGAs

PoE

- Goto device for rapid prototyping of accelerators
- FPGAs consist of Arithmetic Logical Modules (ALMs), programmable interconnects, IOs and BRAMs
- ALMs are the main computational unit
- 100,000s of ALMs in a typical FPGA
- Each ALM has a LookUp Table (LUT) with up to 8 inputs and up to 2 outputs
- Programmed using Hardware Description Languages



Source : https ://hackaday.com/2018/03/01/another-introduction-to-fpgas/

| Introduction | Background | PoET-BiN | Experimental setup and results | Conclusion |
|--------------|------------|----------|--------------------------------|------------|
|              |            | ●000000  |                                |            |
|              |            |          |                                |            |

# Problem Definition and Objectives

### **Problem Definition :**

- Vanilla neural networks are computation, power and area intensive
- Current acceleration approaches are still computationally intensive
- Quantized neural networks and pruning are not optimized for FPGAs

#### Objectives/Contributions :

- A modified Decision Tree training algorithm to better match LUTs with a fixed number of inputs.
- The Reduced Input Neural Circuit (RINC) : A LUT-based architecture founded on modified Decision Trees and the hierarchical version of the well known Adaboost algorithm to efficiently implement a network of binary neurons.
- A sparsely connected output layer for multiclass classification.
- The PoET-BiN architecture consisting of multiple RINC modules and a sparsely connected output layer.
- Automatic VHDL code generation of the PoET-BiN architecture for FPGA implementation.

|  | P∩ | Ē | - | Ri | Ν |
|--|----|---|---|----|---|
|--|----|---|---|----|---|

| Introduction | Background | PoET-BiN | Experimental setup and results | Conclusion |
|--------------|------------|----------|--------------------------------|------------|
|              |            | 000000   |                                |            |
|              |            |          |                                |            |
|              |            |          |                                |            |

## **Binary Decision Trees**



#### What are Binary DTs :

- Inputs(X) and Outputs(Y) are binary
- Node wise creation of Decision Tree from root to leaves
- The feature that most reduces the entropy is chosen
- Divides the representation space to classify data

#### Challenges :

- To classify large datasets need larger **Decision Trees**
- Results in large implementations on the hardware- complex and high power consumption
- To effectively implement on FPGAs we need small Decision Trees of < 6 inputs to fit in one LUT

| Introduction | Background | PoET-BiN | Experimental setup and results | Conclusion |
|--------------|------------|----------|--------------------------------|------------|
|              |            | 000000   |                                |            |
|              |            |          |                                |            |

# RINC-0 : Modified Decision Tree Algorithm

- Modified DT algorithm level based entropy reduction rather than node based
- Decision Trees are restricted by the number of inputs(I)
- A node-wise off-the-shelf 6-input Decision Tree would have only 7 leaf nodes
- Level-wise Decision Tree will have 2<sup>6</sup> = 64 leaf nodes
- More granularity



9/20

| Introduction | Background | PoET-BiN | Experimental setup and results | Conclusion |
|--------------|------------|----------|--------------------------------|------------|
|              |            | 0000000  |                                |            |
|              |            |          |                                |            |

# **RINC-1** : Incorporating Adaboost

- A single Decision Tree is a weak classifier
- Ensemble methods such as Boosting and Bagging are used to create strong classifiers from weak classifiers
- We use the well-know Adaboost algorithm



Source :https ://packtpub.com/book/bigdataandbusinessintelligence/adaboost-

#### classifier

PoET-BiN

- The weak classifiers are created serially
- The samples are given equal weight initially
- The first weak classifier is trained on the data
- The mis-classified sample's weights are increased
- Subsequent classifier focuses on the incorrect samples
- Each classifier is assigned a weight based on the number of correctly classified samples
- A weighted sum of all the weak classifier outputs forms the strong classifier

| MLSys | 03 March 2020 | 10/20 |
|-------|---------------|-------|
|-------|---------------|-------|

| Introduction | Background | PoET-BiN | Experimental setup and results | Conclusion |
|--------------|------------|----------|--------------------------------|------------|
|              |            | 0000000  |                                |            |
|              |            |          |                                |            |

# **RINC-1** Module



- The MAC and threshold operations can be implemented in a LUT
- Can group up to a maximum of P Decision Trees
- However, P Decision Trees with P<sup>2</sup> inputs are not enough compared to a MAC operation in a neural network
- Neuron in a neural network can have up to 4096 inputs as compared 36 (when P=6) in RINC-1 modules
- Hence, we introduce the hierarchical Adaboost algorithm

PoET-BiN

| Introduction | Background | PoET-BiN | Experimental setup and results | Conclusion |
|--------------|------------|----------|--------------------------------|------------|
|              |            | 0000000  |                                |            |
|              |            |          |                                |            |

# **RINC-2** : Hierarchical Adaboost



RINC-2 module

- The RINC-2 modules have adequate capacity to represent MAC operations
- Can only be used for binary classifications

| Introduction | Background | PoET-BiN | Experimental setup and results | Conclusion |
|--------------|------------|----------|--------------------------------|------------|
|              |            | 000000   |                                |            |
|              |            |          |                                |            |

### Binary to Multiclass Classification



#### Current Methods :

- Multiclass DTs : costly to implement
- One-vs-All classification : leads to reduction in accuracy

#### Our Approach :

- A sparsely connected intermediate layer before the final output layer for multiclass classification
- Only P neurons of the intermediate layer connected to each neuron in the output layer
- The neurons in the output layer need to have multiple bits to represent the probabilities and cannot be binary values
- Implemented as LUTs

| Introduction | Background | PoET-BiN | Experimental setup and results | Conclusion |
|--------------|------------|----------|--------------------------------|------------|
|              |            |          | 0000                           |            |
|              |            |          |                                |            |

# **Experimental Setup**



TABLE - Network architecture

| ARCHITECTURE (ARCH.)                               | Symbol | DATASET  |
|----------------------------------------------------|--------|----------|
| $LeNET_{FE} - (512FC) - (10FC)$                    | M1     | MNIST    |
| $VGG11_{FE} - (4096FC) - (4096FC) - (10FC)$        | C1     | CIFAR-10 |
| VGG11 <sub>FE</sub> – (2048FC) – (2048FC) – (10FC) | S1     | SVHN     |

| Introduction | Background<br>00 | PoET-BiN<br>0000000 | Experimental setup and results<br>0000 | Conclusion |
|--------------|------------------|---------------------|----------------------------------------|------------|
|              |                  |                     |                                        |            |
| Results :    | Accuracy         |                     |                                        |            |

- A<sub>1</sub>: Vanilla network, A<sub>2</sub>: Network with binary features, A<sub>3</sub>: Teacher network with intermediate layer, A<sub>4</sub>: PoET-BiN
- TABLE Overall classification accuracy on MNIST, CIFAR-10 and SVHN dataset

| ARCH. | DATASET  | A <sub>1</sub> (%) | A <sub>2</sub> (%) | A <sub>3</sub> (%) | A <sub>4</sub> (%) |
|-------|----------|--------------------|--------------------|--------------------|--------------------|
| M1    | MNIST    | 99.20              | 99.06              | 98.93              | 98.15              |
| C1    | CIFAR-10 | 91.02              | 89.88              | 89.10              | 92.64              |
| S1    | SVHN     | 97.36              | 96.98              | 96.22              | 95.13              |

TABLE - Comparison with other techniques

| IMPLEMENTATION  |       | ACCURACY (%) |       |
|-----------------|-------|--------------|-------|
|                 | MNIST | CIFAR-10     | SVHN  |
| BINARYNET(2016) | 98.97 | 89.76        | 95.06 |
| POLYBINN(2018)  | 97.52 | 91.58        | 94.97 |
| NDF(2015)       | 99.42 | 90.46        | 95.20 |
| OUR WORK        | 98.15 | 92.64        | 95.13 |

- There is a reduction in accuracy for each modification introduced
- Comparable accuracy with other state-of-the-art networks
- Same feature extractor
- BinaryNet Neural Network approach
- POLYBINN Decision Tree approach
- NDF Hybrid approach
- Same feature extractor for fair comparisons

| Introduction | Background | PoET-BiN |      | Conclusion |
|--------------|------------|----------|------|------------|
| 00           | 00         | 0000000  | 0000 | 000        |

## **Results : Power Consumption**

- Measurements from Xilinx Power Analyzer tool
- Power consumption of the classification layers only

#### TABLE - RINC power

#### TABLE - Number of arithmetic operations

| DATA SET   | MNIST | CIFAR-10 | SVHN  | OP.   | MNIST  | CIFAR-10 | SVHN  |
|------------|-------|----------|-------|-------|--------|----------|-------|
| DYNAMIC(W) | 0.468 | 0.300    | 0.374 | Add.  | 0.26 M | 18.9 M   | 5.2 M |
| STATIC(W)  | 0.045 | 0.041    | 0.043 | Mult. | 0.26 M | 18.9 M   | 5.2 M |
| TOTAL(W)   | 0.513 | 0.341    | 0.417 |       |        |          |       |

#### TABLE - Single arithmetic operation power

| OPERATION                | DYNAMIC ( <i>mW</i> ) |       |        |    | STATIC        | TOTAL         |
|--------------------------|-----------------------|-------|--------|----|---------------|---------------|
| (at 62.5 Mhz)            | CLOCK                 | LOGIC | SIGNAL | 10 | ( <i>mW</i> ) | ( <i>mW</i> ) |
| MULTIPLICATION (16 BITS) | 1                     | 1     | 0      | 20 | 36            | 58            |
| ADDITION (16 BITS)       | 1                     | 0.0   | 1      | 24 | 36            | 62            |
| MULTIPLICATION (32 BITS) | 2                     | 1     | 1      | 35 | 37            | 76            |
| ADDITION (32 BITS)       | 1                     | 0.0   | 2      | 48 | 37            | 88            |
| MULTIPLICATION (FP)      | 5                     | 6     | 5      | 46 | 37            | 98            |
| ADDITION (FP)            | 4                     | 3     | 5      | 34 | 37            | 83            |

| Introduction | Background | PoET-BiN | Experimental setup and results | Conclusion |
|--------------|------------|----------|--------------------------------|------------|
|              |            |          | 0000                           |            |
|              |            |          |                                |            |

## Results : Energy Consumption, Latency and Hardware Costs

- The networks were implemented on a Spartan-6 Xilinx FPGA
- Energy reduction by up to three orders of magnitude when compared to recent binary quantized neural networks

| TECHNIQUE    |                      | ENERGY $(J)$        |                      |
|--------------|----------------------|---------------------|----------------------|
|              | MNIST                | CIFAR-10            | SVHN                 |
| VANILLA      | $8.0 	imes 10^{-5}$  | $5.7 	imes 10^{-3}$ | $1.6 \times 10^{-3}$ |
| 1-bit Quant  | $2.1 \times 10^{-7}$ | $3.9 	imes 10^{-5}$ | $9.2	imes10^{-6}$    |
| 16-bit Quant | $8.5	imes10^{-6}$    | $6.0 	imes 10^{-4}$ | $1.0 	imes 10^{-4}$  |
| 32-bit Quant | $1.7 	imes 10^{-5}$  | $1.2 	imes 10^{-3}$ | $3.6	imes10^{-4}$    |
| POET-BIN     | $8.2	imes10^{-9}$    | $5.4	imes10^{-9}$   | $4.1 	imes 10^{-9}$  |

TABLE - Implementation results

| DATA SET       | MNIST | CIFAR-10 | SVHN |
|----------------|-------|----------|------|
| LATENCY(NS)    | 9.11  | 9.48     | 5.85 |
| NUMBER OF LUTS | 11899 | 9650     | 2660 |

- Tens of thousands of LUTs, cannot be handcoded in VHDL
- Python library to generate VHDL from high level network information
- 8-input LUTs for MNIST and CIFAR-10, 6-input LUTs for SVHN
- Need original implementation of the other works to estimate the resource consumption for the fully connected layers for fair comparison

TABLE - Energy consumption

| Introduction | Background | PoET-BiN | Experimental setup and results | Conclusion |
|--------------|------------|----------|--------------------------------|------------|
|              |            |          |                                | 000        |
|              |            |          |                                |            |
|              |            |          |                                |            |

# Conclusion

#### Conclusion

- Proposed a Power-efficient Tiny Binary Neuron architecture
- Removed all MAC operations and memory access in classification layers
- Achieved comparable accuracies to other state-of-the-art works

#### Advantages

- Reduction in energy by up to three orders of magnitude when compared to recent binary quantized neural networks
- Can be implemented in any hardware, not just FPGAs

#### **Further Work**

- Implementation for the convolutional layers
- Results for larger datasets

| Introduction | Background | PoET-BiN | Experimental setup and results | Conclusion |
|--------------|------------|----------|--------------------------------|------------|
|              |            |          |                                | 000        |
|              |            |          |                                |            |
|              |            |          |                                |            |

# Acknowledgments

#### We thank

- Ahmed Abdelsalam for his suggestions and comments throughout the project
- MITACS and ReSMiQ for partially sponsoring the project

| Introduction | Background | PoET-BiN | Experimental setup and results | Conclusion |
|--------------|------------|----------|--------------------------------|------------|
|              |            |          |                                | 000        |
|              |            |          |                                |            |

# Thanks!

# Questions???