

### Memory-driven mixed low precision quantization for enabling deep inference networks on microcontrollers

### Manuele Rusci\*, Alessandro Capotondi, Luca Benini

\*<u>manuele.rusci@unibo.it</u>

Energy-Efficient Embedded Systems Laboratory

Dipartimento di Ingegneria dell'Energia Elettrica e dell'Informazione "Guglielmo Marconi" – DEI – **Università di Bologna** 

ALMA MATER STUDIORUM – UNIVERSITÀ DI BOLOGNA

IL PRESENTE MATERIALE È RISERVATO AL PERSONALE DELL'UNIVERSITÀ DI BOLOGNA E NON PUÒ ESSERE UTILIZZATO AI TERMINI DI LEGGE DA ALTRE PERSONE O PER FINI NON ISTITUZIONAL



### Microcontrollers for smart sensors





### Microcontrollers for smart sensors



- Low-power (<10-100mW) & lowcost
  - □ Smart device are battery- operated
- □ Highly-flexible (SW programmable)

### But limited resources(!)

- □ few MB of memories
- □ single RISC core up to few 100s MHZ (STM32H7: 400MHz) with DSP SIMD instructions and optional FPU
- Currently, tiny visual DL tasks on MCUs (visual wake words, CIFAR10)

Source: STM32H7 datasheet

Challenge: Run 'complex' and 'big' (Imagenet-size) DL inference on MCU ?



# Deep Learning for microcontrollers

### "Efficient" topologies: Accuracy vs MAC vs Memory



Issue1: Integer-only model needed for deployment on low-power MCUs
Issue2: 8-16 bit are not sufficient to bring 'complex' models on MCUs (memory!!)

# Memory-Driven Mixed-Precision Quantization



apply minimum tensor-wise quantization  $\leq$ 8bit to fit the memory constraints with very-low accuracy drop

### **Challenges:**

8 bits

- How to define the quantization policy
- Combine quantization flow this with integer only transformation



### End-to-end Flow & Contributions

**Goal**: Define a design flow to bring Imagenet-size models into an MCU device while paying a low accuracy drop.



We define a rule-based **methodology** to determine the **mixed-precision quantization policy** driven by a memory objective function.

### **Graph Optimization**

A latency-accuracy tradeoff on iso-memory mixed-precision networks belonging to the Imagenet MobilenetV1 family when running on a STM32H7 MCU.

We introduce the **Integer Channel-Normalization** (ICN) activation layer to generate an **integer-only deployment** graph when applying **uniform sub-byte quantization**.





### **Graph Optimization**

# INTEGER-ONLY W/ SUB-BYTE QUANTIZATION



### State of the Art

- I Inference with Integer-only arithmetic (Jacob, 2018)
  - □ Affine transformation between real value and (<u>uniform</u>) quantized parameters
  - Quantization-aware retraining
  - Folding of batch norm into conv weights + rounding of per-layer scaling parameters

real value quantized tensor (INT-Q)  
tensor or sub-  
tensor 
$$\longrightarrow t = S_t \times (T_q - Z_t)$$

- Almost lossless with 8 bit on Image classification and detection problems. Used by TF Lite.
- 8 4 bit MobilnetV1: Training collapse when folding batch norm into convolution weights
- Obes not support Per-Channel (PC) weight quantization

Integer-Only MobilenetV1\_224\_1.0

| Quantization<br>Method | Тор1 | Weights<br>(MB) |
|------------------------|------|-----------------|
| Full-Precision         | 70.9 | 16.8            |
| w8a8                   | 70 1 | 4.06            |
| w4a4                   | 0.1  | 2.05            |
|                        |      |                 |

(Jacob, 2018) Jacob, Benoit, et al. "Quantization and training of neural networks for efficient integer-arithmetic-only inference." CVPR 2018



# Integer-Channel Normalization (ICN)



$$Y_q = quant_{act} \left( \frac{\phi - \mu}{\sigma} \cdot \gamma + \beta \right)$$

 $\phi = \sum w \cdot x$ 

 $\mu, \sigma, \gamma, \beta$  are channel-wise batchnorm parameters

 $\Phi = \sum (W_a - Z_w) \cdot (X_a - Z_r)$ 

Replacing  $t = S_t \times (T_q - Z_t)$ 

 $S_w$  is scalar if PL, else array  $S_i, S_o$  are scalar

$$Y_q = Z_y + quant_{act} \left( \frac{S_i S_w}{S_o} \frac{\gamma}{\sigma} \left( \Phi + \left[ \frac{1}{S_i S_w} \left( B - \mu + \frac{\beta \sigma}{\gamma} \right) \right] \right) \right)$$
$$M_0 2^{N_0} \left( \Phi + B_q \right) \qquad M_0, N_0, B_q \text{ are channel-wise integer params}$$

#### Integer-Only MobilenetV1\_224\_1.0

| Quantization<br>Method | Тор1  | Weights<br>(MB) |
|------------------------|-------|-----------------|
| Full-Precision         | 70.9  | 16.8            |
| PL+ICN w4a4            | 61.75 | 2.10            |
| PC+ICN w4a4            | 66.41 | 2.12            |

### Integer Channel-Normalization (ICN) activation function

holds either for PL or PC quantization of weights





Device-aware Fine-Tuning

# MIXED-PRECISION QUANTIZATION POLICY

# Deployment of an integer-only graph



### Problem

# Can this graph fit the memory constraints of our MCU device?



# Deployment of an integer-only graph





# Deployment of an integer-only graph





**Goal** Maximize memory utilization



[M1] : size(w0) + size(w1) + size (w2) + size(w3) <  $M_{ROM}$ 

 $\delta = 5\%$ 



### Weights Quantization Policy























### Experimental Results on MobilenetV1

Iso-memory MobilenetV1 models with 2MB FLASH and 512kB RAM.

| Model    | Mparams | Full-Prec | Mix-PC | Mix-PL |
|----------|---------|-----------|--------|--------|
| 224_1.0  | 4.24    | 70.9      | 64.3   | 59.6   |
| 192_1.0  | 4.24    | 70.0      | 65.9   | 61.9   |
| 224_0.75 | 2.59    | 68.4      | 68.0   | 67.0   |
| 192_0.75 | 2.59    | 67.2      | 67.2   | 64.8   |
| 224_0.5  | 1.34    | 63.3      | 63.5   | 63.1   |
| 192_0.5  | 1.34    | 61.7      | 62.0   | 59.5   |

Integer-only

Quantization-aware Fine-Tuning recipe:

- □ Init w/ pre-trained params
- BH on 4 NVIDIA Tesla P100
- □ ADAM, Ir=1e-4 (5e-5 @5eph, 1e-5 at 8 eph)
- □ Frozen batch norm stats after 1 eph
- Asymmetric quant on weights, either PC (min/max) or PL (PACT)
- □ Asymmetric activation (PACT)

Open source: https://github.com/mrusci/training-mixed-precision-quantized-networks



### Experimental Results on MobilenetV1

Iso-memory MobilenetV1 models with 2MB FLASH and 512kB RAM.

| Model                                                                                                                   | Mparams                                               | Full-Prec                                                                    | Mix-PC | Mix-PL |
|-------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------|------------------------------------------------------------------------------|--------|--------|
| 224_1.0                                                                                                                 | 4.24                                                  | 70.9                                                                         | 64.3   | 59.6   |
| 192_1.0                                                                                                                 | 4.24                                                  | 70.0                                                                         | 65.9   | 61.9   |
| 224_0.75                                                                                                                | <b>224_1.0</b> P0                                     | D1 p2                                                                        | 68.0   | 67.0   |
| 192_0.75                                                                                                                | P24 6<br>D23 4                                        | P4<br>D5                                                                     | 67.2   | 64.8   |
| 224_0.5                                                                                                                 | P22<br>D21 2<br>P20                                   | P6<br>D7<br>P8                                                               | 63.5   | 63.1   |
| 192_0.5                                                                                                                 | · D19<br>P18                                          | D9<br>P10                                                                    | 62.0   | 59.5   |
| Quantization-awa<br>Init w/ pre-t<br>8H on 4 NV<br>ADAM, Ir=1<br>Frozen bate<br>Asymmetric<br>(min/max) o<br>Asymmetric | $\begin{array}{c ccccccccccccccccccccccccccccccccccc$ | D1 P2<br>D3 P4<br>D5 ph)<br>P6<br>D7<br>P8<br>D9<br>P10<br>D11<br>D11<br>D11 |        |        |

Integer-only

Higher drop due to more aggressive cuts

Open source: https://github.com/mrusci/training-mixed-precision-quantized-networks



Quantization-aware Fine-Tuning recipe:

- □ Init w/ pre-trained params
- 8H on 4 NVIDIA Tesla P100
- □ ADAM, Ir=1e-4 (5e-5 @5eph, 1e-5 at 8 eph)
- □ Frozen batch norm stats after 1 eph
- Asymmetric quant on weights, either PC (min/max) or PL (PACT)
- □ Asymmetric activation (PACT)

Open source: https://github.com/mrusci/training-mixed-precision-quantized-networks



### Experimental Results on MobilenetV1

Iso-memory MobilenetV1 models with 2MB FLASH and 512kB RAM.







**Deployments on MCUs** 

# LATENCY-ACCURACY TRADE-OFF ON A STM32H7 MCU



### Latency-Accuracy Trade Off

### Experiments runs on a STM32H743 (400MHz clk)



- The implementation is based on the sw lib for mixed-precision inference (based on Cmsis-NN):
  - Cmix-NN:
    - https://github.com/EEESlab/CMix-NN
  - UINT2-4 software emulated
  - □ MAC 2x16 bits

PC on the pareto

But PC slower than PL by 20-30%

$$\Phi = \sum (X_q - Z_x) \cdot (W_q - Z_w)$$
  
=  $\sum X_{im2col} \cdot (W_q - Z_w)$  PC  
=  $\sum X_{im2col} \cdot W_q - \sum X_{im2col} \cdot Z_w$  PL

Overall +8% with respect to best 8-bit integer-only MobilenetV1 fitting the device (Jacob et al. 2018)



### Wrap-up

- We proposed an **end-to-end methodology** to train and deploy 'complex' DL models on **tiny MCUs**.
  - **sub-byte** uniform quantization
  - mixed-precision settings
  - a memory-driven rule-based method for determine the quantization policy
  - integer-only transformation with **ICN** activation layers
  - mixed precision **software** library for MCU
- Deployment of a 68% Imagenet MobilenetV1 into a MCU with 2MB FLASH and 512 kB RAM.