Once-for-All Channel Mixers (HyperTinyPW): Generative Compression for TinyML
Abstract
Neural networks on microcontrollers are constrained by kilobytes of flash/SRAM, where 1×1 pointwise (PW) mixers often dominate memory even after INT8 quantization. We present HYPERTINYPW, a compression-as-generation method that replaces most stored PW weights with generated weights: a shared micro-MLP synthesizes PW kernels once at load time from tiny per-layer codes, caches them, and executes them with standard integer operators. This preserves commodity MCU runtimes and incurs only a one-off synthesis cost; steady-state inference matches INT8 separable CNNs. Sharing a latent basis across layers removes cross-layer redundancy, while keeping PW1 in INT8 stabilizes early, morphology-sensitive mixing. We also introduce TinyML-faithful packed-byte accounting (generator, heads/factorization, codes, kept PW1, backbone) and a unified evaluation protocol with validation-tuned thresholds and bootstrap CIs. On three ECG benchmarks (Apnea-ECG, PTB-XL, MIT-BIH), HYPERTINYPW improves the macro- F1–vs.–flash Pareto: at ∼225 kB it achieves neariso performance to a ∼1.4MB CNN while being 6.31× smaller (84.15% fewer bytes), retaining ≥95% of large-model macro-F1. Beyond ECG, HYPERTINYPW transfers to TinyML audio: on Speech Commands keyword spotting it reaches 96.2% test accuracy (98.2% best validation), supporting that generate-and-cache channel mixing applies broadly to embedded sensing workloads where repeated linear mixers dominate memory.