Once-for-All Channel Mixers (HyperTinyPW): Generative Compression for TinyML
Abstract
Deploying neural networks on microcontrollers is constrained by kilobytes of flash and SRAM, where 1x1 pointwise (PW) channel mixers often dominate memory even after INT8 quantization. We present HyperTinyPW, a compression-as-generation approach that replaces most stored PW weights with generated weights. A shared micro-MLP synthesizes PW kernels once at load time from tiny per-layer codes; the kernels are cached and then executed with standard integer operators, so the deployment stack stays unchanged. A shared latent basis across layers reduces redundancy, and keeping the first PW layer in INT8 stabilizes early morphology-sensitive mixing. Our contributions are: (1) TinyML-faithful packed-byte accounting that includes the generator, heads or factorization, per-layer codes, the kept first PW layer, and the backbone; (2) a unified evaluation protocol with a validation-tuned threshold (t*) and bootstrap confidence intervals; and (3) a deployability analysis covering integer-only inference and boot-versus-lazy synthesis trade-offs. On three ECG benchmarks (Apnea-ECG, PTB-XL, MIT-BIH), HyperTinyPW shifts the macro-F1 versus flash Pareto frontier: at about 225 kB it matches a ~1.4 MB CNN while being 6.31x smaller (84.15% fewer bytes), retaining at least 95% of large-model macro-F1. Under 32-64 kB budgets it sustains balanced detection where compact baselines degrade. The mechanism applies broadly to other 1D biosignals, on-device speech, and embedded sensing tasks where per-layer redundancy dominates, suggesting a wider role for compression-as-generation in resource-constrained ML systems.