Invited Talk 2
in
Workshop: Personalized Recommendation Systems and Algorithms
A Memory-centric Approach in Designing System Architectures for Personalized Recommendations
Minsoo Rhu
To satisfy machine learning (ML) practitioners’ insatiable demand for higher processing power, computer architects have been on the forefront of developing “accelerated” computing solutions for ML (e.g., Google TPUs or NVIDIA Tensor Cores) that changed the landscape of the computing industry. Oftentimes quantified using TOPS (tera-operations-per-second) and TOPS/Watt, the market has been in the arms race to design the fastest and most energy-efficient ML accelerator. As such, the past five years have seen a remarkable improvement in raw compute throughput and energy-efficiency delivered with the latest ML accelerators.
Ironically, because computer architects have done such an amazing job addressing the computation bottlenecks of ML, compute primitives accelerated using conventional, GEMM (general purpose matrix multiplication) optimized ML accelerators are becoming relatively less of a concern in several emerging ML applications. In particular, personalized recommendation models for consumer facing products (e.g., e-commerce, Ads) employ “sparse” embedding layers which stand out with their high memory capacity and bandwidth demands, rendering conventional dense-optimized TPUs/GPUs suboptimal in handling the training and deployment process of recommendations. In this talk, I will share our memory-centric approach in designing system architectures for recommendation models, overcoming several key challenges of prior, compute-centric AI systems.