Skip to yearly menu bar Skip to main content


Contributed Talk 8
in
Workshop: Personalized Recommendation Systems and Algorithms

Accelerated Learning by Exploiting Popular Choices

Muhammad Adnan


Abstract:

Recommendation models are commonly used learning models that suggest relevant items to a user for e-commerce and online advertisement-based applications. Current recommendation models include deep-learning based (DLRM) and time-based sequence (TBSM) models. These models use massive embedding tables to store numerical representation of item’s and user’s categorical variables (memory bound) while also using neural networks to generate outputs (compute bound). Due to these conflicting compute and memory requirements, the training process for recommendation models is divided across CPU and GPU for embedding and neural network executions, respectively. Such a training process naively assigns the same level of importance to each embedding entry. This paper observes that some training inputs and their accesses into the embedding tables are heavily skewed with certain entries being accessed up to 10000× more. This paper tries to leverage skewed embedded table accesses to efficiently use the GPU resources during training. To this end, this paper proposes a Frequently Accessed Embeddings (FAE) framework that provides three key features. First, it exposes a dynamic knob to the software based on the GPU memory capacity and the input popularity index, to efficiently estimate and vary the size of the hot portions of the embedding tables. These hot embedding tables can then be stored locally on each GPU. FAE uses statistical techniques to determine the knob, which is a threshold on embedding accesses, without profiling the entire input dataset. Second, FAE pre-processes the inputs to segregate hot inputs (which only access hot embedding entries) and cold inputs into a collection of hot and cold mini-batches. This ensures that a training mini-batch is either entirely hot or cold to obtain most of the benefits. Third, at runtime FAE generates a dynamic schedule for the hot and cold training mini-batches that minimizes data transfer latency between CPU and GPU executions while maintaining the model accuracy. The framework execution then uses the GPU(s) for hot input mini-batches and a baseline CPU-GPU mode for cold input mini-batches. Overall, our framework speeds-up the training of the recommendation models on Kaggle, Terabyte, and Alibaba datasets by 2.34× as compared to a baseline that uses Intel-Xeon CPUs and Nvidia Tesla-V100 GPUs, while maintaining accuracy.

Paper Link: https://arxiv.org/pdf/2103.00686.pdf

Slides: http://prashantnair.bitbucket.io/arxiv/hotembeddings_slides.pdf