Workshop

Personalized Recommendation Systems and Algorithms

Udit Gupta, Carole-Jean Wu, Gu-Yeon Wei, David Brooks

Abstract:

Personalized recommendation is the task of recommendation content to users based on their preferences and history. Providing personalized content is crucial for many emerging applications including health care, fitness, education, food, and entertainment. Today, accurate and efficient recommendation of items power many Internet services such as online search, marketing, e-commerce, and video streaming. In fact, recent estimates show that recommendation systems drive many Internet businesses. In 2018, estimates show that recommendation systems drove up-to 35% of Amazon’s revenue, 75% of movies watched on Netflix, and 60% of videos on Youtube. In addition, the fraction of cycles devoted to serving personalized recommendation models in Facebook’s datacenter -- recommendation accounts for 80% of all AI inference cycles.

While the machine learning and systems research community has devoted significant effort to optimize AI and in particular deep neural networks, the majority of work studies AI-enabled perception, speech recognition, and natural language processing. As a result, efforts across machine learning and systems researchers have primarily focused on convolutional neural networks (CNNs) and recurrent neural networks (RNNs). However, not all services use CNNs and RNNs. In fact, as deep learning forms the backbone of many Internet services, AI for personalized recommendation is arguably one of the most impactful, widely used, and understudied applications of DNNs.

In addition to their importance, modern deep learning solutions for personalized recommendation impose unique compute, memory access, and storage requirements compared to CNNs and RNNs. However, in 2019, less than 2% of research papers were devoted to optimizing systems for recommendation engines.

To address this underinvestment from the research community, we propose a venue to discuss, share, and foster research into personalized recommendation systems and algorithms.

Chat is not available.

Timezone: »

Schedule

Fri 6:15 a.m. - 6:30 a.m.
Welcome to the 3rd PeRSonAl workshop (Introduction)
Udit Gupta, Carole-Jean Wu
Fri 6:30 a.m. - 7:00 a.m.

As machine learning is increasingly being deployed in real world applications, it has become critical to ensure that stakeholders understand and trust these models. End users must have a clear understanding of the model behavior so they can diagnose errors and potential biases in these models, and decide when and how to employ them. However, most accurate models that are deployed in practice are not interpretable, making it difficult for users to understand where the predictions are coming from, and thus, difficult to trust. Recent work on explanation techniques in machine learning offers an attractive solution: they provide intuitive explanations for “any” machine learning model by approximating complex machine learning models with simpler ones. In this talk, I will discuss several popular post hoc explanation methods, and shed light on their advantages and shortcomings. I will conclude the tutorial by discussing implications for recommender systems and highlighting open research problems in the field

Himabindu Lakkaraju
Fri 7:00 a.m. - 7:30 a.m.

To satisfy machine learning (ML) practitioners’ insatiable demand for higher processing power, computer architects have been on the forefront of developing “accelerated” computing solutions for ML (e.g., Google TPUs or NVIDIA Tensor Cores) that changed the landscape of the computing industry. Oftentimes quantified using TOPS (tera-operations-per-second) and TOPS/Watt, the market has been in the arms race to design the fastest and most energy-efficient ML accelerator. As such, the past five years have seen a remarkable improvement in raw compute throughput and energy-efficiency delivered with the latest ML accelerators.

Ironically, because computer architects have done such an amazing job addressing the computation bottlenecks of ML, compute primitives accelerated using conventional, GEMM (general purpose matrix multiplication) optimized ML accelerators are becoming relatively less of a concern in several emerging ML applications. In particular, personalized recommendation models for consumer facing products (e.g., e-commerce, Ads) employ “sparse” embedding layers which stand out with their high memory capacity and bandwidth demands, rendering conventional dense-optimized TPUs/GPUs suboptimal in handling the training and deployment process of recommendations. In this talk, I will share our memory-centric approach in designing system architectures for recommendation models, overcoming several key challenges of prior, compute-centric AI systems.

Minsoo Rhu
Fri 7:30 a.m. - 7:45 a.m.
MERCI: Efficient Embedding Reduction on Commodity Hardware via Sub-Query Memoization (Contributed Talk 1)
Yejin Lee
Fri 7:45 a.m. - 8:00 a.m.
Erasure Coding Based Fault Tolerance for Recommendation Model Training (Contributed Talk 2)
Kaige Liu
Fri 8:00 a.m. - 8:15 a.m.
Elliot: A Comprehensive and Rigorous Framework For Reproducible Recommender Systems Evaluation (Contributed Talk 3)
Vito W Anelli, Claudio Pomo
Fri 8:15 a.m. - 8:30 a.m.
Optimizing Deep Learning Recommender Systems Training on CPU Cluster Architectures (Contributed Talk 4)
Dhiraj Kalamkar Kalamkar
Fri 8:30 a.m. - 8:45 a.m.
Main-Memory Acceleration for Bandwidth-Bound Deep Learning Inference (Contributed Talk 5)
Benjamin Cho, Mattan Erez
Fri 8:45 a.m. - 9:00 a.m.

Neural personalized recommendation is the corner-stone of a wide collection of cloud services and products, constituting significant compute demand of the cloud infrastructure. Thus, improving the execution efficiency of neural recommendation directly translates into infrastructure capacity saving. In this paper, we devise a novel end-to-end modeling infrastructure, DeepRecInfra, that adopts an algorithm and system co-design methodology to custom-design systems for recommendation use cases. Leveraging the insights from the recommendation characterization, a new dynamic scheduler, DeepRecSched, is proposed to maximize latency-bounded throughput by taking into account characteristics of inference query size and arrival patterns, recommendation model architectures, and underlying hardware systems. By doing so, system throughput is doubled across the eight industry-representative recommendation models. Finally, design, deployment, and evaluation in at-scale production datacenter shows over 30% latency reduction across a wide variety of recommendation models running on hundreds of machines.

Udit Gupta
Fri 10:00 a.m. - 11:00 a.m.

In this talk we'll explore three lines of work at the intersection of recommender systems and natural language processing. We'll start by introducing "traditional" recommender systems that leverage text as side-information, either to improve predictive performance or to aid interpretability. Second we'll discuss recent methodological advances in recommendation that borrow methods from NLP as a means of modeling interaction sequences (e.g. models based on word2vec, RNNs, Transformer, etc.). Finally we'll discuss personalized language generation, which borrows ideas from recommender systems to capture patterns of variation in text (subjectivity, context, etc.) and is driving emerging applications such as personalized dialog systems and conversational recommendation.

Julian McAuley
Fri 11:00 a.m. - 11:30 a.m.

The hardware and software that led to the revolution of deep learning was built during the era of computer vision. Differences in architecture and data between that domain and recommenders made the HW/SW stack a poor fit for deep learning based recommender systems, and the experience of many who explored recommendation on the GPU early on, myself included, was bad. In this talk we'll explore changes in GPU hardware within the last generation that make it much better suited to the recommendation problem, along with improvements on the software side that take advantage of optimizations only possible in the recommendation domain. A new era of faster ETL, Training and Inference is coming to the RecSys space and this talk will walk through some of the patterns of optimization that guide the tools we're building to make recommenders faster and easier to use on the GPU.

Even Oldridge Oldridge
Fri 12:00 p.m. - 12:30 p.m.

This talk will present the low-precision techniques, analysis and tool chain we explored to optimize the performance of production scale recommendation models while maintaining the stringent accuracy requirements. We also share the unique challenges and learnings from the deployment of Facebook’s production recommendation models in low precision on existing hardware platforms including CPUs and accelerators. We hope that the methodologies we are sharing are applicable to many ML domains and low precision architectures in general.

Summer Deng
Fri 12:30 p.m. - 1:00 p.m.

This talk will focus on practical and real-world considerations involved with maximizing training speed of deep learning recommender engines. Training deep learning recommenders at scale introduces an interesting set of challenges, because of potential imbalances in compute and communication resources in many training platforms. Our experience in benchmarking the DLRM workload for MLPerf on TensorFlow/TPUs will be used as an exemplar case. In addition, we will use the lessons learned to suggest best practices for efficient design points when tuning recommender architectures.

Tayo Oguntebi
Fri 1:00 p.m. - 1:15 p.m.
Cross-Stack Workload Characterization of Deep Recommendation Systems (Contributed Talk 7)
Samuel Hsia
Fri 1:15 p.m. - 1:30 p.m.

Recommendation models are commonly used learning models that suggest relevant items to a user for e-commerce and online advertisement-based applications. Current recommendation models include deep-learning based (DLRM) and time-based sequence (TBSM) models. These models use massive embedding tables to store numerical representation of item’s and user’s categorical variables (memory bound) while also using neural networks to generate outputs (compute bound). Due to these conflicting compute and memory requirements, the training process for recommendation models is divided across CPU and GPU for embedding and neural network executions, respectively. Such a training process naively assigns the same level of importance to each embedding entry. This paper observes that some training inputs and their accesses into the embedding tables are heavily skewed with certain entries being accessed up to 10000× more. This paper tries to leverage skewed embedded table accesses to efficiently use the GPU resources during training. To this end, this paper proposes a Frequently Accessed Embeddings (FAE) framework that provides three key features. First, it exposes a dynamic knob to the software based on the GPU memory capacity and the input popularity index, to efficiently estimate and vary the size of the hot portions of the embedding tables. These hot embedding tables can then be stored locally on each GPU. FAE uses statistical techniques to determine the knob, which is a threshold on embedding accesses, without profiling the entire input dataset. Second, FAE pre-processes the inputs to segregate hot inputs (which only access hot embedding entries) and cold inputs into a collection of hot and cold mini-batches. This ensures that a training mini-batch is either entirely hot or cold to obtain most of the benefits. Third, at runtime FAE generates a dynamic schedule for the hot and cold training mini-batches that minimizes data transfer latency between CPU and GPU executions while maintaining the model accuracy. The framework execution then uses the GPU(s) for hot input mini-batches and a baseline CPU-GPU mode for cold input mini-batches. Overall, our framework speeds-up the training of the recommendation models on Kaggle, Terabyte, and Alibaba datasets by 2.34× as compared to a baseline that uses Intel-Xeon CPUs and Nvidia Tesla-V100 GPUs, while maintaining accuracy.

Paper Link: https://arxiv.org/pdf/2103.00686.pdf

Slides: http://prashantnair.bitbucket.io/arxiv/hotembeddings_slides.pdf

Muhammad Adnan
Fri 1:30 p.m. - 1:45 p.m.

Recommendation systems have high memory capacity and bandwidth requirements. Disaggregated memory is an upcoming technology that can improve memory utilization, increase memory capacity and bandwidth and allow independent compute and memory scaling. We investigate the use of disaggregated memory for recommenders and we posit that this new technology can provide significant benefits for recommenders’ training and inference.

Talha Imran
Fri 1:45 p.m. - 2:00 p.m.

Modern deep learning models can represent arbitrary objects as vectors, also known as embeddings. Software applications can use these deep learning models and their respective embeddings to power a variety of use cases, including personalization, recommendation systems, image search, anomaly detection, and more. To date, software engineers could build these systems by integrating open source k-nearest neighbor libraries with an off-the-shelf web server. However, using such a solution presents serious challenges in the face of scalability, latency, and flexibility. To address these challenges, we built Pinecone, providing similarity search as a service.

Amir Sadoughi
Fri 2:00 p.m. - 2:15 p.m.

Deep learning recommendation models have grown to the terabyte scale. Traditional serving schemes--that load entire models to a single server--are unable to support this scale. One approach to support this scale is with distributed serving, or distributed inference, which divides the memory requirements of a single large model across multiple servers.

This work is a first-step for the systems research community to develop novel model-serving solutions, given the huge system design space. Large-scale deep recommender systems are a novel workload and vital to study, as they consume up to 79% of all inference cycles in the data center. To that end, this work describes and characterizes scale-out deep learning recommendation inference using data-center serving infrastructure. This work specifically explores latency-bounded inference systems, compared to the throughput-oriented training systems of other recent works. We find that the latency and compute overheads of distributed inference are largely a result of a model's static embedding table distribution and sparsity of input inference requests. We further evaluate three embedding table mapping strategies of three DLRM-like models and specify challenging design trade-offs in terms of end-to-end latency, compute overhead, and resource efficiency. Overall, we observe only a marginal latency overhead when the data-center scale recommendation models are served with the distributed inference manner--P99 latency is increased by only 1% in the best case configuration. The latency overheads are largely a result of the commodity infrastructure used and the sparsity of embedding tables. Even more encouragingly, we also show how distributed inference can account for efficiency improvements in data-center scale recommendation serving.

Mike Lui
Fri 2:15 p.m. - 2:30 p.m.

Modern recommendation systems rely on real-valued embeddings of categorical features. Increasing the dimension of embedding vectors improves model accuracy but comes at a high cost to model size. We introduce a multi-layer embedding training (MLET) scheme that trains embeddings via a sequence of linear layers to derive a superior model accuracy vs. size trade-off. Our approach is fundamentally based on the ability of factorized linear layers to produce superior embeddings to that of a single linear layer. Harnessing recent results in dynamics of backpropagation in linear neural networks, we explain the superior performance obtained by multi-layer embeddings by their tendency to have lower effective rank. We show that substantial advantages are obtained in the regime where the width of the hidden layer is much larger than that of the final embedding vector dimension. Crucially, at the conclusion of training, we convert the two-layer solution into a single-layer one: as a result, the inference-time model size is unaffected by MLET. We prototype MLET across seven different open-source recommendation models. We show that it allows a reduction in vector dimension of up to 16x, and 5.8x on average, across the models. This reduction correspondingly improves inference memory footprint while preserving model accuracy.

Benjamin Ghaemmaghami, Zihao Deng
Fri 2:30 p.m. - 2:45 p.m.

Click-Through Rate (CTR) prediction is one of the most important machine learning tasks in recommender systems, driving personalized experience for billions of consumers. Neural architecture search (NAS), as an emerging field, has demonstrated its capabilities in discovering powerful neural network architectures, which motivates us to explore its potential for CTR predictions. Due to 1) diverse unstructured feature interactions, 2) heterogeneous feature space, and 3) high data volume and intrinsic data randomness, it is challenging to construct, search, and compare different architectures effectively for recommendation models. To address these challenges, we propose an automated interaction architecture discovering framework for CTR prediction named AutoCTR. Via modularizing simple yet representative interactions as virtual building blocks and wiring them into a space of direct acyclic graphs, AutoCTR performs evolutionary architecture exploration with learning-to-rank guidance at the architecture level and achieves acceleration using low-fidelity model. Empirical analysis demonstrates the effectiveness of AutoCTR on different datasets comparing to human-crafted architectures. The discovered architecture also enjoys generalizability and transferability among different datasets.

QQ Song
Fri 2:45 p.m. - 3:00 p.m.
Closing session (Closing)
Udit Gupta, Carole-Jean Wu