MLSys Poster FLStore: Efficient Federated Learning Storage for non-training workloads

Poster

FLStore: Efficient Federated Learning Storage for non-training workloads

Ahmad Faraz Khan · Samuel Fountain · Ahmed Mohamed Abdelmoniem Sayed · Ali R. Butt · Ali Anwar

[ Abstract ]

[ OpenReview]

Abstract: Federated Learning (FL) is an approach for privacy-preserving Machine Learning (ML), enabling model training across multiple clients without centralized data collection. With an aggregator server coordinating training, aggregating model updates, and storing metadata across rounds. In addition to training, a substantial part of FL systems are the non-training workloads such as scheduling, personalization, clustering, debugging, and incentivization. Most existing systems rely on the aggregator to handle non-training workloads and use cloud services for data storage. This results in high latency and increased costs as non-training workloads rely on large volumes of metadata, including weight parameters from client updates, hyperparameters, and aggregated updates across rounds, making the situation even worse. We propose FLStore, a serverless framework for efficient FL non-training workloads and storage. FLStore unifies the data and compute planes on a serverless cache, enabling locality-aware execution via tailored caching policies to reduce latency and costs. Per our evaluations, compared to cloud object store based aggregator server FLStore reduces per request average latency by $71$% and costs by $92.45$%, with peak improvements of $99.7$% and $98.8$%, respectively. Compared to an in-memory cloud cache based aggregator server, FLStore reduces average latency by $64.6$% and costs by $98.83$%, with peak improvements of $98.8$% and $99.6$%, respectively. FLStore integrates seamlessly with existing FL frameworks with minimal modifications, while also being fault-tolerant and highly scalable.

Chat is not available.