Machine learning models, especially large language models such as GPT-3 and generative models for image
synthesis tasks such as Stable Diffusion, are primarily trained in a centralized data center today, with thousands of
GPUs lasting for weeks, if not months. The inference process of these models is also not cheap — given their
staggering size, these models are also often served with expensive cutting-edge GPUs hosted in a centralized data
center. Such a centralized paradigm is not only expensive but also greatly limits the accessibility to the rest of the
research community. Inspired by the great success of volunteer computing and federated learning projects such as
SETI@Home, Folding@Home, and FedML, making machine learning decentralized and collaborative can be a
promising alternative to this centralized paradigm. If we could exploit globally geo-distributed GPUs/edge devices
that are under-utilized, we would share one of the most powerful “supercomputers” in the world and potentially use
them for the next generation of open models!
In recent years, there has been significant progress in decentralized and collaborative learning. This includes new
theoretical and algorithmic developments (e.g., [1, 2, 3, 4]), and practical deployments including Training
Transformer Together [5] and Petals [6]. Together with recent advancements in cryptography, secure computation,
and blockchain technology, we see a path to realizing this decentralized vision for machine learning!
However, there are still many challenging technical problems in front of us, including (1) efficient training and
inference over slow networks, (2) practical verification over untrusted devices, (3) providing privacy and security
guarantees, (4) developing incentive mechanisms, and (5) real-world deployment on blockchains. Tackling these
challenges requires expertise and collaboration from many different communities, not only from machine learning,
systems, security, and privacy, but also from economics, blockchain, and Web3. This workshop aims to bring
leading experts from these different communities together, discuss how these areas can come together to enable a
decentralized learning paradigm, and lay out important directions for future work and concrete cross-community
collaborations.
The topic of this workshop includes but is not limited to:
● New algorithms for decentralized and collaborative learning
● Communication-efficient learning algorithms
● System design and optimizations for decentralized learning
● Learning over untrusted and potentially malicious devices
● Verification of computation in the context of decentralized learning
● Mechanism design and implementation for decentralized learning
● Incentive schemes, e.g., environment, economy, accessibility, for collaborative learning
● Security and privacy for decentralized learning
● Blockchain and Web3 technology for decentralized learning
Thu 5:55 a.m. - 6:00 a.m.
|
Opening Remarks
|
🔗 |
Thu 6:00 a.m. - 6:40 a.m.
|
Building Machine Learning Models like Open-Source Software with git-theta [Colin Raffel & Nikhil Kandpal]
(
Invited Talk
)
Pre-trained models have become a cornerstone of machine learning thanks to the fact that they are often applicable to a huge range of downstream applications. However, these models are typically created by resource-rich research groups that unilaterally decide how a given model should be built, trained, and released, after which point it is never updated. In contrast, open-source development has demonstrated that it is possible for a community of contributors to work together to iteratively build complex and widely used software. This kind of large-scale distributed collaboration is made possible through a mature set of tools including version control and package management. This talk will discuss our research that aims to make it possible to build machine learning models in the way that open-source software is developed. After briefly discussing our work on merging models, model patches, and modular architectures, we will provide a thorough overview of git-theta, our version control system for model parameters. git-theta integrates into the standard git workflow and supports cheaply-communicable patches and can natively handle automatic merging. The talk will conclude with a brief demo of git-theta's functionality. |
Nikhil Kandpal 🔗 |
Thu 6:40 a.m. - 7:20 a.m.
|
Contribution and Fairness-Aware Federated Learning [Han Yu]
(
Invited Talk
)
link »
Federated Learning (FL) is an emerging area of AI focusing on training machine learning models in a privacy-preserving manner. The success of FL, especially in open collaboration settings, rests on being able to continuously attract high quality data owners to participate. This, at the same time, also opens the FL to adversaries trying to exploit other parties’ sensitive privacy information. It is important to adopt an ecosystem management approach to building trust and controlling risk in FL. In this talk, I will share with you some attempts we made at the Trustworthy Federated Ubiquitous Learning (TrustFUL) Research Lab in this general direction, including data valuation under FL settings, fair treatment for FL participants, and studying user reactions to incentive schemes developed for federated learning. |
🔗 |
Thu 7:40 a.m. - 8:20 a.m.
|
Security and Robustness of Collaborative Learning Systems [Anwar Hithnawi]
(
Invited Talk
)
In recent years, secure collaborative machine learning paradigms have emerged as a viable option for sensitive applications. By eliminating the need to centralize data, these paradigms protect data sovereignty and reduce risks associated with large-scale data collection. However, they also expose the learning process to active attackers, amplifying robustness issues. In this talk, I'll discuss the security and robustness challenges of secure collaborative learning systems, present our efforts to mitigate some of these issues and highlight why a definitive solution to robustness in these systems is challenging. |
Anwar Hithnawi 🔗 |
Thu 8:20 a.m. - 9:00 a.m.
|
Poisoning Web-Scale Training Datasets is Practical [Florian Tamer]
(
Invited Talk
)
Deep learning models are often trained on distributed, webscale datasets crawled from the internet. We introduce two new dataset poisoning attacks that intentionally introduce malicious examples to degrade a model's performance. Our attacks are immediately practical and could, today, poison 10 popular datasets. We will discuss how the attacks work; why (we think) these haven't been exploited yet; and why defending against them comes with non-negligible costs. |
🔗 |
Thu 11:00 a.m. - 11:40 a.m.
|
Example Selection for Distributed Learning [Chris De Sa]
(
Invited Talk
)
Training example order in SGD has long been known to affect convergence rate. Recent results show that accelerated rates are possible in a variety of cases for permutation-based sample orders, in which each example from the training set is used once before any example is reused. This talk will cover a line of work in my lab on decentralized learning and sample-ordering schemes. We will discuss the limits of the classic gossip algorithm and random-reshuffling schemes and explore how both can be improved to make SGD converge faster both in theory and in practice with little overhead. |
Christopher De Sa 🔗 |
Thu 11:40 a.m. - 12:20 p.m.
|
DataComp: In search of the next generation of multimodal datasets [Ludwig Schmidt]
(
Invited Talk
)
Multimodal datasets are a critical component in recent breakthroughs such as Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the ML ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. Our benchmark consists of multiple compute scales spanning four orders of magnitude, which enables the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow leads to better training sets. In particular, our best baseline, DataComp-1B, enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet, outperforming OpenAI's CLIP ViT-L/14 by 3.7 percentage points while using the same training procedure and compute. We release DataComp and all accompanying code at www.datacomp.ai |
🔗 |
Thu 12:40 p.m. - 1:20 p.m.
|
Accommodating LLM training over decentralized computational resources [Binhang Yuan]
(
Invited Talk
)
Training algorithms for large language models are often communication heavy. As a result, these models are trained dominantly in a centralized environment such as data centers with fast network connections. This strong dependency on fast interconnections is becoming the limiting factor of further scaling for the data center setting and alternative decentralized infrastructures such as spot instances and geo-distributed volunteer computes. In this talk, I will discuss our research in communication-efficient distributed learning and our current effort in training foundation models in a decentralized way. |
Binhang Yuan 🔗 |