Large language models are fluent text generators, but they struggle at generating factual, correct content, even when paired with tools such as information retrieval and agent programming frameworks. In this talk, I’ll discuss Demonstrate-Search-Predict (DSP), a system we are developing at Stanford to let users build highly accurate applications using LLMs and external tools. DSP offers a declarative programming model, where users write an application using control flow in Python and calls to ML components such as an LLM or a neural information retrieval system. Given such an application and a small amount of data, DSP systematically improves the application by tuning the ML components to get high quality results, by automatically generating better prompts for each model involved, fine-tuning models, etc. We show that with even a few tens of examples, DSP can match state-of-the-art solutions on multiple knowledge-intensive tasks, and that it can then systematically improve both task performance and computational efficiency without requiring manual tuning or prompt engineering from a developer. We also discuss and compare with other emerging approaches to turn LLMs into reliable software components.
Matei Zaharia is an Associate Professor of Computer Science at Stanford (moving to UC Berkeley later this year) and Chief Technologist and Cofounder of Databricks. His research has spanned distributed systems, databases, security and machine learning, with the most recent focus on systems for machine learning, natural language processing, and information retrieval. Matei started and contributed to multiple widely used open source projects including Apache Spark (his PhD project at UC Berkeley), MLflow, Dolly, Delta Lake, and ColBERT. His research was recognized through the 2014 ACM Doctoral Dissertation Award, an NSF CAREER Award, and the US Presidential Early Career Award for Scientists and Engineers (PECASE).
Modern NLP runs on Transformers. Large language models are possible because of system successes in making Transformers bigger, faster, and longer-range. However, 5 years after the advent of BERT and GPT, it is still an open question whether the central routing component of Transformers, Self-Attention, is central to their success in pretraining, or whether it is worth developing large-scale systems for alternative approaches. Inspired by an off-hand wager on this topic https://www.isattentionallyouneed.com, this talk will be an overview of recent work exploring the use of alternative approaches for routing in large-scale NLP architectures. After giving background on the best practices and context of modern NLP, I will describe alternative approaches, primarily focusing on static methods based on state-space models (SSMs) and long-range convolutions. I will conclude by discussing the current empirical results and theoretical properties of these models, as well as paths for their future systems development as competitive technologies.
Alexander "Sasha" Rush is an Associate Professor at Cornell Tech and a researcher at Hugging Face. His current research interests are the intersection of natural language processing and deep generative modeling with applications in text generation, efficient inference, and controllability. In addition to academic research, he has written several popular open-source software projects supporting NLP research, data science, and virtual academic conferences such as NeurIPS and ACL. His research and open-source projects have received paper and demo awards at major NLP, visualization, and hardware conferences, an NSF Career Award, and a Sloan Fellowship. He tweets and blogs, mostly about coding and ML, at @srush_nlp.
Pre-trained models have become a cornerstone of machine learning thanks to the fact that they are often applicable to a huge range of downstream applications. However, these models are typically created by resource-rich research groups that unilaterally decide how a given model should be built, trained, and released, after which point it is never updated. In contrast, open-source development has demonstrated that it is possible for a community of contributors to work together to iteratively build complex and widely used software. This kind of large-scale distributed collaboration is made possible through a mature set of tools including version control and package management. This talk will discuss our research that aims to make it possible to build machine learning models in the way that open-source software is developed. After briefly discussing our work on merging models, model patches, and modular architectures, we will provide a thorough overview of git-theta, our version control system for model parameters. git-theta integrates into the standard git workflow and supports cheaply-communicable patches and can natively handle automatic merging. The talk will conclude with a brief demo of git-theta's functionality.
Federated Learning (FL) is an emerging area of AI focusing on training machine learning models in a privacy-preserving manner. The success of FL, especially in open collaboration settings, rests on being able to continuously attract high quality data owners to participate. This, at the same time, also opens the FL to adversaries trying to exploit other parties’ sensitive privacy information. It is important to adopt an ecosystem management approach to building trust and controlling risk in FL. In this talk, I will share with you some attempts we made at the Trustworthy Federated Ubiquitous Learning (TrustFUL) Research Lab in this general direction, including data valuation under FL settings, fair treatment for FL participants, and studying user reactions to incentive schemes developed for federated learning.
Tatiana is research scientist at Apple MLR working on semi-supervised and unsupervised learning, speech recognition, and federated learning.
In recent years, secure collaborative machine learning paradigms have emerged as a viable option for sensitive applications. By eliminating the need to centralize data, these paradigms protect data sovereignty and reduce risks associated with large-scale data collection. However, they also expose the learning process to active attackers, amplifying robustness issues. In this talk, I'll discuss the security and robustness challenges of secure collaborative learning systems, present our efforts to mitigate some of these issues and highlight why a definitive solution to robustness in these systems is challenging.
Deep learning models are often trained on distributed, webscale datasets crawled from the internet. We introduce two new dataset poisoning attacks that intentionally introduce malicious examples to degrade a model's performance. Our attacks are immediately practical and could, today, poison 10 popular datasets. We will discuss how the attacks work; why (we think) these haven't been exploited yet; and why defending against them comes with non-negligible costs.
Training example order in SGD has long been known to affect convergence rate. Recent results show that accelerated rates are possible in a variety of cases for permutation-based sample orders, in which each example from the training set is used once before any example is reused. This talk will cover a line of work in my lab on decentralized learning and sample-ordering schemes. We will discuss the limits of the classic gossip algorithm and random-reshuffling schemes and explore how both can be improved to make SGD converge faster both in theory and in practice with little overhead.
Nat Jeffries is a founding engineer at Useful Sensors, where he designs privacy-preserving embedded ML sensors. He graduated from Carnegie Mellon University in 2016 with a degree in ECE. He joined Google where he worked on embedded systems before joining Pete Warden to spin up Tensorflow Lite for Microcontrollers. He has previously spoken at Tensorflow World in Sao Paulo Brazil, and guest lectured on TinyML at Harvard.
Multimodal datasets are a critical component in recent breakthroughs such as Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the ML ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. Our benchmark consists of multiple compute scales spanning four orders of magnitude, which enables the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow leads to better training sets. In particular, our best baseline, DataComp-1B, enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet, outperforming OpenAI's CLIP ViT-L/14 by 3.7 percentage points while using the same training procedure and compute. We release DataComp and all accompanying code at www.datacomp.ai
Born from the high energy physics community at the Large Hadron Collider, hls4ml is an open-source Python package for machine learning inference in FPGAs (Field Programmable Gate Arrays). It creates firmware implementations of machine learning algorithms by translating traditional, open-source machine learning package models into optimized high level synthesis C++ that can then be customized for your use case and implemented on devices such as FPGAs and Application Specific Integrated Circuits (ASICs). Hls4ml can easily scale the implementation of a model to take advantage of the parallel processing capabilities that FPGAs offer, not only allowing for low latency, high throughput designs, but also designs sized to fit on lower cost, resource constrained hardware. Hls4ml also supports generating accelerators with different drivers that build minimal, self-contained implementations which enable control via Python or C/C++ with little extra development or hardware expertise.
System-Algorithm Co-Design for TinyML There are billions of tiny IoT devices and microcontrollers worldwide. Deploying deep learning models on these tiny devices is appealing but challenging due to the limited memory size (e.g., 256KB, 2-3 orders of magnitude smaller than mobile phones). In this talk, we will discuss our recent efforts that employ system-algorithm co-design to enable tinyML inference and training. We first propose MCUNet, a framework that jointly designs the efficient neural architecture (TinyNAS) and the lightweight inference engine (TinyEngine), enabling ImageNet-scale inference on microcontrollers. MCUNet is the first framework to achieve the milestone of 70% ImageNet top-1 on commercial microcontrollers. We then look into the SRAM bottleneck of CNN model inference and found that the first several blocks have a significantly higher memory usage. We propose MCUNetV2, featuring a generic patch-based inference schedule that operates only on a small spatial region of the feature map and significantly cuts down the peak memory, enabling more vision applications like object detection for tinyML. Finally, we extend the framework to support on-device training. We propose a sparse update scheme to selectively update only the important weights for transfer learning and cut down the training cost. The algorithmic innovation is implemented by Tiny Training Engine (TTE), which prunes the backward computation graph and offloads the workload from runtime to compile time. Our framework is the first practical solution for on-device transfer learning of visual recognition under 256KB SRAM, 1000x smaller than existing frameworks. We hope our work can inspire more tinyML applications on edge.
Training algorithms for large language models are often communication heavy. As a result, these models are trained dominantly in a centralized environment such as data centers with fast network connections. This strong dependency on fast interconnections is becoming the limiting factor of further scaling for the data center setting and alternative decentralized infrastructures such as spot instances and geo-distributed volunteer computes. In this talk, I will discuss our research in communication-efficient distributed learning and our current effort in training foundation models in a decentralized way.
Ankita is a Senior Staff Engineer in Wireless R&D at Qualcomm Technologies, Inc. Ankita has over eleven years of research and product development experience in hardware and software systems for deep learning, computer vision and wireless domains. She works on on-device ML initiatives for AI/ML-enabled 5G modems. She is also pursuing a doctoral degree at Stanford University and her research is on energy-efficient agile hardware systems for deep learning and computer vision. Ankita has served as a reviewer, technical program committee member and published at various systems and architecture conferences.