Keynote Talk Thu, May 21, 2026 • 10:30 AM – 11:30 AM PDT Grand Ballroom 1

Rethinking Pretraining: Data and Architecture

Luke Zettlemoyer

Abstract

Large language model training follows a standard pipeline:x000D tokenization, pretraining, possibly mid-training, and post training orx000D alignment. Despite its wild success, we understand relatively littlex000D about this recipe and are almost certainly missing many opportunitiesx000D to improve it. In this talk, I will focus on three such cases. I’llx000D describe our work on data efficient post training (e.g. LIMA, ALMA,x000D and s1) where we argue that nearly all advanced model capabilitiesx000D ultimately come from the pretraining data, even if effective alignmentx000D is still essential for controlling model behavior. I will alsox000D describe new methods for extracting more signal from the pretrainingx000D data, including new hierarchical architectures for byte-level languagex000D models (e.g. BLT) that are both tokenizer-free and scale better thanx000D traditional BPE-based methods, especially in the long tail. Finally, Ix000D will discuss decentralized, modular training algorithms (e.g. BTM)x000D that better isolate and control the influence of specific data onx000D specific model components and behaviors. Together, these methodsx000D promise to simplify training and improve scaling, by centering andx000D amplifying the influence of data in architecture design.

Speaker

Luke Zettlemoyer

Luke Zettlemoyer is a Professor in the Paul G. Allen School of Computer Science & Engineering at the University of Washington and a Senior Research Director at Meta. His research interests are in the intersections of natural language processing, machine learning, and decision making under uncertainty, with a recent emphasis on the science of training both text-based and multi-modal language models. Luke did postdoctoral research at the University of Edinburgh, earned his PhD at MIT, and was an undergraduate at NC State University. His honors include numerous paper awards, being named a Schmidt AI 2050 Senior Follow in 2025, elected President of the Association for Computational Linguistics (ACL) in 2024, named a Fellow of the ACL in 2021 along with winning the Presidential Early Career Award for Scientists and Engineers (PECASE) award in 2016, an Allen Distinguished Investigator Award in 2014, and the National Science Foundation (NSF) International Research Fellowship in 2009.