Rethinking Pretraining: Data and Architecture
Large language model training follows a standard pipeline:x000D tokenization, pretraining, possibly mid-training, and post training orx000D alignment. Despite its wild success, we understand relatively littlex000D about this recipe and are almost certainly missing many opportunitiesx000D to improve it. In this talk, I will focus on three such cases. I’llx000D describe our work on data efficient post training (e.g. LIMA, ALMA,x000D and s1) where we argue that nearly all advanced model capabilitiesx000D ultimately come from the pretraining data, even if effective alignmentx000D is still essential for controlling model behavior. I will alsox000D describe new methods for extracting more signal from the pretrainingx000D data, including new hierarchical architectures for byte-level languagex000D models (e.g. BLT) that are both tokenizer-free and scale better thanx000D traditional BPE-based methods, especially in the long tail. Finally, Ix000D will discuss decentralized, modular training algorithms (e.g. BTM)x000D that better isolate and control the influence of specific data onx000D specific model components and behaviors. Together, these methodsx000D promise to simplify training and improve scaling, by centering andx000D amplifying the influence of data in architecture design.