Skip to yearly menu bar Skip to main content

Invited Talk

Do we need Attention?

Alexander Rush

Ballroom C


Modern NLP runs on Transformers. Large language models are possible because of system successes in making Transformers bigger, faster, and longer-range. However, 5 years after the advent of BERT and GPT, it is still an open question whether the central routing component of Transformers, Self-Attention, is central to their success in pretraining, or whether it is worth developing large-scale systems for alternative approaches. Inspired by an off-hand wager on this topic, this talk will be an overview of recent work exploring the use of alternative approaches for routing in large-scale NLP architectures. After giving background on the best practices and context of modern NLP, I will describe alternative approaches, primarily focusing on static methods based on state-space models (SSMs) and long-range convolutions. I will conclude by discussing the current empirical results and theoretical properties of these models, as well as paths for their future systems development as competitive technologies.

Chat is not available.