MLSys Invited Talk Hardware-aware training and inference for large-scale AI

Invited Talk

Hardware-aware training and inference for large-scale AI

Animashree Anandkumar

[ Abstract ]

Wed 14 May 10:30 a.m. PDT — 11:30 a.m. PDT

Abstract:

The scaling of large language models has led to impressive gains in language understanding, but at a cost of insatiable memory and bandwidth requirements. We take a principled approach of designing optimization and quantization algorithms that can reduce memory requirements without sacrificing accuracy. This includes gradient compression methods (GaLore, SignSGD) and logarithmic number system for representation. We also design fine-grained memory reduction schemes such as KV cache compression, chunking and offloading to overcome memory bottlenecks in language models, especially in the reasoning mode where current memory requirements are massive. Such principles are broadly applicable and especially relevant to physical AI where the memory and bandwidth requirements are even greater than frontier LLMs.

Chat is not available.