The Path to Infernece Efficiency
Abstract
Agentic AI is moving out of demos and into daily use, creating enormous demand for efficient inference: higher throughput, lower latency, and better efficiency in both dollars and joules. Meeting these targets requires rethinking the full inference stack, from the specialized silicon that runs the models, to the system software that compiles, schedules, and serves them at scale, to the model architectures that determine what must be computed in the first place. In this talk, we will examine these layers with an eye toward the next major advances in hardware architecture, and how systems and algorithms can be co-designed to fully exploit them. Large gains in inference efficiency will come not from isolated improvements, but from treating hardware, systems, and models as an integrated stack.
Speaker