Skip to yearly menu bar Skip to main content


Invited Talk May 18, 9:00 AM - 9:25 AM Grand Ballroom 1

Rethinking Open Source Contribution in the Age of AI Agents

Roger Wang
Roger Wang is a core maintainer of vLLM, the most popular open-source LLM inference engine, and the lead maintainer of vLLM-Omni, a framework extending vLLM to support omni-modality models and multimodal interactions. Roger is passionate about building infrastructure that combines technical rigor with real-world reliability and practical value.
Open source projects are seeing a surge of AI-generated pull requests, and vLLM, the inference engine behind much of today's production LLM traffic, is no exception. The cost of producing a plausible-looking PR has collapsed, while the cost of reviewing one has not. This has changed what maintainers do every day, and it has changed what it takes for a new contributor to actually contribute something of value. This talk shares a core maintainer's view on what is happening to OSS contribution patterns, with concrete examples from vLLM: PRs that look correct but miss the design intent, and fixes that paper over deeper issues. It is not a talk against AI - agents are now part of how vLLM gets built. The argument is that a human contributor's leverage has shifted away from producing code and toward understanding systems, picking the right problems, and owning what ships to production. We will close with practical thoughts on how new contributors can stand out, and what maintainers of critical infrastructure should be doing differently.
View full details
Invited Talk May 18, 9:25 AM - 9:50 AM Grand Ballroom 1

Beyond Model Serving: Cross-Stack Co-Design for Agentic Systems

Esha Choukse
Esha Choukse is a Principal Researcher in the Azure Research — Systems (AzRS) group at Microsoft. Her research focuses on efficient and sustainable AI across the computing stack, spanning AI platforms, hardware, and datacenter-scale infrastructure. She is a recipient of the ACM SIGMICRO Early Career Award for foundational contributions to hardware memory compression and to sustainable and efficient datacenter systems. Her papers have received three IEEE Micro Top Picks and an HPCA Best Paper Award. Several of her projects, including Splitwise and power stabilization in AI training datacenters, have had far-reaching impact on the research community and are deployed broadly across industry. Esha received her Ph.D. from The University of Texas at Austin in 2019 and has published extensively in leading venues including ISCA, ASPLOS, MICRO, HPCA, NSDI, and SC.
AI is moving from single-model inference to interactive, multimodal, and agentic systems. In this new regime, performance depends on co-design across the full stack, not on models or hardware alone. This talk argues for rethinking the boundary between machine learning and computer systems, and for treating accuracy and quality as dynamic system-level quantities that can be traded against latency, cost, and energy.
View full details
Invited Talk May 18, 9:50 AM - 10:15 AM Grand Ballroom 1

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

Yuhan Liu
Yuhan Liu is a fifth-year PhD candidate at the University of Chicago, co-advised by Junchen Jiang and Shan Lu. Her research interest is in building efficient large-scale system and networking support for ML model inference. She received MIT EECS rising star, EuroSys best paper award, and UChicago’s Neubauer PhD fellowship for her research. She also leads two open-source projects that build large-scale KV caching layer for efficient LLM inference, and are used in over 30 companies in production, including Google Cloud, Amazon AWS, NVIDIA, IBM etc.
KV cache has traditionally been stored in GPU memory to accelerate the decoding phase of large language model (LLM) inference. However, it is increasingly necessary to move KV caches outside GPU devices, to enable cache reuse across different queries and inference engines. Our real-world usage statistics confirm this trend: over time, the total KV cache stored by users has grown rapidly, far exceeding the capacity of GPU memory. Despite this need, there lacks an efficient solution for offloading and transferring KV caches._x000D_ _x000D_ In this talk, I'll present LMCache, the first efficient open-source KV caching solution, which extracts and stores KV caches generated by modern LLM engines (vLLM and SGLang) out of the GPU memory and shares them across engines and queries. LMCache supports both cache offloading (prefix reuse across queries) and prefill-decode (PD) disaggregation (cross-engine/GPU cache transfer). Our evaluation shows that combining LMCache with vLLM achieves up to 15x improvement in throughput across workloads such as multi-round question answering and document analysis. I'll also briefly talk about the key KV cache optimizations behind LMCache, including CacheGen for KV cache compression and CacheBlend for non-prefix KV cache sharing.
View full details
Invited Talk May 18, 10:15 AM - 10:40 AM Grand Ballroom 1

Eliciting Language Model Behaviors with Investigator Agents

Lisa Li
Xiang Lisa Li is a member of technical staff at OpenAI and an incoming assistant professor at the University of Washington. She received her PhD in Computer Science from Stanford, advised by Percy Liang and Tatsunori Hashimoto. Her research focuses on developing methods to make language models more capable and controllable.
Language models exhibit complex, diverse behaviors when prompted with free-form text, making it difficult to characterize the space of possible outputs. We study the problem of behavior elicitation, where the goal is to search for prompts that induce specific target behaviors (e.g., hallucinations or harmful responses) from a target language model. To navigate the exponentially large space of possible prompts, we train investigator models to map randomly-chosen target behaviors to a diverse distribution of outputs that elicit them, similar to amortized Bayesian inference. We do this through supervised fine-tuning, reinforcement learning via DPO, and a novel Frank-Wolfe training objective to iteratively discover diverse prompting strategies. Our investigator models surface a variety of effective and human-interpretable prompts leading to jailbreaks, hallucinations, and open-ended aberrant behaviors.
View full details
Keynote May 18, 1:30 PM - 2:30 PM Grand Ballroom 1

When AI Starts Writing Systems Code

Mark Saroufim
Mark Saroufim is a co-founder at Core Automation, co-founder of GPU MODE and was formerly a systems researcher at Meta working on PyTorch. His work focuses on AI infrastructure, GPU kernels, open-source systems, and AI for systems. He cares about both building better AI systems and building the open communities and benchmarks that make progress possible.
Systems are increasingly being written and optimized by AI systems. This talk focuses on kernel LLMs: models that generate GPU kernels. GPU kernels are a strong target for AI-driven optimization because they are verifiable and commercially interesting to optimize. But despite promising demos, very few AI-generated kernels are reliable enough to be used in production without significant human supervision._x000D_ _x000D_ We will go through examples of how we made LLM kernel evaluation more robust through open benchmarks, community feedback loops, and infrastructure built in public through GPU MODE. We will close with some thoughts on where ML systems are going, where junior researchers should spend their time, and how to build systems that last in a world where the cost of writing code is approaching zero.
View full details
Keynote May 19, 10:30 AM - 11:30 AM Grand Ballroom 1

The Next Horizon of Systems: From MLSys to System Intelligence

Lidong Zhou
Dr. Lidong Zhou is Corporate Vice President at Microsoft, Chief Scientist of the Microsoft Asia Pacific R&D Group, and Managing Director of Microsoft Research Asia. He has held research and leadership roles across Microsoft’s Silicon Valley, Redmond, and Asia labs. His work focuses on scalable, reliable, and trustworthy distributed systems, and has award papers at SOSP, OSDI, and USENIX ATC. He has also contributed to the design of large-scale systems that power Microsoft’s search, big data, cloud, and AI infrastructure.
MLSys showed how systems can accelerate AI. The next shift is broader: AI is beginning to reshape the practice of systems itself. This emerging paradigm, which we call system intelligence, goes beyond automating programming tasks. It enables new forms of reasoning, design, validation, and evolution for complex systems while preserving rigor. In this talk, I will argue that system intelligence changes not only what systems we can build, but also how we understand systems as a discipline. It pushes us to rethink systems principles and methodology, shifting attention from code-level complexity to greater rigor in specification, design, and validation. Through our experiences with system verification, I will discuss how this shift may help give systems a stronger scientific foundation.
View full details
Keynote May 21, 10:30 AM - 11:30 AM Grand Ballroom 1

Rethinking Pretraining: Data and Architecture

Luke Zettlemoyer
Luke Zettlemoyer is a Professor in the Paul G. Allen School of Computer Science & Engineering at the University of Washington and a Senior Research Director at Meta. His research interests are in the intersections of natural language processing, machine learning, and decision making under uncertainty, with a recent emphasis on the science of training both text-based and multi-modal language models. Luke did postdoctoral research at the University of Edinburgh, earned his PhD at MIT, and was an undergraduate at NC State University. His honors include numerous paper awards, being named a Schmidt AI 2050 Senior Follow in 2025, elected President of the Association for Computational Linguistics (ACL) in 2024, named a Fellow of the ACL in 2021 along with winning the Presidential Early Career Award for Scientists and Engineers (PECASE) award in 2016, an Allen Distinguished Investigator Award in 2014, and the National Science Foundation (NSF) International Research Fellowship in 2009.
Large language model training follows a standard pipeline:_x000D_ tokenization, pretraining, possibly mid-training, and post training or_x000D_ alignment. Despite its wild success, we understand relatively little_x000D_ about this recipe and are almost certainly missing many opportunities_x000D_ to improve it. In this talk, I will focus on three such cases. I’ll_x000D_ describe our work on data efficient post training (e.g. LIMA, ALMA,_x000D_ and s1) where we argue that nearly all advanced model capabilities_x000D_ ultimately come from the pretraining data, even if effective alignment_x000D_ is still essential for controlling model behavior. I will also_x000D_ describe new methods for extracting more signal from the pretraining_x000D_ data, including new hierarchical architectures for byte-level language_x000D_ models (e.g. BLT) that are both tokenizer-free and scale better than_x000D_ traditional BPE-based methods, especially in the long tail. Finally, I_x000D_ will discuss decentralized, modular training algorithms (e.g. BTM)_x000D_ that better isolate and control the influence of specific data on_x000D_ specific model components and behaviors. Together, these methods_x000D_ promise to simplify training and improve scaling, by centering and_x000D_ amplifying the influence of data in architecture design.
View full details
Keynote May 20, 10:30 AM - 11:30 AM Grand Ballroom 1

Keynote: Amin Vahdat - SVP and Chief Technologist, AI & Infrastructure

Amin Vahdat
Amin Vahdat is the SVP and Chief Technologist, AI & Infrastructure at Google. His team is responsible for delivering industry-leading infrastructure which spans custom silicon, data centers, network, and supply chain and operations. This infrastructure serves Alphabet, Google and the world, and Artificial Intelligence technologies that empower ML developers and solve customers’ most pressing business challenges. In the past, he was Vice President and General Manager for Google's compute, storage, and network hardware and software infrastructure. Until 2019, he was the Technical Lead and Vice President for the Networking organization at Google.
View full details
Keynote May 22, 9:45 AM - 10:45 AM Grand Ballroom 1

The Path to Infernece Efficiency

Christos Kozyrakis
Christos Kozyrakis is a computer architecture researcher at NVIDIA and the Leonard Bosack and Sandy K Lerner Professor of Engineering at Stanford University. His research focuses on hardware and software infrastructure for AI, as well as the use of AI for hardware and software design. He holds a PhD degree from the University of California at Berkeley and a BS degree from the University of Crete. He is a fellow of the ACM and the IEEE. He has received the IEEE Harry H Goode award, the ACM SIGARCH Maurice Wilkes award, the NSF Career Award, the ISCA Influential Paper Award, the ASPLOS Influential Paper Award, the HPCA Test of Time award, the SoCC Test of Time award, the Okawa Foundation Research Grant, the Noyce Family Faculty Scholarship, and the Willard R. and Inez Kerr Bell Faculty Scholarship, and faculty awards by IBM, Google, and Microsoft.
Agentic AI is moving out of demos and into daily use, creating enormous demand for efficient inference: higher throughput, lower latency, and better efficiency in both dollars and joules. Meeting these targets requires rethinking the full inference stack, from the specialized silicon that runs the models, to the system software that compiles, schedules, and serves them at scale, to the model architectures that determine what must be computed in the first place. In this talk, we will examine these layers with an eye toward the next major advances in hardware architecture, and how systems and algorithms can be co-designed to fully exploit them. Large gains in inference efficiency will come not from isolated improvements, but from treating hardware, systems, and models as an integrated stack.
View full details

No Events Found

Try adjusting your search terms