MLSys 2026 Monday 05/18

Timezone: US/Pacific

Full Schedule Mon Tue Wed Thu Fri

Registration Desk

Rethinking Open Source Contribution in the Age of AI Agents

Roger Wang

9:00 AM - 9:25 AM

Open source projects are seeing a surge of AI-generated pull requests, and vLLM, the inference engine behind much of today's production LLM traffic, is no exception. The cost of producing a plausible-looking PR has collapsed, while the cost of reviewing one has not. This has changed what maintainers do every day, and it has changed what it takes for a new contributor to actually contribute something of value. This talk shares a core maintainer's view on what is happening to OSS contribution patterns, with concrete examples from vLLM: PRs that look correct but miss the design intent, and fixes that paper over deeper issues. It is not a talk against AI - agents are now part of how vLLM gets built. The argument is that a human contributor's leverage has shifted away from producing code and toward understanding systems, picking the right problems, and owning what ships to production. We will close with practical thoughts on how new contributors can stand out, and what maintainers of critical infrastructure should be doing differently.

... more

Speaker Bio

Roger Wang is a core maintainer of vLLM, the most popular open-source LLM inference engine, and the lead maintainer of vLLM-Omni, a framework extending vLLM to support omni-modality models and multimodal interactions. Roger is passionate about building infrastructure that combines technical rigor with real-world reliability and practical value.

... more

Invited Talk

Beyond Model Serving: Cross-Stack Co-Design for Agentic Systems

Esha Choukse

9:25 AM - 9:50 AM

AI is moving from single-model inference to interactive, multimodal, and agentic systems. In this new regime, performance depends on co-design across the full stack, not on models or hardware alone. This talk argues for rethinking the boundary between machine learning and computer systems, and for treating accuracy and quality as dynamic system-level quantities that can be traded against latency, cost, and energy.

... more

Speaker Bio

Esha Choukse is a Principal Researcher in the Azure Research — Systems (AzRS) group at Microsoft. Her research focuses on efficient and sustainable AI across the computing stack, spanning AI platforms, hardware, and datacenter-scale infrastructure. She is a recipient of the ACM SIGMICRO Early Career Award for foundational contributions to hardware memory compression and to sustainable and efficient datacenter systems. Her papers have received three IEEE Micro Top Picks and an HPCA Best Paper Award. Several of her projects, including Splitwise and power stabilization in AI training datacenters, have had far-reaching impact on the research community and are deployed broadly across industry. Esha received her Ph.D. from The University of Texas at Austin in 2019 and has published extensively in leading venues including ISCA, ASPLOS, MICRO, HPCA, NSDI, and SC.

... more

Invited Talk

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

Yuhan Liu

9:50 AM - 10:15 AM

KV cache has traditionally been stored in GPU memory to accelerate the decoding phase of large language model (LLM) inference. However, it is increasingly necessary to move KV caches outside GPU devices, to enable cache reuse across different queries and inference engines. Our real-world usage statistics confirm this trend: over time, the total KV cache stored by users has grown rapidly, far exceeding the capacity of GPU memory. Despite this need, there lacks an efficient solution for offloading and transferring KV caches.x000D x000D In this talk, I'll present LMCache, the first efficient open-source KV caching solution, which extracts and stores KV caches generated by modern LLM engines (vLLM and SGLang) out of the GPU memory and shares them across engines and queries. LMCache supports both cache offloading (prefix reuse across queries) and prefill-decode (PD) disaggregation (cross-engine/GPU cache transfer). Our evaluation shows that combining LMCache with vLLM achieves up to 15x improvement in throughput across workloads such as multi-round question answering and document analysis. I'll also briefly talk about the key KV cache optimizations behind LMCache, including CacheGen for KV cache compression and CacheBlend for non-prefix KV cache sharing.

... more

Speaker Bio

Yuhan Liu is a fifth-year PhD candidate at the University of Chicago, co-advised by Junchen Jiang and Shan Lu. Her research interest is in building efficient large-scale system and networking support for ML model inference. She received MIT EECS rising star, EuroSys best paper award, and UChicago’s Neubauer PhD fellowship for her research. She also leads two open-source projects that build large-scale KV caching layer for efficient LLM inference, and are used in over 30 companies in production, including Google Cloud, Amazon AWS, NVIDIA, IBM etc.

... more

Invited Talk

Eliciting Language Model Behaviors with Investigator Agents

Lisa Li

10:15 AM - 10:40 AM

Language models exhibit complex, diverse behaviors when prompted with free-form text, making it difficult to characterize the space of possible outputs. We study the problem of behavior elicitation, where the goal is to search for prompts that induce specific target behaviors (e.g., hallucinations or harmful responses) from a target language model. To navigate the exponentially large space of possible prompts, we train investigator models to map randomly-chosen target behaviors to a diverse distribution of outputs that elicit them, similar to amortized Bayesian inference. We do this through supervised fine-tuning, reinforcement learning via DPO, and a novel Frank-Wolfe training objective to iteratively discover diverse prompting strategies. Our investigator models surface a variety of effective and human-interpretable prompts leading to jailbreaks, hallucinations, and open-ended aberrant behaviors.

... more

Speaker Bio

Xiang Lisa Li is a member of technical staff at OpenAI and an incoming assistant professor at the University of Washington. She received her PhD in Computer Science from Stanford, advised by Percy Liang and Tatsunori Hashimoto. Her research focuses on developing methods to make language models more capable and controllable.

... more

Sponsor Lightning Talks Session

Sponsor Lightning Talks Session — Morning

11:00 AM - 12:00 PM

Invited Talk

When AI Starts Writing Systems Code

Mark Saroufim

1:30 PM - 2:30 PM

Systems are increasingly being written and optimized by AI systems. This talk focuses on kernel LLMs: models that generate GPU kernels. GPU kernels are a strong target for AI-driven optimization because they are verifiable and commercially interesting to optimize. But despite promising demos, very few AI-generated kernels are reliable enough to be used in production without significant human supervision.x000D x000D We will go through examples of how we made LLM kernel evaluation more robust through open benchmarks, community feedback loops, and infrastructure built in public through GPU MODE. We will close with some thoughts on where ML systems are going, where junior researchers should spend their time, and how to build systems that last in a world where the cost of writing code is approaching zero.

... more

Speaker Bio

Mark Saroufim is a co-founder at Core Automation, co-founder of GPU MODE and was formerly a systems researcher at Meta working on PyTorch. His work focuses on AI infrastructure, GPU kernels, open-source systems, and AI for systems. He cares about both building better AI systems and building the open communities and benchmarks that make progress possible.

... more

Sponsor Lightning Talks Session

Sponsor Lightning Talks Session — Afternoon

2:30 PM - 3:35 PM

Coffee Break

3:35 PM - 4:00 PM

Panel

4:00 PM - 5:00 PM

Poster

YPS Poster Session & YPS Reception

5:00 PM - 7:00 PM

38 Events in this session

Impact of Scheduling for Terminal Agent Workloads on Unified-Memory Workstations

Yuanli Wang ⋅ Vasiliki Kalavri

Tiered Autonomy Framework for Human‚ÄìAgent Collaboration in Mission-Critical Cyber-Physical Systems

David Akokodaripon

On the Diminishing Returns of Expert Load Balancing in MoE LLM Serving

Hanfei Yu ⋅ Jinru Duan ⋅ Jiabin Luo ⋅ Hao Wang

Neuro-Analog

Apuroop Mutyala

SAT-Eval: A Framework for Preference Drift in Multi-Turn LLM Conversations

Suryaprakash Vengadesan ⋅ Suryaprakash Vengadesan

Learning-Guided Design Optimization for Lifecycle Impact Analysis

Grace Magny-Fokam

Cascade: Utility-Driven Speculative Decoding for Mixture-of-Experts

Anish Saxena

LYNX: Workload-Agnostic Expert Remapping for Efficient MoE Inference

Vima Gupta ⋅ Vima Gupta

Automated Feature Engineering -- Faster Iteration to Solve Business Problems

Peter Amenewolde

ADAPTIVE ERASURE CODING FOR FAULT-TOLERANT LLM SERVING WITH CONTINUOUS BATCHING

Chinmay Dhanraj Nehate ⋅ Jun Wang

Practical Unstructured Sparsity for Efficient LLM Inference

Donghyeon Joo ⋅ Bahar Asgari

HiSpec: Hierarchical Speculative Decoding for LLMs

Avinash Kumar ⋅ Sujay Sanghavi ⋅ Poulami Das

HiServe: A Prefix Cache Serving System for Hybrid LLMs

Joshua Zhang

BioTriton: Portable Cross-Vendor GPU Kernels for High-Throughput Bioinformatics via OpenAI Triton

Manpreet singh

REMIX: Dynamic Partitioning for Fine-Grained Heterogeneous LLM Serving

Victoria Clerico ⋅ Corey Lammie ⋅ Garima Singh ⋅ Orhun G√∂rkem ⋅ William Simon ⋅ Hsinyu Tsai ⋅ Jeronimo Castrillon ⋅ Abu Sebastian ⋅ Hadjer Benmeziane

Toward a Small ML Runtime Stack for Raspberry Pi 5 QPUs

Yiannis Hadjiyianni ⋅ Panos Michelakis ⋅ Dimitrios Stamoulis ⋅ Yiannis Hadjiyianni

Communication-Efficient Distributed Inference for Transformer Models via Vector Quantized Context

Xiao Liu ⋅ Lijun Zhang ⋅ Deepak Ganesan ⋅ Hui Guan

LearnedCache: An eBPF-Integrated Perceptron-Based Eviction Policy for the Linux Page Cache

Zejia Qi

Accelerating LLM Inference: Self-Speculative Decoding via Learned Seed Injection

Anuradha Pandey ⋅ Anuradha Pandey

HADIS: Hybrid Adaptive Diffusion Model Serving for Efficient Text-to-Image Generation

Qizheng Yang ⋅ Tung-I Chen ⋅ Siyu Zhao ⋅ Ramesh Sitaraman ⋅ Hui Guan

Leveraging ASIC AI Chips for Homomorphic Encryption

Jianming Tong ⋅ Tianhao Huang ⋅ Jingtian Dang ⋅ Leo de Castro ⋅ Anirudh Itagi ⋅ Anupam Golder ⋅ Asra Ali ⋅ Jevin Jiang ⋅ Jeremy Kun ⋅ Arvind Arvind ⋅ G. Edward Suh ⋅ Tushar Krishna ⋅ Tianhao Huang ⋅ Jeremy Kun ⋅ Jingtian Dang

Shortcut-connected Expert Parallelism for Accelerating Mixture of Experts

Weilin Cai ⋅ Le Qin ⋅ Junwei Cui ⋅ Jiayi Huang

Speciesism in the Assistant Axis: Probing Compassion Vectors in Post-Trained LLMs

Shubham Gupta ⋅ Jasmine Brazilek

ov_training_kit : Model training and inference on local AI PC to strengthen the AI ecosystem

Shivam Basia

NeSyKV: Neuro-Symbolic Architecture-Specific KV-Cache Eviction for LLM Inference

Pratik Poudel ⋅ Jason Liu ⋅ Yanzhao Wu ⋅ Sumit Jha

DriftBench: Measuring and Predicting Infrastructure Drift in LLM Serving Systems

Gianluigi Vitale

ViRuleEval: A Neuro-Symbolic System for Interpretable Evaluation of Text-to-Video Generation

Chufeng Jiang ⋅ Heng Li

Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference

Dhruv Rajesh Deshmukh ⋅ SAURABH GOYAL ⋅ NIPUN KWATRA ⋅ Ramachandran Ramjee

Designing Communication-Efficient AI Systems: An Interconnect-Aware HPC Perspective

Jinghan Yao

AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization

Genghan Zhang ⋅ Shaowei Zhu ⋅ Anjiang Wei ⋅ Zhenyu Song ⋅ Allen Nie ⋅ Zhen Jia ⋅ Nandita Vijaykumar ⋅ Yida Wang ⋅ Kunle Olukotun ⋅ Shaowei Zhu ⋅ Anjiang Wei ⋅ Zhenyu Song

Equinox: Decentralized Scheduling for Hardware-aware Satellite Intelligence

Ansel Erol

BLAZE: Bias-Driven Load-Aware Zero-Overhead Expert Routing

Yide Ran ⋅ DJ Matusz ⋅ Jianwen Xie ⋅ Chuan Li ⋅ Zhaozhuo Xu

ForeCache: Understanding Workloads and Optimizing KVCache Management for Efficiently Serving LLM Coding Agents

Shubham Tiwari ⋅ Tapan Chugh ⋅ Nash Rickert ⋅ Simon Peter ⋅ Ratul Mahajan ⋅ Haiying Shen

SD-HC: Heterogeneous Functional Pipelining for Speculative LLM Decoding on AI PCs

Xikai(Noah) Meng ⋅ Chao Li ⋅ Spandan Tiwari

Flexo: A User-Controllable Distributed Training System

Megan Frisella ⋅ Shubham Tiwari ⋅ Parker Gustafson ⋅ Andy Ruan ⋅ Yi Pan ⋅ Mathew Jacob ⋅ Gilbert Bernstein ⋅ Stephanie Wang ⋅ Parker Gustafson

From 805 ms to 23 ms: Accelerating State-Space Models for Real-Time ICU Monitoring with Fused Triton Kernels

Manpreet singh

A Framework for Evaluating Neural Network Deployability on Analog In-Memory Computing Hardware

Apuroop Mutyala

Towards Efficient Systems for Long-Context Automatic Speech Recognition

Wei-Tzu Lee ⋅ Keisuke Kamahori ⋅ Baris Kasikci

Go to Event Page

Main Navigation

Registration

Opening Remarks

Rethinking Open Source Contribution in the Age of AI Agents

Beyond Model Serving: Cross-Stack Co-Design for Agentic Systems

LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

Eliciting Language Model Behaviors with Investigator Agents

Sponsor Lightning Talks Session — Morning

When AI Starts Writing Systems Code

Sponsor Lightning Talks Session — Afternoon

Coffee Break

Panel

YPS Poster Session & YPS Reception