AI is moving from single-model inference to interactive, multimodal, and agentic systems. In this new regime, performance depends on co-design across the full stack, not on models or hardware alone. This talk argues for rethinking the boundary between machine learning and computer systems, and for treating accuracy and quality as dynamic system-level quantities that can be traded against latency, cost, and energy.
KV cache has traditionally been stored in GPU memory to accelerate the decoding phase of large language model (LLM) inference. However, it is increasingly necessary to move KV caches outside GPU devices, to enable cache reuse across different queries and inference engines. Our real-world usage statistics confirm this trend: over time, the total KV cache stored by users has grown rapidly, far exceeding the capacity of GPU memory. Despite this need, there lacks an efficient solution for offloading and transferring KV caches.
In this talk, I'll present LMCache, the first efficient open-source KV caching solution, which extracts and stores KV caches generated by modern LLM engines (vLLM and SGLang) out of the GPU memory and shares them across engines and queries. LMCache supports both cache offloading (prefix reuse across queries) and prefill-decode (PD) disaggregation (cross-engine/GPU cache transfer). Our evaluation shows that combining LMCache with vLLM achieves up to 15x improvement in throughput across workloads such as multi-round question answering and document analysis. I'll also briefly talk about the key KV cache optimizations behind LMCache, including CacheGen for KV cache compression and CacheBlend for non-prefix KV cache sharing.
Language models exhibit complex, diverse behaviors when prompted with free-form text, making it difficult to characterize the space of possible outputs. We study the problem of behavior elicitation, where the goal is to search for prompts that induce specific target behaviors (e.g., hallucinations or harmful responses) from a target language model. To navigate the exponentially large space of possible prompts, we train investigator models to map randomly-chosen target behaviors to a diverse distribution of outputs that elicit them, similar to amortized Bayesian inference. We do this through supervised fine-tuning, reinforcement learning via DPO, and a novel Frank-Wolfe training objective to iteratively discover diverse prompting strategies. Our investigator models surface a variety of effective and human-interpretable prompts leading to jailbreaks, hallucinations, and open-ended aberrant behaviors.
When AI Starts Writing Systems Code
Systems are increasingly being written and optimized by AI systems. This talk focuses on kernel LLMs: models that generate GPU kernels. GPU kernels are a strong target for AI-driven optimization because they are verifiable and commercially interesting to optimize. But despite promising demos, very few AI-generated kernels are reliable enough to be used in production without significant human supervision.
We will go through examples of how we made LLM kernel evaluation more robust through open benchmarks, community feedback loops, and infrastructure built in public through GPU MODE. We will close with some thoughts on where ML systems are going, where junior researchers should spend their time, and how to build systems that last in a world where the cost of writing code is approaching zero.