Oral Thu, May 21, 2026 • 1:00 PM – 1:15 PM PDT

SHIP: SRAM-Based Huge Inference Pipelines for Fast LLM Serving

Andrew Bitar ⋅ Aravind Vayalapra ⋅ Baorui Zhou ⋅ Matthew Boyd ⋅ Charlie Wang ⋅ Sahil Parmar ⋅ Eugene Sha ⋅ Gautam Rayaprolu ⋅ Peter Hicks ⋅ Alex Bowe ⋅ Roberto DiCecco ⋅ Santosh Raghavan ⋅ Evan Patrick ⋅ Josip Smolcic ⋅ David Han ⋅ Kris Kang ⋅ Andy Rock ⋅ Josh Hay ⋅ Mohamed Eldafrawy ⋅ Mikhail Kandel ⋅ Daulet Zhanguzin ⋅ Omar Kilani ⋅ Liming Gong ⋅ Andrew Paprotskyi ⋅ Arash Taheri-Dezfouli ⋅ Josh Fender ⋅ Andrew Ling

[ Slides] [ OpenReview]

Abstract

The proliferation of large language models (LLMs) demands inference systems with both low latency and high efficiency at scale. GPU-based serving relies on HBM for model weights and KV caches, creating a memory bandwidth bottleneck during decode. To break through this bottleneck, we present the first large-scale, SRAM-based LLM inference deployment—Groq’s public cloud—serving hundreds of billions of tokens daily. This paper reviews Groq’s first-generation SRAM-based Huge Inference Pipelines (SHIP), highlighting: (1) a synchronous, low-diameter interconnect enabling low-latency scaling across thousands of chips; (2) optimizations for LLM serving under limited memory capacity; and (3) a large pipeline design that sustains efficiency and latency under varying prefill-to-decode ratios and context lengths. Together, these yield state-of-the-art latency while maintaining efficiency across diverse traffic scenarios—key to real-world LLM serving.

Chat is not available.