Oral Fri, May 22, 2026 • 8:15 AM – 8:30 AM PDT

Dataflow Is All You Need

Darshan Gandhi ⋅ Pushkar Nandkar ⋅ David Koeplinger ⋅ Nasim Farahini ⋅ Romy Tsoupidi ⋅ Samuel Rydh ⋅ Matheen Musaddiq ⋅ Tuowen Zhao ⋅ Reid Goodbar ⋅ Nathan Sheeley ⋅ Leon Zhang ⋅ Matthew Shaffer ⋅ John Long ⋅ Han Wang ⋅ Angela Wang ⋅ Arjun Sabnis ⋅ Joshua Brot ⋅ Yun Du ⋅ Håkan Zeffer ⋅ Mingran Wang ⋅ Raghu Prabhakar

[ OpenReview]

Abstract

The autoregressive decode phase of token generation is often a major performance bottleneck in modern AI workflows due to its memory-bandwidth-bound characteristics, which are amplified by powerful open-source models with large context windows and techniques such as chain-of-thought reasoning. Popular GPU architectures extract as little as 21\% of the available memory bandwidth from loading weights and KV caches, as the scope of asynchronous execution is limited by CPU scheduling overheads, kernel synchronization overheads, and inadequate compute-communication overlap. While prior work attempts to address these overheads with kernel fusion and asynchronous execution on GPUs, they mostly focus on a single GPU and do not generalize across different model architectures. We argue that to truly mitigate these overheads, $\textit{Dataflow Is All You Need}$. In this paper, we chronicle a co-design approach to achieve peak decoding performance on a dataflow architecture -- the SambaNova SN40 Reconfigurable Dataflow Unit (RDU), substantiated by three key dataflow enabled optimizations -- $\textit{KernelLooping}$, $\textit{BatchStreaming}$, and $\textit{ScheduleOffloading}$ -- that generalize to models that are small, large, dense, MoEs, or hybrids, and that use different attention mechanisms. Collectively, these optimizations deliver more than $\textbf{75}$\% of the theoretical peak roofline performance for a wide range of popular open-source models and demonstrate a speedup of more than $\textbf{6$\times$}$ when using popular speculative decoding techniques. Finally, we show that speculative decoding is $\textbf{1.7$\times$}$ faster on 16 SN40 ~chips than on a DGX H100 despite both systems having comparable HBM bandwidth. The techniques described in this paper and the models used in the evaluation are deployed in a production AI inference cloud at $\href{}{cloud.sambanova.ai}$.

Chat is not available.