Dataflow Is All You Need
Darshan Gandhi ⋅ Pushkar Nandkar ⋅ David Koeplinger ⋅ ⋅ Romy Tsoupidi ⋅ ⋅ ⋅ Tuowen Zhao ⋅ Reid Goodbar ⋅ ⋅ Leon Zhang ⋅ ⋅ John Long ⋅ Han Wang ⋅ ⋅ ⋅ ⋅ Yun Du ⋅ Håkan Zeffer ⋅ ⋅ Raghu Prabhakar
Abstract
The autoregressive decode phase of token generation is often the performance bottleneck in modern AI workflows, thanks to powerful open-source models with large context windows coupled with techniques like chain-of-thought reasoning. Decoding is memory bandwidth bound: the speed of token generation is limited by the memory bandwidth utilized to read weights and KV cache values. However, GPUs only use as little as 21\% of the available bandwidth on weights and KV caches. Asynchronous execution is hard on GPUs, which creates CPU scheduling overheads, kernel synchronization overheads, and inadequate compute-communication overlap. While prior work attempts to address these overheads with kernel fusion and asynchronous execution on GPUs, they mostly focus on a single GPU and do not generalize across different types of model architectures. We argue that to truly mitigate these overheads, \emph{Dataflow Is All You Need}. Dataflow architectures execute subgraphs of operations asynchronously on one or more chips, thereby naturally mitigating the overhead faced on GPUs. In this paper, we chronicle a co-design approach to achieve peak decoding performance on a dataflow architecture -- the SambaNova SN40 Reconfigurable Dataflow Unit (RDU). We describe three key optimizations enabled by dataflow -- \emph{\textbf{KernelLooping}}, \emph{\textbf{BatchStreaming}}, and \emph{\textbf{ScheduleOffloading}} -- that generalize over models that are small, large, dense, MoEs, hybrids, and with different attention mechanisms. Collectively, these optimizations deliver more than \textbf{75\%} of the theoretical peak roofline performance for a wide range of popular open-source models. We study speculative decoding in detail and demonstrate a speed-up of more than \textbf{6$\times$} with speculative decoding. Finally, we also show that speculative decoding runs \textbf{1.7$\times$} faster on 16 SN40 RDUs than DGX H100 despite having comparable HBM bandwidth. The techniques described in this paper and the models used in the evaluation are deployed in a production AI inference cloud at cloud.sambanova.ai.
Chat is not available.
Successful Page Load