Demystifying the Mixture of Experts Serving Tax
Abstract
Mixture-of-Experts (MoEs) enable massive model sizes but suffer from serving overheads compared to dense models with the same per-token compute costs. This MoE tax varies with the model architecture, inference phase, and parallelism strategy. We comprehensively study the tax for different MoE models, finding that they perform 2-3x worse than equivalent dense models. Through microbenchmarks, we analyze and categorize the underlying tax sources and show how they manifest differently under different configurations. Our key result is that prefill and decode phases incur vastly different taxes; counterintuitively, factors like load imbalance, which harm prefill, can sometimes benefit decode. To gain deeper intuition, we propose a balls-bins-buckets performance model and study recent MoE developments like fine-grained experts and data parallel attention. We conclude by discussing existing and new techniques to reduce the MoE tax and their associated trade-offs.