Cross-device Federated Learning (FL) is a distributed learning paradigm with several challenges that differentiate it from traditional distributed learning: variability in the system characteristics on each device, and millions of clients coordinating with a central server being primary ones. Most FL systems described in the literature are synchronous in nature --- they perform a synchronized aggregation of model updates from individual clients. Scaling synchronous FL is challenging since increasing the number of clients training in parallel leads to diminishing returns in training speed, analogous to large-batch training. Moreover, synchronous FL can be slow due to stragglers. In this work, we describe the design of a production asynchronous FL system to tackle the aforementioned issues, sketch some of the system design challenges and their solutions, and touch upon principles that emerged from building the production system for millions of clients. Empirically, we demonstrate that asynchronous FL is significantly faster than synchronous FL when training across millions of devices. In particular, in high concurrency settings, asynchronous FL converges 5$\times$ faster while being nearly 8$\times$ more resource efficient than synchronous FL.