SAKURAONE: An Open Ethernet–Based AI HPC System and Its Observed Workload Dynamics in a Single-Tenant LLM Development Environment
Abstract
SAKURAONE is a managed high performance computing (HPC) cluster developed and operated by the SAKURA Internet Research Center. It builds on the \emph{KOKARYOKU PHY} bare metal GPU platform and is optimized for advanced workloads, including large language model (LLM) training. In ISC 2025 TOP500, SAKURAONE is ranked \textbf{49th} by HPL and is the only top 100 system that uses a fully open networking stack—\textbf{800~GbE} with \textbf{SONiC}—demonstrating the scalability of vendor-neutral technology. Measured performance is 33.95~PFLOP/s (HPL~Rmax), 396.295~TFLOP/s (HPCG), and 339.86~PFLOP/s on HPL-MxP with FP8. The system consists of 100 nodes, each with eight NVIDIA H100 GPUs and a 2~PB all-flash Lustre file system, interconnected via a rail-optimized 800~GbE leaf–spine fabric with RoCEv2. Through exclusive use by a single research project, we observed the characteristics of development-related jobs. Consistent with previous HPC studies, small-scale jobs dominated in number, while a few large-scale jobs accounted for most GPU resource time. As the project progressed, resource use shifted from large-scale to mid-scale jobs, reflecting a transition from initial large-scale training to iterative refinement. These observations illustrate the real-world utilization dynamics of GPU clusters under unified project workloads.