With NVIDIA InfiniBand Adaptive Routing & Self-Healing Networking Technology
From research to business operations, the data center is central to the computational power, storage throughput, and applications performance necessary to support an organization’s mission. That’s why data center infrastructure design is paramount and the network technology connecting it all together plays such a key role in every aspect of its mission critical endeavor.
The performance advantages of NVIDIA InfiniBand technology are second to none, with many attributes that set it apart from any other network, including hardware-based RDMA, purpose-built acceleration engines, in-network computing, and advanced capabilities for managing network congestion and quality of service. Unlike proprietary network solutions, InfiniBand is based on open industry-standards, for backwards and forwards compatibility across generations that protects data center investments.
A Profound Impact on the Data Center’s Mission
High performance computing (HPC) data centers serve an increasing number of applications and users. From research and simulations, to training AI, digital twins, real-time processing, industrial HPC, and more. The growing mix of workloads and users creates further pressure on the data center interconnect, which may result in network data congestion, similar to that of a highway during rush-hour. For the data center, this slowdown results in reduced application performance.
NVIDIA InfiniBand technology includes enhanced capabilities to overcome congested hotspots and efficiently supports many applications and users—all together—including quality of service (QoS), self-healing networking technology, and adaptive routing.
Optimized Data Center Efficiency
InfiniBand adaptive routing technology overcomes network congestion and maximizes overall cluster performance by spreading the traffic across all network links, increasing a links’ utilization and bandwidth. Adaptive routing determines the optimal path that a data packet should follow through a network to arrive at a specific destination. By allowing packets to avoid congested areas, NVIDIA’s adaptive routing technology improves network resource utilization, increasing efficiency and performance, often up to 96%.
To showcase the effects of adaptive routing on application examples, BSMBench and VASP were run on the Texas Advanced Computing Center “Frontera” supercomputer.
BSMBench
BSMBench is a flexible and scalable supercomputer benchmark from computational particle physics. As an open source benchmarking tool, it includes the ability to tune the ratio of communication to computation and is used to simulate workloads, such as Lattice Quantum ChromoDynamics, and by extension, its parent field Lattice Gauge Theory. These make up a significant fraction of global supercomputing cycles. In this test, BSMBench demonstrated a 28% improvement with adaptive routing.
The Vienna Ab initio Simulation Package (VASP) is a computer program for atomic scale materials modelling or electronic structure calculations and quantum-mechanical molecular dynamics. VASP gains a performance improvement of approximately 10% on Frontera when adaptive routing is enabled.
NVIDIA InfiniBand Self-Healing Networking Technology
IT administrators have long known that it’s best to plan for the unexpected, which is one of the fundamental drivers behind NVIDIA Self-Healing Networking technology. This unique hardware-based technology enables the network to overcome link failures and achieve network recovery 1000X faster than any software-based solution. NVIDIA Self-Healing Networking technology also provides switches within the network to exchange information on link status. With this capability, if a specific network link is suddenly inactive, the switch connected to this link will broadcast this information to relevant switches within the network, so they can modify their adaptive routing mechanism. This avoids selecting a path that may lead to the non-active link, allowing for the fastest traffic recovery and eliminating application downtime.
A Better Return on Investment
NVIDIA InfiniBand Self-Healing Networking technology and adaptive routing work together to drive your HPC systems toward new levels of utilization and productivity, ultimately increasing your return on investment (ROI).
To learn more about adaptive routing and NVIDIA Self-Healing Networking technology, review the NVIDIA Adaptive Routing Whitepaper, which includes a key performance analysis of several microbenchmarks, as well as applications with the effects of adaptive routing.