Why HPC Architectures Are Converging on Low Latency iWARP
High Performance Computing cluster architectures are moving away from proprietary and expensive networking technologies towards Ethernet as the performance/latency of TCP/IP continues to lead the way. InfiniBand, the once-dominant interconnect technology for HPC applications leveraging Message Passing Interface (MPI) and remote direct memory access (RDMA), has now been supplanted as the preferred networking protocol in these environments.
Due to the rapid adoption of the x86 platform in high performance parallel computing environments, 48% of the top 500 supercomputers now use Ethernet as their standard networking technology (http://www.top500.org) while latency-sensitive HPC applications, such as those used in financial trading or modeling environments, leverage IP/Ethernet networks to run the same MPI/RDMA applications using iWARP. Compatible with existing Ethernet switches, iWARP is the proven low latency RDMA-over-Ethernet solution for high-performance computing on TCP/IP, developed by the IETF and supported by the industry’s leading 10GbE adapters.
Cost-Effective, Low Latency Clustering
Now, Chelsio has delivered iWARP connectivity with the shortest delay available in a network interface card. In recent tests with Chelsio partners, the new T420-LL-CR 10Gb Ethernet adapter was found to deliver RDMA Verbs latency of 3.4 µs and average latency of 3.7 µs. The detailed results are available in the T420 Low Latency Technical Brief. These results demonstrate the smooth scalability provided by the T420-LL-CR’s 4th generation T4 ASIC—essential to ensuring continuous low latency operation during periods of heavy use.
Capabilities Integrated to Deliver the Best Performance Overall
Chelsio’s second generation iWARP design builds on the RDMA capabilities of T3, with continued MPI support on Linux with OpenFabrics Enterprise Distribution (OFED), and Windows HPC Server 2008. T3 is already a field-proven performer in Purdue University’s 1300-node cluster, and the T4 design reduces RDMA latency from T3’s already low 6.8 µs to about 3.7 µs through its increased pipeline speed.
This demonstrates the linear scalability of Chelsio’s RDMA architecture to deliver comparable or lower latency than InfiniBand DDR or QDR, and to scale effortlessly in real world applications, as connections are added.
Enhanced Storage Offloads
T4 offers protocol acceleration for both file and block-level storage traffic. For file storage, T4 supports full TOE under Linux and TCP Chimney under Windows. T4’s fourth-generation TOE design adds support for IPv6, increasingly prevalent and now a requirement for many government and wide-area applications. For block storage, T4 supports partial or full iSCSI offload of processor-intensive tasks such as protocol data unit (PDU) recovery, header and data digest, cyclic redundancy checking (CRC), and direct data placement (DDP), supporting VMware ESX.
Broadening Chelsio’s support for block storage, T4 adds partial and full FCoE offload. With an HBA driver, full offload provides maximum performance as well as compatibility with SAN management software. For software initiators, Chelsio supports the Open-FCoE stack and T4 offloads certain processing tasks much as it does in iSCSI. Unlike iSCSI, FCoE requires enhanced Ethernet support for lossless transport, Priority-based Flow Control (PFC), Enhanced Transmission Selection (ETS) and Data Centre Bridging Exchange (DCBX).
These results show how the rigorous requirements of cluster computing and storage can be met or exceeded using iWARP and the Chelsio T420-LL-CR Unified Wire Adapter. Pervasive and reliable 10Gb Ethernet with iWARP delivers a highly scalable ultra-low latency interconnect solution for all HPC environments.
Why risk investing in technologies with marginal installed base and a limited future, when Chelsio T4-based 10GbE Unified Wire Adapters provide comprehensive support and offloading of iSCSI, FCoE, TOE, NFSiWARP, LustreiWARP for CIFs and NFS traffic, and achieve all the requirements needed for application clustering ranging from dozens to thousands of compute nodes?
So if you prefer to use one card throughout your HPC application, choose the T420-LL-CR; the Unified Wire Adapter that does it all, and does it best.