August 6, 2014

Austrian HPC Consortium Meets Demanding Inter-node Communication Challenge with Intel True Scale Fabric

Dr. Ernst J. Haunschmid
(c) Peter Wienerroither

The Vienna Scientific Cluster (VSC) refers to a collaboration of high-performance computing resources designed to support a consortium of multiple institutions in Austria consisting of the University of Vienna (UNVIE), the Vienna University of Technology (TUVIE), the University of Natural Resources and Applied Life Sciences Vienna (BOKU), the Graz University of Technology (TU Graz), and several universities in Austria’s southern provinces (University of Graz, University of Mining Leoben, University of Klagenfurt), and the University of Innsbruck.

VSC has three high performance clusters making up the computational hub of this consortium. VSC-1, built in 2009, was ranked #156 on the November 2009 Top500 list. VSC-2, completed in 2011, was recognized as #56 on the June 2011 Top500 list.  And the latest system, VSC-3, is currently being deployed, and sets high expectations in terms of performance and energy efficiency. The VSC-3 system configuration comprises 2,020 nodes based on Supermicro’s green mainboard X9DRD-iF, each fitted with 2 eight-core Intel Xeon E5-2650 v2 processors running at 2.6GHz.  The nodes are oil-cooled using Green Revolution Cooling’s Immersion cooling technology.

Access to the VSC is granted on the basis of peer-reviewed projects.

Researchers will use the VSC-3 cluster for a wide range of applications – from genomics to climate research, using commercial and open source scientific packages, including NAMD, MM5, HMMER and DMFT, among others. A substantial amount of the compute resources will be used for computational materials science, which has a very strong tradition in Austria.  Two of the most important and widely used codes in this area were developed in Vienna, the WIEN2k package and the Vienna Ab-initio Simulation Package (VASP), which is used for performing ab initio electronic structure calculations and quantum mechanical molecular dynamics. Designing VSC-3 required carefully balancing compute performance, memory bandwidth, a strong communication backbone, and other factors, including an ability to manage highly parallel workloads.

Due to the very high inter-node communication demands, the interconnect system of VSC-3 is based on Intel’s Truescale QDR-80 design which is a very attractive fabric solution.  The True Scale QDR-80 design provides an architecture that addresses the consortium’s needs in terms of message rates, latency, resiliency and scalability.

Prior to the VSC-3 selection, VSC benchmarked several communication fabric technologies, including Intel True Scale QDR and Intel True Scale QDR-80, Mellanox FDR, and technologies from Connect IB. The VSC selection committee had experience with Qlogic DDR HCAs and Qlogic QDR switches on VSC-1 and with Mellanox ConnectX2 QDR HCAs and QDR switches on VSC-2.

Scalability was a key concern because of the consortium’s experience with previous clusters. In its first year, some codes exhibited scalability challenges on VSC-2.  Unfortunately, these challenges particularly concerned the two most heavily used codes VASP and WIEN2k.

The team determined that for larger numbers of MPI processes (500-4000), the message rate was the limiting factor. Data showed that the Mellanox ConnectX2 technology on VSC-2 had a much lower message rate on the Ohio State University (OSU) benchmarks at 4-5 million messages/sec with 16 cores per node, versus 16 million messages/sec with 8 cores per node on VSC-1 and its Qlogic fabric. While scalability could be improved by software optimization, in particular using the Eigenvalue soLvers for Petaflop Applications (ELPA) library, the message rate is still the bottleneck limiting the number of nodes that can be used in a job on VSC-2 and still achieve reasonable speedup.

In running the benchmarks for VSC-3, using VSC’s main codes, neither Mellanox FDR or Connect IB* exhibited any advantages over a dual-rail Intel True Scale Fabric QDR80. In some cases, even single-rail Intel True Scale Fabric QDR showed better performance than FDR.

The experience gained from these benchmarks and from the VSC-2 performance data was used in formulating the criteria and requirements in the call for tender for VSC-3. While vendors were free to choose network technology as well as topology, the stringent performance requirements, in particular concerning the message rate (2.5 million messages per second and processor core) led the winning bidder, ClusterVision, to select Intel True Scale Fabric QDR80 for VSC-3.