Taking Measure of Supercomputer Architectures
Members of Berkeley Lab's Computing Sciences divisions are applying their expertise in running scientific codes and evaluating high-performance computers to achieve “real world” assessments of leading supercomputers around the world. Their goal is to determine which architectures are best suited for advancing computational science.
With the re-emergence of viable vector computing systems such as the Earth Simulator and the Cray X1, and with IBM and DOE's Blue Gene/L taking the top spot as the world's fastest computer, there is renewed debate about which architecture is best suited for running large-scale scientific applications.
In order to cut through conflicting claims, researchers from Berkeley Lab's Computational Research and NERSC Center divisions have been putting various architectures through their paces, running benchmarks as well as scientific applications key to Department of Energy programs. The team includes Lenny Oliker, Julian Borrill, Andrew Canning and John Shalf of CRD; Jonathan Carter and David Skinner of NERSC; and Stephane Ethier of the Princeton Plasma Physics Laboratory. Their evaluations have resulted in a half-dozen papers published in journals and presented at conferences in the United States, Norway, Japan and Spain.
In the initial part of their study, the team traveled to Japan in December, 2004 and put five different systems through their paces, running four different scientific applications key to DOE research programs. As part of the effort, the group became the first international team to conduct a performance evaluation study of the 5,120-processor Earth Simulator.
The team also assessed the performance of
the 6,080-processor IBM Power3 supercomputer, running AIX 5.1 at the NERSC Center,
the 864-processor IBM Power4 supercomputer, running AIX 5.2 at Oak Ridge National Laboratory,
the 256-processor SGI Altix 3000 system, running 64-bit Linux at ORNL,
and the 512-processor Cray X1 supercomputer, running UNICOS at ORNL.
“This effort relates to the fact that the gap between peak and actual performance for scientific codes keeps growing,” said team leader Lenny Oliker. “Because of the increasing cost and complexity of HPC systems” – high-performance computing systems – “it is critical to determine which classes of applications are best suited for a given architecture.”
The four applications and research areas selected by the team for the evaluation were
Cactus, an astrophysics code that evolves Einstein's equations from the Theory of Relativity using the Arnowitt-Deser-Misner method,
GTC, a magnetic-fusion application that uses the particle-in-cell approach to solve nonlinear gyrophase-averaged Vlasov-Poisson equations,
LBMHD, a plasma physics application that uses the Lattice-Boltzmann method to study magnetohydrodynamics,
and PARATEC, a first-principles materials science code that solves the Kohn-Sham equations of density-functional theory to obtain electronic wave functions.
“The four applications successfully ran on the Earth Simulator with high parallel efficiency,” Oliker said. “And they ran faster than on any other measured architecture – generally by a large margin.” However, Oliker added, only codes that scale well and are suited to the vector architecture may be run on the Earth Simulator. “Vector architectures are extremely powerful for the set of applications that map well to those architectures,” Oliker said. “But if even a small part of the code is not vectorized, overall performance degrades rapidly.”
One of the codes, LBMHD, ran at 67 percent of peak system performance, even when scaled up to 4,800 processors. However, as with most scientific inquiries, the ultimate solution to the problem is neither simple nor straightforward.
“We're at a point where no single architecture is well suited to the full spectrum of scientific applications,” Oliker said. “One size does not fit all, so we need a range of systems. It's conceivable that future supercomputers would have heterogeneous architectures within a single system, with different sections of a code running on different components.”
One of the codes the group intended to run in this study – MADCAP, the Microwave Anisotropy Dataset Computational Analysis Package – did not scale well enough to be used on the Earth Simulator. MADCAP, developed by Julian Borrill, is a parallel implementation of cosmic microwave background map-making and power spectrum estimation algorithms. Since MADCAP has high input-output requirements, its performance was hampered by the lack of a fast global file system on the Earth Simulator.
Undeterred, the team retuned MADCAP and returned to Japan to try again. The results, outlined in a paper titled “Performance characteristics of a cosmology package on leading HPC architectures” and presented at the 11th International Conference on HPC in Bangalore, India, found that the Cray X1 had the best runtimes for MADCAP but suffered the lowest parallel efficiency. The Earth Simulator and IBM Power3 demonstrated the best scalability, and the code achieved the highest percentage of peak on the Power3. The paper concluded, “Our results highlight the complex interplay between the problem size, architectural paradigm, interconnect, and vendor-supplied numerical libraries, while isolating the I/O filesystem as the key bottleneck across all the platforms.”
Blue Gene/L is currently the world's fastest supercomputer, with the first Blue Gene system being installed at Lawrence Livermore National Laboratory. David Skinner is serving as Berkeley Lab's representative to a new BlueGene/L Consortium led by Argonne National Laboratory. The consortium aims to pull together a group of institutions active in HPC research, collectively building a community focused on the Blue Gene family as a next step towards petascale computing. This consortium will work together to develop or port Blue Gene applications and system software, conduct detailed performance analysis on applications, develop mutual training and support mechanisms, and contribute to future platform directions.
This is a reprint of an article originally published by Berkeley Lab Computing Sciences