The Leading Source for Global News and Information Covering the Ecosystem of High Productivity Computing
May 04, 2007
by Tobias Gradl (Institute of Informatics at the University of Erlangen)
and Reinhold Bader (Leibniz Computing Centre)
On April 17, the Leibniz Computing Centre Munich (LRZ) announced the completion phase 2 of the National Supercomputing System HLRB II installation. The SGI Altix 4700 based computer has been upgraded to 9728 cores and 39 TBytes of memory. With the now installed Intel Itanium 2 Montecito Dual Core processors, the system's LINPACK performance has more than doubled to 56.5 TFlop/s. In normal operation, a single operating system image of Novell's SuSE Linux Enterprise Server 10 spans 512 NUMAlink 4 connected cores. This enables the user to access 2 TBytes of shared memory at a time; more than that is not offered by any other comparable computer in the current Top500 list. In benchmark tests, a single SLES 10 image was capable of running productively on up to 1024 cores.
Besides the LINPACK benchmark, other large scale codes have been used to put HLRB II through its paces. One of them is HHG ("Hierarchical Hybrid Grids"), a multigrid solver for the finite elements method on unstructured grids. HHG has been developed by Benjamin Bergen at the Chair for System Simulation (LSS), University Erlangen-Nuremberg, Germany, under the auspices of the Bavarian KONWIHR supercomputing research consortium, and is now being maintained and refined by Tobias Gradl at LSS. It has received the 2006 award of the International Supercomputing Conference (ISC). The software is designed to solve as large as possible simulations as fast as possible. Both goals are achieved by using a compromise between structured and unstructured meshes. Unstructured coarse grid patches are refined in a structured way, which results in minimal storage space for the operator stencils and in high MFlop/s rates thanks to regular memory access patterns.
Using 9170 cores of HLRB II, HHG solved a finite element simulation with 307 billion unknowns in 93 seconds (7.75 seconds per V-cycle of the multigrid method). These figures are impressive -- presumably a world record -- but the scaling is also very good. When keeping the problem size per processor core constant, a V-cycle running on 64 cores takes 4.93 seconds, 5.68 seconds on 4080 cores, and 6.33 seconds on 6120 cores.
Of the 9728 total cores, 6656 cores constitute the "high bandwidth" part of HLRB II. In this part, every dual core processor can access its own memory channel with full speed (8.5 GByte/s). The remaining 3072 cores are organized in groups of two processors (four cores) per memory channel. This inhomogeneity leads to varying performance figures, depending on what part of the machine a program is running on. The effect is visible from the timing results mentioned above. Using up to 6120 cores, i.e., on the "high bandwidth" part, the scaling is good. But when using the whole computer, it deteriorates slightly.
How strongly the memory bandwidth influences performance highly depends on an application's memory access habits. Therefore, SGI and the LRZ HPC support team around Dr. Matthias Brehm eagerly expected the first performance measurements on the new installation phase, to compare them with those from phase 1, in which a single core Itanium 2 Madison processor could use the whole memory bandwidth of 8.5 GByte/s. As they were happy to see, the reduced memory bandwidth per core didn't have as severe an impact as could have been expected. Per-core MFlop/s rates averaged over all applications appear to be at essentially the same level as in phase 1; focusing on the very memory intensive fluid dynamics code shows a per-core performance decrease of up to 40 percent. In particular, HHG's performance, compared to phase 1, is reduced by a factor of 1.1-1.4 when running on the "high bandwidth" part, and by a factor of 1.9-2.0 on the "high density" part. This is more than pleasing, considering the fact that the available memory bandwidth per core has been reduced by factors of 2 and 4, respectively.
HLRB II is suited especially well for multigrid solvers like HHG. Multigrid is among the computationally most efficient methods for solving PDEs, but it is hard to implement on large scale computers, because its hierarchy of coarse meshes creates an inherent lack of parallelism. HLRB II provides 4 GBytes of main memory per core, much more than some other comparable supercomputers. Because of that, larger subdomains of the finite element mesh can be assigned to each processor, and the coarse meshes are still large enough to allow high MFlop/s rates.
About Leibniz Computing Centre (Leibniz-Rechenzentrum, LRZ)
Leibniz Computing Centre is a facility of the commission for information science of the Bavarian Academy of Sciences, with around 170 employees. As a modern service enterprise, LRZ constitutes a scientific computing centre for all universities in Munich and the Academy of Sciences, as well as being a national centre for scientific supercomputing and a centre for large-scale archiving of data. It is responsible for planning, upgrading and deployment of the Munich Scientific Network and acts as a state-wide competence centre for data communication networks. For further information please visit www.lrz.de.
-----
Source: Leibniz Computing Centre
(Digg, Technorati, more)
White Paper: HPC in a Green and Modular Solution Building Block
Learn how the Appro GreenBlade™ System helps consolidate server, storage, network, power and simplified management capabilities in a single package while providing the performance-density, energy-efficiency and best ROI for your business.
Petascale Computing: Algorithms and Applications, edited by David A. Bader, is the first book in CRC's Computational Science Series, edited by Horst Simon. Although the book is a collection of papers, Bader has done an excellent job of creating a compilation that holds together and covers a broad topic very well.
Read More...
Cilk++ used in parallelization of the FP-tree algorithm for pattern mining; Istanbul benchmark results posted; and the latest on the NVIDIA Tesla shortage. John West recaps those stories and more in our weekly wrap-up.
Read More...
Last week's International Supercomputing Conference (ISC'09) was a convenient excuse for vendors to announce a raft of new products, but three, in particular, stood out.
Read More...
Jul 01 | GenomeWeb Daily News | The popularity of cloud computing in the life sciences community was on full display at April's Bio-IT World conference. Read more...
Jul 01 | Linux Magazine | How can getting to the ocean help with HPC computing? Read more...
Jun 29 | GCN.com | Agency issues RFI for "Ubiquitous High Performance Computing" systems. Read more...
Jun 29 | Computerworld | The bottom of the TOP500 reveals the coming revolution in truly accessible high-end computing. Read more...
Jun 18 | EE Times | Parallel software also takes spotlight at Stanford confab. Read more...
Apr 14 | | Many HPC IT departments are feeling the rising pressure to deliver more capacity computing and performance while trying to reduce the total cost of ownership. This white paper discusses how an environmentally-friendly and open-standards HPC building block based computing system using flexible interconnect options helps address capacity computing needs.
Source: Addison Snell, GM/VP, Tabor Research; sponsored by Dell
Many organizations that could benefit from the use of HPC clusters find that it is complicated to get the systems up and running because of limited IT resources or the complexities of the clusters themselves. Learn how the Intel Cluster Ready program, for which Dell was an original partner, seeks to address this challenge for entry level and mid-range HPC users.
BlueArc's Titan architecture represents an evolutionary step in file servers by creating a hardware-based file system that can scale bandwidth, IOPS, and overall data capacity well beyond conventional software-based devices. With its ability to virtualize a massive storage pool of up to four usable petabytes of tiered storage, Titan can scale with growing data requirements, offering a competitive advantage for businesses, researchers, or other enterprises seeking to better manage data growth while still ensuring optimal performance.
Sun Studio Compilers and Tools and Sun HPC ClusterTools allow you to create high performance parallel applications for OpenSolaris, Solaris and Linux. Sun Studio Express 11/08 includes MPI performance analysis capabilities and full OpenMP 3.0 compiler support. Learn about all this and the latest in Sun HPC ClusterTools 8.1.