In what is being called an unprecedented upgrade, the NASA Center for Climate Simulation (NCCS) is tripling the peak performance of its Discover supercomputer to more than 3.3 petaflops to power NASA’s Earth science modeling efforts.
The open procurement process included the benchmarking of NCCS codes – notably the Goddard Earth Observing System Model, Version 5 (GEOS-5) and the NASA Unified-Weather and Research Forecasting (NU-WRF) Model. Based on performance and value criteria, SGI was selected to provide Rackable clusters, outfitted with 14-core Intel E5-2697v3 “Haswell” processors.
In this photo taken several months ago partly through the upgrade, the NCCS supercomputer had 45,600 processor-cores and a peak speed of 1.995 petaflops. The visible machine “skins” depict the observed and simulated images of Hurricane Sandy. The Discover supercomputer’s new SGI Rackable clusters will house a total of 64,512 processor cores. Credit: Photo by NASA/Goddard/Bill Hrybyk.
NCCS is in the process of installing the SGI Rackable hardware as three Scalable Compute Units (SCUs 10,11 and 12), which combined offer a total of 64,512 processor cores.
Discover – which derives its name from the NASA adage of “Explore. Discover. Understand.” – is comprised of multiple Linux scalable units built with commodity components. The first scalable Discover unit was installed in the fall of 2006 and there have been several upgrades since that time. The new clusters are replacing portions of Discover dating from 2011.
In its current form, the aggregate of Discover’s individual scalable units (SCUs 8, 9, 10 and 11) is 67 racks, incorporating 62,400 total cores, providing 2.678 petaflops of compute power. SCU10 achieved general availability in January, and SCU11 is currently in pioneer user mode. SCU12 is scheduled to arrive in late May.
NCCS describes the three stages that lead up to a successful deployment on their website, going into detail about the role of the vendor, the NCCS system administrators and benchmarking team, and the power users who put the system through its paces. One successful test involved running an ultra-high-resolution GEOS-5 simulation on the entire SC10 cluster.
NASA’s Discover system administrator Mike Donovan observes that while a typical NCCS installation pace is one SCU per year, they are on track to stand up three SCUs in seven months. The effort requires close coordination among the NCCS technical and facilities staff and the computer vendor. The replacement of SCUs must be carefully timed in order to limit disruptions.
“We want to have the old hardware out at least a week beforehand,” said Bruce Pfaff, who leads Discover’s system administration team. “But we also want to maximize the amount of time users have with the old system and minimize the period of limited resources during the installation.”
Planning for the overhaul meant accounting for 1 megawatt of power and 400 tons of cooling equipment. Racks must be factory-configured for optimal onsite operations, and NCCS also acquired 10 nodes for its Test and Development System (TDS). There’s also the matter of scrubbing data from the old hardware as part of the decommissioning process.
A highlight of the new SGI clusters is the fully non-blocking interconnect fabric, where each 28-core node can communicate directly with every other node via FDR InfiniBand rated at 56 gigabits per second. The enhancements are being driven by the science workloads, which continue to push the compute and I/O envelope. Data volumes are also rising and a high-resolution simulation at NCCS can generate several petabytes of data. To ensure sufficient storage space, NCCS is more than doubling Discover’s online disk capacity from 12.4 to 33 petabytes.