Derechos, the namesake of the new supercomputer coming to the National Center for Atmospheric Research (NCAR), are fast-moving, widespread bands of thunderstorms. Indeed, NCAR itself is moving quickly and ambitiously with the new system – its third major installation since 2012. Irfan Elahi, director of NCAR’s high-performance computing division, recently spoke to HPCwire about the development and timeline for Derecho, as well as plans further into the future.
The nuts and bolts
First, the specs: Derecho, built by HPE, will be water-cooled and predominantly powered by third-generation AMD Epyc Milan CPUs and Nvidia’s 40GB A100 GPUs, with 2,488 CPU-only dual-socket nodes (256GB of memory per node) and 82 single-socket heterogeneous nodes (four A100s and 512GB of memory per node). In total, the system is equipped with 692TB of total memory, 328 A100 GPUs and 5,058 Milan CPUs, all connected by HPE Slingshot v11 networking.
This combined hardware will deliver 19.87 peak petaflops – more than triple the performance of Derecho’s predecessor, Cheyenne (5.34 peak petaflops). Cheyenne, installed in 2016, was itself preceded by the 1.26-peak petaflops Yellowstone system in 2012.
Derecho’s firepower will be deployed in service of all things atmospheric and many things environmental, with Elahi noting applications ranging from severe weather (thunderstorms, tornadoes and hurricanes) and climate change to water availability, wildfires, renewable energy, subsurface flow of oil and gas, solar storms and more. “Mainly the supercomputer will enable research that will lead to more detailed and useful prediction capabilities which will have significant societal benefit,” he said, “especially by getting more resilient to climate change.”
A gathering storm
HPE’s win was announced in January of this year, but plans for Derecho had been in the works for years prior. “We kicked off this project in … late summer 2018, and we started by doing a workload analysis study,” Elahi said. “We also then created a panel … and I think that it had 43 different members – diverse both in terms of gender [and] ethnicity but also in subject matter expertise, because we wanted to … look at this subject matter expertise across the earth systems sciences, and we worked with the Science Requirement Advisory Panel to get their requirements[.]”
Through this process, NCAR developed a suite of benchmarks for measuring a new system, which was then referred to as NWSC-3 after the NCAR-Wyoming Supercomputer Center (NWSC) where NCAR’s supercomputers have been housed. With the benchmarks in hand, NCAR put out an RFI for the system, working with “four or five” potential vendors, including through a workshop that brought together researchers and vendors to strategize for system-building. After the RFP was issued and the dust settled, Elahi said that NCAR selected for the best value – not the lowest cost – and landed on HPE.
Updating the forecast
Derecho is still, however, a little ways off. First, a test system — Gust — will launch around February of 2022. Then, Elahi said, “Derecho itself is going to be delivered sometime mid-March – the first quarter of next year.” Once it’s installed and tested – a six- to seven-week process, he said – NCAR will open the system to its inaugural external users. These users will come from the Accelerated Science Discovery (ASD) program, which is soliciting proposals from researchers whose projects involve “actionable science” of relevance to NCAR’s core objectives. These ASD users, Elahi explained, would help beta test the system for a couple of months over the summer before its wider launch and help NCAR to push itself into the next generation of supercomputing. “The whole idea about ASD is these new, upcoming applications,” he said.
“After ASD, we will open it to all of the user community,” Elahi continued. However, Cheyenne – which Elahi noted has been a remarkably reliable system, with just one power outage over the last few years – will continue running until “sometime in late December of 2022.” “In order to help our users transition and migrate to the new environment, we want to provide them an overlap of six months,” Elahi said.
Derecho will be housed in the NWSC datacenter, which Elahi said was LEED Gold certified. “The most important thing, I think, is application energy efficiency,” he said, speaking to the sustainability of the new system. “Because … growth in peak or sustained computing capability is just one thing – but power efficiency is another.” So, he said, Derecho will produce around three to three and a half times more flops per watt compared to Cheyenne.
Looking further into the future, Elahi noted that computer processing was stagnating, pointing instead to technologies like accelerators, GPUs, FPGAs and AI as the sources of greater computing power and efficiency. And, he said, NCAR would be looking to push the efficiency angle even harder for its fourth system. “One of the things we want to do for our next system is also to look at carbon footprint and sustainability as a specification,” he said.