As Moore’s law slows, HPC developers are increasingly looking for speed gains in specialized code and specialized hardware – but this specialization, in turn, can make testing and deploying code trickier than ever. Now, researchers from Texas A&M University, the University of Illinois at Urbana-Champaign and the University of Texas at Austin have teamed, with NSF funding, to build a $5 million prototype supercomputer (“ACES”) with a dynamically configurable smörgåsbord of hardware, aiming to support developers as hardware needs grow ever more diverse.
ACES (short for “Accelerating Computing for Emerging Sciences”) is presented as an “innovative composable hardware platform.” ACES will leverage a PCIe-based composable framework from Liqid to offer access to Intel’s high-bandwidth memory Sapphire Rapids processors and more than 20 accelerators: Intel FPGAs; NEC Vector Engines; NextSilicon co-processors; Graphcore IPUs (Intelligence Processing Units); and Intel’s forthcoming Ponte Vecchio GPUs. All this hardware will be coupled with Intel Optane memory and DDN Lustre Storage and connected with Mellanox NDR 400Gbps networking.
“ACES will enable applications and workflows to dynamically integrate the different accelerators, memory, and in-network computing protocols to glean new insights by rapidly processing large volumes of data,” the NSF grant reads, “and provide researchers with a unique platform to produce complex hybrid programming models that effectively supports calculations that were not feasible before.”
“ACES takes the next step over current and planned XSEDE resources by incorporating composability, reconfigurable hardware, novel accelerators, high bandwidth memory processors and networking that are currently not available to researchers,” Honggao Liu, executive director of Texas A&M’s High Performance Research Computing (HPRC) and principal investigator for the ACES project, told HPCwire. “ACES leverages Liqid’s innovative composable infrastructure platform that unifies multi-fabric support for composability across PCIe 5.0 allowing it to dynamically pair over 20 different accelerators or Optane SSDs to a compute node based on user’s job requirements. The correct accelerators can be used based on the workflow, while unblocked resources may be freely allocated to other jobs.”
“They’ll be able to essentially build the custom environment they require on a per job basis and not be constrained to the contents of a physical server node,” added Timothy Cockerill, director of user services for the Texas Advanced Computing Center (TACC) and co-principal investigator for ACES.
Liu said that the team hopes the ACES platform will be deployed by September 2022 and that it will be housed in a datacenter on the Texas A&M campus.
The ACES system will be used to support researchers across a broad range of disciplines, with the researchers listing everything from health population informatics and agriculture sciences to climate modeling and quantum chemistry in the possible applications of the versatile hardware. Liu explained that the ACES resources will be coordinated through systems supported by the NSF.
“In this way, the ACES system will provide invaluable support to cutting-edge projects across a broad spectrum of research disciplines in the nation,” Liu said. “ACES will also leverage HPRC’s efforts that promote science and broaden participation in computing at the K-12, collegiate and professional levels to have a transformative impact nationally by focusing on training, education and outreach.”
“Exciting advances on many science frontiers will become possible by harnessing the hybrid computing resources and highly adaptable framework offered by ACES to enable increasingly complex scientific workflows driven by geospatial big data and artificial intelligence,” added Shaowen Wang, a professor of geography and geographic information science at the University of Illinois at Urbana-Champaign and co-principal investigator for ACES. (The other principal investigators for the project include Lisa Perez and Dhruva Chakravorty, both hailing from HPRC at Texas A&M.)
The grant has allotted for this system $5 million from October 2021 through an estimated end date of September 2026, plus an additional $1 million per year for five years to operate and support the system. The grant also marks another success for rising HPC star Liqid, which just last year racked up three consecutive wins under the Department of Defense’s High Performance Computing Modernization Program (HPCMP).
“A core tenet of ACES is that the compute task should be attributed to the technology that is best suited to work on it, freeing researchers to truly leverage the strengths of these technologies,” Liu told HPCwire. “By letting researchers run on processors and accelerators best suited for their workflows, ACES will benefit many research and development projects in science and engineering disciplines to glean new insights from rapidly processing large volumes of data.”