HPC up-and-comer Liqid has received its third system order from the Department of Defense’s High Performance Computing Modernization Program (HPCMP) in a month. The $20.6 million contract calls for a 17-petaflops ‘composable supercomputing’ system to be deployed at U.S. Army Corps of Engineers Engineer Research and Development Center (ERDC) in Vicksburg, Mississippi, in 2021.
The system — named Wheat in recognition of Medal of Honor winner Roy M. Wheat — will support a mix of HPC, AI and data analytics workloads to serve HPCMP’s broad user base across the military complex.
On track to be HPCMP’s most powerful system yet, Wheat will span 904 Cascade Lake AP nodes of various configurations (standard, AI/ML, visualization, and high-memory). Altogether, the system comprises 1,808 Xeon Platinum 9242 (Cascade Lake AP) CPUs, 536 Nvidia A100 GPUs, 391 terabytes of memory, 4.5 petabytes of Liqid all-flash NVMe-oF parallel file system storage (to be integrated with existing storage filesystem), and HDR 200 Gbps InfiniBand networking technology.
Liqid’s composable architecture creates a flexible shared resource pool over the PCIe 4.0 bus. “This allows you to take almost any of these nodes and make some of the resources available to other node types in the system via the PCIe switch,” said George Moncrief, chief technologist at ERDC’s DoD Supercomputing Resource Center (DSRC), in an interview with HPCwire. Although the CPUs leverage PCIe Gen3, Moncrief noted that the 32 Gpbs PCIe 4.0 pipes are a boon for shared bandwidth, delivered over the switched architecture.
Accommodating today’s mixed workloads, where HPC users are pulling in AI, and AI users are moving to physics-informed machine learning is a chief goal of the HPCMP.
“With this system, we wanted to be able to better support some of the emerging high performance data analytics projects better than we had with some of our more traditional supercomputers without stepping so far away from what our scientific users have been used to or to make it too different for them,” said Ben Parsons, chief technology officer for the HPCMP.
Wheat will be used by HPCMP’s broad user base, and will be a key resource for the CREATE program, which models military platforms, including ground vehicles, ships, and airframes. These application areas make heavy use of fluid dynamics modeling, finite element analysis and meshing. There are also pure AI/ML users who may be bringing in the data set from somewhere else and need the GPUs to crunch the data, said Parsons.
Parsons highlighted a new approach, where instead of using high performance scratch file systems, HPCMP will be connecting the upcoming system to a high performance center-wide file system that will be able to interface with different resources. “If we want to bring in a machine special for AI/ML, we’ll be able to also interconnect that into the same file system,” said Parsons. “And we expect that to help us better support some of the emerging jobs that we’re seeing.”
Moving away from typical erase-often scratch storage opens up the door to data management policies better suited to users who are transferring in data to do their analytics work, Parsons said.
Broomfield, Colo.-based Liqid, a newcomer to the DOD’s HMCMP datacenters, is on track to supply the DOD with three machines in 2021 — at an aggregate contract value of more than $50 million. In addition to Wheat — slated to be stood up at the ERDC in early 2021 — Jean and Kay are on-track to be deployed at the Army Research Laboratory (ARL) DSRC in mid fiscal 2021.
Moncrief and Parsons cited the parallel priorities of satisfying technical requirements and price-performance targets when fielding these multi-million dollar high-tech instruments. They also underscored the importance of — to the extent possible — fostering diversity within the system provider ecosystem, which as we know, is under constant M&A and consolidation pressures.
“We try to have a balanced approach to the evaluation; we’d like a good price-performance number as well as a good fit for our requirements,” said Moncrief. “And we are also looking at maximizing competition. So we try and write these requirements in such a way and evaluate all of those factors, so that as many vendors as possible have an opportunity to win.”
The technologists say they’re excited to be working with Liqid. “They’re a smaller business, and they have a lot of enthusiasm,” said Moncrief, “That’ll work to our advantage as we partner with them to make this a success.”
The HPCMP funds and oversees the operation of five DoD Supercomputing Resource Centers (DSRCs) – AFRL, ARL, ERDC, Navy, and MHPCC. The Maui High Performance Computing Center (MHPCC) is considered a vanguard center, investigating new computing technologies as they become available. Of the CONUS (continental U.S.) centers, Moncrief told us that ERDC and ARL have traditionally procured new machines on even numbered years, while the Naval Lab and Air Force Research Laboratory do so on odd numbered years.