One of Los Alamos National Laboratory’s central missions is guarding the safety, security and reliability of the U.S. nuclear stockpile as part of the National Nuclear Security Administration’s (NNSA) Stockpile Stewardship program. Computer simulation, of course, is key tool in the effort, and LANL uses its Trinity supercomputer (Cray XC40) to run simulations and to explore new technology that could be used to speed simulations.
A persistent challenge is the high-resolution 3D simulations of the nuclear stockpile also generate extremely large datasets. Used to ensure resiliency during long-running simulations and for data analysis and visualization, these datasets are valuable but also limit application performance and impact throughput. As LANL points in a description of the issue, “The application runs for extended periods of time — often several months on end. As a result, data flushing to the Lustre file system creates inefficiencies because the application stops completely during the flushing process.”
In confronting the problem, LANL had essentially two options: 1) frequent check-pointing with reduced recovery periods but highly inefficient run times; or 2) bare minimum number of check-pointing runs and long recovery periods.
As outlined in a case history posted by Cray, the company proposed an idea Gary Grider, division leader of the HPC division at LANL, had begun pursuing several years prior: “That idea was to develop an SSD storage product consisting of service nodes connected directly to the Aries network, each containing two SSDs and an API/library with functions to initiate stage in/stage out and query stage state. The solution could be configured in multiple modes using the workload manager,” according to Cray.
The resulting solution, noted Cray, is its Cray DataWarp applications I/O, which runs in root level and from which applications can call a library in real time, thereby circumventing a secondary scheduler. LANL reports being able to achieve its 3 TB/s goal and producing a 15x improvement in I/O performance and reduction in checkpoint restarts to 60 seconds (down from tens of minutes).
LANL says achieving the improved efficiency at 1,024 nodes (32,768 cores) means they have more bandwidth available. Trinity, you may recall, is a formidable machine spec’d out at: 41.5 PF (peak), 2+ PB memory, 19,420 compute nodes, 78 PB parallel file system, and 3.7PB bust buffer storage.
“This gives us an order of magnitude increase in I/O performance with extreme predictability,” according to Galen Shipman, computer science lead on the Eulerian applications project. “We’ve never had this level of performance predictability before.” Trinity, you may recall, is a formidable machine spec’d out at: 41.5 PF (peak), 2+ PB memory, 19,420 compute nodes, 78 PB parallel file system, and 3.7PB bust buffer storage.
Link to case history: https://www.cray.com/sites/default/files/Cray-CS-LANL-DataWarp_1.pdf