When the Jaguar supercomputer at Oak Ridge National Laboratory morphed into Titan in 2012, it delivered a huge increase in computational power. Recently, the ORNL’s parallel file system, called Spider, received a similar overhaul, and is in the process of emerging as Spider II.
When it goes online this fall, the Lustre-based Spider II file system will deliver more than 1TB per second of high-end bandwidth across the ORNL’s InfiniBand-based network, up from the 240GB per second delivered by the original Spider. The total storage capacity of the file system increased from 10PB to 32PB, according to a story on the Oak Ridge Leadership Computing Facility website.
“At that speed we expect Spider II to be safely in league with the top three parallel file systems in the world,” Sarp Oral, the task lead for File and Storage Systems projects in the Technology Integration Group, within the National Center for Computational Sciences (NCCS), told the ORLCF.
Spider was unique in that it was the first center-wide shared resource that served all major OLCF platforms, including Jaguar (now Titan), the LENS visualization cluster, the Smoky development cluster, and the lab’s GridFTP servers. Data stored centrally in Spider was accessible to these and other systems–a total of 26,000-plus compute nodes in all.
The physical dimensions of Spider have increased with Spider II, which occupies 672 square feet across four rows of cabinets. Inside the cabinets are I/O servers and a high-end storage array that controls more than 20,000 disks.
The Spider II project also included an upgrade to Lustre 2.4, which should improve the lab’s scalability and metadata performance, and deliver other new features that will benefit the lab.
For example, Lustre 2.4 expands the number of object storage targets for single shared files from 160 to 2,000. An enhancement to the distributed namespace system will support a greater number of users and improve overall metadata performance and scalability, the ORLCF says, while full recoveries of Titan will also be able to be performed in a matter of minutes–“a huge reduction from previous times.”
“Spider II allows our parallel file system to keep pace with the newly increased size and computational horsepower of Titan,” Bronson Messer of the NCCS Scientific Computing Group told the ORLCF. “The anticipated metadata improvement, in particular, should enable our users to produce and analyze the kind of large, complex datasets we anticipate being produced on Titan. Spider II should be both bigger, and better.”
Spider II is the result of collaboration by many parties, including the OLCF staff, Data Direct Networks (DDN), Cray, Mellanox, and Dell.