Since 1986 - Covering the Fastest Computers in the World and the People Who Run Them

Language Flags
November 12, 2012

Reliability Matters – Your HPC Workloads are Thirsty for Enterprise Quality

Nicole Hemsoth

Is your current HPC data storage solution experiencing issues with disk drives?  Are you seeing performance degradation, where HPC projects take longer to complete than they should?  Is your performance situation normal, or are there reliable alternatives to achieving sustained performance at large HPC scale?

To help address these and other questions you might have when evaluating your data storage infrastructure, Seagate and Xyratex co-authored a white paper, “Achieving Rapid Scale in Enterprise and Cloud Data Centers with SAS.” The paper [1] provides insight into selecting the right disk drive for your application environment and specific performance, scalability and reliability needs. Anyone currently experiencing high rates of what appear to be drive-related issues, or anyone considering purchasing or leasing high-density storage solutions, would be advised to consider these points. Also, those who have the goal to efficiently achieve reliable sustained performance on HPC, enterprise or mission critical applications would benefit from reading this paper. 

The Importance of Drive Design

Design tolerance, and features built into disk drives within multi-spindle environments, have a direct impact on performance. Drives that are not optimized to handle rotational vibration (RV) have shown in testing to produce more than 50 percent less performance.  Also, the RV mitigation features provided in enterprise class drives will not perform as effectively without adequate RV isolation designed into the multi-drive enclosure system.  Both drive RV mitigation and enclosure RV isolation are required to act together to deliver a well-crafted RV management solution.  If RV is not taken into account in the design of the drive and multi-drive enclosure, the force of RV can push the disk drive head off track and cause missed revolutions and delays in data transfers.  Specifically, delayed read/write operations are the root of all vibration-induced I/O degradation. 

Seagate and Xyratex point out in this new paper that “the use of lower-end commodity technologies derived from department-level and workgroup clients as well as the blending, merging and displacement of former data center and enterprise techniques underscore the need for broad industry education regarding the facts about storage technologies.”  In most cases, poor drive reliability is usually a result of deploying the wrong type of storage device within an enterprise class system, or for a specific enterprise class workload.  Hard disk drives, being mechanical devices, are designed with specific features and components for specific workloads.

Improper management of RV can be subtle, and can be introduced into your project through selecting an inappropriate disk drive class compared to its application loading or an enclosure lacking design margin relative to the application and selected disk drive.  These factors do not matter if reliable, sustained performance is not a key purchasing criterion, because there are plenty of archive and low-performance bulk storage applications where attention to RV is not as critical.  However, in the case of high-density HPC data storage, reliable and sustained performance at massive scale is paramount.

Since HPC storage solutions provide numerous data protection methods, improper management of RV does not automatically translate into something as obvious as data loss. Instead, it can result in prolonged lingering performance impact, intermittent errors and escalating service costs which are quite literally built into the storage system for given application load levels.  To overcome these avoidable design limitations, Seagate and Xyratex contrast disk drive types and point out the range of mission critical design characteristics available with high-performance, enterprise-class, nearline SAS drives. 

Drive Testing Critical to Improving Performance

In addition to selecting the right drive type, the white paper describes intensive solution and component test methods adopted by Xyratex to improve drive reliability and system robustness by detecting  individual drive weaknesses or defects early  and thoroughly exercising enclosure-level RV isolation design techniques.  Xyratex’ four-stage Integrated System Testing Platform (ISTP)[2] includes a highly efficient and scalable storage test that exposes, identifies and eliminates devices with inherent defects or defects resulting from manufacturing aberrations that cause time and stress-dependent failures.  This identifies and removes hidden quality problems and significantly reduces in-the-field component failures.  Additionally, this represents attention to drive quality and solution robustness above and beyond business as usual expectations and yields useful perspective on what is attainable to raise the bar on solution performance and reliability among HPC storage providers.

Xyratex’ ISTP process is based on the fact that 50 percent of worldwide disk drives are produced utilizing Xyratex disk drive test and processing technologies.  Further, Xyratex is the industry’s largest OEM storage manufacturer, with over 25 years of experience and innovation in end-to-end engineering design, manufacturing and field failure analysis supporting the entire market from entry and mid-range enterprises to emerging HPC, cloud and solid state storage platforms.

Performance Solution Possibilities

The Xyratex ClusterStor™ 6000 is an example of a scale-out HPC data storage solution designed to satisfy the linear file system processing and data capacity scaling needs for state-of-the-art HPC systems, supporting hundreds of GB/s to 1TB/s Lustre® file system throughput and beyond.  ClusterStor features enterprise-class, nearline SAS drives that are tested, packaged and sourced using Xyratex’ attention to comprehensive quality and high-density solution-level robustness.

Xyratex goes above and beyond with all components of the ClusterStor high-density solution, including metadata servers, object storage servers and object storage targets that are factory-integrated, tested and supported by one company.  Xyratex’ methodical attention to integral solution quality drives ClusterStor’s seamless integration from the lowest level component to highest-level management interface, as well as its linear file system processing and capacity scaling capabilities.  In addition, Xyratex has unique partnerships with drive suppliers, providing insights into low-level drive testing as well as extensive high-density storage design experience. Accordingly, Xyratex data storage solutions are designed to routinely exceed the quality and reliability figures of other industry offerings.[2]

This white paper points out the range of mission-critical design characteristics available with enterprise-class, nearline SAS drives and provides insight into leading high-density solution design methods that raise the bar on solution performance and reliability among HPC storage providers.

The Seagate and Xyratex white paper is available here

[1] “Achieving Rapid Scale in Enterprise and Cloud Data Centers with SAS,” November 2012, Seagate & Xyratex Whitepaper, Topic: Enterprise Nearline vs. Desktop.

[2] “How Do You Get To 1TB/s? Quality.” HPC Wire, October 29, 2012.