Reliability Matters – Your HPC Workloads are Thirsty for Enterprise Quality

By Nicole Hemsoth

November 12, 2012

Is your current HPC data storage solution experiencing issues with disk drives?  Are you seeing performance degradation, where HPC projects take longer to complete than they should?  Is your performance situation normal, or are there reliable alternatives to achieving sustained performance at large HPC scale?

To help address these and other questions you might have when evaluating your data storage infrastructure, Seagate and Xyratex co-authored a white paper, “Achieving Rapid Scale in Enterprise and Cloud Data Centers with SAS.” The paper [1] provides insight into selecting the right disk drive for your application environment and specific performance, scalability and reliability needs. Anyone currently experiencing high rates of what appear to be drive-related issues, or anyone considering purchasing or leasing high-density storage solutions, would be advised to consider these points. Also, those who have the goal to efficiently achieve reliable sustained performance on HPC, enterprise or mission critical applications would benefit from reading this paper. 

The Importance of Drive Design

Design tolerance, and features built into disk drives within multi-spindle environments, have a direct impact on performance. Drives that are not optimized to handle rotational vibration (RV) have shown in testing to produce more than 50 percent less performance.  Also, the RV mitigation features provided in enterprise class drives will not perform as effectively without adequate RV isolation designed into the multi-drive enclosure system.  Both drive RV mitigation and enclosure RV isolation are required to act together to deliver a well-crafted RV management solution.  If RV is not taken into account in the design of the drive and multi-drive enclosure, the force of RV can push the disk drive head off track and cause missed revolutions and delays in data transfers.  Specifically, delayed read/write operations are the root of all vibration-induced I/O degradation. 

Seagate and Xyratex point out in this new paper that “the use of lower-end commodity technologies derived from department-level and workgroup clients as well as the blending, merging and displacement of former data center and enterprise techniques underscore the need for broad industry education regarding the facts about storage technologies.”  In most cases, poor drive reliability is usually a result of deploying the wrong type of storage device within an enterprise class system, or for a specific enterprise class workload.  Hard disk drives, being mechanical devices, are designed with specific features and components for specific workloads.

Improper management of RV can be subtle, and can be introduced into your project through selecting an inappropriate disk drive class compared to its application loading or an enclosure lacking design margin relative to the application and selected disk drive.  These factors do not matter if reliable, sustained performance is not a key purchasing criterion, because there are plenty of archive and low-performance bulk storage applications where attention to RV is not as critical.  However, in the case of high-density HPC data storage, reliable and sustained performance at massive scale is paramount.

Since HPC storage solutions provide numerous data protection methods, improper management of RV does not automatically translate into something as obvious as data loss. Instead, it can result in prolonged lingering performance impact, intermittent errors and escalating service costs which are quite literally built into the storage system for given application load levels.  To overcome these avoidable design limitations, Seagate and Xyratex contrast disk drive types and point out the range of mission critical design characteristics available with high-performance, enterprise-class, nearline SAS drives. 

Drive Testing Critical to Improving Performance

In addition to selecting the right drive type, the white paper describes intensive solution and component test methods adopted by Xyratex to improve drive reliability and system robustness by detecting  individual drive weaknesses or defects early  and thoroughly exercising enclosure-level RV isolation design techniques.  Xyratex’ four-stage Integrated System Testing Platform (ISTP)[2] includes a highly efficient and scalable storage test that exposes, identifies and eliminates devices with inherent defects or defects resulting from manufacturing aberrations that cause time and stress-dependent failures.  This identifies and removes hidden quality problems and significantly reduces in-the-field component failures.  Additionally, this represents attention to drive quality and solution robustness above and beyond business as usual expectations and yields useful perspective on what is attainable to raise the bar on solution performance and reliability among HPC storage providers.

Xyratex’ ISTP process is based on the fact that 50 percent of worldwide disk drives are produced utilizing Xyratex disk drive test and processing technologies.  Further, Xyratex is the industry’s largest OEM storage manufacturer, with over 25 years of experience and innovation in end-to-end engineering design, manufacturing and field failure analysis supporting the entire market from entry and mid-range enterprises to emerging HPC, cloud and solid state storage platforms.

Performance Solution Possibilities

The Xyratex ClusterStor™ 6000 is an example of a scale-out HPC data storage solution designed to satisfy the linear file system processing and data capacity scaling needs for state-of-the-art HPC systems, supporting hundreds of GB/s to 1TB/s Lustre® file system throughput and beyond.  ClusterStor features enterprise-class, nearline SAS drives that are tested, packaged and sourced using Xyratex’ attention to comprehensive quality and high-density solution-level robustness.

Xyratex goes above and beyond with all components of the ClusterStor high-density solution, including metadata servers, object storage servers and object storage targets that are factory-integrated, tested and supported by one company.  Xyratex’ methodical attention to integral solution quality drives ClusterStor’s seamless integration from the lowest level component to highest-level management interface, as well as its linear file system processing and capacity scaling capabilities.  In addition, Xyratex has unique partnerships with drive suppliers, providing insights into low-level drive testing as well as extensive high-density storage design experience. Accordingly, Xyratex data storage solutions are designed to routinely exceed the quality and reliability figures of other industry offerings.[2]

This white paper points out the range of mission-critical design characteristics available with enterprise-class, nearline SAS drives and provides insight into leading high-density solution design methods that raise the bar on solution performance and reliability among HPC storage providers.

The Seagate and Xyratex white paper is available here

[1] “Achieving Rapid Scale in Enterprise and Cloud Data Centers with SAS,” November 2012, Seagate & Xyratex Whitepaper, Topic: Enterprise Nearline vs. Desktop.

[2] “How Do You Get To 1TB/s? Quality.” HPC Wire, October 29, 2012.

 

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

UCSD, AIST Forge Tighter Alliance with AI-Focused MOU

January 18, 2018

The rich history of collaboration between UC San Diego and AIST in Japan is getting richer. The organizations entered into a five-year memorandum of understanding on January 10. The MOU represents the continuation of a 1 Read more…

By Tiffany Trader

New Blueprint for Converging HPC, Big Data

January 18, 2018

After five annual workshops on Big Data and Extreme-Scale Computing (BDEC), a group of international HPC heavyweights including Jack Dongarra (University of Tennessee), Satoshi Matsuoka (Tokyo Institute of Technology), Read more…

By John Russell

Researchers Measure Impact of ‘Meltdown’ and ‘Spectre’ Patches on HPC Workloads

January 17, 2018

Computer scientists from the Center for Computational Research, State University of New York (SUNY), University at Buffalo have examined the effect of Meltdown and Spectre security updates on the performance of popular H Read more…

By Tiffany Trader

HPE Extreme Performance Solutions

HPE and NREL Take Steps to Create a Sustainable, Energy-Efficient Data Center with an H2 Fuel Cell

As enterprises attempt to manage rising volumes of data, unplanned data center outages are becoming more common and more expensive. As the cost of downtime rises, enterprises lose out on productivity and valuable competitive advantage without access to their critical data. Read more…

Fostering Lustre Advancement Through Development and Contributions

January 17, 2018

Six months after organizational changes at Intel's High Performance Data (HPDD) division, most in the Lustre community have shed any initial apprehension around the potential changes that could affect or disrupt Lustre Read more…

By Carlos Aoki Thomaz

UCSD, AIST Forge Tighter Alliance with AI-Focused MOU

January 18, 2018

The rich history of collaboration between UC San Diego and AIST in Japan is getting richer. The organizations entered into a five-year memorandum of understandi Read more…

By Tiffany Trader

New Blueprint for Converging HPC, Big Data

January 18, 2018

After five annual workshops on Big Data and Extreme-Scale Computing (BDEC), a group of international HPC heavyweights including Jack Dongarra (University of Te Read more…

By John Russell

Researchers Measure Impact of ‘Meltdown’ and ‘Spectre’ Patches on HPC Workloads

January 17, 2018

Computer scientists from the Center for Computational Research, State University of New York (SUNY), University at Buffalo have examined the effect of Meltdown Read more…

By Tiffany Trader

Fostering Lustre Advancement Through Development and Contributions

January 17, 2018

Six months after organizational changes at Intel's High Performance Data (HPDD) division, most in the Lustre community have shed any initial apprehension aroun Read more…

By Carlos Aoki Thomaz

When the Chips Are Down

January 11, 2018

In the last article, "The High Stakes Semiconductor Game that Drives HPC Diversity," I alluded to the challenges facing the semiconductor industry and how that may impact the evolution of HPC systems over the next few years. I thought I’d lift the covers a little and look at some of the commercial challenges that impact the component technology we use in HPC. Read more…

By Dairsie Latimer

How Meltdown and Spectre Patches Will Affect HPC Workloads

January 10, 2018

There have been claims that the fixes for the Meltdown and Spectre security vulnerabilities, named the KPTI (aka KAISER) patches, are going to affect applicatio Read more…

By Rosemary Francis

Momentum Builds for US Exascale

January 9, 2018

2018 looks to be a great year for the U.S. exascale program. The last several months of 2017 revealed a number of important developments that help put the U.S. Read more…

By Alex R. Larzelere

ANL’s Rick Stevens on CANDLE, ARM, Quantum, and More

January 8, 2018

Late last year HPCwire caught up with Rick Stevens, associate laboratory director for computing, environment and life Sciences at Argonne National Laboratory, f Read more…

By John Russell

Inventor Claims to Have Solved Floating Point Error Problem

January 17, 2018

"The decades-old floating point error problem has been solved," proclaims a press release from inventor Alan Jorgensen. The computer scientist has filed for and Read more…

By Tiffany Trader

US Coalesces Plans for First Exascale Supercomputer: Aurora in 2021

September 27, 2017

At the Advanced Scientific Computing Advisory Committee (ASCAC) meeting, in Arlington, Va., yesterday (Sept. 26), it was revealed that the "Aurora" supercompute Read more…

By Tiffany Trader

Japan Unveils Quantum Neural Network

November 22, 2017

The U.S. and China are leading the race toward productive quantum computing, but it's early enough that ultimate leadership is still something of an open questi Read more…

By Tiffany Trader

AMD Showcases Growing Portfolio of EPYC and Radeon-based Systems at SC17

November 13, 2017

AMD’s charge back into HPC and the datacenter is on full display at SC17. Having launched the EPYC processor line in June along with its MI25 GPU the focus he Read more…

By John Russell

Nvidia Responds to Google TPU Benchmarking

April 10, 2017

Nvidia highlights strengths of its newest GPU silicon in response to Google's report on the performance and energy advantages of its custom tensor processor. Read more…

By Tiffany Trader

IBM Begins Power9 Rollout with Backing from DOE, Google

December 6, 2017

After over a year of buildup, IBM is unveiling its first Power9 system based on the same architecture as the Department of Energy CORAL supercomputers, Summit a Read more…

By Tiffany Trader

Fast Forward: Five HPC Predictions for 2018

December 21, 2017

What’s on your list of high (and low) lights for 2017? Volta 100’s arrival on the heels of the P100? Appearance, albeit late in the year, of IBM’s Power9? Read more…

By John Russell

Chip Flaws ‘Meltdown’ and ‘Spectre’ Loom Large

January 4, 2018

The HPC and wider tech community have been abuzz this week over the discovery of critical design flaws that impact virtually all contemporary microprocessors. T Read more…

By Tiffany Trader

Leading Solution Providers

Perspective: What Really Happened at SC17?

November 22, 2017

SC is over. Now comes the myriad of follow-ups. Inboxes are filled with templated emails from vendors and other exhibitors hoping to win a place in the post-SC thinking of booth visitors. Attendees of tutorials, workshops and other technical sessions will be inundated with requests for feedback. Read more…

By Andrew Jones

Tensors Come of Age: Why the AI Revolution Will Help HPC

November 13, 2017

Thirty years ago, parallel computing was coming of age. A bitter battle began between stalwart vector computing supporters and advocates of various approaches to parallel computing. IBM skeptic Alan Karp, reacting to announcements of nCUBE’s 1024-microprocessor system and Thinking Machines’ 65,536-element array, made a public $100 wager that no one could get a parallel speedup of over 200 on real HPC workloads. Read more…

By John Gustafson & Lenore Mullin

Delays, Smoke, Records & Markets – A Candid Conversation with Cray CEO Peter Ungaro

October 5, 2017

Earlier this month, Tom Tabor, publisher of HPCwire and I had a very personal conversation with Cray CEO Peter Ungaro. Cray has been on something of a Cinderell Read more…

By Tiffany Trader & Tom Tabor

Flipping the Flops and Reading the Top500 Tea Leaves

November 13, 2017

The 50th edition of the Top500 list, the biannual publication of the world’s fastest supercomputers based on public Linpack benchmarking results, was released Read more…

By Tiffany Trader

How Meltdown and Spectre Patches Will Affect HPC Workloads

January 10, 2018

There have been claims that the fixes for the Meltdown and Spectre security vulnerabilities, named the KPTI (aka KAISER) patches, are going to affect applicatio Read more…

By Rosemary Francis

GlobalFoundries, Ayar Labs Team Up to Commercialize Optical I/O

December 4, 2017

GlobalFoundries (GF) and Ayar Labs, a startup focused on using light, instead of electricity, to transfer data between chips, today announced they've entered in Read more…

By Tiffany Trader

HPC Chips – A Veritable Smorgasbord?

October 10, 2017

For the first time since AMD's ill-fated launch of Bulldozer the answer to the question, 'Which CPU will be in my next HPC system?' doesn't have to be 'Whichever variety of Intel Xeon E5 they are selling when we procure'. Read more…

By Dairsie Latimer

Nvidia, Partners Announce Several V100 Servers

September 27, 2017

Here come the Volta 100-based servers. Nvidia today announced an impressive line-up of servers from major partners – Dell EMC, Hewlett Packard Enterprise, IBM Read more…

By John Russell

  • arrow
  • Click Here for More Headlines
  • arrow
Share This