Reliable Memory: Coming to a GPU Near You

By Michael Feldman

September 2, 2009

GPUs are becoming more like CPUs. But in the critical area of error corrected memory, graphics hardware still lags. The lack of error correction is probably the single biggest factor that makes users of GPUs for high performance computing nervous. Some HPC applications are resistant to the occasional bad data value, but many are not. The good news is that graphics chip vendors are aware of the problem and it appears to be only a matter of time before GPUs get a memory makeover.

Before AMD and NVIDIA brought GPU computing onto the scene, graphics processors didn’t really need to be concerned with error-prone memory. If a pixel’s color is off by a bit or two, nobody is going to notice as the images go flying by. So it was natural (and cheaper) for GPU devices to be built without support for error corrected memory. In 2006, with the advent of general-purpose computing on graphics processing units, otherwise know as GPGPU, the issue of reliable memory came to the fore.

The problem is that when you’re using the GPU as a math accelerator and a memory bit flips in a data value, you’ve got a potential problem. Obviously in numerical calculations, accuracy matters. That’s why all standard CPU servers today come with memory that supports Error Correcting Codes (ECC) as well as with on-chip intelligence for error checking and correction in cache and local data structures. The reason that general-purpose computing can be done on GPUs at all has to do with the relatively infrequent occurrence of these errors on standard graphics hardware. Algorithms are typically run many times in a typical technical computing application, so anomalous results can be averaged out, or even manually discarded.

The only simple way to circumvent the problem on the current crop of GPUs is to run the code twice (or simultaneously on two separate devices). If the results don’t match, you assume an error occurred and you rerun the offending sequence. It’s relatively bulletproof, but you’ve cut your price-performance in half for the sake of error correction. A less brute-force method was devised by the Tokyo Institute of Technology, who came up with software-based ECC for GPUs (PDF). But the preliminary results showed the performance overhead was acceptable only for compute-intensive applications, not bandwidth-intensive ones.

There are different categories of memory errors. The kind most people focus on are thought to be the result of cosmic rays, alpha particles in packaging material, or possibly as a side-effect of harsh environmental conditions. They are called soft (or transient) errors and most commonly occur in off-chip DRAM, but can also strike the GPU ASIC itself in local memory or data registers.

Hard (or permanent) errors can also be present on memory chips, but these are easy to detect with simple diagnostic tests. Hard errors are usually dealt with by replacing the offending memory module, but theoretically could be handled in software too. The conventional wisdom is that soft errors are much more common than hard errors, although at least one study (PDF) by Google found just the opposite.

Data errors can also occur at the memory bus interface. Here, at least, the graphics world has made some progress. GDDR5 (Graphics Double Data Rate, version 5) memory, which first appeared in 2008, was the first memory specification for graphics platforms that contained an error detection facility. The motivation behind this was the high data rates of GDDR5, which made the odds of producing bad data much more likely. Since GDDR5 contains an error correction protocol, a compatible memory controller is able to take corrective action — basically a retry — to compensate.

That still leaves a lot of data on the GPU board exposed. Adding ECC memory to GPU boards intended for the technical computing market is a relatively straightforward product decision since the extra cost can be passed on to the GPGPU consumer. But changing the GPU core as well as the integrated memory controller to complete the protection requires a tradeoff, since extra transistors are needed for error detection and correction on the ASIC. And because of the expense of designing and testing chips, GPUs are shared across product lines at AMD and NVIDIA.

For example, the latest AMD FireStream products use the Radeon HD 4800 core, while the current NVIDIA Tesla platforms uses (presumably) the GeForce GTX 285. These are the same ASICs used in high-end graphics products. The challenge to the two GPGPU vendors is to figure out how to design processors that offer the data reliability of a CPU server, without impacting their core graphics business unduly.

Patricia Harrell, AMD’s director of Stream Computing, admits that the need for more robust data protection in GPUs already exists. She says error corrected memory is a requirement for a number of customers, especially those looking to deploy GPUs at scale, i.e., high performance computing users with large compute clusters. Although individual memory error rates are low, as you add more GPUs (and thus more graphics memory) to the system, and run applications for longer periods of time, the chances of hitting a flipped memory bit increases proportionally.

The AMD FireStream 9270 board already incorporates GDDR5 memory, so data protection is already in place at the memory interface in this product. In this case, whenever the memory controller sends and receives data to and from the DRAM, it buffers the data locally while the DRAM calculates the integrity of the value and returns a status code. If the code indicates an error, the memory controller does the retry automatically.

Overall though, AMD seems to be taking a cautious approach to error correcting GPUs. “It’s really important to put in the required features intelligently, and make sure you do the research and engineering to protect the data structures that are going to return the most value,” notes Harrell. If not, she says, you end up with devices that are too big and too hot, in which case you lose the performance advantages GPGPU was originally intended for.

Harrell says that they are continuing to look at the memory protection issue, but couldn’t offer more specific guidance on AMD’s roadmap. “I think it isn’t clear if that [error correction] is going to be required for the broad market yet,” she adds.

Unlike AMD’s more wait-and-see attitude, NVIDIA appears to be fully committed to bringing error protection to GPU computing. According to Andy Keane, general manager of the GPU computing business unit at NVIDIA, it is not a matter of if, but when. From his point of view, ECC memory is a hard requirement in datacenters. “We have to respond to that by building that kind of support into our roadmap,” Keane said unequivocally. “It will be in a future GPU.”

As far as when ECC-capable Tesla products will show up, Keane wouldn’t say. It’s likely that NVIDIA’s OEM partners and GPU computing developers already have a pretty good idea of the timeline (under NDA of course), so systems and software based on high-integrity GPUs may already be in the works. In a Real World Technologies article that spells out the major costs and benefits of error corrected memory in GPUs, analyst David Kanter predicts that NVIDIA’s next GPGPU product release will include ECC.

Presumably Intel is also mulling over its options, since Larrabee, the company’s first high-end graphics processor, is scheduled to be released into the wild next year. But Intel insists the first version of Larrabee will target the traditional graphics space, making it unlikely that they would introduce ECC into the mix. Of course, the company could reverse itself and release a true HPC processor variant with ECC bells and whistles.

My sense is that ECC will come to GPU computing products sooner (1-2 years) rather that later (3-5 years). Being able to ensure data integrity in these devices will widen the aperture for HPC applications and help push GPGPU into true supercomputers. Just like double precision performance and on-board memory capacity, error correction is destined to become an important differentiator in high-end GPU computing.

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industy updates delivered to you every week!

Scalable Informatics Ceases Operations

March 23, 2017

On the same day we reported on the uncertain future for HPC compiler company PathScale, we are sad to learn that another HPC vendor, Scalable Informatics, is closing its doors. Read more…

By Tiffany Trader

‘Strategies in Biomedical Data Science’ Advances IT-Research Synergies

March 23, 2017

“Strategies in Biomedical Data Science: Driving Force for Innovation” by Jay A. Etchings is both an introductory text and a field guide for anyone working with biomedical data. Read more…

By Tiffany Trader

HPC Compiler Company PathScale Seeks Life Raft

March 23, 2017

HPCwire has learned that HPC compiler company PathScale has fallen on difficult times and is asking the community for help or actively seeking a buyer for its assets. Read more…

By Tiffany Trader

Google Launches New Machine Learning Journal

March 22, 2017

On Monday, Google announced plans to launch a new peer review journal and “ecosystem” Read more…

By John Russell

HPE Extreme Performance Solutions

HFT Firms Turn to Co-Location to Gain Competitive Advantage

High-frequency trading (HFT) is a high-speed, high-stakes world where every millisecond matters. Finding ways to execute trades faster than the competition translates directly to greater revenue for firms, brokerages, and exchanges. Read more…

Swiss Researchers Peer Inside Chips with Improved X-Ray Imaging

March 22, 2017

Peering inside semiconductor chips using x-ray imaging isn’t new, but the technique hasn’t been especially good or easy to accomplish. Read more…

By John Russell

LANL Simulation Shows Massive Black Holes Break ‘Speed Limit’

March 21, 2017

A new computer simulation based on codes developed at Los Alamos National Laboratory (LANL) is shedding light on how supermassive black holes could have formed in the early universe contrary to most prior models which impose a limit on how fast these massive ‘objects’ can form. Read more…

Quantum Bits: D-Wave and VW; Google Quantum Lab; IBM Expands Access

March 21, 2017

For a technology that’s usually characterized as far off and in a distant galaxy, quantum computing has been steadily picking up steam. Read more…

By John Russell

Intel Ships Drives Based on 3D XPoint Non-volatile Memory

March 20, 2017

Intel Corp. has begun shipping new storage drives based on its 3D XPoint non-volatile memory technology as it targets data-driven workloads. Intel’s new Optane solid-state drives, designated P4800X, seek to combine the attributes of memory and storage in the same device. Read more…

By George Leopold

HPC Compiler Company PathScale Seeks Life Raft

March 23, 2017

HPCwire has learned that HPC compiler company PathScale has fallen on difficult times and is asking the community for help or actively seeking a buyer for its assets. Read more…

By Tiffany Trader

Quantum Bits: D-Wave and VW; Google Quantum Lab; IBM Expands Access

March 21, 2017

For a technology that’s usually characterized as far off and in a distant galaxy, quantum computing has been steadily picking up steam. Read more…

By John Russell

Trump Budget Targets NIH, DOE, and EPA; No Mention of NSF

March 16, 2017

President Trump’s proposed U.S. fiscal 2018 budget issued today sharply cuts science spending while bolstering military spending as he promised during the campaign. Read more…

By John Russell

CPU-based Visualization Positions for Exascale Supercomputing

March 16, 2017

In this contributed perspective piece, Intel’s Jim Jeffers makes the case that CPU-based visualization is now widely adopted and as such is no longer a contrarian view, but is rather an exascale requirement. Read more…

By Jim Jeffers, Principal Engineer and Engineering Leader, Intel

US Supercomputing Leaders Tackle the China Question

March 15, 2017

Joint DOE-NSA report responds to the increased global pressures impacting the competitiveness of U.S. supercomputing. Read more…

By Tiffany Trader

New Japanese Supercomputing Project Targets Exascale

March 14, 2017

Another Japanese supercomputing project was revealed this week, this one from emerging supercomputer maker, ExaScaler Inc., and Keio University. The partners are working on an original supercomputer design with exascale aspirations. Read more…

By Tiffany Trader

Nvidia Debuts HGX-1 for Cloud; Announces Fujitsu AI Deal

March 9, 2017

On Monday Nvidia announced a major deal with Fujitsu to help build an AI supercomputer for RIKEN using 24 DGX-1 servers. Read more…

By John Russell

HPC4Mfg Advances State-of-the-Art for American Manufacturing

March 9, 2017

Last Friday (March 3, 2017), the High Performance Computing for Manufacturing (HPC4Mfg) program held an industry engagement day workshop in San Diego, bringing together members of the US manufacturing community, national laboratories and universities to discuss the role of high-performance computing as an innovation engine for American manufacturing. Read more…

By Tiffany Trader

For IBM/OpenPOWER: Success in 2017 = (Volume) Sales

January 11, 2017

To a large degree IBM and the OpenPOWER Foundation have done what they said they would – assembling a substantial and growing ecosystem and bringing Power-based products to market, all in about three years. Read more…

By John Russell

TSUBAME3.0 Points to Future HPE Pascal-NVLink-OPA Server

February 17, 2017

Since our initial coverage of the TSUBAME3.0 supercomputer yesterday, more details have come to light on this innovative project. Of particular interest is a new board design for NVLink-equipped Pascal P100 GPUs that will create another entrant to the space currently occupied by Nvidia's DGX-1 system, IBM's "Minsky" platform and the Supermicro SuperServer (1028GQ-TXR). Read more…

By Tiffany Trader

Tokyo Tech’s TSUBAME3.0 Will Be First HPE-SGI Super

February 16, 2017

In a press event Friday afternoon local time in Japan, Tokyo Institute of Technology (Tokyo Tech) announced its plans for the TSUBAME3.0 supercomputer, which will be Japan’s “fastest AI supercomputer,” Read more…

By Tiffany Trader

IBM Wants to be “Red Hat” of Deep Learning

January 26, 2017

IBM today announced the addition of TensorFlow and Chainer deep learning frameworks to its PowerAI suite of deep learning tools, which already includes popular offerings such as Caffe, Theano, and Torch. Read more…

By John Russell

Lighting up Aurora: Behind the Scenes at the Creation of the DOE’s Upcoming 200 Petaflops Supercomputer

December 1, 2016

In April 2015, U.S. Department of Energy Undersecretary Franklin Orr announced that Intel would be the prime contractor for Aurora: Read more…

By Jan Rowell

Is Liquid Cooling Ready to Go Mainstream?

February 13, 2017

Lost in the frenzy of SC16 was a substantial rise in the number of vendors showing server oriented liquid cooling technologies. Three decades ago liquid cooling was pretty much the exclusive realm of the Cray-2 and IBM mainframe class products. That’s changing. We are now seeing an emergence of x86 class server products with exotic plumbing technology ranging from Direct-to-Chip to servers and storage completely immersed in a dielectric fluid. Read more…

By Steve Campbell

Enlisting Deep Learning in the War on Cancer

December 7, 2016

Sometime in Q2 2017 the first ‘results’ of the Joint Design of Advanced Computing Solutions for Cancer (JDACS4C) will become publicly available according to Rick Stevens. He leads one of three JDACS4C pilot projects pressing deep learning (DL) into service in the War on Cancer. Read more…

By John Russell

BioTeam’s Berman Charts 2017 HPC Trends in Life Sciences

January 4, 2017

Twenty years ago high performance computing was nearly absent from life sciences. Today it’s used throughout life sciences and biomedical research. Genomics and the data deluge from modern lab instruments are the main drivers, but so is the longer-term desire to perform predictive simulation in support of Precision Medicine (PM). There’s even a specialized life sciences supercomputer, ‘Anton’ from D.E. Shaw Research, and the Pittsburgh Supercomputing Center is standing up its second Anton 2 and actively soliciting project proposals. There’s a lot going on. Read more…

By John Russell

Leading Solution Providers

HPC Startup Advances Auto-Parallelization’s Promise

January 23, 2017

The shift from single core to multicore hardware has made finding parallelism in codes more important than ever, but that hasn’t made the task of parallel programming any easier. Read more…

By Tiffany Trader

HPC Technique Propels Deep Learning at Scale

February 21, 2017

Researchers from Baidu’s Silicon Valley AI Lab (SVAIL) have adapted a well-known HPC communication technique to boost the speed and scale of their neural network training and now they are sharing their implementation with the larger deep learning community. Read more…

By Tiffany Trader

CPU Benchmarking: Haswell Versus POWER8

June 2, 2015

With OpenPOWER activity ramping up and IBM’s prominent role in the upcoming DOE machines Summit and Sierra, it’s a good time to look at how the IBM POWER CPU stacks up against the x86 Xeon Haswell CPU from Intel. Read more…

By Tiffany Trader

Trump Budget Targets NIH, DOE, and EPA; No Mention of NSF

March 16, 2017

President Trump’s proposed U.S. fiscal 2018 budget issued today sharply cuts science spending while bolstering military spending as he promised during the campaign. Read more…

By John Russell

IDG to Be Bought by Chinese Investors; IDC to Spin Out HPC Group

January 19, 2017

US-based publishing and investment firm International Data Group, Inc. (IDG) will be acquired by a pair of Chinese investors, China Oceanwide Holdings Group Co., Ltd. Read more…

By Tiffany Trader

US Supercomputing Leaders Tackle the China Question

March 15, 2017

Joint DOE-NSA report responds to the increased global pressures impacting the competitiveness of U.S. supercomputing. Read more…

By Tiffany Trader

Quantum Bits: D-Wave and VW; Google Quantum Lab; IBM Expands Access

March 21, 2017

For a technology that’s usually characterized as far off and in a distant galaxy, quantum computing has been steadily picking up steam. Read more…

By John Russell

Intel and Trump Announce $7B for Fab 42 Targeting 7nm

February 8, 2017

In what may be an attempt by President Trump to reset his turbulent relationship with the high tech industry, he and Intel CEO Brian Krzanich today announced plans to invest more than $7 billion to complete Fab 42. Read more…

By John Russell

  • arrow
  • Click Here for More Headlines
  • arrow
Share This