Visit additional Tabor Communication Publications
June 24, 2010
Following my blog last week about the transition to GPU computing in HPC, I ran into a couple of items that cast the subject in a somewhat different light. One was a paper written by a team of computer science researchers at Georgia Tech titled "On the Limits of GPU Acceleration" (hat tip to NERSC's John Shalf for bringing it to my attention.) The other item surfaced as a result of an Intel presentation on the relative merits of CPU and GPU architectures for throughput computing, titled "Debunking the 100X GPU vs. CPU Myth." I think you can guess where this is going.
Turning first to the Georgia Tech paper, authors Richard Vuduc and four colleagues set out to compare CPU and GPU performance on three typical computations in scientific computing: iterative sparse linear solvers, sparse Cholesky factorization, and the fast multipole method. If you don't know what those are, you can look them up later. Suffice to say that they are representitive of HPC-type algorithms that are neither completely regular, like dense matrix multiplication, or completely irregular, such as graph-intensive computations.
For these codes, Vuduc and company found that a GPU was only equivalent to one or two quad-core Nehalem CPUs performance-wise. And since a single high-end GPU draws nearly as much power as two high-end x86 CPUs, from a performance-per-watt standpoint, the GPU advantage nearly disappears. They also bring up the fact that the additional cost of transfering data between the CPU and the GPU can further narrow the built-in FLOPS advantage enjoyed by the GPU. The authors sum it up thusly:
In particular, we argue that, for a moderately complex class of “irregular” computations, even well-tuned GPGPU accelerated implementations on currently available systems will deliver performance that is, roughly speaking, only comparable to well-tuned code for general-purpose multicore CPU systems, within a roughly comparable power footprint.
The GPU technology chosen was based on NVIDIA's Tesla C1060/S1070 and GTX285 systems, so the authors do admit that the results may have been very different if they had run these code on the lastest ATI hardware or the new NVIDIA Fermi card. Also, while the researchers made an attempt to tune both the CPU and GPU codes for best performance, they may have missed some important opportunities.
Presumably the Georgia Tech research was unencumbered by commercial agendas. Support for the work came from the National Science Foundation, the Semiconductor Research Corporation, and DARPA. It is worth noting, however, that Intel was also listed as a funder. Hmmm.
Which provides an interesting segue to our second item. At the International Symposium on Computer Architecture in Saint-Malo, France, Intel presented a paper that cast a few more aspersions on the lowly graphics processor. Like the Georgia Tech researcher, the Intel folks did their own CPU vs GPU performance benchmarking, in this case, matching the Intel Core i7 960 with the NVIDIA GTX280. They used 14 different throughput computing kernels and found a mean speedup of 2.5X in favor of the GPU. The GPU did best on the GKJ kernel (collision detection), with a 14-fold performance advantage, and worst on the Sort and Solv kernels, where the CPU actually outran the GPU.
The GPU-loving folks at NVIDIA took this as good news, however, noting the 14-fold performance advantage is quite nice, thank you. In a blog post this week, NVIDIAn Andy Keane writes:
It’s a rare day in the world of technology when a company you compete with stands up at an important conference and declares that your technology is *only* up to 14 times faster than theirs. In fact in all the 26 years I’ve been in this industry, I can’t recall another time I’ve seen a company promote competitive benchmarks that are an order of magnitude slower.
Of course the 14X value was the best kernel result for the GPU, not the average. Intel's real point was that they couldn't produce 100-fold increases in performance on the GPU, like NVIDIA claims for some apps. NVIDIA actually freely admits that not all codes will get the two orders of magnitude increase. Keane does, however, list ten examples of real codes where users did record a 100X or better performance boost compared to a CPU implementation. He also points out that for these throughput benchmarks, Intel relied on a previous generation GPU, the GTX280, and doubted that the testers even optimized the GPU code properly -- or at all.
So what does it all mean? Well, when it comes to the CPU vs. GPU performance wars, it pays to know who's runnning the benchmarks -- not only in relation to vendor loyalties, but also programming skills, software tools they used, etc. It's also worth comparing like-to-like as far as processor generations. In this regard, I think the NVIDIA Fermi GPU should be used as sort of a ground floor for all future benchmarks. To my mind, it represents the first GPU that can really be called "general-purpose" without rolling your eyes.
It's also important to keep in mind the effort required to port these parallel codes to their respective platforms. Skeptics are quick to point out that porting code to a GPU requires a significant up-front investment. But in his blog Keane reminds us that scaling codes on multicore CPUs is not a guaranteed path to delivering performance gains either. As a wise computer scientist once said: "All hardware sucks; all software sucks. Some just suck more than others."
Posted by Michael Feldman - June 24, 2010 @ 8:38 PM, Pacific Daylight Time
Michael Feldman is the editor of HPCwire.
No Recent Blog Comments
The Xeon Phi coprocessor might be the new kid on the high performance block, but out of all first-rate kickers of the Intel tires, the Texas Advanced Computing Center (TACC) got the first real jab with its new top ten Stampede system.We talk with the center's Karl Schultz about the challenges of programming for Phi--but more specifically, the optimization...
Although Horst Simon was named Deputy Director of Lawrence Berkeley National Laboratory, he maintains his strong ties to the scientific computing community as an editor of the TOP500 list and as an invited speaker at conferences.
Supercomputing veteran, Bo Ewald, has been neck-deep in bleeding edge system development since his twelve-year stint at Cray Research back in the mid-1980s, which was followed by his tenure at large organizations like SGI and startups, including Scale Eight Corporation and Linux Networx. He has put his weight behind quantum company....
May 16, 2013 |
When it comes to cloud, long distances mean unacceptably high latencies. Researchers from the University of Bonn in Germany examined those latency issues of doing CFD modeling in the cloud by utilizing a common CFD and its utilization in HPC instance types including both CPU and GPU cores of Amazon EC2.
May 15, 2013 |
Supercomputers at the Department of Energy’s National Energy Research Scientific Computing Center (NERSC) have worked on important computational problems such as collapse of the atomic state, the optimization of chemical catalysts, and now modeling popping bubbles.
May 10, 2013 |
Program provides cash awards up to $10,000 for the best open-source end-user applications deployed on 100G network.
May 09, 2013 |
The Japanese government has revealed its plans to best its previous K Computer efforts with what they hope will be the first exascale system...
May 08, 2013 |
For engineers looking to leverage high-performance computing, the accessibility of a cloud-based approach is a powerful draw, but there are costs that may not be readily apparent.
05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas | From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.
04/15/2013 | Bull | “50% of HPC users say their largest jobs scale to 120 cores or less.” How about yours? Are your codes ready to take advantage of today’s and tomorrow’s ultra-parallel HPC systems? Download this White Paper by Analysts Intersect360 Research to see what Bull and Intel’s Center for Excellence in Parallel Programming can do for your codes.
In this demonstration of SGI DMF ZeroWatt disk solution, Dr. Eng Lim Goh, SGI CTO, discusses a function of SGI DMF software to reduce costs and power consumption in an exascale (Big Data) storage datacenter.
The Cray CS300-AC cluster supercomputer offers energy efficient, air-cooled design based on modular, industry-standard platforms featuring the latest processor and network technologies and a wide range of datacenter cooling requirements.