GPU Computing II: Where the Truth Lies
Following my blog last week about the transition to GPU computing in HPC, I ran into a couple of items that cast the subject in a somewhat different light. One was a paper written by a team of computer science researchers at Georgia Tech titled “On the Limits of GPU Acceleration” (hat tip to NERSC’s John Shalf for bringing it to my attention.) The other item surfaced as a result of an Intel presentation on the relative merits of CPU and GPU architectures for throughput computing, titled “Debunking the 100X GPU vs. CPU Myth.” I think you can guess where this is going.
Turning first to the Georgia Tech paper, authors Richard Vuduc and four colleagues set out to compare CPU and GPU performance on three typical computations in scientific computing: iterative sparse linear solvers, sparse Cholesky factorization, and the fast multipole method. If you don’t know what those are, you can look them up later. Suffice to say that they are representitive of HPC-type algorithms that are neither completely regular, like dense matrix multiplication, or completely irregular, such as graph-intensive computations.
For these codes, Vuduc and company found that a GPU was only equivalent to one or two quad-core Nehalem CPUs performance-wise. And since a single high-end GPU draws nearly as much power as two high-end x86 CPUs, from a performance-per-watt standpoint, the GPU advantage nearly disappears. They also bring up the fact that the additional cost of transfering data between the CPU and the GPU can further narrow the built-in FLOPS advantage enjoyed by the GPU. The authors sum it up thusly:
In particular, we argue that, for a moderately complex class of “irregular” computations, even well-tuned GPGPU accelerated implementations on currently available systems will deliver performance that is, roughly speaking, only comparable to well-tuned code for general-purpose multicore CPU systems, within a roughly comparable power footprint.
The GPU technology chosen was based on NVIDIA’s Tesla C1060/S1070 and GTX285 systems, so the authors do admit that the results may have been very different if they had run these code on the lastest ATI hardware or the new NVIDIA Fermi card. Also, while the researchers made an attempt to tune both the CPU and GPU codes for best performance, they may have missed some important opportunities.
Presumably the Georgia Tech research was unencumbered by commercial agendas. Support for the work came from the National Science Foundation, the Semiconductor Research Corporation, and DARPA. It is worth noting, however, that Intel was also listed as a funder. Hmmm.
Which provides an interesting segue to our second item. At the International Symposium on Computer Architecture in Saint-Malo, France, Intel presented a paper that cast a few more aspersions on the lowly graphics processor. Like the Georgia Tech researcher, the Intel folks did their own CPU vs GPU performance benchmarking, in this case, matching the Intel Core i7 960 with the NVIDIA GTX280. They used 14 different throughput computing kernels and found a mean speedup of 2.5X in favor of the GPU. The GPU did best on the GKJ kernel (collision detection), with a 14-fold performance advantage, and worst on the Sort and Solv kernels, where the CPU actually outran the GPU.
The GPU-loving folks at NVIDIA took this as good news, however, noting the 14-fold performance advantage is quite nice, thank you. In a blog post this week, NVIDIAn Andy Keane writes:
It’s a rare day in the world of technology when a company you compete with stands up at an important conference and declares that your technology is *only* up to 14 times faster than theirs. In fact in all the 26 years I’ve been in this industry, I can’t recall another time I’ve seen a company promote competitive benchmarks that are an order of magnitude slower.
Of course the 14X value was the best kernel result for the GPU, not the average. Intel’s real point was that they couldn’t produce 100-fold increases in performance on the GPU, like NVIDIA claims for some apps. NVIDIA actually freely admits that not all codes will get the two orders of magnitude increase. Keane does, however, list ten examples of real codes where users did record a 100X or better performance boost compared to a CPU implementation. He also points out that for these throughput benchmarks, Intel relied on a previous generation GPU, the GTX280, and doubted that the testers even optimized the GPU code properly — or at all.
So what does it all mean? Well, when it comes to the CPU vs. GPU performance wars, it pays to know who’s runnning the benchmarks — not only in relation to vendor loyalties, but also programming skills, software tools they used, etc. It’s also worth comparing like-to-like as far as processor generations. In this regard, I think the NVIDIA Fermi GPU should be used as sort of a ground floor for all future benchmarks. To my mind, it represents the first GPU that can really be called “general-purpose” without rolling your eyes.
It’s also important to keep in mind the effort required to port these parallel codes to their respective platforms. Skeptics are quick to point out that porting code to a GPU requires a significant up-front investment. But in his blog Keane reminds us that scaling codes on multicore CPUs is not a guaranteed path to delivering performance gains either. As a wise computer scientist once said: “All hardware sucks; all software sucks. Some just suck more than others.”